Tao Luo
I am a CS Ph.D. candidate at the University of Pennsylvania (defending 2026), advised by Profs. Boon Thau Loo and Vincent Liu. I build agent infrastructure across RL post-training and LLM serving and retrieval.
I am seeking full-time industry roles starting in 2026.
At Alibaba, I designed and shipped Partial Overlapping, a runtime scheduling mechanism for asynchronous agentic RL in ROLL that expands agent rollouts to idle training GPUs (3.5x rollout throughput; featured in the ROME technical report), leveraging coding agents extensively (Claude Code, Codex) with zero human-written code (first feature to ship this way in Alibaba’s flagship post-training framework). It now powers agentic RL post-training of multiple production agents at 100B+ parameters and thousands of GPUs. I also designed and built RLix , an orchestration layer for concurrent agentic RL pipelines (2.6x rollout throughput in SWE-agent RL training; stars from NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc.). My work spans vLLM, Megatron-LM, and Ray.
During my Ph.D. at Penn, I led ParaFlex, a multiplexed heterogeneous LLM serving system that eliminates head-of-line blocking via stage-aligned parallelism. Earlier, during M.S. study at Columbia University, I introduced Privacy Budget Scheduling, and developed DPF as the first scheduling algorithm for ML training under differential privacy constraints. I also contribute to retrieval/data systems research, spanning vector search and query optimization. My work has appeared at OSDI, SOSP, and SoCC.
Before academia, I spent ~4 years in quant investment, developing strategies and building infrastructure. I hold a B.S. in Financial Mathematics from Southern University of Science and Technology.
Selected Projects
Agentic RL Post-Training Infrastructure @Alibaba, DAMO Academy
- Proposed and shipped Partial Overlapping, a runtime scheduling mechanism for asynchronous agentic RL that expands agent rollouts to idle training GPUs, improving rollout throughput by 3.5x (featured in the ROME technical report).
- Leveraged coding agents extensively (Claude Code, Codex) to design, implement, and debug Partial Overlapping with zero human-written code (first high-priority feature in alibaba/ROLL shipped this way); featured in technical blogs (English/Chinese) as a case study for AI-assisted systems engineering.
- Deployed in production for agentic RL post-training of models with 100s of billions of parameters on 1000s of GPUs, including Qoder IDE (coding agent), iFlow CLI (terminal agent), Amap (travel-planning agent), and Alimama (ads).
- Extended Partial Overlapping to async multi-LoRA fine-tuning via per-adapter optimizers on a shared Megatron base model.
- Designed and built RLix
, an orchestration layer for concurrent agentic RL pipelines that enables elastic GPU sharing and higher cluster utilization with minimal changes to training recipes (2.6x rollout throughput in SWE-agent RL training; stars from NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc.).
Heterogeneous Multi-Model LLM Serving at Scale @University of Pennsylvania
- Architected a multiplexed serving system with stage-aligned parallelism that eliminates head-of-line blocking, increasing token throughput by 1.6x while reducing median latency.
- Built multi-model KV cache management, distributed execution in vLLM and Ray, and NCCL concurrency controls.
- Developed algorithms for efficient model sharding, replication, placement, and scheduling across heterogeneous serving workloads.
- ParaFlex, SoCC’25 paper
Query Optimization for Declarative Smart Contracts @UPenn
- Framed efficiency of Datalog-compiled smart contracts as a view selection problem under a non-standard, history-dependent cost model (Ethereum gas).
- Designed and implemented a selective view materialization algorithm with simplification-based pruning; formally proved algorithm correctness and pruning completeness.
- Reduced storage gas by ~78% and total gas by >50% over naive compilation, matching expert hand-tuned Solidity on a benchmark of widely deployed contracts.
- DeSCO paper, FAB’24 (co-located with VLDB).
Privacy-Preserving Scheduling for ML Training @Columbia University
- Designed the first fair-allocation scheduling algorithm for ML training under differential-privacy constraints.
- Improved job throughput by 2x over FCFS under the same privacy budget, verified in large-scale simulations; proved formal efficiency and fairness guarantees.
- Privacy Budget Scheduling, OSDI’21
Honors & Service
- Program Committee: ACM Symposium on Cloud Computing 2025
- Manjushri Fellowship, University of Pennsylvania, 2021
- China Merchant Bank Scholarship, 2012-2014
- Pioneering Undergraduate Fellowship, 2011-2014
