Tao Luo

I am a CS Ph.D. candidate at the University of Pennsylvania (defending 2026), advised by Profs. Boon Thau Loo and Vincent Liu. I build agent infrastructure across RL post-training and LLM serving and retrieval.

I am seeking full-time industry roles starting in 2026.

At Alibaba, I designed and shipped Partial Overlapping, a runtime scheduling mechanism for asynchronous agentic RL in ROLL that expands agent rollouts to idle training GPUs (3.5x rollout throughput; featured in the ROME technical report), leveraging coding agents extensively (Claude Code, Codex) with zero human-written code (first feature to ship this way in Alibaba’s flagship post-training framework). It now powers agentic RL post-training of multiple production agents at 100B+ parameters and thousands of GPUs. I also designed and built RLix GitHub stars, an orchestration layer for concurrent agentic RL pipelines (2.6x rollout throughput in SWE-agent RL training; stars from NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc.). My work spans vLLM, Megatron-LM, and Ray.

During my Ph.D. at Penn, I led ParaFlex, a multiplexed heterogeneous LLM serving system that eliminates head-of-line blocking via stage-aligned parallelism. Earlier, during M.S. study at Columbia University, I introduced Privacy Budget Scheduling, and developed DPF as the first scheduling algorithm for ML training under differential privacy constraints. I also contribute to retrieval/data systems research, spanning vector search and query optimization. My work has appeared at OSDI, SOSP, and SoCC.

Before academia, I spent ~4 years in quant investment, developing strategies and building infrastructure. I hold a B.S. in Financial Mathematics from Southern University of Science and Technology.

Selected Projects

Agentic RL Post-Training Infrastructure @Alibaba, DAMO Academy

  • Proposed and shipped Partial Overlapping, a runtime scheduling mechanism for asynchronous agentic RL that expands agent rollouts to idle training GPUs, improving rollout throughput by 3.5x (featured in the ROME technical report).
  • Leveraged coding agents extensively (Claude Code, Codex) to design, implement, and debug Partial Overlapping with zero human-written code (first high-priority feature in alibaba/ROLL shipped this way); featured in technical blogs (English/Chinese) as a case study for AI-assisted systems engineering.
  • Deployed in production for agentic RL post-training of models with 100s of billions of parameters on 1000s of GPUs, including Qoder IDE (coding agent), iFlow CLI (terminal agent), Amap (travel-planning agent), and Alimama (ads).
  • Extended Partial Overlapping to async multi-LoRA fine-tuning via per-adapter optimizers on a shared Megatron base model.
  • Designed and built RLix GitHub stars, an orchestration layer for concurrent agentic RL pipelines that enables elastic GPU sharing and higher cluster utilization with minimal changes to training recipes (2.6x rollout throughput in SWE-agent RL training; stars from NVIDIA, Google, xAI, Anthropic, ByteDance, Zhipu AI, etc.).

Heterogeneous Multi-Model LLM Serving at Scale @University of Pennsylvania

  • Architected a multiplexed serving system with stage-aligned parallelism that eliminates head-of-line blocking, increasing token throughput by 1.6x while reducing median latency.
  • Built multi-model KV cache management, distributed execution in vLLM and Ray, and NCCL concurrency controls.
  • Developed algorithms for efficient model sharding, replication, placement, and scheduling across heterogeneous serving workloads.
  • ParaFlex, SoCC’25 paper

Query Optimization for Declarative Smart Contracts @UPenn

  • Framed efficiency of Datalog-compiled smart contracts as a view selection problem under a non-standard, history-dependent cost model (Ethereum gas).
  • Designed and implemented a selective view materialization algorithm with simplification-based pruning; formally proved algorithm correctness and pruning completeness.
  • Reduced storage gas by ~78% and total gas by >50% over naive compilation, matching expert hand-tuned Solidity on a benchmark of widely deployed contracts.
  • DeSCO paper, FAB’24 (co-located with VLDB).

Privacy-Preserving Scheduling for ML Training @Columbia University

  • Designed the first fair-allocation scheduling algorithm for ML training under differential-privacy constraints.
  • Improved job throughput by 2x over FCFS under the same privacy budget, verified in large-scale simulations; proved formal efficiency and fairness guarantees.
  • Privacy Budget Scheduling, OSDI’21

Honors & Service

  • Program Committee: ACM Symposium on Cloud Computing 2025
  • Manjushri Fellowship, University of Pennsylvania, 2021
  • China Merchant Bank Scholarship, 2012-2014
  • Pioneering Undergraduate Fellowship, 2011-2014