Serverless AI
Highlights of a UVA F25 course.
Shoutout to Professor Yue Cheng for a great Fall semester!
This course was structured similarly to most graduate courses at UVA—a mix of paper readings, presentations, and a course project. I was most interested in the LLM serving component, but also enjoyed discussions around cloud computing, FaaS, and computing hardware. I've highlighted some of my favorite serving readings from the semester below.
PagedAttention [1]
From Stoica's lab at Berkeley, this work introduces vLLM, an open-sourced inference engine. The authors find that previous KV cache memory management systems lead to internal (reserved, unused memory) and external (free, non-contiguous memory) fragmentation. Inspired by traditional OS, they propose PagedAttention, which allocates fixed-size KV blocks similar to pages. This naturally extends to support prefix caching.
Orca [2]
From SNU, this work introduces iteration-level scheduling and selective batching. As opposed to previous methods, which waits for all requests in a batch to complete, continuous batching schedules new requests as soon as any previous request completes, improving serving latency and throughput.
DistServe [3]
From the Hao AI lab at UCSD, this work introduces prefill-decode (PD) disaggregation. The authors identify that prefill and decode workloads have very different characteristics; decode steps should run with larger batch sizes due to their memory-bound nature, while prefill steps are compute-bound. They show that placing prefill and decode instances on separate GPUs maximizes per-GPU goodput.
Sarathi-Serve [4]
From Microsoft, this work introduces chunked-prefills, which splits prefill requests into smaller chunks. Given the compute-bound nature of prefill, this allows for prefills to be jointly scheduled with decodes without a significant latency penalty.
MoonCake [5]
From Moonshot AI, this work details the serving platform for Moonshot's Kimi. The authors propose a multi-level KV cache pool for PD disaggregated architectures, which utilizes Remote Direct Memory Access (RDMA) for fast transfers.
BlitzScale [6]
From SJTU, this work proposes faster GPU autoscaling. Instead of loading models from slow network storage, the authors leverage RDMA for the case of deploying new instances of an existing, live model. This works well with online RL workloads, where checkpoints are updated and deployed every 90 minutes. Lequn Chen's blog on inter-node RL weight transfer is also an exceptional resource.
Bonus: Shuffling, Fast and Slow [7]
Not a serving work, but interesting nonetheless. Also from Stoica's lab, this work introduces a hybrid sorting method which combines cheap, slow (S3) with fast, expensive (Redis) storage to trade off cost and performance.