Inference Infrastructure

This is a discussion of the basics of inference and the serving ecosystem from a systems design perspective. It largely focuses on techniques to scale serving, reduce latency, or drive better efficiency and utilization in modern inference stacks.

  1. LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
  2. Inference and the KV Cache Inference execution and the KV cache
  3. Sharding a Model Pipeline, tensor, and expert parallelism
  4. Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
  5. I/O-Aware Kernels FlashAttention and FlashInfer
  6. Speculative Decoding Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
  7. Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
  8. KV Cache Management and Offload Prefix caching and KV offload
  9. Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
  10. Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
  11. Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang