Inference Infrastructure

This is a discussion of the basics of inference and the serving ecosystem from a systems design perspective. It largely focuses on techniques to scale serving, reduce latency, or drive better efficiency and utilization in modern inference stacks.

LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
Inference and the KV Cache Inference execution and the KV cache
Sharding a Model Pipeline, tensor, and expert parallelism
Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
I/O-Aware Kernels FlashAttention, FlashInfer, and Blockwise Parallel Transformers
Speculative Decoding Sequential and parallel drafting, EAGLE, Medusa, MTP, DFlash, and DSpark
Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
KV Cache Management and Offload Prefix caching and KV offload
Sharding the Input Sequence and context parallelism, Ring Attention, and DeepSpeed-Ulysses
Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang