Introduction, embeddings, transformers and attention mechanisms
LLM Inference
LLMs and Transformers
Inference and the KV Cache
Inference execution and the KV cache
Sharding a Model
Pipeline, tensor, and expert parallelism
Batching, Scheduling, and Paging
Continuous batching, Orca, and PagedAttention
I/O-Aware Kernels
FlashAttention and FlashInfer
Speculative Decoding
Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
Prefill-Decode Scheduling and Disaggregation
Chunk prefill and prefill-decode disaggregation
KV Cache Management and Offload
Prefix caching and KV offload
Appendix: Overview of Training
Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
Appendix: GPU Hardware
Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
Appendix: Inference Runtimes
LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang