Inference Infrastructure
This is a discussion of the basics of inference and the serving ecosystem from a systems design perspective. It largely focuses on techniques to scale serving, reduce latency, or drive better efficiency and utilization in modern inference stacks.
- LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
- Inference and the KV Cache Inference execution and the KV cache
- Sharding a Model Pipeline, tensor, and expert parallelism
- Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
- I/O-Aware Kernels FlashAttention and FlashInfer
- Speculative Decoding Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
- Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
- KV Cache Management and Offload Prefix caching and KV offload
- Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
- Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
- Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang