LLM Inference
-
LLMs and Transformers
Introduction, embeddings, transformers and attention mechanisms -
Inference and the KV Cache
Inference execution and the KV cache -
Sharding a Model
Pipeline, tensor, and expert parallelism -
Batching, Scheduling, and Paging
Continuous batching, Orca, and PagedAttention -
I/O-Aware Kernels
FlashAttention and FlashInfer -
Speculative Decoding
Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction -
Prefill-Decode Scheduling and Disaggregation
Chunk prefill and prefill-decode disaggregation -
KV Cache Management and Offload
Prefix caching and KV offload -
Appendix: Overview of Training
Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques -
Appendix: GPU Hardware
Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy -
Appendix: Inference Runtimes
LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang
Histograms and Tail Latency
-
H2 Histogram
HdrHistogram, bucket construction, and configurable options -
Latency Is a Curve
Latency, single values vs. distributions, quantiles, histograms, and SLOs -
Hedging or Scattering? A Mixed Bag of Distributed Latencies
Service dependencies, request patterns, combined latencies, and real-world patterns