LLM Inference

Introduction, embeddings, transformers and attention mechanisms

Inference execution and the KV cache

Sharding a Model

Pipeline, tensor, and expert parallelism

Continuous batching, Orca, and PagedAttention

I/O-Aware Kernels

FlashAttention and FlashInfer

Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction

Chunk prefill and prefill-decode disaggregation

Prefix caching and KV offload

Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques

Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy

LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang