Speculative Decoding
Even when a model fits entirely into HBM and only a single model is being executed on the GPU, the primary bottleneck during LLM inference is the movement of model weights from the HBM, through the cache hierarchy into device registers for actual computation. This happens multiple times in a single batch, since different layers are loaded one after the other as the computation proceeds.
Speculative decoding attempts to address this by speculating a number of tokens cheaply, which can then be verified in parallel by the model. This verification is similar to prefill: the model is given all the speculated tokens with the appropriate masking to prevent tokens from being able to see into the future and asked to generate the entire batch, which is compared with the speculated tokens. The cost of fetching model weights is effectively amortized over multiple tokens; the verified tokens are added to the KV cache, while the extra tokens are discarded.
The traditional way of speculating the tokens is to use a Draft-Target-Model architecture, where a smaller model acts as a draft model to speculatively generate tokens, while the actual full-sized model acts as the target which verifies them in parallel and keeps only the tokens that match. Both the draft and target models reside in the HBM, so speculative decoding actually increases the overall memory required during inference. The number of draft tokens to generate, called the lookahead window, depends on the acceptance rate of the tokens and is empirically determined to balance the potential speedup with wasted work, with common values being 3-12.
One option for the draft model is to use a smaller version of the target model, for instance using Llama-8B as a draft model for a target of Llama-70B. EAGLE uses another approach, where it trains a lightweight transformer-like decoder which uses the internal (feature) state of the target model to predict its draft tokens, rather than just relying on the output tokens. This reduces the overall uncertainty of predictions and results in a higher acceptance rate.
Another approach to speculative decoding is to eschew the draft model altogether and augment the original model to generate the draft tokens. In Medusa, the original model is frozen and then a number of heads are added (similar to fine-tuning), where the head is to generate the draft token in the position. The Medusa heads are independent of each other and each generates tokens without knowing the actual predictions of prior heads, which means a misprediction by an early head often invalidates the entire draft sequence.
Medusa gets around this by taking the top- predictions from every head and then generates a number of possible draft sequences to validate using the Cartesian product of the individual outputs, called the Medusa Tree. Since the total number of possibilities is exponential in the number of heads and k, these draft sequences are usually pruned and only the most likely candidates validated in a single pass. The validated tokens from the candidate with the highest acceptance rate are selected and the process continues.
Multi-Token Prediction (MTP) is a training-side technique designed to increase data efficiency by allowing models to derive more knowledge from the same datasets. It extends the basic transformer architecture: while models are typically trained to reduce the error during prediction of the next token, here the model is trained to reduce prediction error over each of the subsequent tokens. This mechanism can be repurposed during inference to generate tokens per-decode loop, which can be used for speculative decoding in lieu of those generated by a draft model. Models are designed to predict the next tokens in parallel by creating independent heads, each implemented as a shallow, single-layer transformer that operates on the same hidden states produced by the shared computation pipeline (called the trunk).
Despite the similarities with Medusa (both add additional layers to the model), multi-token prediction cannot be bolted onto a model after the fact using fine-tuning; on the other hand, any such model can use the additional tokens for speculative decoding without further modification. The shallow transformers added by MTP typically have higher memory overhead than the Medusa heads (transformers vs. linear layers), but lower than an independent EAGLE draft model. In terms of token generation, MTP can generate all the draft tokens independently in a single pass (similar to the Medusa heads); however, in some variants, such as DeepSeek-V3, the heads are more heavyweight transformer blocks that operate sequentially to better preserve the inter-token dependencies (similar to EAGLE).
Another approach to generating multiple candidate tokens in parallel is to use diffusion models, which can generate multiple tokens in a single forward pass, as draft models. While such diffusion models can lower per-token latency by amortizing the generation time across the entire lookahead window, they usually underperform autoregressive drafters in accuracy, resulting in a lower acceptance rate. Block diffusion models are a hybrid of diffusion and autoregressive models where fixed-size blocks, each of which is generated in parallel using diffusion, are produced serially.
DFlash uses block diffusion models in an attempt to balance both generation latency and the acceptance rate. Similar to EAGLE, DFlash uses the hidden states of the target model as context, under the assumption that they implicitly encode information about long-range dependencies and future tokens. Parallel token generation allows DFlash to use a larger and more expressive draft model than purely autoregressive frameworks; however, as the number of layers in the model increases, the input context gets increasingly diluted, impacting quality. DFlash addresses this by using the context, not just as input to the model, but as part of the input to every layer of the model to increase prediction accuracy. DDTree is an extension to DFlash which uses a tree structure (conceptually similar to Medusa Trees) where the diffusion model generates multiple candidates for each position, after which a single draft tree is constructed and verified by the target model.
DSpark, which was introduced as part of DeepSeek-V4, increases the acceptance rate of speculative decoding in a highly concurrent production system both by improving the accuracy of the draft model, thereby increasing the numerator (accepted tokens), and by pruning low-confidence draft tokens from the verification batch, thereby decreasing the denominator (verified tokens). Block diffusion drafters lose inter-token dependencies within a generated block which can lead to certain incoherent combinations of tokens and a decay of the acceptance rate deeper into the block. DSpark augments a DFlash-based parallel drafter with a sequential autoregressive stage to allow intra-block dependencies to inform the generation and reduce the decay for later tokens within the block. To preserve the performance benefits of parallel drafting, it uses a lightweight Markov model for its sequential pass, rather than a transformer-based sequential model, as in EAGLE.
DSpark also uses a systems-level optimization to reduce verification load: at high concurrency, verification is no longer entirely free and the GPU’s compute resources can be used for other active requests. DSpark uses a confidence model to determine which of the draft tokens are likely to be accepted and only runs the verification pass over those to reduce wasted computation. Rather than using a static threshold for admission of draft tokens, it uses a load-aware scheduler which calculates the desired batch size for verification as a function of the compute capacity available beyond processing active requests and then greedily fills the batch with the highest confidence draft tokens. This is conceptually similar to how chunked prefills, which are discussed in Prefill-Decode Scheduling and Disaggregation, decide how to co-locate prefill and decode tokens within a single batch.
One subtlety of the selection process for draft tokens is that tokens accepted by the scheduler for verification get two chances of making the final output, i.e., they can be accepted by the verification pass or potentially generated during decode by the target model after token rejection, while tokens rejected by the scheduler only make the final output if they are generated by the target model. Since the models are probabilistic, a naive selection strategy could result in the output of running with speculative decoding drawing tokens from a different distribution than if speculative decoding were disabled, without generating incorrect or nonsensical tokens (since all output tokens have still been verified by the target model). DSpark guards against bias in the admission policy by conforming to the non-anticipating property, i.e., preventing later tokens in a batch from affecting the selection probability of earlier tokens. One mechanism to achieve this is to use a greedy early exit the moment admitting a token is no longer estimated to improve performance, so later tokens are not processed and do not influence the admission decision.
Speculative decoding trades additional compute (generated draft tokens are often thrown away) for better memory bandwidth utilization (multiple tokens can be accepted in a single pass). Consequently, the greatest benefits are when the model is memory-bound, i.e., large models at small batch sizes. As batch sizes increase, the workload shifts towards being compute-bound and the benefits decrease. Despite this, recent studies have shown that it benefits throughput for several evaluated workloads. Speculative decoding is an active research topic and there’s a comprehensive survey of the different techniques here.