Support Eagle-3 Speculative Decoding in llama.cpp #15902

ichbinhandsome · 2025-09-09T15:12:10Z

ichbinhandsome
Sep 9, 2025

Eagle-3 is currently the SOTA algorithm for speculative decoding, as demonstrated by Spec-bench and the Eagle-3 paper. However, llama.cpp does not yet support Eagle-3 while other major LLM inference frameworks, such as TRT-LLM, vLLM, and SGLang, already provide support for Eagle-3, resulting in around 2-2.5x performance boost compared to native autoregressive decoding.

Furthermore, there are several PR & issues for implementing Eagle-3 in llama.cpp: #13908, #15305

Many models with Eagle-3 checkpoints are already available on Hugging Face (link), and users can also fine-tune their own Eagle-3 checkpoints using TensorRT-Model-Optimizer.

Based on the above, I see a significant need to implement Eagle-3 in llama.cpp to potentially make LLM inference faster and llama.cpp more competitive. Therefore, I would like to initiate a discussion with the llama.cpp team to align on the goals and implementation.

To implement Eagle-3 in llama.cpp, several components need to be addressed (this may not be 100% accurate, and I am happy to receive any feedback on it):

Eagle-3 Draft Model Checkpoint Support:
- Convert Eagle-3 Draft PyTorch Checkpoints to GGUF Format
- Eagle-3 Draft Model Architecture Support
- Feature Extraction from Target Model Forward Pass: Hidden states after the first, middle, and last decoding layers.
Draft Tokens Sampling & Parallel Verification with Target Model:
- context-dependent dynamic draft tree structure support
- Acceptance & Rejection Strategy for Draft Tokens (Speculative Sampling)
Repeat above until max-token-len or EOS token

Workflow: During inference, we need to record low, middle, and high-level features (hidden states after first, middle, and last decoding layers) in the forward pass of the target model. After that, we combine those hidden states and token embedding and feed it to the speculative layer. The speculative layer then generates a sequence of draft tokens autoregressively for parallel verification by the target model.

Since the Eagle-3 checkpoint is model-specific, I propose to start with llama3. I appreciate your feedbacks on it.

lippman1125 · 2025-09-13T04:18:10Z

lippman1125
Sep 13, 2025

absolutely necessary!

0 replies

ggerganov · 2025-09-19T15:53:40Z

ggerganov
Sep 19, 2025
Maintainer

context-dependent dynamic draft tree structure support

How important do you think this step is? My experience from some time ago with tree-based speculative decoding is that it does not bring much benefit to basic, linear speculative sampling. And at the same time it complicates the logic quite a bit.

After that, we combine those hidden states and token embedding and feed it to the speculative layer.

Does the speculative layer maintain any additional state? Or is it purely a function of the 3 hidden states and the last token embedding?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Eagle-3 Speculative Decoding in llama.cpp #15902

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Support Eagle-3 Speculative Decoding in llama.cpp #15902

Uh oh!

ichbinhandsome Sep 9, 2025

Replies: 2 comments

Uh oh!

lippman1125 Sep 13, 2025

Uh oh!

Uh oh!

ggerganov Sep 19, 2025 Maintainer

ichbinhandsome
Sep 9, 2025

lippman1125
Sep 13, 2025

ggerganov
Sep 19, 2025
Maintainer