common: "Profile Guided Speculative Decoding" #17034

jukofyork · 2025-11-05T18:46:52Z

This is very much a draft PR, but trying to drum up some interest in this idea...

To use this PR you need to pass an array of costs using the GGML_BATCH_COSTS environment variable, eg:

export GGML_BATCH_COSTS="1,0.605,0.473,0.408,0.374,0.351,0.334,0.32,0.313,0.3,0.289,0.287,0.278,0.272,0.269,0.27,0.265,0.261,0.26,0.256,0.254,0.253,0.251,0.248,0.248,0.247,0.245,0.247,0.244,0.243,0.243,0.243,0.242,0.241,0.239,0.239,0.238,0.238,0.238,0.237,0.238,0.238,0.234,0.235,0.234,0.233,0.232,0.233,0.233,0.232,0.232,0.231,0.23,0.231,0.231,0.231,0.23,0.23,0.229,0.229,0.23,0.229,0.228,0.228"

which can be plotted:

using this script (click to see)

import matplotlib.pyplot as plt
import numpy as np

data = [
  {
      "name": "DeepSeek-R1-0528 (NUMA)",
      "main": [1, 0.606, 0.472, 0.415, 0.378, 0.362, 0.339, 0.331, 0.321, 0.311, 0.298, 0.296, 0.287, 0.281, 0.277, 0.279, 0.275, 0.269, 0.269, 0.265, 0.262, 0.262, 0.259, 0.257, 0.256, 0.256, 0.254, 0.255, 0.252, 0.251, 0.252, 0.252, 0.25, 0.249, 0.247, 0.248, 0.246, 0.246, 0.246, 0.245, 0.246, 0.244, 0.243, 0.243, 0.242, 0.241, 0.24, 0.241, 0.241, 0.24, 0.24, 0.239, 0.238, 0.239, 0.239, 0.239, 0.238, 0.238, 0.236, 0.237, 0.237, 0.237, 0.236, 0.236]
  },
  {
      "name": "Kimi-K2-Instruct (NUMA)",
      "main": [1, 0.576, 0.446, 0.394, 0.353, 0.334, 0.321, 0.307, 0.301, 0.291, 0.287, 0.28, 0.276, 0.275, 0.27, 0.267, 0.267, 0.266, 0.264, 0.26, 0.26, 0.259, 0.258, 0.256, 0.256, 0.255, 0.254, 0.253, 0.252, 0.252, 0.25, 0.252, 0.249, 0.251, 0.249, 0.248, 0.247, 0.247, 0.246, 0.247, 0.246, 0.246, 0.246, 0.245, 0.245, 0.244, 0.244, 0.244, 0.243, 0.243, 0.243, 0.243, 0.243, 0.242, 0.242, 0.242, 0.242, 0.242, 0.241, 0.241, 0.241, 0.241, 0.24, 0.24]
  },
  {
      "name": "GLM-4.6 (RPC + PR 15405)",
      "main": [1, 1.027, 0.712, 0.544, 0.459, 0.389, 0.347, 0.315, 0.281, 0.256, 0.233, 0.214, 0.199, 0.186, 0.178, 0.166, 0.161, 0.153, 0.144, 0.139, 0.132, 0.128, 0.124, 0.12, 0.119, 0.115, 0.112, 0.107, 0.105, 0.102, 0.099, 0.097, 0.1, 0.098, 0.096, 0.093, 0.091, 0.09, 0.088, 0.086, 0.087, 0.086, 0.084, 0.083, 0.082, 0.08, 0.079, 0.077, 0.082, 0.082, 0.08, 0.08, 0.079, 0.077, 0.077, 0.075, 0.074, 0.073, 0.073, 0.072, 0.071, 0.071, 0.069, 0.069]
  },
  {
      "name": "Mistral-Large-Instruct-2411",
      "main": [1, 0.531, 0.397, 0.384, 0.334, 0.36, 0.364, 0.328, 0.178, 0.16, 0.146, 0.134, 0.124, 0.115, 0.108, 0.101, 0.089, 0.085, 0.08, 0.077, 0.073, 0.07, 0.067, 0.064, 0.059, 0.057, 0.055, 0.053, 0.052, 0.05, 0.049, 0.047, 0.048, 0.046, 0.045, 0.044, 0.043, 0.042, 0.041, 0.04, 0.039, 0.039, 0.038, 0.037, 0.037, 0.036, 0.035, 0.035, 0.039, 0.037, 0.037, 0.037, 0.036, 0.035, 0.034, 0.034, 0.034, 0.032, 0.033, 0.032, 0.031, 0.031, 0.031, 0.03]
  },
  {
      "name": "command-a-03-2025",
      "main": [1, 0.523, 0.381, 0.368, 0.329, 0.341, 0.338, 0.329, 0.167, 0.15, 0.137, 0.126, 0.116, 0.108, 0.101, 0.095, 0.088, 0.084, 0.079, 0.076, 0.072, 0.069, 0.066, 0.064, 0.057, 0.055, 0.053, 0.051, 0.05, 0.048, 0.047, 0.045, 0.046, 0.045, 0.044, 0.042, 0.041, 0.04, 0.04, 0.038, 0.039, 0.038, 0.037, 0.036, 0.036, 0.035, 0.034, 0.034, 0.038, 0.037, 0.036, 0.035, 0.034, 0.035, 0.034, 0.033, 0.032, 0.033, 0.031, 0.032, 0.031, 0.031, 0.031, 0.029]
  },
]

plt.figure(figsize=(10, 6))
x = np.arange(1, 65)  # Batch sizes from 1 to 64

# Plot each series with different markers
markers = ['o', 's', '^', 'D', 'v', 'p']
for i, series in enumerate(data):
  normalized = [v / series["main"][0] for v in series["main"]]
  plt.plot(x, normalized, marker=markers[i], linestyle='-', 
           markersize=5, label=series["name"])

plt.xlabel('Batch Size', fontsize=12)
plt.ylabel('Normalized Cost (relative to batch size 1)', fontsize=12)
plt.title('Batch Cost Scaling', fontsize=14, pad=20)
plt.legend(loc='upper right', bbox_to_anchor=(1, 1), framealpha=1)
plt.grid(True, alpha=0.3)
plt.xlim(1, 64)  # Ensure x-axis starts at 1
plt.xticks(np.arange(1, 64, 2))  # Show every other batch size
plt.tight_layout()
plt.show()

The way to read this is:

"computing a batch of 2 tokens costs 60.5% of the cost of computing the 2 tokens separately"
"computing a batch of 3 tokens costs 47.3% of the cost of computing the 3 tokens separately"
and so on...

These are then used (in place of --draft-p-min ) to decide if a draft sequence is predicted to have positive expectation, eg:

If your current sequence of 2 tokens has a predicted probability of > 60.5% then this is a +EV "gamble" and worth trying.
If your current sequence of 3 tokens has a predicted probability of < 47.3% then this is a -EV "gamble" and not worth trying.
and so on...

I've tried quite a few variations on this, and the current version is as simple as possible, but with one caveat:

If you look at the graph for GLM-4.6 over RPC above you can see that drafting a batch of 2 tokens is NEVER +EV, and the current code's logic of breaking as soon as an -EV batch is seen will mean we never try anything!
If you look at the graphs for Mistral-Large-Instruct-2411 and command-a-03-2025 you can see that looking ahead several sizes might be beneficial due to the weird "jaggedness" between batch sizes 3 and 8...

So this means we need a second parameter passed via the GGML_MAX_LOOK_AHEAD environment variable, eg:

For `GLM-4.6 over RPC`:

export GGML_MAX_LOOK_AHEAD=1

For `Mistral-Large-Instruct-2411`:

export GGML_MAX_LOOK_AHEAD=6

NOTE: The default value of GGML_MAX_LOOK_AHEAD is zero, and so long as your graph looks to be decaying monotonically; it seems best to just leave it as the default...

So you can try to guess generic values by trial and error, but as the graphs I've plotted above show; this is unlikely to work (at all!) and hence why I'm calling this "Profile Guided Speculative Decoding".

Here are 3 example scripts that I used to create the sets of values I plotted:

Basic example (ie: single machine, no NUMA):

#!/bin/bash

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~models/gguf/command-a-03-2025-Q5_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --tensor-split 44,45 \
              --flash-attn 1"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$model_path")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

RPC over 3 nodes:

#!/bin/bash

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~/models/gguf/GLM-4.6-Q5_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

RPC_SERVERS="192.168.1.2:50052,192.168.1.3:50052"

TENSOR_SPLIT="16,15,15,15,15,17"

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn 1 \
              --rpc $RPC_SERVERS \
              --device CUDA0,RPC0,RPC1,RPC2,RPC3,CUDA1 \
              --tensor-split $TENSOR_SPLIT"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$MODEL_PATH")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

Using NUMA, `--override-tensor` and `CUDA_VISIBLE_DEVICES=0`:

NOTE: The use of --no-op-offload for the test! To use this for the full range of batch sizes up to 64, you will likely need to use my other hack from #17026 (comment) or limit your maximum draft size to 32 via --draft-max 32 when running this PR...

#!/bin/bash

# Environment variables
export CUDA_VISIBLE_DEVICES=0

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~/models/gguf/DeepSeek-R1-0528-Q4_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn 1 \
              --numa distribute \
              --threads $(nproc) \
              --override-tensor exps=CPU \
              --no-op-offload"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$model_path")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

NOTES:

You will need jq installed to extract the values at the end of the script.
You will probably need to customise your own script to try to run llama-batched-bench as similarly as possible to how you intend to use the target model.
It's important to average several runs (eg: NUM_SAMPLES=4) and let it discard the first set of batch results (as the first run is clearly biased for some reason).
It doesn't seem to make much difference using PROMPT_SIZE > 0, but will make the script take much longer if you raise the value above zero.

USE:

The --draft-p-min option will be completely ignored and should not be used with this.
I suggest you set --draft-max 64 and let the expected value calculation do the work (or --draft-max 32 if offloading the experts without my other hack [see above]).
The --draft-min option is also redundant, but be sure to set GGML_MAX_LOOK_AHEAD if needed [see above].
I am not 100% sure the existing logic with reusing via prompt_dft.push_back(id) works entirely correctly with this PR - the code is very dense and hard to see exactly the effect of my result.resize(best_size) code is.

If you get it all working correctly, then running with a script like this:

#!/bin/bash

export GGML_BATCH_COSTS="1,0.605,0.473,0.408,0.374,0.351,0.334,0.32,0.313,0.3,0.289,0.287,0.278,0.272,0.269,0.27,0.265,0.261,0.26,0.256,0.254,0.253,0.251,0.248,0.248,0.247,0.245,0.247,0.244,0.243,0.243,0.243,0.242,0.241,0.239,0.239,0.238,0.238,0.238,0.237,0.238,0.238,0.234,0.235,0.234,0.233,0.232,0.233,0.233,0.232,0.232,0.231,0.23,0.231,0.231,0.231,0.23,0.23,0.229,0.229,0.23,0.229,0.228,0.228"

~/llama.cpp/build/bin/llama-server ... \
        --model-draft ~/models/gguf/draft_models/DeepSeek-R1-DRAFT-0.6B-64k-Q4_0.gguf \
        --gpu-layers-draft 99 \
        --draft-max 64 \
        --top-k 1 \
        --samplers "top_k"

should give a very large boost for high "draftability" prompts like "refactor this code" or "reword this report", and almost no degradation in TG tokens/s for low "draftability" prompts (depending on how "steppy" your array of values is, and assuming your draft:target active-parameter ratio is small [eg: 0.5B draft for 30B+ ideally]).

jukofyork · 2025-11-05T19:36:35Z

Server / server-windows (pull_request)Failing after 7m

This may not be a false positive and probably to do with the way I've hacked out the --draft-p-min option and/or a bug to do with this:

I am not 100% sure the existing logic with reusing via prompt_dft.push_back(id) works entirely correctly with this PR - the code is very dense and hard to see exactly the effect of my result.resize(best_size) code is.

(this really needs properly investigating, but the code is very dense and not 100% clear exactly what is getting saved between re-entrant calls... I think unless you use GGML_MAX_LOOK_AHEAD it should be correct, but not 100% sure...)

If this gets plenty of interest and seems to work well for other people, I'll definitely polish it up and make a proper PR with the proper command line options - I just don't want to add those yet and need to rebase every time somebody else adds a new argument...

I've actually tried a lot of different variations on calculating the expected value, look-ahead heuristics,and so on, but this is the simplest version that seems to work well (it might later be worth re-investigating ideas like this).

jukofyork and others added 3 commits November 5, 2025 10:52

Add ability to use look-ahead

8c40673

Improved comments

95bde74

Merge branch 'ggml-org:master' into spec-decode

13b86bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common: "Profile Guided Speculative Decoding" #17034

common: "Profile Guided Speculative Decoding" #17034

jukofyork commented Nov 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Nov 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

common: "Profile Guided Speculative Decoding" #17034

Are you sure you want to change the base?

common: "Profile Guided Speculative Decoding" #17034

Conversation

jukofyork commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For GLM-4.6 over RPC:

For Mistral-Large-Instruct-2411:

Basic example (ie: single machine, no NUMA):

RPC over 3 nodes:

Using NUMA, --override-tensor and CUDA_VISIBLE_DEVICES=0:

NOTES:

USE:

Uh oh!

jukofyork commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jukofyork commented Nov 5, 2025 •

edited

Loading

For `GLM-4.6 over RPC`:

For `Mistral-Large-Instruct-2411`:

Using NUMA, `--override-tensor` and `CUDA_VISIBLE_DEVICES=0`:

jukofyork commented Nov 5, 2025 •

edited

Loading