Skip to content

Conversation

@jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Nov 5, 2025

This is very much a draft PR, but trying to drum up some interest in this idea...

To use this PR you need to pass an array of costs using the GGML_BATCH_COSTS environment variable, eg:

export GGML_BATCH_COSTS="1,0.605,0.473,0.408,0.374,0.351,0.334,0.32,0.313,0.3,0.289,0.287,0.278,0.272,0.269,0.27,0.265,0.261,0.26,0.256,0.254,0.253,0.251,0.248,0.248,0.247,0.245,0.247,0.244,0.243,0.243,0.243,0.242,0.241,0.239,0.239,0.238,0.238,0.238,0.237,0.238,0.238,0.234,0.235,0.234,0.233,0.232,0.233,0.233,0.232,0.232,0.231,0.23,0.231,0.231,0.231,0.23,0.23,0.229,0.229,0.23,0.229,0.228,0.228"

which can be plotted:

Figure_1
using this script (click to see)
import matplotlib.pyplot as plt
import numpy as np

data = [
  {
      "name": "DeepSeek-R1-0528 (NUMA)",
      "main": [1, 0.606, 0.472, 0.415, 0.378, 0.362, 0.339, 0.331, 0.321, 0.311, 0.298, 0.296, 0.287, 0.281, 0.277, 0.279, 0.275, 0.269, 0.269, 0.265, 0.262, 0.262, 0.259, 0.257, 0.256, 0.256, 0.254, 0.255, 0.252, 0.251, 0.252, 0.252, 0.25, 0.249, 0.247, 0.248, 0.246, 0.246, 0.246, 0.245, 0.246, 0.244, 0.243, 0.243, 0.242, 0.241, 0.24, 0.241, 0.241, 0.24, 0.24, 0.239, 0.238, 0.239, 0.239, 0.239, 0.238, 0.238, 0.236, 0.237, 0.237, 0.237, 0.236, 0.236]
  },
  {
      "name": "Kimi-K2-Instruct (NUMA)",
      "main": [1, 0.576, 0.446, 0.394, 0.353, 0.334, 0.321, 0.307, 0.301, 0.291, 0.287, 0.28, 0.276, 0.275, 0.27, 0.267, 0.267, 0.266, 0.264, 0.26, 0.26, 0.259, 0.258, 0.256, 0.256, 0.255, 0.254, 0.253, 0.252, 0.252, 0.25, 0.252, 0.249, 0.251, 0.249, 0.248, 0.247, 0.247, 0.246, 0.247, 0.246, 0.246, 0.246, 0.245, 0.245, 0.244, 0.244, 0.244, 0.243, 0.243, 0.243, 0.243, 0.243, 0.242, 0.242, 0.242, 0.242, 0.242, 0.241, 0.241, 0.241, 0.241, 0.24, 0.24]
  },
  {
      "name": "GLM-4.6 (RPC + PR 15405)",
      "main": [1, 1.027, 0.712, 0.544, 0.459, 0.389, 0.347, 0.315, 0.281, 0.256, 0.233, 0.214, 0.199, 0.186, 0.178, 0.166, 0.161, 0.153, 0.144, 0.139, 0.132, 0.128, 0.124, 0.12, 0.119, 0.115, 0.112, 0.107, 0.105, 0.102, 0.099, 0.097, 0.1, 0.098, 0.096, 0.093, 0.091, 0.09, 0.088, 0.086, 0.087, 0.086, 0.084, 0.083, 0.082, 0.08, 0.079, 0.077, 0.082, 0.082, 0.08, 0.08, 0.079, 0.077, 0.077, 0.075, 0.074, 0.073, 0.073, 0.072, 0.071, 0.071, 0.069, 0.069]
  },
  {
      "name": "Mistral-Large-Instruct-2411",
      "main": [1, 0.531, 0.397, 0.384, 0.334, 0.36, 0.364, 0.328, 0.178, 0.16, 0.146, 0.134, 0.124, 0.115, 0.108, 0.101, 0.089, 0.085, 0.08, 0.077, 0.073, 0.07, 0.067, 0.064, 0.059, 0.057, 0.055, 0.053, 0.052, 0.05, 0.049, 0.047, 0.048, 0.046, 0.045, 0.044, 0.043, 0.042, 0.041, 0.04, 0.039, 0.039, 0.038, 0.037, 0.037, 0.036, 0.035, 0.035, 0.039, 0.037, 0.037, 0.037, 0.036, 0.035, 0.034, 0.034, 0.034, 0.032, 0.033, 0.032, 0.031, 0.031, 0.031, 0.03]
  },
  {
      "name": "command-a-03-2025",
      "main": [1, 0.523, 0.381, 0.368, 0.329, 0.341, 0.338, 0.329, 0.167, 0.15, 0.137, 0.126, 0.116, 0.108, 0.101, 0.095, 0.088, 0.084, 0.079, 0.076, 0.072, 0.069, 0.066, 0.064, 0.057, 0.055, 0.053, 0.051, 0.05, 0.048, 0.047, 0.045, 0.046, 0.045, 0.044, 0.042, 0.041, 0.04, 0.04, 0.038, 0.039, 0.038, 0.037, 0.036, 0.036, 0.035, 0.034, 0.034, 0.038, 0.037, 0.036, 0.035, 0.034, 0.035, 0.034, 0.033, 0.032, 0.033, 0.031, 0.032, 0.031, 0.031, 0.031, 0.029]
  },
]

plt.figure(figsize=(10, 6))
x = np.arange(1, 65)  # Batch sizes from 1 to 64

# Plot each series with different markers
markers = ['o', 's', '^', 'D', 'v', 'p']
for i, series in enumerate(data):
  normalized = [v / series["main"][0] for v in series["main"]]
  plt.plot(x, normalized, marker=markers[i], linestyle='-', 
           markersize=5, label=series["name"])

plt.xlabel('Batch Size', fontsize=12)
plt.ylabel('Normalized Cost (relative to batch size 1)', fontsize=12)
plt.title('Batch Cost Scaling', fontsize=14, pad=20)
plt.legend(loc='upper right', bbox_to_anchor=(1, 1), framealpha=1)
plt.grid(True, alpha=0.3)
plt.xlim(1, 64)  # Ensure x-axis starts at 1
plt.xticks(np.arange(1, 64, 2))  # Show every other batch size
plt.tight_layout()
plt.show()

The way to read this is:

  • "computing a batch of 2 tokens costs 60.5% of the cost of computing the 2 tokens separately"
  • "computing a batch of 3 tokens costs 47.3% of the cost of computing the 3 tokens separately"
  • and so on...

These are then used (in place of --draft-p-min ) to decide if a draft sequence is predicted to have positive expectation, eg:

  • If your current sequence of 2 tokens has a predicted probability of > 60.5% then this is a +EV "gamble" and worth trying.
  • If your current sequence of 3 tokens has a predicted probability of < 47.3% then this is a -EV "gamble" and not worth trying.
  • and so on...

I've tried quite a few variations on this, and the current version is as simple as possible, but with one caveat:

  • If you look at the graph for GLM-4.6 over RPC above you can see that drafting a batch of 2 tokens is NEVER +EV, and the current code's logic of breaking as soon as an -EV batch is seen will mean we never try anything!
  • If you look at the graphs for Mistral-Large-Instruct-2411 and command-a-03-2025 you can see that looking ahead several sizes might be beneficial due to the weird "jaggedness" between batch sizes 3 and 8...

So this means we need a second parameter passed via the GGML_MAX_LOOK_AHEAD environment variable, eg:

For GLM-4.6 over RPC:

export GGML_MAX_LOOK_AHEAD=1

For Mistral-Large-Instruct-2411:

export GGML_MAX_LOOK_AHEAD=6

NOTE: The default value of GGML_MAX_LOOK_AHEAD is zero, and so long as your graph looks to be decaying monotonically; it seems best to just leave it as the default...


So you can try to guess generic values by trial and error, but as the graphs I've plotted above show; this is unlikely to work (at all!) and hence why I'm calling this "Profile Guided Speculative Decoding".

Here are 3 example scripts that I used to create the sets of values I plotted:

Basic example (ie: single machine, no NUMA):

#!/bin/bash

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~models/gguf/command-a-03-2025-Q5_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --tensor-split 44,45 \
              --flash-attn 1"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$model_path")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

RPC over 3 nodes:

#!/bin/bash

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~/models/gguf/GLM-4.6-Q5_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

RPC_SERVERS="192.168.1.2:50052,192.168.1.3:50052"

TENSOR_SPLIT="16,15,15,15,15,17"

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn 1 \
              --rpc $RPC_SERVERS \
              --device CUDA0,RPC0,RPC1,RPC2,RPC3,CUDA1 \
              --tensor-split $TENSOR_SPLIT"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$MODEL_PATH")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

Using NUMA, --override-tensor and CUDA_VISIBLE_DEVICES=0:

NOTE: The use of --no-op-offload for the test! To use this for the full range of batch sizes up to 64, you will likely need to use my other hack from #17026 (comment) or limit your maximum draft size to 32 via --draft-max 32 when running this PR...

#!/bin/bash

# Environment variables
export CUDA_VISIBLE_DEVICES=0

# Configuration variables
BATCHED_BENCH_PATH="~/llama.cpp/build/bin/llama-batched-bench"
MODEL_PATH="~/models/gguf/DeepSeek-R1-0528-Q4_X.gguf"

# Benchmark parameters
PROMPT_SIZE=0
MAX_DRAFT_SIZE=64
NUM_SAMPLES=4

# Model-specific parameters
MODEL_PARAMS="--n-gpu-layers 99 \
              --flash-attn 1 \
              --numa distribute \
              --threads $(nproc) \
              --override-tensor exps=CPU \
              --no-op-offload"

# Run the benchmark (NOTE: Add an extra set of samples as the models needs to warmup to get accurate stats!)
echo "- Running benchmark for $(basename "$model_path")..."
$BATCHED_BENCH_PATH \
  --model "$MODEL_PATH" \
  $MODEL_PARAMS \
  -pps \
  -npp "$PROMPT_SIZE" \
  -npl $(printf "%s," $(for i in $(seq 1 $((NUM_SAMPLES + 1))); do seq 1 $MAX_DRAFT_SIZE; done) | sed 's/,$//') \
  -ntg 1 \
  --output-format jsonl \
  | tee benchmark.log

# Extract the JSONL lines from the log (NOTE: This skips the extra set of samples we added above)
echo -n "- Extracting results from log..." >&2
cat benchmark.log | grep '^{' | tail -n "+$((MAX_DRAFT_SIZE + 1))" > results.jsonl
echo "Done ($(wc -l < results.jsonl) results extracted)." >&2

# Print the batch costs (relative to batch size of 1) rounded to 3dp
echo "----------------------------------------"
cat results.jsonl | jq -rs '
  (map(.pl) | max) as $max_pl |
  reduce .[] as $item (
    {};
    ($item.pl | tostring) as $pl |
    .[$pl].sum = (.[$pl].sum + $item.speed_tg) |
    .[$pl].count = (.[$pl].count + 1)
  ) |
  [range(1; $max_pl + 1) as $pl |
    (.[($pl|tostring)]).sum / .[($pl|tostring)].count
  ] as $values |
  $values[0] as $first |
  $values | map(. / $first | 1 / .) |
  map(. * 1000 | round | . / 1000) |
  "export GGML_BATCH_COSTS=\"" + (join(",")) + "\""
'
echo "----------------------------------------"

# Clean previous files
rm -f benchmark.log results.jsonl

NOTES:

  • You will need jq installed to extract the values at the end of the script.
  • You will probably need to customise your own script to try to run llama-batched-bench as similarly as possible to how you intend to use the target model.
  • It's important to average several runs (eg: NUM_SAMPLES=4) and let it discard the first set of batch results (as the first run is clearly biased for some reason).
  • It doesn't seem to make much difference using PROMPT_SIZE > 0, but will make the script take much longer if you raise the value above zero.

USE:

  • The --draft-p-min option will be completely ignored and should not be used with this.
  • I suggest you set --draft-max 64 and let the expected value calculation do the work (or --draft-max 32 if offloading the experts without my other hack [see above]).
  • The --draft-min option is also redundant, but be sure to set GGML_MAX_LOOK_AHEAD if needed [see above].
  • I am not 100% sure the existing logic with reusing via prompt_dft.push_back(id) works entirely correctly with this PR - the code is very dense and hard to see exactly the effect of my result.resize(best_size) code is.

If you get it all working correctly, then running with a script like this:

#!/bin/bash

export GGML_BATCH_COSTS="1,0.605,0.473,0.408,0.374,0.351,0.334,0.32,0.313,0.3,0.289,0.287,0.278,0.272,0.269,0.27,0.265,0.261,0.26,0.256,0.254,0.253,0.251,0.248,0.248,0.247,0.245,0.247,0.244,0.243,0.243,0.243,0.242,0.241,0.239,0.239,0.238,0.238,0.238,0.237,0.238,0.238,0.234,0.235,0.234,0.233,0.232,0.233,0.233,0.232,0.232,0.231,0.23,0.231,0.231,0.231,0.23,0.23,0.229,0.229,0.23,0.229,0.228,0.228"

~/llama.cpp/build/bin/llama-server ... \
        --model-draft ~/models/gguf/draft_models/DeepSeek-R1-DRAFT-0.6B-64k-Q4_0.gguf \
        --gpu-layers-draft 99 \
        --draft-max 64 \
        --top-k 1 \
        --samplers "top_k"

should give a very large boost for high "draftability" prompts like "refactor this code" or "reword this report", and almost no degradation in TG tokens/s for low "draftability" prompts (depending on how "steppy" your array of values is, and assuming your draft:target active-parameter ratio is small [eg: 0.5B draft for 30B+ ideally]).

@jukofyork
Copy link
Collaborator Author

jukofyork commented Nov 5, 2025

Server / server-windows (pull_request)Failing after 7m

This may not be a false positive and probably to do with the way I've hacked out the --draft-p-min option and/or a bug to do with this:

I am not 100% sure the existing logic with reusing via prompt_dft.push_back(id) works entirely correctly with this PR - the code is very dense and hard to see exactly the effect of my result.resize(best_size) code is.

(this really needs properly investigating, but the code is very dense and not 100% clear exactly what is getting saved between re-entrant calls... I think unless you use GGML_MAX_LOOK_AHEAD it should be correct, but not 100% sure...)

If this gets plenty of interest and seems to work well for other people, I'll definitely polish it up and make a proper PR with the proper command line options - I just don't want to add those yet and need to rebase every time somebody else adds a new argument...

I've actually tried a lot of different variations on calculating the expected value, look-ahead heuristics,and so on, but this is the simplest version that seems to work well (it might later be worth re-investigating ideas like this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant