common: "Profile Guided Speculative Decoding" #17034
Draft
+31
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is very much a draft PR, but trying to drum up some interest in this idea...
To use this PR you need to pass an array of costs using the
GGML_BATCH_COSTSenvironment variable, eg:which can be plotted:
using this script (click to see)
The way to read this is:
These are then used (in place of
--draft-p-min) to decide if a draft sequence is predicted to have positive expectation, eg:> 60.5%then this is a +EV "gamble" and worth trying.< 47.3%then this is a -EV "gamble" and not worth trying.I've tried quite a few variations on this, and the current version is as simple as possible, but with one caveat:
GLM-4.6 over RPCabove you can see that drafting a batch of 2 tokens is NEVER +EV, and the current code's logic of breaking as soon as an -EV batch is seen will mean we never try anything!Mistral-Large-Instruct-2411andcommand-a-03-2025you can see that looking ahead several sizes might be beneficial due to the weird "jaggedness" between batch sizes3and8...So this means we need a second parameter passed via the
GGML_MAX_LOOK_AHEADenvironment variable, eg:For
GLM-4.6 over RPC:export GGML_MAX_LOOK_AHEAD=1For
Mistral-Large-Instruct-2411:export GGML_MAX_LOOK_AHEAD=6NOTE: The default value of
GGML_MAX_LOOK_AHEADis zero, and so long as your graph looks to be decaying monotonically; it seems best to just leave it as the default...So you can try to guess generic values by trial and error, but as the graphs I've plotted above show; this is unlikely to work (at all!) and hence why I'm calling this "Profile Guided Speculative Decoding".
Here are 3 example scripts that I used to create the sets of values I plotted:
Basic example (ie: single machine, no NUMA):
RPC over 3 nodes:
Using NUMA,
--override-tensorandCUDA_VISIBLE_DEVICES=0:NOTE: The use of
--no-op-offloadfor the test! To use this for the full range of batch sizes up to 64, you will likely need to use my other hack from #17026 (comment) or limit your maximum draft size to32via--draft-max 32when running this PR...NOTES:
jqinstalled to extract the values at the end of the script.llama-batched-benchas similarly as possible to how you intend to use the target model.NUM_SAMPLES=4) and let it discard the first set of batch results (as the first run is clearly biased for some reason).PROMPT_SIZE > 0, but will make the script take much longer if you raise the value above zero.USE:
--draft-p-minoption will be completely ignored and should not be used with this.--draft-max 64and let the expected value calculation do the work (or--draft-max 32if offloading the experts without my other hack [see above]).--draft-minoption is also redundant, but be sure to setGGML_MAX_LOOK_AHEADif needed [see above].prompt_dft.push_back(id)works entirely correctly with this PR - the code is very dense and hard to see exactly the effect of myresult.resize(best_size)code is.If you get it all working correctly, then running with a script like this:
should give a very large boost for high "draftability" prompts like "refactor this code" or "reword this report", and almost no degradation in TG tokens/s for low "draftability" prompts (depending on how "steppy" your array of values is, and assuming your
draft:targetactive-parameter ratio is small [eg:0.5Bdraft for30B+ideally]).