sampling : add support for GPU sampling (wip) #17004

danbev · 2025-11-04T17:34:17Z

This is a work in progress to add support for GPU sampling.

The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU.

For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph:

    // add GPU sampling layers (if any)
    llm->build_sampling(*this, params);

GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, gpu_sampler_chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

        llama_context_params cparams = llama_context_default_params();
        cparams.samplers = sampler_configs.data();
        cparams.n_samplers = sampler_configs.size();

When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-gpu-sampling -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-gpu-sampling$' -V

The following individual tests are available:

$ ctest --test-dir build -N -R test-gpu-sampling-
  Test 35: test-gpu-sampling-greedy
  Test 36: test-gpu-sampling-temp
  Test 37: test-gpu-sampling-softmax
  Test 38: test-gpu-sampling-top_k
  Test 39: test-gpu-sampling-top_p
  Test 40: test-gpu-sampling-mul_seq

Total Tests: 6

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-gpu-sampling-temp' -V

TODO

Allocate GPU sampler tensors on the same backend as the logits (dev_output.dev)
Implement GPU dist sampler
Allow GPU samplers to pre-allocate state tensors
Integrate GPU samplers with llama-cli
Integrate GPU samplers with llama-server
Implement true top-p sampler on GPU
Add missing GPU samplers (e.g. typical, mirostat, etc)
Add support for operations like ggml_top_k (support vocabulary size tensors) in all backends
Add ggml_cumsum operation to all backends

Basically I think we should have support in all backends for the operations that the GPU samplers use. At the moment this is not the case and currently if the target backend device (the same device that holds the logits tensor) does not support the operation a warning is printed similar to this:

Warning: backend does not support argsort operation required for top-k sampling
CPU backend will be used instead which defeats the purpose of having GPU samplers

am17an · 2025-11-05T09:55:06Z

One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready

This is a work in progress to add support for GPU sampling. The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU. For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph: ```c++ // add GPU sampling layers (if any) llm->build_sampling(*this, params); ``` GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id: ```c++ struct llama_sampler_chain_params params = llama_sampler_chain_default_params(); struct llama_sampler * chain = llama_sampler_chain_init(params); llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy()); std::vector<llama_sampler_seq_config> sampler_configs = { { 0, gpu_sampler_chain } }; ``` The struct is defined as: ```c++ struct llama_sampler_seq_config { llama_seq_id seq_id; struct llama_sampler * sampler; }; ``` These sampler configs are then passed as context params: ```c++ llama_context_params cparams = llama_context_default_params(); cparams.samplers = sampler_configs.data(); cparams.n_samplers = sampler_configs.size(); ``` When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph. This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using: ```c++ llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index); ``` Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits. Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities. Building and running the tests: Download a model for testing: ```console $ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf ``` Building the test: ```console $ cmake --build build --target test-gpu-sampling -j8 ``` Runing all tests: ```console $ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \ ctest --test-dir build -R '^test-gpu-sampling$' -V ``` The following individual tests are available: ```console $ ctest --test-dir build -N -R test-gpu-sampling- Test 35: test-gpu-sampling-greedy Test 36: test-gpu-sampling-temp Test 37: test-gpu-sampling-softmax Test 38: test-gpu-sampling-top_k Test 39: test-gpu-sampling-top_p Test 40: test-gpu-sampling-mul_seq Total Tests: 6 ``` These can be run individually, for example: ```console $ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \ ctest --test-dir build -R 'test-gpu-sampling-temp' -V ``` TODO: - [ ] Allow GPU samplers to pre-allocate state tensors - [ ] Integrate GPU samplers with llama-server - [ ] Implement true top-p sampler on GPU - [ ] Add missing GPU samplers (e.g. typical, mirostat, etc)

This commit updates the llama_sampler_gpu_top_p_apply_ggml function to use ggml_div_inplace instead of ggml_div as this generated an error on webgpu backends: ```console /home/danbev/work/ai/llama.cpp-debug/ggml/src/ggml-webgpu/ggml-webgpu.cpp:2146: ggml_webgpu: Device error! Reason: 2, Message: Writable storage buffer binding aliasing found between [BindGroup "div_f32"] set at bind group index 0, binding index 1, and [BindGroup "div_f32"] set at bind group index 0, binding index 2, with overlapping ranges (offset: 0, size: 32) and (offset: 0, size: 32) in [Buffer "allocated_buffer"]. - While encoding [ComputePassEncoder (unlabeled)].DispatchWorkgroups(1, 1, 1). - While finishing [CommandEncoder (unlabeled)]. ``` It also sets ggml_data-filtered_ids as an output tensor as it might otherwise be reused before being read.

This commit adds a new cumulative sum (cumsum) operation to the ggml library. The motivation for this it to be able to implement GPU distribution sampler. I notice that there is work underway to add cumsum in other PRs so this commit can probably be removed once those are merged.

This commit add support for performing distribution sampling on the GPU. It adds a function to the sampler interface for setting input tensors which will be called after the computation graph has been built and scheduled. For the dist sampler this allows it to set a random uniform value that is used to sample from the cumulative distribution.

This commit adds a function to set the ggml_backend_sched_t and ggml_backend_t GPU-based samplers. The motivation for this is that the tenors that a GPU sampler creates (new tensors and operations) should be allocated on the same backend as the logits tensor produced by the model's graph. With this change the samplers can use the scheduler and backend to set the correct backend for the tensor that it creates. I'll try to find a nice way of enforcing this as it would be easy to miss doing this step otherwise.

This commit adds check to see if the target backend can support operations like argsort (used by top-k sampling) and cont. Currently these operations are not supported in all backends (e.g., Metal backend) and will cause runtime errors. The checks in the commit allow us to avoid the error but if you print the print the schedulers debug table (GGML_SCHED_DEBUG=2) we can see that there will be a split in the graph to use the CPU backed for these operations which defeats the purpose of GPU sampling. We should probably fix/add support for the operations that are going to be used in the GPU samplers to have this work most effectively. Metal issues: * Metal ARGSORT only supports ne[0] <= 1024. GPU samplers need to sort full vocabulary. * CUMSUM not implemented for Metal backend. This was just added by me in a recent commit and there are other PRs open that look like they are also in the process of adding support for it.

github-actions bot added the testing Everything test related label Nov 4, 2025

danbev force-pushed the gpu-sampling branch from d04847d to 71b0e3d Compare November 5, 2025 10:16

danbev added 3 commits November 6, 2025 06:39

danbev force-pushed the gpu-sampling branch from 71b0e3d to c82b67b Compare November 6, 2025 06:14

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2025

danbev force-pushed the gpu-sampling branch from c82b67b to 56bca5e Compare November 6, 2025 06:22

danbev force-pushed the gpu-sampling branch from 56bca5e to 5d18032 Compare November 6, 2025 06:27

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip) auroralabs-loci/llama.cpp#102

Open

5 tasks

danbev added 2 commits November 6, 2025 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sampling : add support for GPU sampling (wip) #17004

sampling : add support for GPU sampling (wip) #17004

danbev commented Nov 4, 2025 •

edited

Loading

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sampling : add support for GPU sampling (wip) #17004

Are you sure you want to change the base?

sampling : add support for GPU sampling (wip) #17004

Conversation

danbev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Building and running the tests

TODO

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danbev commented Nov 4, 2025 •

edited

Loading