Skip to content

Conversation

creatorrr
Copy link

Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache

Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.

Motivation

What’s included

  • Deterministic RMSNorm (fixed per-row reduction order; batch-invariant).
  • Deterministic MatMul for FP16/BF16 (fixed tiling, no split-K, FP32 accumulation).
  • Deterministic Attention (fixed split-size over KV, stable softmax reduction, unified KV path; KV cache aligned/padded so chunked vs one-shot prefill are identical).
  • MoE mul_mat_id path made deterministic.
  • Off by default; normal fast paths unchanged.

Usage

  • Build: -DGGML_DETERMINISTIC=ON
  • Run: --deterministic (or GGML_DETERMINISTIC=1)
  • For fully reproducible generation: temperature=0, top_k=1, top_p=1.

Scope & perf

  • Targets CUDA (BF16/FP16). CPU is already deterministic; other GPU backends unchanged.
  • Throughput trade-off in deterministic mode; default builds/perf unaffected when flag is off.

Tests

  • New tests assert:

    • run-to-run bit equality,
    • batch & chunking invariance,
    • attention/masking (incl. ALiBi),
    • deterministic MoE.
      All passing on tested NVIDIA GPUs.

Notes

  • Happy to rename the flag to LLAMA_DETERMINISTIC if maintainers prefer; currently GGML_DETERMINISTIC.

creatorrr and others added 9 commits September 14, 2025 00:28
Signed-off-by: Diwank Singh Tomer <[email protected]>
- Add deterministic mode plumbing (CMake option GGML_DETERMINISTIC, env var + CLI --deterministic).
- Project 01: RMSNorm tests for batch invariance and cross-run determinism; docs.
- Project 02: Deterministic CUDA matmul
  - Gate off cuBLAS in deterministic mode and route to custom kernels.
  - Implement mmvf-based deterministic column-tiling fallback; prefer mmf when eligible.
  - Expand test suite (F32/F16/BF16; M∈{256,512}, K∈{1024,4096}, B up to 64).
  - Optional MoE (mul_mat_id) invariance test scaffold (enable with TEST_MATMUL_ID=1).
  - Update docs/DETERMINISM.md with MatMul section.
- scripts/build-in-container.sh: GPU passthrough for docker when building with CUDA.
- Wire tests into CTest; both suites pass on CPU and CUDA (A4000 x2 + RTX 2000E Ada).
…ntial per-token per-slot matmul in deterministic mode when src1,dst are F32. Enable optional test via TEST_MATMUL_ID=1; matmul determinism tests now pass with MoE.

- Add deterministic branch in ggml_cuda_mul_mat_id to compute c[:,e,t] = as[:,:,ids[e,t]] @ b[:,e,t] sequentially.
- Leaves fast path unchanged when not in deterministic mode or for non-F32 src1.
- Verified in container with GPUs (A4000 x2 + RTX 2000E Ada).
…olumns to F32 in deterministic path; enable MoE invariance test by default.

- In det mode, ggml_cuda_mul_mat_id now handles src1 types F32/F16/BF16 by copying single-column inputs to contiguous device buffer and converting to F32 before matmul; sequential per-token/slot execution guarantees batch invariance.
- Update tests to always run MoE invariance alongside main matmul checks.
- Verified across A4000 x2 and RTX 2000E Ada.
…an (CUDA forward) aligned with implemented deterministic dispatch and launch policy.\n- Clarify KV stride constraint (multiples of 256) and mask padding; update docs Overview scope.\n- Add attention determinism test (batch invariance, cross-run; ALiBi+sinks; softcap path for D=128/256).\n- Add 03B planning docs and runbook for Ada/Ampere.\n- Minor test improvements for matmul/rmsnorm determinism.
…antized K/V vec support (D=128 q4_0/q8_0); F16 tile fallback; MMA gated via env; tests for toggles + quantized; docs debug controls and clarifications; status updated
…ent special head sizes unsupported in det mode; add env flag docs; add vec/MMA probe helpers comment; enable dual-arch build runbook; minor test gating for toggles/MMA
Signed-off-by: Codex CLI <[email protected]>
…──────────────────────�[0m

       �[38;5;238m│ �[0m�[1mSTDIN�[0m
�[38;5;238m───────┼────────────────────────────────────────────────────────────────────────�[0m
�[38;5;238m   1�[0m   �[38;5;238m│�[0m �[38;2;131;148;150mDeterministic Attention (03C): KV-cache invariance foundation�[0m
�[38;5;238m   2�[0m   �[38;5;238m│�[0m
�[38;5;238m   3�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Add KV-cache invariance test for prefill/decode logit matching�[0m
�[38;5;238m   4�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Add im2col3d stride handling for non-contiguous tensors  �[0m
�[38;5;238m   5�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Improve CUDA FA deterministic dispatch with better softcap validation�[0m
�[38;5;238m   6�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Add phase 03C planning docs focusing on KV-cache prioritized approach�[0m
�[38;5;238m   7�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Add test-in-container script for reproducible test environments�[0m
�[38;5;238m   8�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Enhance graph construction and KV-cache handling for determinism�[0m
�[38;5;238m   9�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m- Document commit history and project status updates�[0m
�[38;5;238m  10�[0m   �[38;5;238m│�[0m
�[38;5;238m  11�[0m   �[38;5;238m│�[0m �[38;2;131;148;150m🤖 Generated with [Claude Code](https://claude.ai/code)�[0m
�[38;5;238m  12�[0m   �[38;5;238m│�[0m
�[38;5;238m  13�[0m   �[38;5;238m│�[0m �[38;2;131;148;150mCo-Authored-By: Claude <[email protected]>�[0m
�[38;5;238m───────┴────────────────────────────────────────────────────────────────────────�[0m
@creatorrr
Copy link
Author

@ggerganov this is my first PR here. It's well tested and full disclosure contains vibe code from gpt-5 / codex. But I wanted to get a sense of direction so please take a look and let me know if this is how you'd have gone about this yourself. And what all to change for it to become merge-worthy. Happy to redo / break it up as long as the conceptual direction aligns

@JohannesGaessler
Copy link
Collaborator

I don't want to maintain guarantees for bit-for-bit identical results as the batch size is varied. Determinism for prompt caching should be handled in llama.cpp by caching the logit distribution of the last prompt eval. Determinism for batched inference should be handled by allowing users to submit prompts as a batch to ensure that they're always executed in the exact same order.

@JohannesGaessler
Copy link
Collaborator

BTW, as I kind of implied in my previous post, you can already get bit-for-bit identical results by disabling prompt caching and using only a single slot for the HTTP server.

@github-actions github-actions bot added documentation Improvements or additions to documentation script Script related testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants