Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016

creatorrr · 2025-09-15T18:27:38Z

Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache

Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.

Motivation

Research & implementation inspired by Thinking Machines’ analysis of batch invariance and reduction order in LLM inference: [Defeating Nondeterminism in LLM Inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/).
This work underpins reproducibility guarantees needed by Steadytext: [steadytext.julep.ai](https://steadytext.julep.ai).

What’s included

Deterministic RMSNorm (fixed per-row reduction order; batch-invariant).
Deterministic MatMul for FP16/BF16 (fixed tiling, no split-K, FP32 accumulation).
Deterministic Attention (fixed split-size over KV, stable softmax reduction, unified KV path; KV cache aligned/padded so chunked vs one-shot prefill are identical).
MoE mul_mat_id path made deterministic.
Off by default; normal fast paths unchanged.

Usage

Build: -DGGML_DETERMINISTIC=ON
Run: --deterministic (or GGML_DETERMINISTIC=1)
For fully reproducible generation: temperature=0, top_k=1, top_p=1.

Scope & perf

Targets CUDA (BF16/FP16). CPU is already deterministic; other GPU backends unchanged.
Throughput trade-off in deterministic mode; default builds/perf unaffected when flag is off.

Tests

New tests assert:
- run-to-run bit equality,
- batch & chunking invariance,
- attention/masking (incl. ALiBi),
- deterministic MoE.
  All passing on tested NVIDIA GPUs.

Notes

Happy to rename the flag to LLAMA_DETERMINISTIC if maintainers prefer; currently GGML_DETERMINISTIC.

Signed-off-by: Diwank Singh Tomer <[email protected]>

- Add deterministic mode plumbing (CMake option GGML_DETERMINISTIC, env var + CLI --deterministic). - Project 01: RMSNorm tests for batch invariance and cross-run determinism; docs. - Project 02: Deterministic CUDA matmul - Gate off cuBLAS in deterministic mode and route to custom kernels. - Implement mmvf-based deterministic column-tiling fallback; prefer mmf when eligible. - Expand test suite (F32/F16/BF16; M∈{256,512}, K∈{1024,4096}, B up to 64). - Optional MoE (mul_mat_id) invariance test scaffold (enable with TEST_MATMUL_ID=1). - Update docs/DETERMINISM.md with MatMul section. - scripts/build-in-container.sh: GPU passthrough for docker when building with CUDA. - Wire tests into CTest; both suites pass on CPU and CUDA (A4000 x2 + RTX 2000E Ada).

…ntial per-token per-slot matmul in deterministic mode when src1,dst are F32. Enable optional test via TEST_MATMUL_ID=1; matmul determinism tests now pass with MoE. - Add deterministic branch in ggml_cuda_mul_mat_id to compute c[:,e,t] = as[:,:,ids[e,t]] @ b[:,e,t] sequentially. - Leaves fast path unchanged when not in deterministic mode or for non-F32 src1. - Verified in container with GPUs (A4000 x2 + RTX 2000E Ada).

…olumns to F32 in deterministic path; enable MoE invariance test by default. - In det mode, ggml_cuda_mul_mat_id now handles src1 types F32/F16/BF16 by copying single-column inputs to contiguous device buffer and converting to F32 before matmul; sequential per-token/slot execution guarantees batch invariance. - Update tests to always run MoE invariance alongside main matmul checks. - Verified across A4000 x2 and RTX 2000E Ada.

…an (CUDA forward) aligned with implemented deterministic dispatch and launch policy.\n- Clarify KV stride constraint (multiples of 256) and mask padding; update docs Overview scope.\n- Add attention determinism test (batch invariance, cross-run; ALiBi+sinks; softcap path for D=128/256).\n- Add 03B planning docs and runbook for Ada/Ampere.\n- Minor test improvements for matmul/rmsnorm determinism.

…antized K/V vec support (D=128 q4_0/q8_0); F16 tile fallback; MMA gated via env; tests for toggles + quantized; docs debug controls and clarifications; status updated

…ent special head sizes unsupported in det mode; add env flag docs; add vec/MMA probe helpers comment; enable dual-arch build runbook; minor test gating for toggles/MMA

Signed-off-by: Codex CLI <[email protected]>

…──────────────────────�[0m �[38;5;238m│ �[0m�[1mSTDIN�[0m �[38;5;238m───────┼────────────────────────────────────────────────────────────────────────�[0m �[38;5;238m 1�[0m �[38;5;238m│�[0m �[38;2;131;148;150mDeterministic Attention (03C): KV-cache invariance foundation�[0m �[38;5;238m 2�[0m �[38;5;238m│�[0m �[38;5;238m 3�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add KV-cache invariance test for prefill/decode logit matching�[0m �[38;5;238m 4�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add im2col3d stride handling for non-contiguous tensors �[0m �[38;5;238m 5�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Improve CUDA FA deterministic dispatch with better softcap validation�[0m �[38;5;238m 6�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add phase 03C planning docs focusing on KV-cache prioritized approach�[0m �[38;5;238m 7�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add test-in-container script for reproducible test environments�[0m �[38;5;238m 8�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Enhance graph construction and KV-cache handling for determinism�[0m �[38;5;238m 9�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Document commit history and project status updates�[0m �[38;5;238m 10�[0m �[38;5;238m│�[0m �[38;5;238m 11�[0m �[38;5;238m│�[0m �[38;2;131;148;150m🤖 Generated with [Claude Code](https://claude.ai/code)�[0m �[38;5;238m 12�[0m �[38;5;238m│�[0m �[38;5;238m 13�[0m �[38;5;238m│�[0m �[38;2;131;148;150mCo-Authored-By: Claude <[email protected]>�[0m �[38;5;238m───────┴────────────────────────────────────────────────────────────────────────�[0m

creatorrr · 2025-09-15T18:29:21Z

@ggerganov this is my first PR here. It's well tested and full disclosure contains vibe code from gpt-5 / codex. But I wanted to get a sense of direction so please take a look and let me know if this is how you'd have gone about this yourself. And what all to change for it to become merge-worthy. Happy to redo / break it up as long as the conceptual direction aligns

JohannesGaessler · 2025-09-15T18:52:29Z

I don't want to maintain guarantees for bit-for-bit identical results as the batch size is varied. Determinism for prompt caching should be handled in llama.cpp by caching the logit distribution of the last prompt eval. Determinism for batched inference should be handled by allowing users to submit prompts as a batch to ensure that they're always executed in the exact same order.

JohannesGaessler · 2025-09-15T18:55:37Z

BTW, as I kind of implied in my previous post, you can already get bit-for-bit identical results by disabling prompt caching and using only a single slot for the HTTP server.

creatorrr and others added 9 commits September 14, 2025 00:28

feat: Deterministic RMSNorm

11232b7

Signed-off-by: Diwank Singh Tomer <[email protected]>

Deterministic Attention (03B): probe + fallback in det dispatcher; qu…

9584351

…antized K/V vec support (D=128 q4_0/q8_0); F16 tile fallback; MMA gated via env; tests for toggles + quantized; docs debug controls and clarifications; status updated

03B follow-ups: clarify F16 tile fallback vs quantized no-tile; docum…

49625c3

…ent special head sizes unsupported in det mode; add env flag docs; add vec/MMA probe helpers comment; enable dual-arch build runbook; minor test gating for toggles/MMA

Project: Progress 03B.3

ffe6666

Signed-off-by: Codex CLI <[email protected]>

github-actions bot added documentation Improvements or additions to documentation script Script related testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016

Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016

creatorrr commented Sep 15, 2025

Uh oh!

creatorrr commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

Uh oh!

Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016

Are you sure you want to change the base?

Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016

Conversation

creatorrr commented Sep 15, 2025

Uh oh!

creatorrr commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

JohannesGaessler commented Sep 15, 2025

Uh oh!

Uh oh!