-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache #16016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Diwank Singh Tomer <[email protected]>
- Add deterministic mode plumbing (CMake option GGML_DETERMINISTIC, env var + CLI --deterministic). - Project 01: RMSNorm tests for batch invariance and cross-run determinism; docs. - Project 02: Deterministic CUDA matmul - Gate off cuBLAS in deterministic mode and route to custom kernels. - Implement mmvf-based deterministic column-tiling fallback; prefer mmf when eligible. - Expand test suite (F32/F16/BF16; M∈{256,512}, K∈{1024,4096}, B up to 64). - Optional MoE (mul_mat_id) invariance test scaffold (enable with TEST_MATMUL_ID=1). - Update docs/DETERMINISM.md with MatMul section. - scripts/build-in-container.sh: GPU passthrough for docker when building with CUDA. - Wire tests into CTest; both suites pass on CPU and CUDA (A4000 x2 + RTX 2000E Ada).
…ntial per-token per-slot matmul in deterministic mode when src1,dst are F32. Enable optional test via TEST_MATMUL_ID=1; matmul determinism tests now pass with MoE. - Add deterministic branch in ggml_cuda_mul_mat_id to compute c[:,e,t] = as[:,:,ids[e,t]] @ b[:,e,t] sequentially. - Leaves fast path unchanged when not in deterministic mode or for non-F32 src1. - Verified in container with GPUs (A4000 x2 + RTX 2000E Ada).
…olumns to F32 in deterministic path; enable MoE invariance test by default. - In det mode, ggml_cuda_mul_mat_id now handles src1 types F32/F16/BF16 by copying single-column inputs to contiguous device buffer and converting to F32 before matmul; sequential per-token/slot execution guarantees batch invariance. - Update tests to always run MoE invariance alongside main matmul checks. - Verified across A4000 x2 and RTX 2000E Ada.
…an (CUDA forward) aligned with implemented deterministic dispatch and launch policy.\n- Clarify KV stride constraint (multiples of 256) and mask padding; update docs Overview scope.\n- Add attention determinism test (batch invariance, cross-run; ALiBi+sinks; softcap path for D=128/256).\n- Add 03B planning docs and runbook for Ada/Ampere.\n- Minor test improvements for matmul/rmsnorm determinism.
…antized K/V vec support (D=128 q4_0/q8_0); F16 tile fallback; MMA gated via env; tests for toggles + quantized; docs debug controls and clarifications; status updated
…ent special head sizes unsupported in det mode; add env flag docs; add vec/MMA probe helpers comment; enable dual-arch build runbook; minor test gating for toggles/MMA
Signed-off-by: Codex CLI <[email protected]>
…──────────────────────�[0m �[38;5;238m│ �[0m�[1mSTDIN�[0m �[38;5;238m───────┼────────────────────────────────────────────────────────────────────────�[0m �[38;5;238m 1�[0m �[38;5;238m│�[0m �[38;2;131;148;150mDeterministic Attention (03C): KV-cache invariance foundation�[0m �[38;5;238m 2�[0m �[38;5;238m│�[0m �[38;5;238m 3�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add KV-cache invariance test for prefill/decode logit matching�[0m �[38;5;238m 4�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add im2col3d stride handling for non-contiguous tensors �[0m �[38;5;238m 5�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Improve CUDA FA deterministic dispatch with better softcap validation�[0m �[38;5;238m 6�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add phase 03C planning docs focusing on KV-cache prioritized approach�[0m �[38;5;238m 7�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Add test-in-container script for reproducible test environments�[0m �[38;5;238m 8�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Enhance graph construction and KV-cache handling for determinism�[0m �[38;5;238m 9�[0m �[38;5;238m│�[0m �[38;2;131;148;150m- Document commit history and project status updates�[0m �[38;5;238m 10�[0m �[38;5;238m│�[0m �[38;5;238m 11�[0m �[38;5;238m│�[0m �[38;2;131;148;150m🤖 Generated with [Claude Code](https://claude.ai/code)�[0m �[38;5;238m 12�[0m �[38;5;238m│�[0m �[38;5;238m 13�[0m �[38;5;238m│�[0m �[38;2;131;148;150mCo-Authored-By: Claude <[email protected]>�[0m �[38;5;238m───────┴────────────────────────────────────────────────────────────────────────�[0m
@ggerganov this is my first PR here. It's well tested and full disclosure contains vibe code from gpt-5 / codex. But I wanted to get a sense of direction so please take a look and let me know if this is how you'd have gone about this yourself. And what all to change for it to become merge-worthy. Happy to redo / break it up as long as the conceptual direction aligns |
I don't want to maintain guarantees for bit-for-bit identical results as the batch size is varied. Determinism for prompt caching should be handled in llama.cpp by caching the logit distribution of the last prompt eval. Determinism for batched inference should be handled by allowing users to submit prompts as a batch to ensure that they're always executed in the exact same order. |
BTW, as I kind of implied in my previous post, you can already get bit-for-bit identical results by disabling prompt caching and using only a single slot for the HTTP server. |
Title: Deterministic inference mode (CUDA): RMSNorm, MatMul, Attention, KV-cache
Summary
Adds an opt-in deterministic mode that makes CUDA inference bit-identical for identical inputs—independent of batch size, prompt chunking, or concurrency. When enabled, RMSNorm, dense MatMul, and Attention use batch-invariant, fixed-reduction kernels and a stable, padded KV-cache layout.
Motivation
What’s included
mul_mat_id
path made deterministic.Usage
-DGGML_DETERMINISTIC=ON
--deterministic
(orGGML_DETERMINISTIC=1
)temperature=0, top_k=1, top_p=1
.Scope & perf
Tests
New tests assert:
All passing on tested NVIDIA GPUs.
Notes
LLAMA_DETERMINISTIC
if maintainers prefer; currentlyGGML_DETERMINISTIC
.