Releases: EAddario/llama.cpp
Releases · EAddario/llama.cpp
b5139
CUDA/HIP: Share the same unified memory allocation logic. (#12934) Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
b5137
llama : DeepSeek V2/V3 MLA implementation (#12801) * Merged using squash to remove all noise commit messages * Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large * Removed 3 conts (2x RoPE and 1x RMS-norm) * Changed to use `<cmath>` instead of `<math.h>` * Reverted removal of the 3 conts * Used `reshape` in `llm_graph_context::build_attn_mha()` * Use `k_pe = ggml_reshape` * Removed the 3 conts again * Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF * Removed MQA optimisation from `build_attn_mha()` as no gains now * Simplified `is_mla` branch in `llm_build_deepseek2()` * Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls * Fixed call to `build_attn` in `llm_build_t5_enc`
b5133
Add performance print for gemma3 in example (#12929)
b5129
sync : ggml ggml-ci
b5126
ggml: disable CUDA graphs for unsupported DUP and CONT node types (#1…
b5072
hellaswag: display estimated score confidence interval (#12797)