Releases · EAddario/llama.cpp

03 Oct 22:05

128d522

b6686 Latest

Latest

chat : support Magistral thinking (#16413)

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls

* feat: new flow in the chat template test suite for Magistral

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-03T22:05:45Z
llama-b6686-bin-macos-arm64.zip

sha256:cd63e267ee8573f2e54801741adb2e22441ef69c508f41bf2b7cacfde3894be2

10.3 MB 2025-10-03T22:05:56Z
llama-b6686-bin-macos-x64.zip

sha256:bc93291bd2bc0e726001e9188f46d2abff39228d4d88287c9a8fb7846f105093

26.7 MB 2025-10-03T22:05:57Z
llama-b6686-bin-ubuntu-vulkan-x64.zip

sha256:0cc9cdf01b139f0a5ccd457d821feb75b6b348787e5adb291fda830f226ad82e

25.6 MB 2025-10-03T22:05:59Z
llama-b6686-bin-ubuntu-x64.zip

sha256:926dc741117b58fdb755ee87884db388f3e71f246e4c84a70a09ff02db9af1c0

12.3 MB 2025-10-03T22:06:00Z
llama-b6686-bin-win-cpu-arm64.zip

sha256:ac49e6d858107654c8b77b1b6aa8db4261dd049548aced4d78037de2a72ab9c6

10.5 MB 2025-10-03T22:06:01Z
llama-b6686-bin-win-cpu-x64.zip

sha256:02fadb555207657891b440f9f5d6a829ce029b2cf340f0e2a38a2daae3f137eb

13.6 MB 2025-10-03T22:06:02Z
llama-b6686-bin-win-cuda-12.4-x64.zip

sha256:aa8cddae5ccf2b1f33314ae221a39cc596670e20997565a99ab47f9f857f3edb

149 MB 2025-10-03T22:06:03Z
llama-b6686-bin-win-hip-radeon-x64.zip

sha256:06049a6bf1c0766c10b152717786711291d0cf35fea60e4b608607aab4b1c8ee

313 MB 2025-10-03T22:06:09Z
llama-b6686-bin-win-opencl-adreno-arm64.zip

sha256:99cfe22d17c5242bc3584876444f5e3cb3d6f3d36e3ac0bab368bfc6276d81d8

10.9 MB 2025-10-03T22:06:16Z
Source code (zip)

2025-10-03T18:51:48Z
Source code (tar.gz)

2025-10-03T18:51:48Z

03 Oct 15:00

github-actions

b6683

946f71e

b6683

llama : fix shapes for bert/mpt q/k norm (#16409)

Assets 15

03 Oct 11:05

github-actions

b6679

0e1f838

b6679

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.

Assets 15

01 Oct 18:28

github-actions

b6660

4201dea

b6660

common: introduce http.h for httplib-based client (#16373)

* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <[email protected]>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>

Assets 15

01 Oct 17:05

github-actions

b6658

2a9b633

b6658

Improve code block color theming (#16325)

* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build

Assets 15

20 Sep 21:43

github-actions

b6527

7f76692

b6527

sync : ggml

Assets 15

19 Sep 08:09

github-actions

b6519

4b8560a

b6519

chat : fix build on arm64 (#16101)

Assets 15

15 Sep 07:39

github-actions

b6475

b8e09f0

b6475

model : add grok-2 support (#15539)

* add grok-2 support

* type fix

* type fix

* type fix

* "fix" vocab for invalid sequences

* fix expert tensor mapping and spaces in vocab

* add chat template

* fix norm tensor mapping

* rename layer_out_norm to ffn_post_norm

* ensure ffn_post_norm is mapped

* fix experts merging

* remove erroneous FFN_GATE entry

* concatenate split tensors and add more metadata

* process all expert layers and try cat instead of hstack

* add support for community BPE vocab

* fix expert feed forward length and ffn_down concat

* commit this too

* add ffn_up/gate/down, unsure if sequence is right

* add ffn_gate/down/up to tensor names

* correct residual moe (still not working)

* mess--

* fix embedding scale being applied twice

* add built in chat template

* change beta fast for grok if default value

* remove spm vocab in favor of community bpe vocab

* change attention temp length metadata type to integer

* update attention temp length metadata

* remove comment

* replace M_SQRT2 with std::sqrt(2)

* add yarn metadata, move defaults to hparams

Assets 15

10 Sep 21:13

github-actions

b6445

00681df

b6445

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (#…

Assets 15

06 Sep 12:53

github-actions

b6399

61bdfd5

b6399

server : implement prompt processing progress report in stream mode (…

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: EAddario/llama.cpp

b6686

Uh oh!

b6683

Uh oh!

b6679

Uh oh!

b6660

Uh oh!

b6658

Uh oh!

b6527

Uh oh!

b6519

Uh oh!

b6475

Uh oh!

b6445

Uh oh!

b6399

Uh oh!