Q4/Q8 Tiled Gemm Optimization. #16999

shalinib-ibm · 2025-11-04T13:48:50Z

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul.

30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0).

Make sure to read the contributing guidelines before submitting a PR

This patch implemenrts tiled GEMM for large blocks where we pack blocks of 64x64 and perfrom matmul. 30 ~ 50 % improvement in llama-bench and llama-batched-bench with Meta-Llama3-8B Qunatized models( Q4_0 and Q8_0). Signed-off-by: Shalini Salomi Bodapati <[email protected]>

shalinib-ibm · 2025-11-04T13:50:58Z

@taronaeo Can you please review this PR ?

shalinib-ibm · 2025-11-05T13:48:36Z

@ggerganov Can you please review this PR?

ggerganov · 2025-11-05T15:27:28Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+
+#include <pthread.h>
+
+typedef vector unsigned char vec_t;
+typedef __vector_quad acc_t;
+
+static pthread_key_t t_data_key;
+typedef struct {
+    vec_t* A_pack;
+    vec_t* B_pack;
+    int* comparray;
+} thread_scratchpad_t;
+void thread_cleanup(void* arg) {
+    thread_scratchpad_t* data = (thread_scratchpad_t*)arg;
+    if (data) {
+        delete[] data->A_pack;
+        delete[] data->B_pack;
+        delete[] data->comparray;
+
+        delete data;
+    }
+}
+static bool key_created = false;
+


It would be better to avoid dynamic allocations - none of the code currently uses those. The mechanism for this is to use the wdata from ggml_compute_params to store scratch data. You'll need to reserve the worst-case wsize for your case.

@ggerganov Thank you for the input. I tried to avoid dynamic allocation from the code, but lost perforamce without this pthread based code. Below is the code Performance comparison after integrating thread-local scratchpad using wdata.

void matmul_tiled(const ggml_compute_params* params, int64_t m, int64_t n, int64_t mc, int64_t nc, int64_t kc) { char* wdata = (char*) params->wdata; constexpr size_t ALIGN = 128; auto align_ptr = [&](char* ptr, size_t alignment) { return (char*)(((uintptr_t)ptr + alignment - 1) & ~(alignment - 1)); }; char* ptr = align_ptr(wdata, ALIGN); vec_t* A_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * mc * kc * 2; vec_t* B_pack = (vec_t*)ptr; ptr += sizeof(vec_t) * nc * kc * 2; int* comparray = (int*)align_ptr(ptr, ALIGN); // integer part aligned too ptr += sizeof(int) * mc * kc; // rest of your original matmul_tiled() code unchanged }

Benchmark (llama-bench) Baseline pthread-based TLS ggml wdata-based TLS

pp128 69 t/s 89 t/s 36 t/s

pp256 69 t/s 94 t/s 36 t/s

This regression is likely due to:

Loss of persistent per-thread cache locality — the previous pthread-based version reused buffers effectively across tiles.

Higher memory initialization or shared buffer contention across threads.

I have also tried static allocation on stack by having just this code , But it has suffers from similar perf loss. ( 38 t/s)
vec_t A_pack [mckc2];
vec_t B_pack[nckc2];
int comparray[mc*kc];

Can you please suggest ?

shalinib-ibm requested review from ggerganov and slaren as code owners November 4, 2025 13:48

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16999: Q4/Q8 Tiled Gemm Optimization. auroralabs-loci/llama.cpp#81

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 4, 2025

ggerganov reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q4/Q8 Tiled Gemm Optimization. #16999

Q4/Q8 Tiled Gemm Optimization. #16999

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Uh oh!

shalinib-ibm Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Q4/Q8 Tiled Gemm Optimization. #16999

Are you sure you want to change the base?

Q4/Q8 Tiled Gemm Optimization. #16999

Conversation

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 4, 2025

Uh oh!

shalinib-ibm commented Nov 5, 2025

Uh oh!

ggerganov Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants