llama.cpp-gfx906: AMD MI50/MI60/Vega7 Optimized Fork

This fork implements low-level AMD GCN ISA optimizations for llama.cpp inference, specifically targeting the AMD MI50/MI60/Vega VII GPUs (GFX906 / Vega 20 architecture).

Key Achievement: Replaced generic shuffle-based reductions with fused DPP+ALU instructions, reducing instruction count by ~37% in critical reduction paths.

Performance Improvements

Test Configuration: ROCm backend, ngl=99, threads=12, batch=1024, KV cache: q8_0, Flash Attention enabled

Model	Quant	Context	Test Type	Vanilla (t/s)	Fork (t/s)	Improvement	Speedup
Qwen3 4B	Q4_0	d=0	pp512	1782.62 ± 0.59	2023.40 ± 0.86	+240.78	+13.5%
			tg128	127.95 ± 0.02	134.61 ± 0.04	+6.66	+5.2%
		d=2048	pp512	1382.44 ± 17.69	1612.72 ± 0.95	+230.28	+16.7%
			tg128	81.58 ± 1.56	107.05 ± 0.03	+25.47	+31.2%
Qwen3 4B	Q4_1	d=0	pp512	1859.20 ± 0.61	1921.99 ± 0.46	+62.79	+3.4%
			tg128	132.30 ± 0.01	139.82 ± 0.02	+7.52	+5.7%
		d=2048	pp512	1498.35 ± 0.51	1541.53 ± 1.72	+43.18	+2.9%
			tg128	88.51 ± 0.01	110.83 ± 0.02	+22.32	+25.2%
Qwen3VLMoE 30B	Q4_1	d=0	pp512	1245.10 ± 11.10	1362.27 ± 11.47	+117.17	+9.4%
			tg128	97.65 ± 0.04	100.87 ± 0.03	+3.22	+3.3%
		d=2048	pp512	1022.02 ± 19.17	1146.50 ± 8.23	+124.48	+12.2%
			tg128	70.10 ± 0.69	81.86 ± 0.04	+11.76	+16.8%

Legend:

pp512: Prompt processing with 512 tokens
tg128: Text generation with 128 tokens
d=0: No context
d=2048: With 2048 tokens of context
t/s: Tokens per second

Key Optimizations

1. Fused DPP Instructions (Main Optimization)

Replaced separate shuffle + arithmetic operations with single fused DPP+ALU instructions:

Before (2 instructions):

x = __shfl_xor(x, 1);  // DPP shuffle
x = x + other;          // ALU add

After (1 instruction):

x = hip_add_xor1_f32(x);  // Fused v_add_f32_dpp

Impact:

Reduction operations: 37% fewer instructions (10-15 → 8 instructions)
Fused XOR 1, 2, 8 patterns
XOR 4, 16 remain unfused (architectural limitation)
Applied to: argmax, quantization, flash attention, top-k MoE

2. Vectorized Q4_0/Q4_1 Memory Loads

Replaced 8 scalar 32-bit loads with 2 vectorized 128-bit int4 loads:

Impact:

~2× memory throughput for Q4_0/Q4_1 quantization formats
Improved vec_dot operation performance

3. Flash Attention Fixes

Fixed kernel selection logic for GFX906 single-token generation
GCN-tuned thread counts (nthreads_KQ_q=2, nthreads_V_q=4)
Generic reduction templates with type-aware dispatch
Restored GFX906 compatibility

4. DPP Architecture Optimizations

DPP (Data Parallel Primitives) on AMD GCN allow:

Lane-to-lane data movement within a wavefront (64 threads)
Fusion with ALU operations (add, max, etc.)
Single-cycle execution for common patterns

Barrier management (critical for correctness):

asm volatile(
    "s_nop 4\n"  // FIRST DPP: EXEC mask hazard protection
    "v_add_f32_dpp %0, %1, %1 quad_perm:[1,0,3,2] ..."
    : "=v"(result) : "v"(x)
);

asm volatile(
    "s_nop 1\n"  // SUBSEQUENT DPP: VGPR→DPP data hazard
    "v_add_f32_dpp %0, %1, %1 quad_perm:[2,3,0,1] ..."
    : "=v"(result) : "v"(x)
);

Tested Models

All llama.cpp supported models work with this fork. Extensively tested with:

Qwen3-4B (Q4_0, Q4_1)
Qwen3VLMoE-30B (Q4_0, Q4_1) - Vision + MoE model

Quick Start

Prerequisites

ROCm 7.0.1 (tested version)
CMake 3.21+
HIP compiler toolchain
AMD GFX906 GPU (MI50/MI60/Vega VII)
Ubuntu 24.04 (tested, other distros should work)

System Dependencies

# Ubuntu
sudo apt update
sudo apt install cmake build-essential

# Install ROCm 7.0.1 following AMD's official guide
# Note: Tensile library for gfx906 must be imported for ROCm 7.0.1

# Verify ROCm installation
/opt/rocm/bin/rocm-smi

Build Instructions

1. Clone the repository

git clone https://github.com/iacopPBK/llama.cpp-gfx906.git
cd llama.cpp-gfx906

2. Compile using the provided script

chmod +x SCRIPT_compile_MI50.sh
./SCRIPT_compile_MI50.sh

The compilation script automatically:

Sets GFX906-specific compiler flags
Enables HIP backend with GFX906 optimizations
Builds with flash attention support
Links against ROCm libraries (rocBLAS, hipBLAS)

3. Launch the server

# Edit SCRIPT_launch_server_MI50.sh to set your model path
vim SCRIPT_launch_server_MI50.sh

# Launch server with Flash Attention and KV quantization
./SCRIPT_launch_server_MI50.sh

Environment Variables

The optimized build sets these automatically:

export HSA_OVERRIDE_GFX_VERSION=9.0.6
export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0
export GGML_BACKEND_HIP=1
export HCC_AMDGPU_TARGET=gfx906

Build Configuration

The build enables these optimizations:

GGML_HIP=ON - Enable HIP backend
GGML_HIP_GFX906_OPTIMIZED=ON - GFX906-specific optimizations
CMAKE_HIP_ARCHITECTURES=gfx906 - Target GFX906 architecture
Flash attention with F16 precision (hardcoded)

Benchmarking

Run llama-bench

./SCRIPT_llama_bench.sh

Vision Models

./build/bin/llama-cli \
  -m model.gguf \
  --image test.jpg \
  -p "Describe this image"

Technical Details

Files Modified

Core Optimization Files:

ggml/src/ggml-cuda/common.cuh - Unified DPP optimization section, fused DPP+ALU templates
ggml/src/ggml-cuda/mmq.cuh - Vectorized int4 loads for Q4_0/Q4_1
ggml/src/ggml-cuda/quantize.cu - Fused warp reductions
ggml/src/ggml-cuda/fattn-vec.cuh - GCN-tuned thread counts
ggml/src/ggml-cuda/fattn.cu - Fixed kernel selection
ggml/src/ggml-cuda/vecdotq.cuh - 2-byte aligned loads
ggml/src/ggml-cuda/argmax.cu - Fused reductions
ggml/src/ggml-cuda/topk-moe.cu - Fused reductions

References

Built with care for the AMD GFX906 community ❤️‍🔥

Name		Name	Last commit message	Last commit date
Latest commit History 6,905 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SCRIPT_compile_MI50.sh		SCRIPT_compile_MI50.sh
SCRIPT_launch_server_MI50.sh		SCRIPT_launch_server_MI50.sh
SCRIPT_llama_bench.sh		SCRIPT_llama_bench.sh
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama.cpp-gfx906: AMD MI50/MI60/Vega7 Optimized Fork

Performance Improvements

Key Optimizations

1. Fused DPP Instructions (Main Optimization)

2. Vectorized Q4_0/Q4_1 Memory Loads

3. Flash Attention Fixes

4. DPP Architecture Optimizations

Tested Models

Quick Start

Prerequisites

System Dependencies

Build Instructions

1. Clone the repository

2. Compile using the provided script

3. Launch the server

Environment Variables

Build Configuration

Benchmarking

Run llama-bench

Vision Models

Technical Details

Files Modified

References

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

License

iacopPBK/llama.cpp-gfx906

Folders and files

Latest commit

History

Repository files navigation

llama.cpp-gfx906: AMD MI50/MI60/Vega7 Optimized Fork

Performance Improvements

Key Optimizations

1. Fused DPP Instructions (Main Optimization)

2. Vectorized Q4_0/Q4_1 Memory Loads

3. Flash Attention Fixes

4. DPP Architecture Optimizations

Tested Models

Quick Start

Prerequisites

System Dependencies

Build Instructions

1. Clone the repository

2. Compile using the provided script

3. Launch the server

Environment Variables

Build Configuration

Benchmarking

Run llama-bench

Vision Models

Technical Details

Files Modified

References

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages