[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

kylesayrs · 2025-08-20T01:48:01Z

Purpose

Support attention transforms
Support attention quantization
Support any kv cache quantization strategy
Support kv cache quantization in hf

LlamaAttention(
  (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
  (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
  (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (impl): QuantizedAttentionImpl()
  (kv_cache): QuantizedKVCache()
  (R3_q_attn): HadamardTransform(inverse=False)
  (R3_k_cache): HadamardTransform(inverse=False)
)

Changes

Implement attention implementation hook
- When a module has a q/tconfig applied, its attention implementation is set to ct_hooked_attention
- This implementation is registered as a module to the attention module and called by ct_hooked_attention
- Implementing as a module allows us to integrate with pytorch hooks (and the hooksmixin) enabling (1) and (2)
Move hooked kvcache to compressed-tensors
- Moving qparams directly to the attention module (rather than manually tracking in the kv cache) enables (3)
- Moving qparam initialization to CT enables (4)
- Adding the ability to dynamically wrap existing past_key_value allows for more caching flexibility (sliding window, ect.) and faster hf inference
Implement attention matching in apply
- TODO: any targets which specifically target attention are considered attention quantization. Anything else is considered linear quantization
  - if "self_attn" in target, or module_name.split(".")[-1] in target

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 6 commits August 19, 2025 21:41

quantization done

5a0ec31

Signed-off-by: Kyle Sayers <[email protected]>

fix kv cache passing

e7b1338

Signed-off-by: Kyle Sayers <[email protected]>

slightly cleaner, validated with r3

36d7f2c

Signed-off-by: Kyle Sayers <[email protected]>

qparam initialization

e64dbd2

Signed-off-by: Kyle Sayers <[email protected]>

add markers

1a01dc3

Signed-off-by: Kyle Sayers <[email protected]>

cleanup

585550b

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[Transform] [Attention] Support KV-cache integrated attention transform and quantization~~ [Transform] [Attention] [Kv Cache] Support KV-cache integrated attention transform and quantization Aug 20, 2025

kylesayrs changed the title ~~[Transform] [Attention] [Kv Cache] Support KV-cache integrated attention transform and quantization~~ [Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization Aug 20, 2025

kylesayrs added 6 commits August 20, 2025 12:42

add narrow match

f49524a

Signed-off-by: Kyle Sayers <[email protected]>

better quant matching

7374312

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into attention-cache-submodules

dbc90ba

attention and kv quantization

8941779

Signed-off-by: Kyle Sayers <[email protected]>

remove debug prints

c4af508

Signed-off-by: Kyle Sayers <[email protected]>

add todo for other strategies (block/group, channel, head)

71463c7

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Aug 22, 2025

[KV Cache] support kv cache int8 per channel quantization vllm-project/llm-compressor#1663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Uh oh!

kylesayrs commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Are you sure you want to change the base?

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Uh oh!

Conversation

kylesayrs commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Uh oh!

Uh oh!

kylesayrs commented Aug 20, 2025 •

edited

Loading