Skip to content

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Aug 20, 2025

Purpose

  1. Support attention transforms
  2. Support attention quantization
  3. Support any kv cache quantization strategy
  4. Support kv cache quantization in hf
LlamaAttention(
  (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
  (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
  (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
  (impl): QuantizedAttentionImpl()
  (kv_cache): QuantizedKVCache()
  (R3_q_attn): HadamardTransform(inverse=False)
  (R3_k_cache): HadamardTransform(inverse=False)
)

Changes

  • Implement attention implementation hook
    • When a module has a q/tconfig applied, its attention implementation is set to ct_hooked_attention
    • This implementation is registered as a module to the attention module and called by ct_hooked_attention
    • Implementing as a module allows us to integrate with pytorch hooks (and the hooksmixin) enabling (1) and (2)
  • Move hooked kvcache to compressed-tensors
    • Moving qparams directly to the attention module (rather than manually tracking in the kv cache) enables (3)
    • Moving qparam initialization to CT enables (4)
    • Adding the ability to dynamically wrap existing past_key_value allows for more caching flexibility (sliding window, ect.) and faster hf inference
  • Implement attention matching in apply
    • TODO: any targets which specifically target attention are considered attention quantization. Anything else is considered linear quantization
      • if "self_attn" in target, or module_name.split(".")[-1] in target

IMG_9999

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [Transform] [Attention] Support KV-cache integrated attention transform and quantization [Transform] [Attention] [Kv Cache] Support KV-cache integrated attention transform and quantization Aug 20, 2025
@kylesayrs kylesayrs changed the title [Transform] [Attention] [Kv Cache] Support KV-cache integrated attention transform and quantization [Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant