KV-Cache Quantization for Efficient LLM Inference

This repository contains a self-contained notebook that demonstrates how to install, build inference engines, and evaluate Large Language Models (LLMs) using TensorRT-LLM. The focus is on optimizing inference performance through techniques such as precision scaling and KV-Cache quantization.

It is intended for students and practitioners with limited compute resources who want a practical, reproducible introduction to LLM inference optimization. All experiments were conducted on an NVIDIA A100 GPU in Google Colab, making the workflow accessible without requiring direct access to on-premise datacenter hardware.

Repo Highlights

Colab-Ready: The notebook is self-contained and can be run directly in Google Colab without complex local setup.
Student-Friendly Model Size: Uses a 1B parameter model, large enough to show meaningful tradeoffs but small enough to run within resource-constrained environments.
End-to-End Demonstration: Covers loading, quantization, and inference with multiple backends.
Optimization Techniques: Demonstrates FP16 precision, INT8 KV-Cache quantization, and AWQ (activation-aware weight quantization).
Evaluation Across Context Lengths: Benchmarks short-context and long-context scenarios to highlight where optimizations provide the most benefit.

Optimization Techniques

FP16 Precision

Reduces memory usage compared to FP32 and improves throughput, with negligible accuracy degradation.

INT8 KV-Cache Quantization

The KV-Cache grows linearly with sequence length, becoming the main memory bottleneck in long-context inference. Quantizing it to INT8 reduces memory by 2× compared to FP16, enabling longer contexts and faster inference.

AWQ (Activation-Aware Weight Quantization)

Quantizes weights to 4-bit or 8-bit while retaining higher precision for activations. Balances performance, memory, and accuracy, enabling efficient deployment in both datacenter and edge settings.

Why This Matters

Long-context inference is increasingly important for applications such as code assistants, multi-document summarization, and conversational AI. Without optimization, KV-Cache memory quickly becomes a limiting factor. This project illustrates the importance of cache quantization by comparing short-context and long-context scenarios.

Notes and Caveats

Model Choice: A 1B parameter model was selected for reproducibility. Larger models can be substituted if additional GPU memory is available.
GPU Support: Works on A100 and newer GPUs. Not supported on T4 GPUs, as TensorRT-LLM kernels (e.g., multi-head attention) are not optimized for them.
Installation: Designed to run in Colab. For production-scale development, follow NVIDIA's TensorRT-LLM Installation Instructions.

TODO (Work in Progress)

Self-contained Colab notebook demonstrating KV-Cache quantization.
Evaluation on both short-context and long-context inputs.
Integrate trtllm-bench based evaluation to standardize comparisons across backends.
Add GPU profiling using Nsight Systems / Nsight Compute to analyze kernel-level performance.
Explore additional precision strategies (e.g., FP8) for inference on H100-class GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
KV_Cache_Quantisation_Final.ipynb		KV_Cache_Quantisation_Final.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KV-Cache Quantization for Efficient LLM Inference

Repo Highlights

Optimization Techniques

FP16 Precision

INT8 KV-Cache Quantization

AWQ (Activation-Aware Weight Quantization)

Why This Matters

Notes and Caveats

TODO (Work in Progress)

About

Uh oh!

Releases

Packages

Languages

BhuvanashreeM/tensorrt-llm-kv-cache-quantization

Folders and files

Latest commit

History

Repository files navigation

KV-Cache Quantization for Efficient LLM Inference

Repo Highlights

Optimization Techniques

FP16 Precision

INT8 KV-Cache Quantization

AWQ (Activation-Aware Weight Quantization)

Why This Matters

Notes and Caveats

TODO (Work in Progress)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages