-
Notifications
You must be signed in to change notification settings - Fork 288
update gpu block size based on xattn #2764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5a146f4
to
e8311d7
Compare
e8311d7
to
eded411
Compare
b56c1c5
to
ac7a454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave C++ review to @vshampor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for Vasily
The test at
SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.
Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my previous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.
- Adds XAttention detection logic based on cache dimensions
- Introduces sparse attention configuration support in benchmarking tools
- Refactors sparse attention setup into a reusable function
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
src/cpp/src/continuous_batching/cache_manager.hpp | Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled |
tools/who_what_benchmark/whowhatbench/model_loaders.py | Extracts sparse attention configuration into a separate function and adds validation logic |
tools/llm_bench/llm_bench_utils/ov_utils.py | Adds validation to prevent conflicting sparse attention configuration |
tools/llm_bench/task/text_generation.py | Moves GenerationConfig import outside conditional block for broader scope |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
scheduler_config.sparse_attention_config = openvino_genai.SparseAttentionConfig(**sparse_attention_kwargs) | ||
log.info("Sparse Attention mode ON") | ||
else: | ||
raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False") |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space in error message prefix: '==Failure ==' should be '==Failure=='.
raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False") | |
raise RuntimeError("==Failure==: sparse_attention_config value can't be used with use_sparse_attention=False") |
Copilot uses AI. Check for mistakes.
scheduler_params = cb_config or default_cb_config | ||
if scheduler_params: | ||
logger.info(f"Scheduler parameters for:\n{scheduler_params}") | ||
configure_sparse_attention(scheduler_params, scheduler_config) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace at end of line should be removed.
configure_sparse_attention(scheduler_params, scheduler_config) | |
configure_sparse_attention(scheduler_params, scheduler_config) |
Copilot uses AI. Check for mistakes.
// set block_size depending on device | ||
const size_t cpu_block_size = 32, gpu_block_size = 16; | ||
m_block_size = all_gpu_device ? gpu_block_size : cpu_block_size; | ||
const size_t cpu_block_size = 32, gpu_block_size = 16, gpu_block_size_xattn = 256; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable gpu_block_size_xattn
is defined but m_block_size
assignment was moved to line 89. This leaves gpu_block_size_xattn
defined far from its only usage point. Consider moving this constant definition closer to where it's used (before line 89) or defining it as a class constant for better maintainability.
Copilot uses AI. Check for mistakes.
std::cout << "[XAttention]: DISABLED on GPU device." << std::endl; | ||
} | ||
} | ||
m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The nested ternary operator makes this line difficult to read. Consider using an if-else statement for better clarity, especially since this is a critical configuration decision.
m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size; | |
if (all_gpu_device) { | |
if (has_xattention) { | |
m_block_size = gpu_block_size_xattn; | |
} else { | |
m_block_size = gpu_block_size; | |
} | |
} else { | |
m_block_size = cpu_block_size; | |
} |
Copilot uses AI. Check for mistakes.
Tickets: CVS-173857