Skip to content

Conversation

rnwang04
Copy link

@rnwang04 rnwang04 commented Sep 24, 2025

Tickets: CVS-173857

@github-actions github-actions bot added the category: continuous batching Continuous batching label Sep 24, 2025
@ceciliapeng2011 ceciliapeng2011 marked this pull request as draft September 24, 2025 05:32
@github-actions github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Sep 26, 2025
Copy link
Collaborator

@Wovchena Wovchena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave C++ review to @vshampor

Copy link
Collaborator

@Wovchena Wovchena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for Vasily

@ceciliapeng2011 ceciliapeng2011 added this to the 2025.4 milestone Oct 14, 2025
@vshampor
Copy link
Contributor

vshampor commented Oct 15, 2025

The test at

def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):
must be extended with the XAttention case, and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

@vshampor vshampor self-requested a review October 15, 2025 08:34
Copy link
Contributor

@vshampor vshampor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment

@github-actions github-actions bot added the category: WWB PR changes WWB label Oct 16, 2025
@Wovchena Wovchena requested a review from Copilot October 16, 2025 11:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.

  • Adds XAttention detection logic based on cache dimensions
  • Introduces sparse attention configuration support in benchmarking tools
  • Refactors sparse attention setup into a reusable function

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/cpp/src/continuous_batching/cache_manager.hpp Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled
tools/who_what_benchmark/whowhatbench/model_loaders.py Extracts sparse attention configuration into a separate function and adds validation logic
tools/llm_bench/llm_bench_utils/ov_utils.py Adds validation to prevent conflicting sparse attention configuration
tools/llm_bench/task/text_generation.py Moves GenerationConfig import outside conditional block for broader scope

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

scheduler_config.sparse_attention_config = openvino_genai.SparseAttentionConfig(**sparse_attention_kwargs)
log.info("Sparse Attention mode ON")
else:
raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False")
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space in error message prefix: '==Failure ==' should be '==Failure=='.

Suggested change
raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False")
raise RuntimeError("==Failure==: sparse_attention_config value can't be used with use_sparse_attention=False")

Copilot uses AI. Check for mistakes.

scheduler_params = cb_config or default_cb_config
if scheduler_params:
logger.info(f"Scheduler parameters for:\n{scheduler_params}")
configure_sparse_attention(scheduler_params, scheduler_config)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace at end of line should be removed.

Suggested change
configure_sparse_attention(scheduler_params, scheduler_config)
configure_sparse_attention(scheduler_params, scheduler_config)

Copilot uses AI. Check for mistakes.

// set block_size depending on device
const size_t cpu_block_size = 32, gpu_block_size = 16;
m_block_size = all_gpu_device ? gpu_block_size : cpu_block_size;
const size_t cpu_block_size = 32, gpu_block_size = 16, gpu_block_size_xattn = 256;
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable gpu_block_size_xattn is defined but m_block_size assignment was moved to line 89. This leaves gpu_block_size_xattn defined far from its only usage point. Consider moving this constant definition closer to where it's used (before line 89) or defining it as a class constant for better maintainability.

Copilot uses AI. Check for mistakes.

std::cout << "[XAttention]: DISABLED on GPU device." << std::endl;
}
}
m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size;
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested ternary operator makes this line difficult to read. Consider using an if-else statement for better clarity, especially since this is a critical configuration decision.

Suggested change
m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size;
if (all_gpu_device) {
if (has_xattention) {
m_block_size = gpu_block_size_xattn;
} else {
m_block_size = gpu_block_size;
}
} else {
m_block_size = cpu_block_size;
}

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: llm_bench Label for tool/llm_bench folder category: WWB PR changes WWB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants