update gpu block size based on xattn #2764

rnwang04 · 2025-09-24T02:52:01Z

block size for XAttention
benchmark.py supports XAttention configuration
work with [GPU] XAttention as a preview feature openvino#32064

Tickets: CVS-173857

src/cpp/src/continuous_batching/cache_manager.hpp

tools/llm_bench/task/text_generation.py

Wovchena

I'll leave C++ review to @vshampor

tools/llm_bench/llm_bench_utils/ov_utils.py

….genai into pa_block_xattn

Wovchena

Waiting for Vasily

vshampor · 2025-10-15T08:34:34Z

The test at

openvino.genai/tests/python_tests/test_kv_cache_eviction.py

Line 105 in 2650995

    
           def test_cache_optimized_generation_is_similar_to_unoptimized(test_struct, apply_rotation, use_sparse_attention):

must be extended with the XAttention case, and/or additional tests must be added to demonstrate just what behaviour you had in mind for the case where the user just tries to switch from SparseAttentionMode::TRISHAPE to SparseAttentionMode::XATTENTION without changing his expectations about the block size of 16 that TRISHAPE used to work perfectly with.

Also, I guess I'll have to ask again - why is it that only enabling xattention changes the GPU block size implicitly to 128? If it's such a performant block size anyway, why do you not change the GPU block size to 128 everywhere and avoid implicit block size changes of which the user is unaware?

vshampor

See my previous comment

wwb support xattention

Copilot

Pull Request Overview

This PR updates the GPU block size configuration to support XAttention, which uses a larger block size (256) compared to the standard GPU block size (16). The changes enable detection of XAttention at runtime and configure the appropriate block size accordingly.

Adds XAttention detection logic based on cache dimensions
Introduces sparse attention configuration support in benchmarking tools
Refactors sparse attention setup into a reusable function

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
src/cpp/src/continuous_batching/cache_manager.hpp	Adds XAttention detection and sets GPU block size to 256 when XAttention is enabled
tools/who_what_benchmark/whowhatbench/model_loaders.py	Extracts sparse attention configuration into a separate function and adds validation logic
tools/llm_bench/llm_bench_utils/ov_utils.py	Adds validation to prevent conflicting sparse attention configuration
tools/llm_bench/task/text_generation.py	Moves GenerationConfig import outside conditional block for broader scope

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T11:24:44Z

tools/llm_bench/llm_bench_utils/ov_utils.py

+                scheduler_config.sparse_attention_config = openvino_genai.SparseAttentionConfig(**sparse_attention_kwargs)
+                log.info("Sparse Attention mode ON")
+            else:  
+                raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False")  


Extra space in error message prefix: '==Failure ==' should be '==Failure=='.

Suggested change

raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False")

raise RuntimeError("==Failure==: sparse_attention_config value can't be used with use_sparse_attention=False")

Copilot · 2025-10-16T11:24:45Z

tools/who_what_benchmark/whowhatbench/model_loaders.py

    scheduler_params = cb_config or default_cb_config
    if scheduler_params:
        logger.info(f"Scheduler parameters for:\n{scheduler_params}")
+        configure_sparse_attention(scheduler_params, scheduler_config) 


Trailing whitespace at end of line should be removed.

Suggested change

configure_sparse_attention(scheduler_params, scheduler_config)

configure_sparse_attention(scheduler_params, scheduler_config)

Copilot · 2025-10-16T11:24:45Z

src/cpp/src/continuous_batching/cache_manager.hpp

        // set block_size depending on device
-        const size_t cpu_block_size = 32, gpu_block_size = 16;
-        m_block_size = all_gpu_device ? gpu_block_size : cpu_block_size;
+        const size_t cpu_block_size = 32, gpu_block_size = 16, gpu_block_size_xattn = 256;


The variable gpu_block_size_xattn is defined but m_block_size assignment was moved to line 89. This leaves gpu_block_size_xattn defined far from its only usage point. Consider moving this constant definition closer to where it's used (before line 89) or defining it as a class constant for better maintainability.

Copilot · 2025-10-16T11:24:45Z

src/cpp/src/continuous_batching/cache_manager.hpp

+                    std::cout << "[XAttention]: DISABLED on GPU device." << std::endl;
+            }
+        }
+        m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size;


[nitpick] The nested ternary operator makes this line difficult to read. Consider using an if-else statement for better clarity, especially since this is a critical configuration decision.

Suggested change

m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size;

if (all_gpu_device) {

if (has_xattention) {

m_block_size = gpu_block_size_xattn;

} else {

m_block_size = gpu_block_size;

}

} else {

m_block_size = cpu_block_size;

}

update gpu block size based on xattn

352c7a0

github-actions bot added the category: continuous batching Continuous batching label Sep 24, 2025

ceciliapeng2011 reviewed Sep 24, 2025

View reviewed changes

src/cpp/src/continuous_batching/cache_manager.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 marked this pull request as draft September 24, 2025 05:32

update gpu block size based on xattn

51d0018

rnwang04 force-pushed the pa_block_xattn branch from 5a146f4 to e8311d7 Compare September 25, 2025 02:41

merge

eded411

rnwang04 force-pushed the pa_block_xattn branch from e8311d7 to eded411 Compare September 25, 2025 02:45

add missing GenerationConfig

ac7a454

rnwang04 force-pushed the pa_block_xattn branch from b56c1c5 to ac7a454 Compare September 26, 2025 06:01

github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Sep 26, 2025

fix use_sparse_attention=False

e6ba90a

This was referenced Oct 11, 2025

chang gpu_block_size to 256 #2727

Closed

[GPU] XAttention as a preview feature openvinotoolkit/openvino#32064

Open

ceciliapeng2011 marked this pull request as ready for review October 11, 2025 08:15

Merge branch 'master' into pa_block_xattn

4101008

ceciliapeng2011 assigned vshampor and Wovchena Oct 11, 2025

Wovchena reviewed Oct 13, 2025

View reviewed changes

tools/llm_bench/task/text_generation.py Show resolved Hide resolved

rnwang04 and others added 2 commits October 13, 2025 16:26

update GenerationConfig based on comments

9dcb2da

refactor: get gpu block_size from value_cache.

c5c67ba

Wovchena reviewed Oct 13, 2025

View reviewed changes

tools/llm_bench/llm_bench_utils/ov_utils.py Outdated Show resolved Hide resolved

tools/llm_bench/llm_bench_utils/ov_utils.py Show resolved Hide resolved

rnwang04 added 3 commits October 13, 2025 22:34

update solution for use_sparse_attention=False based on comments

9ccf083

Merge branch 'pa_block_xattn' of https://github.com/rnwang04/openvino…

79fd027

….genai into pa_block_xattn

Merge branch 'master' into pa_block_xattn

c96728c

Wovchena approved these changes Oct 13, 2025

View reviewed changes

ceciliapeng2011 added this to the 2025.4 milestone Oct 14, 2025

peterchen-intel and others added 3 commits October 15, 2025 07:55

Merge branch 'master' into pa_block_xattn

9c27fc1

add log to show if XAttention is actually ON/OFF.

9ab91e1

fix

526ba22

vshampor self-requested a review October 15, 2025 08:34

vshampor requested changes Oct 15, 2025

View reviewed changes

wgzintel and others added 4 commits October 15, 2025 23:29

wwb support xattention

7bd2aaf

remove blank line

9c752b9

refactoring the code

13053a9

Merge pull request #2 from wgzintel/guozhong/wwb_support_xattention

8d481e8

wwb support xattention

github-actions bot added the category: WWB PR changes WWB label Oct 16, 2025

Wovchena requested a review from Copilot October 16, 2025 11:23

Copilot AI reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update gpu block size based on xattn #2764

update gpu block size based on xattn #2764

rnwang04 commented Sep 24, 2025 •

edited by peterchen-intel

Loading

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Uh oh!

vshampor commented Oct 15, 2025 •

edited

Loading

Uh oh!

vshampor left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	raise RuntimeError("==Failure ==: sparse_attention_config value can't be used with use_sparse_attention=False")
	raise RuntimeError("==Failure==: sparse_attention_config value can't be used with use_sparse_attention=False")

	configure_sparse_attention(scheduler_params, scheduler_config)
	configure_sparse_attention(scheduler_params, scheduler_config)

-        m_block_size = all_gpu_device ? ( has_xattention ? gpu_block_size_xattn : gpu_block_size ) : cpu_block_size;
+        if (all_gpu_device) {
+            if (has_xattention) {
+                m_block_size = gpu_block_size_xattn;
+            } else {
+                m_block_size = gpu_block_size;
+            }
+        } else {
+            m_block_size = cpu_block_size;
+        }

update gpu block size based on xattn #2764

Are you sure you want to change the base?

update gpu block size based on xattn #2764

Conversation

rnwang04 commented Sep 24, 2025 • edited by peterchen-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Wovchena left a comment

Choose a reason for hiding this comment

Uh oh!

vshampor commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshampor left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rnwang04 commented Sep 24, 2025 •

edited by peterchen-intel

Loading

vshampor commented Oct 15, 2025 •

edited

Loading