Skip to content

Conversation

BoyuanFeng
Copy link

Prior to this PR, cudagraph mode does not warmup.

However, both default mode and cudagraph mode need warmup. For a compiled function, CUDAGraph mode may record all autotune kernels if it is not warm up properly.

torch.cuda.synchronize()
for _ in range(n_warmup):
run_iteration()
torch.cuda.synchronize()
Copy link
Contributor

@yf225 yf225 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder does estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10) above already serve the warmup purpose?

Or if it's not explicit enough, should we move this explicit warmup to be before estimate_ms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes moved it to the top

Comment on lines 213 to 214
if should_clear_cache:
cache.zero_()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when measuring with cudagraph, we should not clear cache, since it adds extra memory access time.
This matches the behavior of triton.testing.do_bench_cudagraph
https://github.com/triton-lang/triton/blob/f90255886173b873dfb8b5bbb9a3f67951954660/python/triton/testing.py#L106-L111

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we should still clear L2 cache when running with cudagraph. This is an issue with do_bench_cudagraph cc @nmacchioni

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use tensor.zero_() to clear L2 cache. if we do that in cudagraph, the graph would contain both 1) the tensor.zero() AND 2) the fn to benchmark. Would this lead to wrong result?

By contract, in do_bench, we do clear_cache outside measurement which does NOT include the clearing L2 cache time.

    for i in range(n_repeat):
        # we don't want `fn` to accumulate gradient values
        # if it contains a backward pass. So we clear the
        # provided gradients
        if grad_to_none is not None:
            for x in grad_to_none:
                x.grad = None
        # we clear the L2 cache before each run
        runtime.driver.active.clear_cache(cache)
        # record time of `fn`
        start_event[i].record()
        fn()
        end_event[i].record()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L2 cache clear is essential to more stable and accurate latency measurements.
We will explicitly exclude the latency of the CACHE_CLEAR_KERNEL from the kernel latencies: https://github.com/BoyuanFeng/tritonbench/blob/a0c75ad868f9d03f1cf6107da139e500cfe0b60b/tritonbench/components/do_bench/run.py#L259

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for profiler-based benchmark but not for cudagraph-based benchmark?

Copy link
Contributor

@xuzhao9 xuzhao9 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide cudagraph option for profiler-based benchmark, e.g., --latency-measure-mode=profiler --cudagraph: #386

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echoing Xu, the and evt.name != CACHE_CLEAR_KERNEL below should exclude tensor.zero_() from the GPU trace, so I believe we should run tensor.zero_() even with cudagraph-based benchmark. Otherwise --profiler --cudagraph vs. --profiler-only will give different latency results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. CACHE_CLEAR_KERNEL would still appear in cuda trace when cudagraph is used. thanks for clarification!

]
)
assert (
num_cache_clear_kernels == n_repeat
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert to make sure L2 cache clear dispatch to hard-coded kernel name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thanks!

@@ -185,6 +185,14 @@ def _do_bench_profiler(
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, the parameter warmup will not be used by this function. Can you help remove it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep it to match the API of _do_bench_inductor and triton.testing.do_bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants