[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411

BoyuanFeng · 2025-09-10T05:55:31Z

Prior to this PR, cudagraph mode does not warmup.

However, both default mode and cudagraph mode need warmup. For a compiled function, CUDAGraph mode may record all autotune kernels if it is not warm up properly.

yf225 · 2025-09-10T07:18:56Z

tritonbench/components/do_bench/run.py

+    torch.cuda.synchronize()
+    for _ in range(n_warmup):
+        run_iteration()
+    torch.cuda.synchronize()


Wonder does estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10) above already serve the warmup purpose?

Or if it's not explicit enough, should we move this explicit warmup to be before estimate_ms?

yes moved it to the top

BoyuanFeng · 2025-09-10T17:47:54Z

tritonbench/components/do_bench/run.py

+        if should_clear_cache:
+            cache.zero_()


when measuring with cudagraph, we should not clear cache, since it adds extra memory access time.
This matches the behavior of triton.testing.do_bench_cudagraph
https://github.com/triton-lang/triton/blob/f90255886173b873dfb8b5bbb9a3f67951954660/python/triton/testing.py#L106-L111

I thought we should still clear L2 cache when running with cudagraph. This is an issue with do_bench_cudagraph cc @nmacchioni

we use tensor.zero_() to clear L2 cache. if we do that in cudagraph, the graph would contain both 1) the tensor.zero() AND 2) the fn to benchmark. Would this lead to wrong result?

By contract, in do_bench, we do clear_cache outside measurement which does NOT include the clearing L2 cache time.

for i in range(n_repeat): # we don't want `fn` to accumulate gradient values # if it contains a backward pass. So we clear the # provided gradients if grad_to_none is not None: for x in grad_to_none: x.grad = None # we clear the L2 cache before each run runtime.driver.active.clear_cache(cache) # record time of `fn` start_event[i].record() fn() end_event[i].record()

L2 cache clear is essential to more stable and accurate latency measurements.
We will explicitly exclude the latency of the CACHE_CLEAR_KERNEL from the kernel latencies: https://github.com/BoyuanFeng/tritonbench/blob/a0c75ad868f9d03f1cf6107da139e500cfe0b60b/tritonbench/components/do_bench/run.py#L259

this is for profiler-based benchmark but not for cudagraph-based benchmark?

We provide cudagraph option for profiler-based benchmark, e.g., --latency-measure-mode=profiler --cudagraph: #386

echoing Xu, the and evt.name != CACHE_CLEAR_KERNEL below should exclude tensor.zero_() from the GPU trace, so I believe we should run tensor.zero_() even with cudagraph-based benchmark. Otherwise --profiler --cudagraph vs. --profiler-only will give different latency results.

agree. CACHE_CLEAR_KERNEL would still appear in cuda trace when cudagraph is used. thanks for clarification!

BoyuanFeng · 2025-09-10T21:50:55Z

tritonbench/components/do_bench/run.py

+            ]
+        )
+        assert (
+            num_cache_clear_kernels == n_repeat


assert to make sure L2 cache clear dispatch to hard-coded kernel name

This makes sense to me, thanks!

xuzhao9 · 2025-09-10T22:09:52Z

tritonbench/components/do_bench/run.py

@@ -185,6 +185,14 @@ def _do_bench_profiler(
    Returns:


After this PR, the parameter warmup will not be used by this function. Can you help remove it?

let's keep it to match the API of _do_bench_inductor and triton.testing.do_bench

minor improvement to do_bench_profiler

cc3c1ff

BoyuanFeng requested review from xuzhao9 and yf225 September 10, 2025 05:55

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 05:55 — with GitHub Actions Error

meta-cla bot added the cla signed label Sep 10, 2025

lint

e377dd2

BoyuanFeng temporarily deployed to docker-s3-upload September 10, 2025 06:02 — with GitHub Actions Inactive

yf225 reviewed Sep 10, 2025

View reviewed changes

move warmup to top

0378939

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 17:30 — with GitHub Actions Error

nit

aa047d1

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 17:32 — with GitHub Actions Error

nit

775743a

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 17:33 — with GitHub Actions Error

BoyuanFeng requested a review from yf225 September 10, 2025 17:33

avoid clear cache when cudagraph

a0c75ad

BoyuanFeng temporarily deployed to docker-s3-upload September 10, 2025 17:46 — with GitHub Actions Inactive

BoyuanFeng commented Sep 10, 2025

View reviewed changes

keep L2 cache clear for cudagraph

f54664b

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 18:44 — with GitHub Actions Error

nit

15a2486

BoyuanFeng temporarily deployed to docker-s3-upload September 10, 2025 18:45 — with GitHub Actions Inactive

add checks for number of cache clear kernels

d419edd

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 21:50 — with GitHub Actions Error

BoyuanFeng commented Sep 10, 2025

View reviewed changes

nit

f849449

BoyuanFeng temporarily deployed to docker-s3-upload September 10, 2025 21:52 — with GitHub Actions Inactive

xuzhao9 reviewed Sep 10, 2025

View reviewed changes

nit

0e180b7

BoyuanFeng had a problem deploying to docker-s3-upload September 10, 2025 23:25 — with GitHub Actions Error

nit

ebec0d3

BoyuanFeng temporarily deployed to docker-s3-upload September 10, 2025 23:30 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411

[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411

Uh oh!

BoyuanFeng commented Sep 10, 2025

Uh oh!

yf225 Sep 10, 2025 •

edited

Loading

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

xuzhao9 Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

xuzhao9 Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

xuzhao9 Sep 10, 2025 •

edited

Loading

Uh oh!

yf225 Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

xuzhao9 Sep 10, 2025

Uh oh!

xuzhao9 Sep 10, 2025

Uh oh!

BoyuanFeng Sep 10, 2025

Uh oh!

Uh oh!

[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411

Are you sure you want to change the base?

[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411

Uh oh!

Conversation

BoyuanFeng commented Sep 10, 2025

Uh oh!

yf225 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuzhao9 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yf225 Sep 10, 2025 •

edited

Loading

xuzhao9 Sep 10, 2025 •

edited

Loading