Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 17 additions & 19 deletions tritonbench/components/do_bench/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,14 @@ def _do_bench_profiler(
Returns:
List of measured kernel times in milliseconds (if return_mode="all") or single value.
"""
# we don't want any outside errors propagating into benchmarking
torch.cuda.synchronize()

# warmup `fn` (and catches any failures in the process)
for _ in range(3):
fn()
torch.cuda.synchronize()

# Get cache for L2 cache clearing
cache = triton.runtime.driver.active.get_empty_cache_for_benchmark()

Expand All @@ -193,36 +201,28 @@ def _do_bench_profiler(

# Calculate number of iterations based on target rep time
if estimate_ms == 0:
n_repeat = 100 # Default if function is very fast
n_repeat = 1000 # Default if function is very fast
else:
n_repeat = max(1, int(rep / estimate_ms))

# Helper function to execute one iteration
def run_iteration():
def run_iteration(should_clear_cache: bool):
if grad_to_none is not None:
for x in grad_to_none:
x.grad = None
cache.zero_()
if should_clear_cache:
cache.zero_()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when measuring with cudagraph, we should not clear cache, since it adds extra memory access time.
This matches the behavior of triton.testing.do_bench_cudagraph
https://github.com/triton-lang/triton/blob/f90255886173b873dfb8b5bbb9a3f67951954660/python/triton/testing.py#L106-L111

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we should still clear L2 cache when running with cudagraph. This is an issue with do_bench_cudagraph cc @nmacchioni

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use tensor.zero_() to clear L2 cache. if we do that in cudagraph, the graph would contain both 1) the tensor.zero() AND 2) the fn to benchmark. Would this lead to wrong result?

By contract, in do_bench, we do clear_cache outside measurement which does NOT include the clearing L2 cache time.

    for i in range(n_repeat):
        # we don't want `fn` to accumulate gradient values
        # if it contains a backward pass. So we clear the
        # provided gradients
        if grad_to_none is not None:
            for x in grad_to_none:
                x.grad = None
        # we clear the L2 cache before each run
        runtime.driver.active.clear_cache(cache)
        # record time of `fn`
        start_event[i].record()
        fn()
        end_event[i].record()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L2 cache clear is essential to more stable and accurate latency measurements.
We will explicitly exclude the latency of the CACHE_CLEAR_KERNEL from the kernel latencies: https://github.com/BoyuanFeng/tritonbench/blob/a0c75ad868f9d03f1cf6107da139e500cfe0b60b/tritonbench/components/do_bench/run.py#L259

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for profiler-based benchmark but not for cudagraph-based benchmark?

Copy link
Contributor

@xuzhao9 xuzhao9 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide cudagraph option for profiler-based benchmark, e.g., --latency-measure-mode=profiler --cudagraph: #386

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echoing Xu, the and evt.name != CACHE_CLEAR_KERNEL below should exclude tensor.zero_() from the GPU trace, so I believe we should run tensor.zero_() even with cudagraph-based benchmark. Otherwise --profiler --cudagraph vs. --profiler-only will give different latency results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. CACHE_CLEAR_KERNEL would still appear in cuda trace when cudagraph is used. thanks for clarification!

fn()

if use_cudagraph:
# Create CUDA graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
for _ in range(n_repeat):
run_iteration()
torch.cuda.synchronize()
else:
# Regular mode warmup
n_warmup = max(1, int(warmup / estimate_ms)) if estimate_ms > 0 else 25

torch.cuda.synchronize()
for _ in range(n_warmup):
run_iteration()
run_iteration(should_clear_cache=False)
torch.cuda.synchronize()

n_profiler_runs = 5
iterations_per_profiler_run = n_repeat
n_profiler_runs = 10

# Benchmark phase - collect kernel times for each iteration
all_kernel_times = []
Expand All @@ -243,8 +243,8 @@ def run_iteration():
g.replay()
else:
# Execute multiple iterations for regular mode
for _ in range(iterations_per_profiler_run):
run_iteration()
for _ in range(n_repeat):
run_iteration(should_clear_cache=True)
torch.cuda.synchronize()

# Collect all kernel execution intervals
Expand Down Expand Up @@ -299,9 +299,7 @@ def run_iteration():
)

# Convert to milliseconds and normalize by iterations
total_kernel_time_ms = (
total_kernel_time_us / 1000.0
) / iterations_per_profiler_run
total_kernel_time_ms = (total_kernel_time_us / 1000.0) / n_repeat
all_kernel_times.append(total_kernel_time_ms)

times = torch.tensor(all_kernel_times, dtype=torch.float)
Expand Down