-
Notifications
You must be signed in to change notification settings - Fork 37
[do_bench][easy] warmup cudagraph mode in do_bench_profiler #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
torch.cuda.synchronize() | ||
for _ in range(n_warmup): | ||
run_iteration() | ||
torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder does estimate_ms = benchmarker.benchmark_gpu(fn, estimation_iters=5, benchmark_iters=10)
above already serve the warmup purpose?
Or if it's not explicit enough, should we move this explicit warmup to be before estimate_ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes moved it to the top
if should_clear_cache: | ||
cache.zero_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when measuring with cudagraph, we should not clear cache, since it adds extra memory access time.
This matches the behavior of triton.testing.do_bench_cudagraph
https://github.com/triton-lang/triton/blob/f90255886173b873dfb8b5bbb9a3f67951954660/python/triton/testing.py#L106-L111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we should still clear L2 cache when running with cudagraph. This is an issue with do_bench_cudagraph
cc @nmacchioni
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use tensor.zero_() to clear L2 cache. if we do that in cudagraph, the graph would contain both 1) the tensor.zero() AND 2) the fn to benchmark. Would this lead to wrong result?
By contract, in do_bench, we do clear_cache outside measurement which does NOT include the clearing L2 cache time.
for i in range(n_repeat):
# we don't want `fn` to accumulate gradient values
# if it contains a backward pass. So we clear the
# provided gradients
if grad_to_none is not None:
for x in grad_to_none:
x.grad = None
# we clear the L2 cache before each run
runtime.driver.active.clear_cache(cache)
# record time of `fn`
start_event[i].record()
fn()
end_event[i].record()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L2 cache clear is essential to more stable and accurate latency measurements.
We will explicitly exclude the latency of the CACHE_CLEAR_KERNEL
from the kernel latencies: https://github.com/BoyuanFeng/tritonbench/blob/a0c75ad868f9d03f1cf6107da139e500cfe0b60b/tritonbench/components/do_bench/run.py#L259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for profiler-based benchmark but not for cudagraph-based benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We provide cudagraph option for profiler-based benchmark, e.g., --latency-measure-mode=profiler --cudagraph
: #386
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing Xu, the and evt.name != CACHE_CLEAR_KERNEL
below should exclude tensor.zero_()
from the GPU trace, so I believe we should run tensor.zero_()
even with cudagraph-based benchmark. Otherwise --profiler --cudagraph
vs. --profiler
-only will give different latency results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. CACHE_CLEAR_KERNEL
would still appear in cuda trace when cudagraph is used. thanks for clarification!
] | ||
) | ||
assert ( | ||
num_cache_clear_kernels == n_repeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert to make sure L2 cache clear dispatch to hard-coded kernel name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, thanks!
@@ -185,6 +185,14 @@ def _do_bench_profiler( | |||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR, the parameter warmup
will not be used by this function. Can you help remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep it to match the API of _do_bench_inductor
and triton.testing.do_bench
Prior to this PR, cudagraph mode does not warmup.
However, both default mode and cudagraph mode need warmup. For a compiled function, CUDAGraph mode may record all autotune kernels if it is not warm up properly.