From fca70df408b43cd8620f4590e706c8b598845b4a Mon Sep 17 00:00:00 2001 From: Eddie Yan Date: Thu, 18 Sep 2025 16:44:21 +0000 Subject: [PATCH 1/3] update --- 2.9.0/todo/result_cuda.md | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/2.9.0/todo/result_cuda.md b/2.9.0/todo/result_cuda.md index d864beb..c981304 100644 --- a/2.9.0/todo/result_cuda.md +++ b/2.9.0/todo/result_cuda.md @@ -28,35 +28,35 @@ The categories below are as follows: ### deprecation ### new features - MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209)) +- [CUDAGraph] Add getter for cuda graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294)) ### improvements -### bug fixes -- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102)) -### performance -### docs -### devs -### Untopiced -- Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384)) -- cublaslt/hipblaslt persistent workspace ([#156495](https://github.com/pytorch/pytorch/pull/156495)) +- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495)) - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703)) - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996)) - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395)) - [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486)) -- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633)) - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896)) - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276)) -- Disable cudagraph GCs by default ([#158649](https://github.com/pytorch/pytorch/pull/158649)) -- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell ([#159522](https://github.com/pytorch/pytorch/pull/159522)) +- Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101)) +- [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373)) +- [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655)) +- [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554)) +### bug fixes +- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102)) +- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633)) +- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522)) +- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054)) +### performance +- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384)) +- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649)) - [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444)) - [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466)) -- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054)) -- Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101)) - [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181)) -- [CUDAGraph] Add getter for cuda graph exec ([#161294](https://github.com/pytorch/pytorch/pull/161294)) - [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180)) -### not user facing -- [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373)) -- [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655)) - [ROCm] Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430)) - [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652)) -- [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554)) +### docs +### devs +### Untopiced +### not user facing ### security From e58ae23bce639c11dc8f87baf116ce8810e50568 Mon Sep 17 00:00:00 2001 From: Eddie Yan Date: Thu, 18 Sep 2025 16:44:45 +0000 Subject: [PATCH 2/3] move to done --- 2.9.0/{todo => done}/result_cuda.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename 2.9.0/{todo => done}/result_cuda.md (100%) diff --git a/2.9.0/todo/result_cuda.md b/2.9.0/done/result_cuda.md similarity index 100% rename from 2.9.0/todo/result_cuda.md rename to 2.9.0/done/result_cuda.md From b66ddad2d9cde2aa7ae9e45b5bb5e213081cd247 Mon Sep 17 00:00:00 2001 From: Eddie Yan Date: Thu, 18 Sep 2025 16:46:57 +0000 Subject: [PATCH 3/3] remove rocm --- 2.9.0/done/result_cuda.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/2.9.0/done/result_cuda.md b/2.9.0/done/result_cuda.md index c981304..c34e83d 100644 --- a/2.9.0/done/result_cuda.md +++ b/2.9.0/done/result_cuda.md @@ -34,7 +34,6 @@ The categories below are as follows: - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703)) - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996)) - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395)) -- [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486)) - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896)) - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276)) - Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101)) @@ -45,16 +44,9 @@ The categories below are as follows: - [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102)) - [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633)) - [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522)) -- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054)) ### performance - Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384)) - Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649)) -- [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444)) -- [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466)) -- [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181)) -- [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180)) -- [ROCm] Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430)) -- [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652)) ### docs ### devs ### Untopiced