From fca70df408b43cd8620f4590e706c8b598845b4a Mon Sep 17 00:00:00 2001
From: Eddie Yan <eddiey@nvidia.com>
Date: Thu, 18 Sep 2025 16:44:21 +0000
Subject: [PATCH 1/3] update

---
 2.9.0/todo/result_cuda.md | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/2.9.0/todo/result_cuda.md b/2.9.0/todo/result_cuda.md
index d864beb..c981304 100644
--- a/2.9.0/todo/result_cuda.md
+++ b/2.9.0/todo/result_cuda.md
@@ -28,35 +28,35 @@ The categories below are as follows:
 ### deprecation
 ### new features
 - MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209))
+- [CUDAGraph] Add getter for cuda graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
 ### improvements
-### bug fixes
-- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
-### performance
-### docs
-### devs
-### Untopiced
-- Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
-- cublaslt/hipblaslt persistent workspace ([#156495](https://github.com/pytorch/pytorch/pull/156495))
+- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
 - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703))
 - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
 - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
 - [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486))
-- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
 - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
 - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276))
-- Disable cudagraph GCs by default ([#158649](https://github.com/pytorch/pytorch/pull/158649))
-- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell ([#159522](https://github.com/pytorch/pytorch/pull/159522))
+- Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101))
+- [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
+- [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
+- [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
+### bug fixes
+- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
+- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
+- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
+- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054))
+### performance
+- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
+- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
 - [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444))
 - [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466))
-- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054))
-- Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101))
 - [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181))
-- [CUDAGraph] Add getter for cuda graph exec ([#161294](https://github.com/pytorch/pytorch/pull/161294))
 - [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180))
-### not user facing
-- [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
-- [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
 - [ROCm]  Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430))
 - [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652))
-- [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
+### docs
+### devs
+### Untopiced
+### not user facing
 ### security

From e58ae23bce639c11dc8f87baf116ce8810e50568 Mon Sep 17 00:00:00 2001
From: Eddie Yan <eddiey@nvidia.com>
Date: Thu, 18 Sep 2025 16:44:45 +0000
Subject: [PATCH 2/3] move to done

---
 2.9.0/{todo => done}/result_cuda.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename 2.9.0/{todo => done}/result_cuda.md (100%)

diff --git a/2.9.0/todo/result_cuda.md b/2.9.0/done/result_cuda.md
similarity index 100%
rename from 2.9.0/todo/result_cuda.md
rename to 2.9.0/done/result_cuda.md

From b66ddad2d9cde2aa7ae9e45b5bb5e213081cd247 Mon Sep 17 00:00:00 2001
From: Eddie Yan <eddiey@nvidia.com>
Date: Thu, 18 Sep 2025 16:46:57 +0000
Subject: [PATCH 3/3] remove rocm

---
 2.9.0/done/result_cuda.md | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/2.9.0/done/result_cuda.md b/2.9.0/done/result_cuda.md
index c981304..c34e83d 100644
--- a/2.9.0/done/result_cuda.md
+++ b/2.9.0/done/result_cuda.md
@@ -34,7 +34,6 @@ The categories below are as follows:
 - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703))
 - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
 - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
-- [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486))
 - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
 - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276))
 - Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101))
@@ -45,16 +44,9 @@ The categories below are as follows:
 - [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
 - [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
 - [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
-- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054))
 ### performance
 - Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
 - Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
-- [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444))
-- [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466))
-- [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181))
-- [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180))
-- [ROCm]  Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430))
-- [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652))
 ### docs
 ### devs
 ### Untopiced