meta-pytorch · liangel-02 · Sep 18, 2025 · Sep 18, 2025 · Sep 18, 2025 · Sep 18, 2025
diff --git a/2.9.0/todo/result_cuda.md → 2.9.0/done/result_cuda.md b/2.9.0/todo/result_cuda.md → 2.9.0/done/result_cuda.md
@@ -28,35 +28,27 @@ The categories below are as follows:
 ### deprecation
 ### new features
 - MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209))
+- [CUDAGraph] Add getter for cuda graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294))
 ### improvements
-### bug fixes
-- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
-### performance
-### docs
-### devs
-### Untopiced
-- Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
-- cublaslt/hipblaslt persistent workspace ([#156495](https://github.com/pytorch/pytorch/pull/156495))
+- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495))
 - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703))
 - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996))
 - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395))
-- [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486))
-- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
 - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896))
 - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276))
-- Disable cudagraph GCs by default ([#158649](https://github.com/pytorch/pytorch/pull/158649))
-- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell ([#159522](https://github.com/pytorch/pytorch/pull/159522))
-- [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444))
-- [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466))
-- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054))
 - Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101))
-- [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181))
-- [CUDAGraph] Add getter for cuda graph exec ([#161294](https://github.com/pytorch/pytorch/pull/161294))
-- [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180))
-### not user facing
 - [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373))
 - [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655))
-- [ROCm]  Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430))
-- [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652))
 - [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554))
+### bug fixes
+- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102))
+- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633))
+- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522))
+### performance
+- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384))
+- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649))
+### docs
+### devs
+### Untopiced
+### not user facing
 ### security