diff --git a/2.9.0/todo/result_cuda.md b/2.9.0/done/result_cuda.md similarity index 76% rename from 2.9.0/todo/result_cuda.md rename to 2.9.0/done/result_cuda.md index d864beb..c34e83d 100644 --- a/2.9.0/todo/result_cuda.md +++ b/2.9.0/done/result_cuda.md @@ -28,35 +28,27 @@ The categories below are as follows: ### deprecation ### new features - MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump ([#162209](https://github.com/pytorch/pytorch/pull/162209)) +- [CUDAGraph] Add getter for cuda graph exec to allow mutation of captured kernel params ([#161294](https://github.com/pytorch/pytorch/pull/161294)) ### improvements -### bug fixes -- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102)) -### performance -### docs -### devs -### Untopiced -- Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384)) -- cublaslt/hipblaslt persistent workspace ([#156495](https://github.com/pytorch/pytorch/pull/156495)) +- Make cublaslt/hipblaslt workspaces persistent ([#156495](https://github.com/pytorch/pytorch/pull/156495)) - Remove unnecessary warnings during the ATen compilation process. ([#157703](https://github.com/pytorch/pytorch/pull/157703)) - Slightly improve error message from repeat_interleave kernel ([#157996](https://github.com/pytorch/pytorch/pull/157996)) - Add framework for explanations for common CUDA errors ([#158395](https://github.com/pytorch/pytorch/pull/158395)) -- [ROCm] delete un-needed workaround for tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486)) -- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633)) - [fbgemm_gpu] Upgrade KernelLauncher kernelLaunchCheck to print help string ([#158896](https://github.com/pytorch/pytorch/pull/158896)) - [cutlass] Prep for cutlass upgrade by ignoring Wunused-but-set-variable ([#159276](https://github.com/pytorch/pytorch/pull/159276)) -- Disable cudagraph GCs by default ([#158649](https://github.com/pytorch/pytorch/pull/158649)) -- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell ([#159522](https://github.com/pytorch/pytorch/pull/159522)) -- [ROCm] Set thread_work_size to 16 for vectorized elementwise kernels for MI300X ([#160444](https://github.com/pytorch/pytorch/pull/160444)) -- [ROCm] Improve reduction sum performance ([#160466](https://github.com/pytorch/pytorch/pull/160466)) -- [ROCm] fix large tensor sort on MI350 ([#161054](https://github.com/pytorch/pytorch/pull/161054)) - Workaround ATen SFINAE under libc++ ([#161101](https://github.com/pytorch/pytorch/pull/161101)) -- [ROCm] Unroll loads in global_reduce ([#161181](https://github.com/pytorch/pytorch/pull/161181)) -- [CUDAGraph] Add getter for cuda graph exec ([#161294](https://github.com/pytorch/pytorch/pull/161294)) -- [ROCm] No-fence global reduce ([#161180](https://github.com/pytorch/pytorch/pull/161180)) -### not user facing - [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen ([#153373](https://github.com/pytorch/pytorch/pull/153373)) - [Refactor] Add maybe unused flag to remove warning ([#157655](https://github.com/pytorch/pytorch/pull/157655)) -- [ROCm] Use opportunistic fastatomics based on hueristics ([#159430](https://github.com/pytorch/pytorch/pull/159430)) -- [ROCm] Limit number of values per thread for reductions on three dimensions ([#159652](https://github.com/pytorch/pytorch/pull/159652)) - [ATen][CUDA] Use new CCCL API in v2.8 ([#160554](https://github.com/pytorch/pytorch/pull/160554)) +### bug fixes +- [FlexAttention][TF32] Handle uninitialized `torch.backends.cuda.matmul.fp32_precision` ([#161102](https://github.com/pytorch/pytorch/pull/161102)) +- [CUDA] fix nansum in non-JIT build ([#158633](https://github.com/pytorch/pytorch/pull/158633)) +- [CUDA] Decrease launch bounds of CTCLoss backward for blackwell to avoid crash ([#159522](https://github.com/pytorch/pytorch/pull/159522)) +### performance +- Use a nonblocking copy to avoid stream synchronization for GPU tensor indexing with CPU mask ([#156384](https://github.com/pytorch/pytorch/pull/156384)) +- Disable cudagraph GCs by default to improve capture performance ([#158649](https://github.com/pytorch/pytorch/pull/158649)) +### docs +### devs +### Untopiced +### not user facing ### security