[ROCm] update release notes (#82)

jeffdaily · web-flow · commit 69fd560082f2 · 2025-09-18T12:29:50.000-07:00
diff --git a/2.9.0/done/result_jit.md b/2.9.0/done/result_jit.md
@@ -41,7 +41,6 @@ The categories below are as follows:
 - [BE][8/16] fix typos in torch/ (torch/csrc/jit/) ([#156318](https://github.com/pytorch/pytorch/pull/156318))
 - [BE][10/16] fix typos in torch/ (torch/csrc/jit/) ([#156320](https://github.com/pytorch/pytorch/pull/156320))
 - [nativert] Add OSS version of ModelRunner ([#159268](https://github.com/pytorch/pytorch/pull/159268))
-- [ROCm] Fix resource_strings.h ([#159996](https://github.com/pytorch/pytorch/pull/159996))
 - added stubs for jit tree views ([#156504](https://github.com/pytorch/pytorch/pull/156504))
 - Remove ts to export retracer ([#156857](https://github.com/pytorch/pytorch/pull/156857))
 - [BE][12/16] fix typos in torch/ ([#156602](https://github.com/pytorch/pytorch/pull/156602))
diff --git a/2.9.0/done/result_rocm.md b/2.9.0/done/result_rocm.md
@@ -0,0 +1,65 @@
+
+# Release Notes worksheet rocm
+
+The main goal of this process is to rephrase all the commit messages below to make them **clear and easy to read** by the end user. You should follow the following instructions to do so:
+
+* **Please clean up and format commit titles to be readable by the general PyTorch user.** Make sure you're [following the guidance here](https://docs.google.com/document/d/14OmgGBr1w6gl1VO47GGGdwrIaUNr92DFhQbY_NEk8mQ/edit)! Your resulting notes must be consistent and easy to read.
+* Please sort commits into the following categories (you should not rename the categories!), I tried to pre-sort these to ease your work, feel free to move commits around if the current categorization is not good.
+* Anything that is not public facing needs to be removed.
+* If anything is miscategorized/belongs to another domain, move it to `miscategorized.md`.
+* Please scan through `miscategorized.md` and handle any commits that belong within your domain according to these instructions.
+* We place a lot of emphasis on the “BC-breaking” and “deprecation” sections. Those should be where the most effort goes in. The “improvements” and “bug fixes” for Python API should be nice as well.
+* Once you are finished, move this very file from `todo/` to `done/` and submit a pull request.
+
+The categories below are as follows:
+
+* BC breaking: All commits that are BC-breaking. These are the most important commits. If any pre-sorted commit is actually BC-breaking, do move it to this section. Each commit should contain a paragraph explaining the rational behind the change as well as an example for how to update user code [BC-Guidelines](https://docs.google.com/document/d/14OmgGBr1w6gl1VO47GGGdwrIaUNr92DFhQbY_NEk8mQ/edit#heading=h.a9htwgvvec1m).
+* Deprecations: All commits introducing deprecation. Each commit should include a small example explaining what should be done to update user code.
+* new_features: All commits introducing a new feature (new functions, new submodule, new supported platform etc)
+* improvements: All commits providing improvements to existing feature should be here (new backend for a function, new argument, better numerical stability)
+* bug fixes: All commits that fix bugs and behaviors that do not match the documentation
+* performance: All commits that are added mainly for performance (we separate this from improvements above to make it easier for users to look for it)
+* documentation: All commits that add/update documentation
+* Developers: All commits that are not end-user facing but still impact people that compile from source, develop into pytorch, extend pytorch, etc
+* not user facing: All commits that are not public end-user facing and hence should be dropped from the release notes
+
+## rocm
+### bc breaking
+### deprecation
+### new features
+- OCP Micro-scaling Format (mx-fp8/mx-fp4) Support ([#151360](https://github.com/pytorch/pytorch/pull/151360))
+- Support experimental CU carveout torch._C._set_sm_carveout_experimental() ([#149466](https://github.com/pytorch/pytorch/pull/149466))
+- Add FP8 rowwise support to _scaled_grouped_mm ([#159075](https://github.com/pytorch/pytorch/pull/159075))
+### improvements
+- Additional hipify mappings ([#158056](https://github.com/pytorch/pytorch/pull/158056), [#158352](https://github.com/pytorch/pytorch/pull/158352), [#161992](https://github.com/pytorch/pytorch/pull/161992))
+- composable_kernel (CK) backend user interface refactored to improve user experience ([#152951](https://github.com/pytorch/pytorch/pull/152951))
+- Allow use of rocSOLVER for Cholesky inversion. ([#157154](https://github.com/pytorch/pytorch/pull/157154))
+- AOT Inductor enable gfx950 for max autotune using CK ([#159195](https://github.com/pytorch/pytorch/pull/159195))
+- Add flag torch.backends.miopen.immediate to toggle MIOpen Immediate Mode instead of relying on deterministic=True + benchmark=False ([#158951](https://github.com/pytorch/pytorch/pull/158951))
+- MIOpen convolutions no longer call reshape_ or unexpectedly change memory formats ([#161687](https://github.com/pytorch/pytorch/pull/161687))
+### bug fixes
+- inductor with cudagraph trees hip:0 device error is resolved ([#161221](https://github.com/pytorch/pytorch/pull/161221))
+- ROCm 7.0 BC-breaking change to amdclang compiler `warpSize` no longer constexpr ([#156979](https://github.com/pytorch/pytorch/pull/156979))
+- ROCm 7.0 BC-breaking change to hiprtc needed fix resource_strings.h and jit_utils.cpp ([#159292](https://github.com/pytorch/pytorch/pull/159292), [#159996](https://github.com/pytorch/pytorch/pull/159996))
+- On Windows fix some build failures and support some BLAS calls ([#161981](https://github.com/pytorch/pytorch/pull/161981))
+- On Windows fix undefined symbol linker error after exposing MIOpen symbols ([#156479](https://github.com/pytorch/pytorch/pull/156479))
+- On Windows fix finding ROCm/HIP version ([#156486](https://github.com/pytorch/pytorch/pull/156486))
+- On Windows fix LoadHIP handling of environment variable paths on Windows. ([#159080](https://github.com/pytorch/pytorch/pull/159080))
+- On Windows add hipcc compatibility flags to cpp_extension.py. ([#159790](https://github.com/pytorch/pytorch/pull/159790))
+- Symmetric memory set handle type for ROCm ([#161741](https://github.com/pytorch/pytorch/pull/161741))
+- In SDPA via AOTriton, logsumexp needs scaling back to natural base. ([#156903](https://github.com/pytorch/pytorch/pull/156903))
+- Check stream graph capture status in memcpy_and_sync inline function ([#158165](https://github.com/pytorch/pytorch/pull/158165))
+### performance
+- SDPA now uses AOTriton to 0.11b ([#161754](https://github.com/pytorch/pytorch/pull/161754))
+- hipblaslt is used by default on gfx908 for ROCm >= 6.3 ([#159092](https://github.com/pytorch/pytorch/pull/159092))
+- Enable miopen channels last 3d for conv and batchnorm ([#160529](https://github.com/pytorch/pytorch/pull/160529))
+- Remove extra transposes in NHWC convolutions on MIOpen ([#160435](https://github.com/pytorch/pytorch/pull/160435))
+- Remove extra sync in tensor.item() ([#158486](https://github.com/pytorch/pytorch/pull/158486))
+- Elementwise and reduction kernel perf improvements ([#159430](https://github.com/pytorch/pytorch/pull/159430), [#159652](https://github.com/pytorch/pytorch/pull/159652), [#160444](https://github.com/pytorch/pytorch/pull/160444), [#160466](https://github.com/pytorch/pytorch/pull/160466), [#161054](https://github.com/pytorch/pytorch/pull/161054), [#161180](https://github.com/pytorch/pytorch/pull/161180), [#161181](https://github.com/pytorch/pytorch/pull/161181))
+- Symmetric Memory Performance improvements for two-shot allreduce ([#156746](https://github.com/pytorch/pytorch/pull/156746))
+- Enable build of fbgemm_gpu genai sources for grouped gemm support. ([#160676](https://github.com/pytorch/pytorch/pull/160676))
+### docs
+### devs
+### Untopiced
+### not user facing
+### security
diff --git a/2.9.0/todo/result_distributed.md b/2.9.0/todo/result_distributed.md
@@ -52,8 +52,6 @@ The categories below are as follows:
 - [tp] improve parallelize_module API to support more cases ([#157182](https://github.com/pytorch/pytorch/pull/157182))
 - Script for consolidation of sharded safetensor files ([#154743](https://github.com/pytorch/pytorch/pull/154743))
 - HF - consolidate shards of safetensors files to full tensors in finish step ([#156705](https://github.com/pytorch/pytorch/pull/156705))
-- [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce ([#156746](https://github.com/pytorch/pytorch/pull/156746))
-- [ROCm] Remove use of `warpsize` on host-side compilation ([#156979](https://github.com/pytorch/pytorch/pull/156979))
 - [SymmMem] Add NVSHMEM_CHECK macro ([#157174](https://github.com/pytorch/pytorch/pull/157174))
 - [PT] support custom all_gather and reduce_scatter comms ([#155189](https://github.com/pytorch/pytorch/pull/155189))
 - Fix typo: 'Intializes' → 'Initializes' in _distributed_c10d.pyi docst… ([#157455](https://github.com/pytorch/pytorch/pull/157455))
@@ -201,8 +199,6 @@ The categories below are as follows:
 - [SymmMem] Increase minimum nthreads to cover sync needs in NVL72 ([#161983](https://github.com/pytorch/pytorch/pull/161983))
 - [SymmMem] Use non-blocking version of getmem ([#162006](https://github.com/pytorch/pytorch/pull/162006))
 - [c10d] Lessen density of barrier warning ([#162015](https://github.com/pytorch/pytorch/pull/162015))
-- [ROCm/Windows] Fix build failures and support some BLAS calls ([#161981](https://github.com/pytorch/pytorch/pull/161981))
-- [Symmetric memory] set handle type for ROCm ([#161741](https://github.com/pytorch/pytorch/pull/161741))
 - [PP] Add profiling to schedule execution ([#160753](https://github.com/pytorch/pytorch/pull/160753))
 - [DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints ([#160682](https://github.com/pytorch/pytorch/pull/160682))
 - Don't require FakeStore to be passed into fake backend ([#162164](https://github.com/pytorch/pytorch/pull/162164))
diff --git a/2.9.0/todo/result_inductor.md b/2.9.0/todo/result_inductor.md
@@ -62,7 +62,6 @@ The categories below are as follows:
 - Add inputs and outputs in Triton Kernel FX Graph segment ([#158174](https://github.com/pytorch/pytorch/pull/158174))
 - [Optimus] Support decompose mm with dynamic shapes ([#158821](https://github.com/pytorch/pytorch/pull/158821))
 - Enable dynamic shapes for foreach operations by default ([#158985](https://github.com/pytorch/pytorch/pull/158985))
-- [ROCm][CK][Inductor] enable gfx950 for max autotune with CK ([#159195](https://github.com/pytorch/pytorch/pull/159195))
 - [cutlass] rename EVT args within kernels for code caching ([#159243](https://github.com/pytorch/pytorch/pull/159243))
 - All custom operators go through Inductor's graph.call_function ([#159174](https://github.com/pytorch/pytorch/pull/159174))
 - [AOTInductor] Add test for enabling CUDACachingAllocator for AOTInductor's Weight ([#159279](https://github.com/pytorch/pytorch/pull/159279))
@@ -161,7 +160,6 @@ The categories below are as follows:
 - Support caching if joint_custom_pre_pass/joint_custom_post_pass implement the proper interface ([#157990](https://github.com/pytorch/pytorch/pull/157990))
 - Fix is_unaligned usage of statically_known_true ([#157845](https://github.com/pytorch/pytorch/pull/157845))
 - Return false in statically_known_multiple_of if numerator has more than 20 unique symbols ([#157855](https://github.com/pytorch/pytorch/pull/157855))
-- [ROCm][Inductor][CK] update API for gemm-multiD change ([#156122](https://github.com/pytorch/pytorch/pull/156122))
 - Add size_hints to cache key ([#158026](https://github.com/pytorch/pytorch/pull/158026))
 - [Bugfix][Inductor] Fix dependency list merged incorrectly for a custom op with multiple mutated inputs and None return type. ([#157133](https://github.com/pytorch/pytorch/pull/157133))
 - [aot] add format_consts_to_cpp function for further development. ([#157608](https://github.com/pytorch/pytorch/pull/157608))
@@ -294,7 +292,6 @@ The categories below are as follows:
 - Add kernel stack traces tlparse dump (#160608) ([#160779](https://github.com/pytorch/pytorch/pull/160779))
 - [MTIA] add correct name for CFF in tlparse ([#160599](https://github.com/pytorch/pytorch/pull/160599))
 - Add cutedsl template support to compile ([#160108](https://github.com/pytorch/pytorch/pull/160108))
-- [ROCm][inductor][dashboard] Add GPT2ForSequenceClassification to use_larger_multiplier_for_smaller_tensor list ([#160001](https://github.com/pytorch/pytorch/pull/160001))
 - Add signpost to provenance tracking error ([#160755](https://github.com/pytorch/pytorch/pull/160755))
 - [cpp][inductor] Fix crash on bmm when input is used twice. ([#160087](https://github.com/pytorch/pytorch/pull/160087))
 - Fix duplicated kernel name in kernel stack trace tracking ([#160905](https://github.com/pytorch/pytorch/pull/160905))
diff --git a/2.9.0/todo/result_nn_frontend.md b/2.9.0/todo/result_nn_frontend.md
@@ -41,7 +41,6 @@ The categories below are as follows:
 - Support deterministic upsample trilinear backward ([#154239](https://github.com/pytorch/pytorch/pull/154239))
 - Add device check in `mse_loss` ([#155089](https://github.com/pytorch/pytorch/pull/155089))
 - Fused RMSNorm Housekeeping ([#159317](https://github.com/pytorch/pytorch/pull/159317))
-- [ROCm] revamp miopen integration ([#161687](https://github.com/pytorch/pytorch/pull/161687))
 - NLLLoss: validate target is 0D when input is 1D ([#161412](https://github.com/pytorch/pytorch/pull/161412))
 ### not user facing
 - add test_batchnorn_2D and 3D tests ([#156498](https://github.com/pytorch/pytorch/pull/156498))
diff --git a/2.9.0/todo/result_quantization.md b/2.9.0/todo/result_quantization.md
@@ -88,6 +88,5 @@ The categories below are as follows:
 - Fix qembeddingbag_byte_prepack_meta to use sym_sizes ([#159985](https://github.com/pytorch/pytorch/pull/159985))
 - Using std::make_unique<T>() instead of unique<T>(new T()) ([#160723](https://github.com/pytorch/pytorch/pull/160723))
 - Using std::vector or c10::SmallVector instead of CArray ([#160959](https://github.com/pytorch/pytorch/pull/160959))
-- [ROCm] fix numpy version detection and adjust fudge_factors for MI355 ([#161429](https://github.com/pytorch/pytorch/pull/161429))
 - Enable more nightly tests on s390x ([#160893](https://github.com/pytorch/pytorch/pull/160893))
 ### security
diff --git a/2.9.0/todo/result_releng.md b/2.9.0/todo/result_releng.md
@@ -53,7 +53,6 @@ The categories below are as follows:
 - [BE] bump test dependency `z3-solver` to drop using deprecated `pkg_resources` ([#158905](https://github.com/pytorch/pytorch/pull/158905))
 - Enable MI355X PyTorch CI testing. ([#158889](https://github.com/pytorch/pytorch/pull/158889))
 - Setup TorchBench in Docker (d72ebefe3fa)
-- [ROCm] Update jit_utils.cpp trait modification based on HIP version. ([#159292](https://github.com/pytorch/pytorch/pull/159292))
 - Enable sample nightly PT2 benchmark on B200 ([#158011](https://github.com/pytorch/pytorch/pull/158011))
 - [Take 2] Setup TorchBench in Docker  ([#159300](https://github.com/pytorch/pytorch/pull/159300))
 - [BE]: ruff PLC0207 - use maxsplit kwarg ([#160107](https://github.com/pytorch/pytorch/pull/160107))
@@ -127,7 +126,6 @@ The categories below are as follows:
 - [audio hash update] update the pinned audio hash ([#158402](https://github.com/pytorch/pytorch/pull/158402))
 - [BE] Get rid of final mentions of BUILD_SPLIT_CUDA ([#158453](https://github.com/pytorch/pytorch/pull/158453))
 - ci: Update lint workflow to only run on changed files for PRs ([#158518](https://github.com/pytorch/pytorch/pull/158518))
-- [ROCm][CI] Last known good HIP patch ([#158596](https://github.com/pytorch/pytorch/pull/158596))
 - Fix s390x CI: ensure that all python dependencies are installed when … ([#158552](https://github.com/pytorch/pytorch/pull/158552))
 - Use linux.12xlarge.memory to build for H100/sm_90 ([#158598](https://github.com/pytorch/pytorch/pull/158598))
 - setup pinned commit for vllm in pytorch ci ([#158591](https://github.com/pytorch/pytorch/pull/158591))
diff --git a/2.9.0/todo/result_rocm.md b/2.9.0/todo/result_rocm.md
diff --git a/2.9.0/todo/result_skip.md b/2.9.0/todo/result_skip.md
diff --git a/2.9.0/todo/result_sparse_frontend.md b/2.9.0/todo/result_sparse_frontend.md