Skip to content

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Sep 5, 2025

Add the shader of subgroup matrix multiplication.

$ sysctl -n machdep.cpu.brand_string
Apple M4 Max

$ for i in 11 12 13 ; do MATMUL_VERSION=$i ./build/matmul  | grep 'Dispatching\|FLOPS' ; done
# Without subgroupMatrix (f16) 
[info] Dispatching Kernel version 11: f16: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
25.8 milliseconds / dispatch ~ 10640.35 GFLOPS
# With subgroupMatrix (f16) 
[info] Dispatching Kernel version 12: f16: Subgroup matrix multiply with transpose, 30 iterations ...
20.7 milliseconds / dispatch ~ 13299.45 GFLOPS
# With subgroupMatrix (f32) 
[info] Dispatching Kernel version 13: f32: Subgroup matrix multiply with transpose (default), 30 iterations ...
24.6 milliseconds / dispatch ~ 11185.63 GFLOPS

@junjihashimoto
Copy link
Collaborator Author

junjihashimoto commented Sep 5, 2025

The main branch does not seem to output any shader compilation errors.

# Ensure source present on required commit (idempotent remote setup)
if(NOT DEFINED DAWN_COMMIT OR DAWN_COMMIT STREQUAL "")
set(DAWN_COMMIT "e1d6e12337080cf9f6d8726209e86df449bc6e9a" CACHE STRING "Dawn commit to checkout" FORCE)
set(DAWN_COMMIT "3f79f3aefe0b0a498002564fcfb13eb21ab6c047" CACHE STRING "Dawn commit to checkout" FORCE)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google/dawn@d7d27a6
Required to set subgroupsize to 32 on MacOS.

@junjihashimoto junjihashimoto marked this pull request as ready for review October 13, 2025 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant