I’m so sorry to bother you. I encountered an issue with the HGEMM kernel. On A100 and H100 GPUs, it only achieves about 60% of the expected performance.
Could you please advise which parts of the code should be adjusted or optimized to better adapt to these architectures?