[ROCm] Improve perf for elementwise broadcast with mixed dtype by jerrymannil · Pull Request #163562 · pytorch/pytorch

jerrymannil · 2025-09-22T21:29:08Z

Unroll loops manually to hide memory access latency

Co-author: @amd-hhashemi

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

* Unroll loops manually to hide memory access latency

pytorch-bot · 2025-09-22T21:29:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163562

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit e774647 with merge base e558f7a ():

NEW FAILURE - The following job has failed:

rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 4, 6, linux.rocm.gpu.gfx942.1) (gh)
inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load

CANCELLED JOB - The following job was cancelled. Please retry:

linux-binary-manywheel-rocm / manywheel-py3_10-rocm6_4-build / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerrymannil · 2025-09-22T21:39:57Z

Reproducer:

import time
import argparse
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--events", action="store_true", help="Use CUDA events")
parser.add_argument("--check", action="store_true", help="Enable correctness check")
args = parser.parse_args()

shapes = [[(34816, 1), (34816, 3840)]]

for shape in shapes:
    a = torch.randn(shape[0], device='cuda', dtype=torch.float)
    b = torch.randn(shape[1], device='cuda', dtype=torch.bfloat16)
    for i in range(20):
        if args.check and i == 5:
            a_cpu = a.cpu()
            b_cpu = b.cpu()
            c_cpu = torch.mul(a_cpu, b_cpu)
            c = torch.mul(a, b)
            assert torch.equal(c.cpu(), c_cpu)
        _ = torch.mul(a, b)
    torch.cuda.synchronize()

    if args.events:
        start_evt = torch.cuda.Event(enable_timing=True)
        end_evt = torch.cuda.Event(enable_timing=True)
        start_evt.record()
    else:
        start_time = time.perf_counter_ns()

    for _ in range(100):
        c = torch.mul(a, b)

    if args.events:
        end_evt.record()
    else:
         torch.cuda.synchronize()
         end_time = time.perf_counter_ns()

    if args.events:
        torch.cuda.synchronize()

    if args.events:
        print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")
    else:
        print(f"Avg time for shape {shape}: {(end_time - start_time) / (100 * 1e3):.2f} us")

Results on MI325X

Before:
Avg time for shape [(34816, 1), (34816, 3840)]: 432.10 us

After:
Avg time for shape [(34816, 1), (34816, 3840)]: 381.74 us

jeffdaily · 2025-09-23T17:40:24Z

@pytorchbot merge -f "change is completely inside ifdef USE_ROCM, ROCm CI is passing"

jerrymannil · 2025-09-23T17:40:24Z

The single failure is some intermittent issue.
I am able to run it fine in my local setup

 python test/inductor/test_cuda_repro.py -k "test_repeated_masked_load" --verbose
test_repeated_masked_load (__main__.CudaReproTests.test_repeated_masked_load) ... expected failure

----------------------------------------------------------------------
Ran 1 test in 0.004s

pytorchmergebot · 2025-09-23T17:42:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@amd-hhashemi

…ch#163562) * Unroll loops manually to hide memory access latency Co-author: @amd-hhashemi Pull Request resolved: pytorch#163562 Approved by: https://github.com/jeffdaily

[ROCm] Improve perf for elementwise broadcast with mixed dtype

e774647

* Unroll loops manually to hide memory access latency

jerrymannil requested review from eqy and syed-ahmed as code owners September 22, 2025 21:29

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Sep 22, 2025

pytorchbot added the open source label Sep 22, 2025

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 and removed release notes: cuda release notes category labels Sep 22, 2025

jeffdaily approved these changes Sep 22, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 23, 2025

pytorchmergebot closed this in 2aadcea Sep 23, 2025

pytorchmergebot added Merged and removed merging labels Sep 23, 2025

jerrymannil deleted the patch-1 branch September 23, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Improve perf for elementwise broadcast with mixed dtype#163562

[ROCm] Improve perf for elementwise broadcast with mixed dtype#163562
jerrymannil wants to merge 1 commit intopytorch:mainfrom
jerrymannil:patch-1

jerrymannil commented Sep 22, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 22, 2025 •

edited

Loading

Uh oh!

jerrymannil commented Sep 22, 2025

Uh oh!

jeffdaily commented Sep 23, 2025

Uh oh!

jerrymannil commented Sep 23, 2025

Uh oh!

pytorchmergebot commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jerrymannil commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163562

❌ 1 New Failure, 1 Cancelled Job

Uh oh!

jerrymannil commented Sep 22, 2025

Uh oh!

jeffdaily commented Sep 23, 2025

Uh oh!

jerrymannil commented Sep 23, 2025

Uh oh!

pytorchmergebot commented Sep 23, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerrymannil commented Sep 22, 2025 •

edited

Loading

pytorch-bot bot commented Sep 22, 2025 •

edited

Loading