[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half by jerrymannil · Pull Request #167233 · pytorch/pytorch

jerrymannil · 2025-11-06T18:02:53Z

c10::fetch_and_cast and c10::cast_and_store produce branchy code since it supports all datatypes
So, we do special handling for binary elementwise broadcast with mixed dtypes of float/bfloat16/half
This improves performance

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

… float/bfloat16/half

pytorch-bot · 2025-11-06T18:02:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167233

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 598d9b3 with merge base 73078f3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerrymannil · 2025-11-06T18:26:23Z

Reproducer:

import time
import argparse
import torch

parser = argparse.ArgumentParser()
parser.add_argument("--events", action="store_true", help="Use CUDA events")
parser.add_argument("--check", action="store_true", help="Enable correctness check")
args = parser.parse_args()

shapes = [[(34816, 1), (34816, 3840)]]


for i, shape in enumerate(shapes):
    a = torch.randn(shape[0], device='cuda', dtype=torch.float)
    b = torch.randn(shape[1], device='cuda', dtype=torch.bfloat16)
    for i in range(20):
        if args.check and i == 5:
            a_cpu = a.cpu()
            b_cpu = b.cpu()
            c_cpu = torch.mul(a_cpu, b_cpu)
            c = torch.mul(a, b)
            torch.cuda.synchronize()
            assert torch.equal(c.cpu(), c_cpu)
        _ = torch.mul(a, b)
    torch.cuda.synchronize()

    if args.events:
        start_evt = torch.cuda.Event(enable_timing=True)
        end_evt = torch.cuda.Event(enable_timing=True)
        start_evt.record()
    else:
        start_time = time.perf_counter_ns()

    for _ in range(100):
        c = torch.mul(a, b)

    if args.events:
        end_evt.record()
    else:
         torch.cuda.synchronize()
         end_time = time.perf_counter_ns()

    if args.events:
        torch.cuda.synchronize()

    if args.events:
        print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")
    else:
        print(f"Avg time for shape {shape}: {(end_time - start_time) / (100 * 1e3):.2f} us")

Results (MI300X):

Before:
Avg time for shape [(34816, 1), (34816, 3840)]: 406.78 us

After:
Avg time for shape [(34816, 1), (34816, 3840)]: 231.73 us

jerrymannil · 2025-11-07T00:03:13Z

@pytorchbot merge

pytorchmergebot · 2025-11-07T00:05:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…es with float/bfloat16/half cherry-pick of pytorch#167233

…es with float/bfloat16/half (#2791) cherry-pick of pytorch#167233 Fixes #SWDEV-551924

…es with float/bfloat16/half cherry-pick of pytorch#167233

…es with float/bfloat16/half (#2795) cherry-pick of pytorch#167233 Fixes #SWDEV-551924

…es with float/bfloat16/half (pytorch#167233) * `c10::fetch_and_cast` and `c10::cast_and_store` produce branchy code since it supports all datatypes * So, we do special handling for binary elementwise broadcast with mixed dtypes of float/bfloat16/half * This improves performance Pull Request resolved: pytorch#167233 Approved by: https://github.com/jeffdaily

…es with float/bfloat16/half cherry-pick of pytorch#167233

…es with float/bfloat16/half (#2818) cherry-pick of pytorch#167233

[ROCm] Specialized elementwise broadcast kernel for mixed dtypes with…

bf90822

… float/bfloat16/half

jerrymannil requested review from Aidyn-A, eqy and syed-ahmed as code owners November 6, 2025 18:02

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Nov 6, 2025

jerrymannil marked this pull request as draft November 6, 2025 18:03

pytorchbot added the open source label Nov 6, 2025

jeffdaily approved these changes Nov 6, 2025

View reviewed changes

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 and removed release notes: cuda release notes category labels Nov 6, 2025

Minor update

598d9b3

pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Nov 6, 2025

jerrymannil marked this pull request as ready for review November 6, 2025 21:13

jeffdaily added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Nov 6, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2025

pytorchmergebot added the merging label Nov 7, 2025

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

15c832d

…es with float/bfloat16/half cherry-pick of pytorch#167233

jerrymannil mentioned this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half ROCm/pytorch#2791

Merged

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

5729657

…es with float/bfloat16/half (#2791) cherry-pick of pytorch#167233 Fixes #SWDEV-551924

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

a7f0d19

…es with float/bfloat16/half cherry-pick of pytorch#167233

jerrymannil mentioned this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half ROCm/pytorch#2795

Merged

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 7, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

130d937

…es with float/bfloat16/half (#2795) cherry-pick of pytorch#167233 Fixes #SWDEV-551924

pytorchmergebot added the Merged label Nov 7, 2025

pytorchmergebot closed this in ae67a5a Nov 7, 2025

pytorchmergebot removed the merging label Nov 7, 2025

jerrymannil deleted the patch-1 branch November 7, 2025 03:15

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

5d0ae67

…es with float/bfloat16/half cherry-pick of pytorch#167233

jerrymannil mentioned this pull request Nov 19, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half ROCm/pytorch#2818

Merged

jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtyp…

4e285de

…es with float/bfloat16/half (#2818) cherry-pick of pytorch#167233

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half#167233

[ROCm] Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half#167233
jerrymannil wants to merge 2 commits intopytorch:mainfrom
jerrymannil:patch-1

jerrymannil commented Nov 6, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

jerrymannil commented Nov 6, 2025 •

edited

Loading

Uh oh!

jerrymannil commented Nov 7, 2025

Uh oh!

pytorchmergebot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jerrymannil commented Nov 6, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167233

✅ No Failures

Uh oh!

jerrymannil commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerrymannil commented Nov 7, 2025

Uh oh!

pytorchmergebot commented Nov 7, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerrymannil commented Nov 6, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 6, 2025 •

edited

Loading

jerrymannil commented Nov 6, 2025 •

edited

Loading