[ROCm] Improve perf for elementwise broadcast with mixed dtype#163562
[ROCm] Improve perf for elementwise broadcast with mixed dtype#163562jerrymannil wants to merge 1 commit intopytorch:mainfrom jerrymannil:patch-1
Conversation
* Unroll loops manually to hide memory access latency
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163562
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Cancelled JobAs of commit e774647 with merge base e558f7a ( NEW FAILURE - The following job has failed:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Reproducer: Results on MI325X |
|
@pytorchbot merge -f "change is completely inside ifdef USE_ROCM, ROCm CI is passing" |
|
The single failure is some intermittent issue. |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ch#163562) * Unroll loops manually to hide memory access latency Co-author: @amd-hhashemi Pull Request resolved: pytorch#163562 Approved by: https://github.com/jeffdaily
Co-author: @amd-hhashemi
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd