[ROCm][inductor] More configs for pointwise kernels.#166470
[ROCm][inductor] More configs for pointwise kernels.#166470naromero77amd wants to merge 3 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166470
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ac96ea4 with merge base 1e836bc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
PaulZhang12
left a comment
There was a problem hiding this comment.
LGTM! Seems like another one of those layout issues with num_warps=1....
| num_stages=2, | ||
| waves_per_eu=1, # 20% improvement | ||
| ), | ||
| triton_config_with_settings( |
There was a problem hiding this comment.
Can we conditionalize this if atomic add is actually present ?
There was a problem hiding this comment.
I don't see how we can do this.
@jataylo any ideas?
There was a problem hiding this comment.
See, num_stores in heuristics:
pytorch/torch/_inductor/codegen/triton.py
Line 5052 in e0604d3
There was a problem hiding this comment.
At the moment, t1.atomic_add are counted as stores. So I could not distinguish kernel with one atomic_add vs one store. So I had to add another field into this inductor_meta structure.
There was a problem hiding this comment.
Sounds good - was just pointing to where we do similar analysis. makes sense you need new field.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV. Pull Request resolved: #166470 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel
This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV. Pull Request resolved: pytorch#166470 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel
…ports (#2807) These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>
This config improves performance by 250% on some kernels that contain
t1.atomic_add(...). Again, we conditionalize for ROCm/HIP, so there is no impact to NV.cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos