[ROCm][inductor] autotune support for persistent reduction kernels#163908
[ROCm][inductor] autotune support for persistent reduction kernels#163908naromero77amd wants to merge 9 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163908
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 90f2fe2 with merge base 382b015 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This is a redo of this PR: #162056 But this time branch was created in ROCm repo. |
f894ae7 to
e549d1c
Compare
|
We demonstrate that the updated heuristics are generally applicable to other models such as the HuggingFace (HF) Inductor Dashboard benchmark suite. Here is a high-level outline of the steps that were taken. We ran the HF benchmark suite from Inductor Dashboard using the command below: Then we extract the relevant kernels and benchmark with and without the PR (baseline also used TORCHINDCTOR_MAX_AUTOTUNE_POINTWISE=1). There is a geomean speed-up of 1.08 X on over 400 kernels. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#163908) After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels. Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: pytorch#163908 Approved by: https://github.com/jansel, https://github.com/PaulZhang12
…ytorch#163908) After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels. Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled. Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023 Pull Request resolved: pytorch#163908 Approved by: https://github.com/jansel, https://github.com/PaulZhang12
…ports (#2807) These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>
After the removal of want_no_x_dim for persistent reduction kernels, we can improve the autotuning setup for persistent reduction kernels.
Currently even with tuning enable, filtering will only try a single config in many cases. Avoid filtering with autotune mode, and override MAX_BLOCK limit. Also we always include tiny_config when autotuning is enabled.
Contributions from several members of the AMD Inductor and Triton teams: @jataylo @iupaikov-amd @AmdSampsa @xiaohuguo2023
cc @mlazos @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben