[Inductor] Naive foreach autotune support#162053
[Inductor] Naive foreach autotune support#162053jataylo wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162053
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 13 Unrelated FailuresAs of commit ff32d98 with merge base 2b93d5b ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
| ): | ||
| configs.append(triton.Config({}, num_stages=1, num_warps=8)) | ||
| else: | ||
| for warps in [1, 2, 4, 8]: |
There was a problem hiding this comment.
Do you have any more details on the kernels that were improved ?
There was a problem hiding this comment.
Let me dig that out
|
Along w/ Elias comment, depending on the kernels it might be good to run some optimizer microbenchmarks after this to see if there's any improvement. Thanks for looking at this! |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
41b4090 to
f4fb636
Compare
|
@mlazos, @eellison is there any sort of benchmark we can run to display perf results. Problem is the workload this is using is private. Another annoying issue is for for kernels do not support TORCHINDUCTOR_BENCHMARK_KERNEL=1 so its not enough to copy over some codegen as there is no call harness. We can conditionalise for ROCm only if this makes things a bit easier to land this |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
f4fb636 to
9fde70d
Compare
|
Optimizers are a pretty good workload for this, they will fuse a bunch of foreach ops together. To make it realistic you can just take weight sizes from a public model and init w/ random data. |
|
I think if we just only have one config in the else statement we're good to reland, lmk if that's okay. |
The else statement needs to have more configs, you wouldn't know apriori which is the fastest config. The point is to try more configs, this is what is doing for reductions, persistent reductions, etc. There should be no increased compile time with autotune disabled, since there is no one config is the same one as before by construction. More details would be helpful with respect to the slowdown which is being observed. |
|
Yeah sorry for the misunderstanding, it turns out the internal team has max_autotune_pointwise enabled which explains why they were seeing this and I missed the negation in the conditional so my previous comment was incorrect. I asked them to turn it off, if they are unable to because there are too many jobs, we should switch this to only max_autotune. I'm waiting for their response but I think we should be able to reland after they confirm. Sorry for the churn! I also really want this and want to get it in. The perf looks great! |
|
cc: @jataylo @mlazos Currently,
The logic is that So, I would recommend against only activating foreach tuning with |
@naromero77amd Yeah I think adding an additional flag TORCHINDUCTOR_MAX_AUTOTUNE_COMBOKERNEL is perhaps a better solution. The main reason being is that 1) it takes a long time because these kernels can be large w/ optimizers and 2) it can cause OOMs when autotuning the optimizer. This is because the optimizer kernels we generate are inplace mutations which induce a clone when we autotune (we can't mutate the actual params). what do you think? |
This reverts commit 6c5db82. Reverted pytorch#162053 on behalf of https://github.com/mlazos due to Sorry, there's an internal slowdown due to the extra triton configs you added ([comment](pytorch#162053 (comment)))
|
@pytorchbot merge |
|
I talked to the internal team, they were able to remove the flag. Relanding! Thank you for your patience. |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@mlazos Due to some unrelated failures, you will need to do something like: |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge -i |
Merge failedReason: 12 jobs have failed, first few of them are: pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, lf.linux.4xlarge), pull / linux-jammy-py3.10-gcc11 / test (default, 2, 5, lf.linux.2xlarge), pull / linux-jammy-py3.10-clang18-asan / test (default, 3, 7, lf.linux.4xlarge), pull / linux-jammy-py3.10-clang12 / test (default, 2, 5, lf.linux.4xlarge), trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "Unrelated failures, relanding a reverted PR" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ports (#2807) These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>
Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code. Before: triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 | triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 | triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 | After: triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 | triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 | triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 | num_warps=8 default due to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton_combo_kernel.py#L374 Pull Request resolved: #162053 Approved by: https://github.com/mlazos, https://github.com/naromero77amd, https://github.com/jeffdaily Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code.
Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |
After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |
num_warps=8 default due to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton_combo_kernel.py#L374
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos