[Inductor] Fix Diode / exhaustive autotune crash on AMD by JChunX · Pull Request #169225 · pytorch/pytorch

JChunX · 2025-11-28T23:33:27Z

Summary:
Two issues prevent using Diode w/ expanded search space on AMD:

matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

pytorch-bot · 2025-11-28T23:33:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169225

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 7 Unrelated Failures

As of commit a137b1a with merge base 6f7dcf5 ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test/inductor/test_cooperative_reductions.py::CooperativeReductionTests::test_non_power_of_2_bs_15_count_1048575

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_torchinductor.py::GPUTests::test_aoti_eager_override_registration_cuda
trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_torchinductor.py::GPUTests::test_aoti_eager_with_scalar_cuda
trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbacked_equals_input_size_runtime_assertion_mark_unbacked_False_cuda
trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_non_tensor_input_cuda
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbacked_expr_replacements_shift_k_0_use_static_size_False_cuda
trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1) (gh) (#169451)
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_clamp_decomposition_cuda
trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4) (gh) (#169451)
RuntimeError: distributed/test_nvshmem_triton 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-11-28T23:33:37Z

@JChunX has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87963125.

Summary: Two issues prevent using Diode w/ expanded search space on AMD: 1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail 2. GROUP_M=0 crashes AMD GPU (but not NV) repro: P2057901593 Test Plan: MODEL=822608598; SNAPSHOT=0 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log Differential Revision: D87963125

Summary: Two issues prevent using Diode w/ expanded search space on AMD: 1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail 2. GROUP_M=0 crashes AMD GPU (but not NV) repro: P2057901593 Test Plan: MODEL=822608598; SNAPSHOT=0 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log Reviewed By: coconutruben Differential Revision: D87963125

facebook-github-bot · 2025-12-03T07:46:25Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-12-03T07:48:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-12-03T07:48:42Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

huydhn · 2025-12-03T08:27:23Z

@pytorchbot merge -i

pytorchmergebot · 2025-12-03T08:29:26Z

Merge started

Your change will be merged while ignoring the following 8 checks: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4), trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Two issues prevent using Diode w/ expanded search space on AMD: 1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail 2. GROUP_M=0 crashes AMD GPU (but not NV) repro: P2057901593 Test Plan: MODEL=822608598; SNAPSHOT=0 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log Differential Revision: D87963125 Pull Request resolved: #169225 Approved by: https://github.com/coconutruben

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 28, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 28, 2025

JChunX force-pushed the export-D87963125 branch from 2132616 to 836994e Compare December 1, 2025 17:56

JChunX added the topic: not user facing topic category label Dec 1, 2025

JChunX force-pushed the export-D87963125 branch from 836994e to 203e751 Compare December 1, 2025 18:56

JChunX force-pushed the export-D87963125 branch from 203e751 to facd747 Compare December 2, 2025 04:45

JChunX force-pushed the export-D87963125 branch from facd747 to 62bbc0f Compare December 2, 2025 06:58

coconutruben self-requested a review December 2, 2025 18:06

coconutruben approved these changes Dec 2, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 2, 2025

JChunX force-pushed the export-D87963125 branch from 62bbc0f to a137b1a Compare December 2, 2025 19:27

pytorchmergebot added the merging label Dec 3, 2025

pytorchmergebot removed the merging label Dec 3, 2025

pytorchmergebot added the merging label Dec 3, 2025

pytorchmergebot closed this in 0bbbdf1 Dec 3, 2025

pytorchmergebot added the Merged label Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225

[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225
JChunX wants to merge 1 commit intopytorch:mainfrom
JChunX:export-D87963125

JChunX commented Nov 28, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 28, 2025

Uh oh!

facebook-github-bot commented Dec 3, 2025

Uh oh!

pytorchmergebot commented Dec 3, 2025

Uh oh!

pytorchmergebot commented Dec 3, 2025

Uh oh!

huydhn commented Dec 3, 2025

Uh oh!

pytorchmergebot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

JChunX commented Nov 28, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169225

❌ 1 New Failure, 7 Unrelated Failures

Uh oh!

meta-codesync bot commented Nov 28, 2025

Uh oh!

facebook-github-bot commented Dec 3, 2025

Uh oh!

pytorchmergebot commented Dec 3, 2025

Merge started

Uh oh!

pytorchmergebot commented Dec 3, 2025

Merge failed

Uh oh!

huydhn commented Dec 3, 2025

Uh oh!

pytorchmergebot commented Dec 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JChunX commented Nov 28, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 28, 2025 •

edited

Loading