Skip to content

[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225

Closed
JChunX wants to merge 1 commit intopytorch:mainfrom
JChunX:export-D87963125
Closed

[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225
JChunX wants to merge 1 commit intopytorch:mainfrom
JChunX:export-D87963125

Conversation

@JChunX
Copy link
Contributor

@JChunX JChunX commented Nov 28, 2025

Summary:
Two issues prevent using Diode w/ expanded search space on AMD:

  1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

  2. GROUP_M=0 crashes AMD GPU (but not NV)
    repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169225

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 7 Unrelated Failures

As of commit a137b1a with merge base 6f7dcf5 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Nov 28, 2025

@JChunX has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87963125.

JChunX added a commit to JChunX/pytorch that referenced this pull request Dec 1, 2025
Summary:

Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125
@JChunX JChunX added the topic: not user facing topic category label Dec 1, 2025
JChunX added a commit to JChunX/pytorch that referenced this pull request Dec 1, 2025
Summary:

Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125
JChunX added a commit to JChunX/pytorch that referenced this pull request Dec 2, 2025
Summary:

Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125
JChunX added a commit to JChunX/pytorch that referenced this pull request Dec 2, 2025
Summary:

Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125
@coconutruben coconutruben self-requested a review December 2, 2025 18:06
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 2, 2025
Summary:

Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Reviewed By: coconutruben

Differential Revision: D87963125
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@huydhn
Copy link
Contributor

huydhn commented Dec 3, 2025

@pytorchbot merge -i

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:

1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail

2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593

Test Plan:
MODEL=822608598; SNAPSHOT=0

TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log

Differential Revision: D87963125

Pull Request resolved: #169225
Approved by: https://github.com/coconutruben
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants