[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225
[Inductor] Fix Diode / exhaustive autotune crash on AMD#169225JChunX wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169225
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 7 Unrelated FailuresAs of commit a137b1a with merge base 6f7dcf5 ( NEW FAILURE - The following job has failed:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
2132616 to
836994e
Compare
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
836994e to
203e751
Compare
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
203e751 to
facd747
Compare
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
facd747 to
62bbc0f
Compare
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Reviewed By: coconutruben
Differential Revision: D87963125
62bbc0f to
a137b1a
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 8 checks: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1), trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4), trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
1. matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
2. GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
Pull Request resolved: #169225
Approved by: https://github.com/coconutruben
Summary:
Two issues prevent using Diode w/ expanded search space on AMD:
matrix_instr_nonkdim=2 and kpack=2 causes triton compile to fail
GROUP_M=0 crashes AMD GPU (but not NV)
repro: P2057901593
Test Plan:
MODEL=822608598; SNAPSHOT=0
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_COMPILE_DEBUG=1 HIP_VISIBLE_DEVICES=7 buck2 run -m rocm640 mode/opt-split-dwarf mode/inplace mode/amd-gpu -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=mi300 caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --max-batch-size=3072 --preset_lowerer='relu_nan_to_num;disable_new_lowering_weights' --input-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/input.predictor.disagg.gpu.merge --output-file=/home/${USER}/models/${MODEL}/${SNAPSHOT}/fp8_amd_output_diode.predictor.disagg.gpu.merge --diode-config="{'top_k': 100, 'expand_search_space': True, 'discard_unpredicted': False}" --lower-backend aot_inductor --add_passes="use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction" --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False, 'aot_inductor.use_runtime_constant_folding': True}" --hardware-type GFX942_X86 --node_replacement_dict="{'torch.nn.Linear':{'(3000+, 3000+)':'fp8_float_model_dynamic_quantization_rowwise_triton'}}" 2>&1 | tee ~/logs/lower_diode.log
Differential Revision: D87963125
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo