[Inductor][Triton][FP8] Support tile-wise (1x128) scaling in Inductor by jananisriram · Pull Request #165132 · pytorch/pytorch

jananisriram · 2025-10-10T07:59:37Z

Summary:
Support tile-wise 1x128 scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors a and b represent a 1x128 slice of input.

NOTE: Block-wise 128x128 and 1x128 scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in fbcode (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling).

Test Plan:
Works out-of-the-box with TritonBench:

TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5

Differential Revision: D84025878

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-10-10T07:59:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165132

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

❌ 1 Cancelled Job

As of commit 4820cba with merge base 56a809a ():

CANCELLED JOB - The following job was cancelled. Please retry:

Limited CI on H100 / linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test / test (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-10T07:59:45Z

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84025878.

aten/src/ATen/native/cuda/Blas.cpp

PaulZhang12 · 2025-10-10T14:00:08Z

Overall, this looks great, will wait for @eqy to comment about the blas changes cc @eellison

test/inductor/test_fp8.py

torch/_inductor/kernel/mm.py

drisspg

Much cleaner, a couple small comments. Great job!

…#165132) Summary: Pull Request resolved: #165132 Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input. NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling). Test Plan: Works out-of-the-box with TritonBench: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5 ``` Differential Revision: D84025878

njriasan

@jananisriram See my feedback. Overall this seems good, but I have some suggestions to have a single point of failure for the allowed scaling options. I'll defer to you though regarding what combinations should actually be supported and how to enforce those.

torch/_meta_registrations.py

torch/_inductor/kernel/mm.py

…#165132) Summary: Pull Request resolved: #165132 Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input. NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling). Test Plan: Works out-of-the-box with TritonBench: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5 ``` Reviewed By: njriasan Differential Revision: D84025878

…#165132) Summary: Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input. NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling). Test Plan: Works out-of-the-box with TritonBench: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5 ``` Reviewed By: njriasan Differential Revision: D84025878

facebook-github-bot · 2025-11-03T16:41:16Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-11-03T16:43:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-03T16:43:33Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Limited CI on H100 / linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test / test

Details for Dev Infra team

Raised by workflow job

jeanschmidt · 2025-11-03T18:22:05Z

@pytorchbot merge -i

pytorchmergebot · 2025-11-03T18:28:02Z

Merge started

Your change will be merged while ignoring the following 1 checks: Limited CI on H100 / linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-03T18:32:53Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jeanschmidt · 2025-11-03T18:34:31Z

@pytorchbot -f "merge with -i is not ignoring a canceled job, as it says it will. Seems a bug"

pytorch-bot · 2025-11-03T18:34:33Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'merge with -i is not ignoring a canceled job, as it says it will. Seems a bug' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

jeanschmidt · 2025-11-03T18:34:51Z

@pytorchbot merge -f "merge with -i is not ignoring a canceled job, as it says it will. Seems a bug"

pytorchmergebot · 2025-11-03T18:36:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#165132) Summary: Support tile-wise `1x128` scaling in Inductor Triton for FP8 GEMMs, i.e. scaling values along tensors `a` and `b` represent a `1x128` slice of input. NOTE: Block-wise `128x128` and `1x128` scaling is only supported in CUDA 12.9+; therefore, tile-wise scaling is currently unsupported in `fbcode` (CUDA 12.4). Use OSS PyTorch to run tile-wise scaling (as with deepseek-style scaling). Test Plan: Works out-of-the-box with TritonBench: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling-pair=BlockWise1x128,BlockWise1x128 --atol=1e-2 --rtol=0.5 ``` Differential Revision: D84025878 Pull Request resolved: #165132 Approved by: https://github.com/eqy, https://github.com/drisspg, https://github.com/njriasan

jananisriram requested review from Aidyn-A, eqy and syed-ahmed as code owners October 10, 2025 07:59

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 10, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 10, 2025

jananisriram requested review from NikhilAPatel, PaulZhang12, drisspg, njriasan and slayton58 October 10, 2025 08:00

PaulZhang12 reviewed Oct 10, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Outdated Show resolved Hide resolved

eqy approved these changes Oct 10, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2025

eellison added the ciflow/h100 label Oct 10, 2025

drisspg added ciflow/b200 ciflow/rocm Trigger "default" config CI on ROCm labels Oct 10, 2025

drisspg reviewed Oct 10, 2025

View reviewed changes

test/inductor/test_fp8.py Outdated Show resolved Hide resolved

drisspg reviewed Oct 10, 2025

View reviewed changes

torch/_inductor/kernel/mm.py Outdated Show resolved Hide resolved

drisspg approved these changes Oct 10, 2025

View reviewed changes

jananisriram added the topic: not user facing topic category label Oct 10, 2025

jananisriram force-pushed the export-D84025878 branch from 1fbc8e0 to 8988580 Compare October 10, 2025 22:46

njriasan approved these changes Oct 13, 2025

View reviewed changes

torch/_meta_registrations.py Outdated Show resolved Hide resolved

torch/_inductor/kernel/mm.py Show resolved Hide resolved

torch/_inductor/kernel/mm.py Show resolved Hide resolved

torch/_inductor/kernel/mm.py Show resolved Hide resolved

jananisriram force-pushed the export-D84025878 branch from 8988580 to 63f2493 Compare October 16, 2025 06:57

jananisriram force-pushed the export-D84025878 branch from 4a4c81d to c74c91a Compare October 20, 2025 15:33

facebook-github-bot force-pushed the export-D84025878 branch from c74c91a to b3ef393 Compare October 27, 2025 19:16

facebook-github-bot force-pushed the export-D84025878 branch 2 times, most recently from 48d3c59 to ec744b3 Compare October 27, 2025 20:49

facebook-github-bot force-pushed the export-D84025878 branch from ec744b3 to a9226e6 Compare October 27, 2025 20:51

facebook-github-bot force-pushed the export-D84025878 branch from a9226e6 to 4820cba Compare October 29, 2025 21:22

pytorchmergebot added the merging label Nov 3, 2025

pytorchmergebot removed the merging label Nov 3, 2025

pytorchmergebot added the merging label Nov 3, 2025

pytorchmergebot closed this in aa4a8c9 Nov 3, 2025

pytorchmergebot added Merged and removed merging labels Nov 3, 2025

github-actions bot deleted the export-D84025878 branch December 4, 2025 02:18

Conversation

jananisriram commented Oct 10, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165132

❗ 1 Active SEVs

❌ 1 Cancelled Job

Uh oh!

meta-codesync bot commented Oct 10, 2025

Uh oh!

Uh oh!

PaulZhang12 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

njriasan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Nov 3, 2025

Uh oh!

pytorchmergebot commented Nov 3, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 3, 2025

Merge failed

Uh oh!

jeanschmidt commented Nov 3, 2025

Uh oh!

pytorchmergebot commented Nov 3, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 3, 2025

Uh oh!

jeanschmidt commented Nov 3, 2025

Uh oh!

pytorch-bot bot commented Nov 3, 2025

Uh oh!

jeanschmidt commented Nov 3, 2025

Uh oh!

pytorchmergebot commented Nov 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jananisriram commented Oct 10, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 10, 2025 •

edited

Loading