[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. by jammm · Pull Request #162330 · pytorch/pytorch

jammm · 2025-09-06T09:34:11Z

Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set USE_FLASH_ATTENTION=1 and USE_MEM_EFF_ATTENTION=1 as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-09-06T09:34:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162330

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 19 New Failures, 2 Cancelled Jobs, 2 Unrelated Failures

As of commit 4f46a52 with merge base 5b9114b ():

NEW FAILURES - The following jobs have failed:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.gfx942.4) (gh)
Process completed with exit code 134.
windows-binary-libtorch-debug / libtorch-cuda12_8-shared-with-deps-debug-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\third_party\fbgemm\fbgemm_gpu\experimental\gen_ai\src\quantize\common\include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-libtorch-debug / libtorch-cuda13_0-shared-with-deps-debug-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\build\.tmpaPsov7\tmpxft_00001010_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-libtorch-release / libtorch-cuda12_8-shared-with-deps-release-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-libtorch-release / libtorch-cuda13_0-shared-with-deps-release-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\build\.tmpyq7PqT\tmpxft_00000d20_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_10-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_10-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmpOtpnUu\tmpxft_00000078_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_11-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_11-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmpqT6BZ1\tmpxft_000015fc_00000000-7_mx8mx8bf16_grouped_256_128_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_12-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_12-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmpmW8x2h\tmpxft_00000f34_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_13-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_13-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmpUd2flB\tmpxft_000018b4_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_13t-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_13t-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmptWFSvX\tmpxft_00000b44_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_14-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_14-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmpbwYcBC\tmpxft_0000047c_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned
windows-binary-wheel / wheel-py3_14t-cuda12_8-build (gh)
C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include\fbgemm_gpu/quantize/utils.h(19): error C3861: '__builtin_clz': identifier not found
windows-binary-wheel / wheel-py3_14t-cuda13_0-build (gh)
C:\actions-runner\_work\pytorch\pytorch\pytorch\build\.tmp5w9aiD\tmpxft_000016f0_00000000-7_mx8mx8bf16_grouped_256_256_256_2_1_1.compute_100a.cudafe1.stub.c(89): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned

CANCELLED JOBS - The following jobs were cancelled. Please retry:

linux-binary-libtorch / libtorch-rocm6_3-shared-with-deps-release-build / build (gh)
##[error]The operation was canceled.
linux-binary-libtorch / libtorch-rocm6_4-shared-with-deps-release-build / build (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

windows-arm64-binary-wheel / wheel-py3_11-cpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
windows-binary-wheel / wheel-py3_14t-xpu-build (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jammm · 2025-09-06T09:34:20Z

cc @ScottTodd

jammm · 2025-09-06T09:34:31Z

cc @xinyazhang

jammm · 2025-09-06T11:29:50Z

@pytorchbot label "release notes: rocm

pytorch-bot · 2025-09-06T11:29:52Z

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

jammm · 2025-09-06T11:30:04Z

@pytorchbot label "release notes: rocm"

jammm · 2025-09-06T11:31:00Z

@pytorchbot label "topic: performance"

jammm · 2025-09-06T11:32:03Z

@pytorchbot label "module: windows"

cmake/External/aotriton.cmake

aten/src/ATen/native/transformers/cuda/attention.cu

cmake/External/aotriton.cmake

Nem404 · 2025-09-08T19:23:03Z

Wait a minute, so this is actually TheRock's external-builds/pytorch/patches/pytorch/main/pytorch/hipified/0001-Support-FLASH_ATTENTION-MEM_EFF_ATTENTION-via.-aotri.patch now upstreamed as a PR, making this patch unnecessary?

WoW

xinyazhang

LGTM

Nem404 · 2025-09-09T15:44:04Z

Great to see Xinya has approved :D

Who else do we need here as a reviewer with merge privileges? Jeff?

cmake/External/aotriton.cmake

ScottTodd · 2025-09-09T16:45:46Z

CMakeLists.txt

 if(USE_ROCM)
-  if(UNIX AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
+  if(USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION)
    include(cmake/External/aotriton.cmake)


Thanks, I tested this using https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py, with and without --enable-pytorch-flash-attention-windows.

Both builds succeeded

Running pytorch succeeded with aotriton enabled, and comfyUI seemed to generate images on my gfx1100 GPU using the memory efficient attention implementation (after setting the TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 env var)

With the option enabled, I see 45MB more logs (15Mb -> 60MB), including 5337 instances of this warning. It seems to just be a warning, possibly fixed by forcing Python into UTF8 mode (will verify)

Message: '%s %s -> %s' Arguments: ('copying', 'torch\\lib\\aotriton.images\\amd-gfx11xx\\flash\\bwd_kernel_dq\\FONLY__\uff0afp32@16_48_0_T_T_1___gfx11xx.aks2', 'build\\lib.win-amd64-cpython-312\\torch\\lib\\aotriton.images\\amd-gfx11xx\\flash\\bwd_kernel_dq') --- Logging error --- Traceback (most recent call last): File "C:\Users\Nod-Shark16\AppData\Local\Programs\Python\Python312\Lib\logging\__init__.py", line 1163, in emit stream.write(msg + self.terminator) File "C:\Users\Nod-Shark16\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\uff0a' in position 73: character maps to <undefined> Call stack: File "D:\b\pytorch_main\setup.py", line 1785, in <module> main() File "D:\b\pytorch_main\setup.py", line 1766, in main setup( File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\__init__.py", line 117, in setup return distutils.core.setup(**attrs) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\core.py", line 186, in setup return run_commands(dist) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\core.py", line 202, in run_commands dist.run_commands() File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_commands self.run_command(cmd) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\dist.py", line 1104, in run_command super().run_command(command) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command cmd_obj.run() File "D:\b\pytorch_main\setup.py", line 1353, in run super().run() File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\command\bdist_wheel.py", line 370, in run self.run_command("build") File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\cmd.py", line 357, in run_command self.distribution.run_command(command) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\dist.py", line 1104, in run_command super().run_command(command) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command cmd_obj.run() File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\cmd.py", line 357, in run_command self.distribution.run_command(command) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\dist.py", line 1104, in run_command super().run_command(command) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command cmd_obj.run() File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\command\build_py.py", line 78, in run self.build_package_data() File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\command\build_py.py", line 171, in build_package_data _outf, _copied = self.copy_file(srcfile, target) File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\command\build_py.py", line 64, in copy_file return super().copy_file( # pyright: ignore[reportReturnType] # pypa/distutils#309 File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\cmd.py", line 421, in copy_file return file_util.copy_file( File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\_distutils\file_util.py", line 130, in copy_file log.info("%s %s -> %s", action, src, dir)

I rebuilt (without fully cleaning my build/source dirs) with the PYTHONUTF8=1 environment variable and didn't see the warnings. Hopefully a clean rebuild (including deleting torch/lib/aotriton.images/ in the source dir) is also warning-free. We can add that env var to our downstream build script and any upstream build scripts we contribute (see #160776)

jammm · 2025-09-09T17:13:53Z

@jeffdaily PTAL. Received approval from @xinyazhang and @ScottTodd.
We can proceed with queuing it for merge.

## Motivation Progress on #1040, getting closer to enabling aotriton in PyTorch on Windows. ## Technical Details This will supersede #1409 and is dependent on pytorch/pytorch#162330. The UTF8 change I believe helps with warnings about logs for copying files with unicode characters in their names: ``` Message: '%s %s -> %s' Arguments: ('copying', 'torch\\lib\\aotriton.images\\amd-gfx11xx\\flash\\bwd_kernel_dq\\FONLY__\uff0afp32@16_48_0_T_T_1___gfx11xx.aks2', 'build\\lib.win-amd64-cpython-312\\torch\\lib\\aotriton.images\\amd-gfx11xx\\flash\\bwd_kernel_dq') --- Logging error --- Traceback (most recent call last): File "C:\Users\Nod-Shark16\AppData\Local\Programs\Python\Python312\Lib\logging\__init__.py", line 1163, in emit stream.write(msg + self.terminator) File "C:\Users\Nod-Shark16\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\uff0a' in position 73: character maps to <undefined> Call stack: File "D:\b\pytorch_main\setup.py", line 1785, in <module> main() File "D:\b\pytorch_main\setup.py", line 1766, in main setup( File "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\setuptools\__init__.py", line 117, in setup return distutils.core.setup(**attrs) ``` ## Test Plan Tested with local builds on Windows with and without `--enable-pytorch-flash-attention-windows`. ## Test Result Builds succeeded, ComfyUI generated images on my gfx1100 GPU (needed `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` for aotriton on that GPU). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

jammm · 2025-09-10T10:57:30Z

Lint test fails because:

 Error: Failed due to ValueError:
/pytorch/pytorch/cmake/External/aotriton.cmake:83: DEPENDEES ==> DEPENDENCIES
/pytorch/pytorch/cmake/External/aotriton.cmake:113: DEPENDEES ==> DEPENDENCIES

Please either fix the error or add the word(s) to the dictionary file.
HINT: all-lowercase words in the dictionary can cover all case variations.

But DEPENDEES is a valid keyword for ExternalProject_Add_Step https://cmake.org/cmake/help/latest/module/ExternalProject.html#command:externalproject_add_step. We can ignore this I feel.

jammm · 2025-09-14T12:17:56Z

So did 4f46a52 cause the merge to fail? But why

No that's the fix to the regression that broke the CUDA builds. The merge failures are unrelated and should be fixed once they're fixed elsewhere

Nem404 · 2025-09-15T07:56:31Z

No that's the fix to the regression that broke the CUDA builds. The merge failures are unrelated and should be fixed once they're fixed elsewhere

Kinda curious where and when they should be fixed 🤔

Oh, #162881 (comment) 👀

jeffdaily · 2025-09-15T16:11:05Z

@pytorchbot merge -f "the cuda build OOM that caused a revert of this PR has been fixed, all other failures are unrelated"

pytorchmergebot · 2025-09-15T16:12:32Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…ion on Windows. (pytorch#162330)" This reverts commit 62843c1. Reverted pytorch#162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see pytorch#162881 ([comment](pytorch#162330 (comment)))

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…ion on Windows. (pytorch#162330)" This reverts commit 62843c1. Reverted pytorch#162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see pytorch#162881 ([comment](pytorch#162330 (comment)))

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…ion on Windows. (pytorch#162330)" This reverts commit 62843c1. Reverted pytorch#162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see pytorch#162881 ([comment](pytorch#162330 (comment)))

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…ion on Windows. (pytorch#162330)" This reverts commit 62843c1. Reverted pytorch#162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see pytorch#162881 ([comment](pytorch#162330 (comment)))

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

Fixes: pytorch#163958 Cherry-pick pytorch#161754 Cherry-pick pytorch#162330 Cherry-pick pytorch#163373 Cherry-pick pytorch#163745 Note TF32 support is still being plagued by `HIPBLASLT_ALLOW_TF32`, which should be handled by another PR due to its complexity. --------- Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Scott Todd <scott.todd0@gmail.com>

…indows. (pytorch#162330) Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton. Already tested to be working on Windows with TheRock. Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604 Pull Request resolved: pytorch#162330 Approved by: https://github.com/jeffdaily Co-authored-by: Scott Todd <scott.todd0@gmail.com>

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Sep 6, 2025

pytorchbot added the open source label Sep 6, 2025

pytorch-bot bot added the release notes: rocm mandatorylabel label Sep 6, 2025

pytorch-bot bot added the topic: performance topic category label Sep 6, 2025

pytorch-bot bot added the module: windows Windows support for PyTorch label Sep 6, 2025

xinyazhang reviewed Sep 7, 2025

View reviewed changes

cmake/External/aotriton.cmake Outdated Show resolved Hide resolved

xinyazhang suggested changes Sep 7, 2025

View reviewed changes

aten/src/ATen/native/transformers/cuda/attention.cu Show resolved Hide resolved

cmake/External/aotriton.cmake Show resolved Hide resolved

jammm force-pushed the jam/hip_platform_aotriton_windows branch from a44f41f to f7ebef2 Compare September 7, 2025 08:58

Nem404 mentioned this pull request Sep 8, 2025

Rebase aotriton windows patch for v0.11 ROCm/TheRock#1409

Closed

jammm force-pushed the jam/hip_platform_aotriton_windows branch from f7ebef2 to 3c50ab2 Compare September 9, 2025 12:17

jammm requested a review from xinyazhang September 9, 2025 12:18

xinyazhang previously approved these changes Sep 9, 2025

View reviewed changes

ScottTodd previously approved these changes Sep 9, 2025

View reviewed changes

bdhirsh requested a review from jeffdaily September 9, 2025 17:18

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 9, 2025

ScottTodd mentioned this pull request Sep 9, 2025

[torch] Adjust env vars used for builds with aotriton enabled. ROCm/TheRock#1432

Merged

1 task

ScottTodd mentioned this pull request Sep 9, 2025

[torch] Flip --enable-pytorch-flash-attention-windows for releases. ROCm/TheRock#1437

Merged

1 task

pytorchmergebot added the merging label Sep 15, 2025

pytorchmergebot closed this in 0826aaf Sep 15, 2025

pytorchmergebot removed the merging label Sep 15, 2025

xinyazhang mentioned this pull request Sep 29, 2025

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes ROCm/pytorch#2686

Merged

ScottTodd mentioned this pull request Oct 15, 2025

[release/2.9] Cherrypick aotriton build fixes and Windows support ROCm/pytorch#2712

Merged

Conversation

jammm commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162330

❌ 19 New Failures, 2 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

jammm commented Sep 6, 2025

Uh oh!

jammm commented Sep 6, 2025

Uh oh!

jammm commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 6, 2025

Uh oh!

jammm commented Sep 6, 2025

Uh oh!

jammm commented Sep 6, 2025

Uh oh!

jammm commented Sep 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nem404 commented Sep 8, 2025

Uh oh!

xinyazhang left a comment

Choose a reason for hiding this comment

Uh oh!

Nem404 commented Sep 9, 2025

Uh oh!

Uh oh!

ScottTodd Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jammm commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented Sep 10, 2025

Uh oh!

jammm commented Sep 14, 2025

Uh oh!

Nem404 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Sep 15, 2025

Uh oh!

pytorchmergebot commented Sep 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jammm commented Sep 6, 2025 •

edited

Loading

pytorch-bot bot commented Sep 6, 2025 •

edited

Loading

jammm commented Sep 6, 2025 •

edited

Loading

jammm commented Sep 9, 2025 •

edited

Loading

Nem404 commented Sep 15, 2025 •

edited

Loading