[FlexFlash] CuteDSL flat indexer needs to be colexigraphic in coordinate space by drisspg · Pull Request #166657 · pytorch/pytorch

drisspg · 2025-10-30T18:23:55Z

Stack from ghstack (oldest at bottom):

-> [FlexFlash] CuteDSL flat indexer needs to be colexigraphic in coordinate space #166657

Benchmarks on Hopper:
Note the triton impl is not using max-autotune because I didnt feel like waiting for 90x plots

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-10-30T18:23:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166657

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm][CI] Machines under the label linux.rocm.gpu.2, label linux.rocm.gpu.4, linux.rocm.gpu.gfx1100 are undergoing maintenance.

✅ No Failures

As of commit cba6288 with merge base 687c15c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 9f02a02 Pull-Request: #166657

[ghstack-poisoned]

ghstack-source-id: fa9ee7b Pull-Request: #166657

[ghstack-poisoned]

ghstack-source-id: d356935 Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: 99d7270 Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: e7665d6 Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: 216c04a Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: a8b6482 Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: a0fa477 Pull-Request: #166657

eellison

Can you add some tests with views, to be safe ?

torch/_inductor/codegen/cutedsl/_cutedsl_utils.py

torch/_inductor/kernel/flex/common.py

torch/_inductor/kernel/flex/flex_flash_attention.py

eellison · 2025-10-31T21:29:13Z

torch/_inductor/kernel/flex/flex_flash_attention.py

+    original_make_indexer = FixedLayout.make_indexer
+
+    def cutedsl_make_indexer(self):
+        return _fixed_indexer_cute(self.size, self.stride, self.offset)


Is it just FixedLayout.make_indexer that encodes lexicographic order ? I see make_reindexer, but I think that works because it just dispatches to FixedLayout in the end.

I think one thing to that I need to make more clear is that it is not actually the colexigraphic part that is the problem.

SO make indexer; is producing an index in ptr space.

if I have tensor with shape(10, 15) : stride(15, 1)

and I index at Tensor[9, 12]

this does dot(index, stride) so 9 * 15 + 12 * 1 and this is what is given to the load,

IN the current cutedsl path, we are passing the indexes directly to the the on device cute tensor. which for all behaves like a pytorch tensor.

So it actually wants cute_tensor[9,12] and then it does the stride dot. It also happens to always accept a 'flat 1d index' so in this case the total possible number of indices is dot(shape) = 10 * 15 = 150 and the mapping of 1d index to ND index is colexigraphic.

So that would be Sum index_n * size(n-1)

or 9 * 1 + 12 * 10 = 129

does this make sense?

eellison · 2025-10-31T21:37:13Z

torch/_inductor/kernel/flex/flex_flash_attention.py

+    def cutedsl_make_indexer(self):
+        return _fixed_indexer_cute(self.size, self.stride, self.offset)
+
+    FixedLayout.make_indexer = cutedsl_make_indexer  # type: ignore[assignment]


This gives me a little pause, because i'm not sure what other locations in our codebase encodes row-major indexing.. Another solution would be to wrap the indexes and expressions in Identity, which would make it easier to do the row-major -> column major transformation.

But I can't actually find anywhere else we do this, as far as codegen goes.

[ghstack-poisoned]

…inate space ghstack-source-id: 9c4df1d Pull-Request: #166657

drisspg · 2025-11-01T00:31:37Z

Can you add some tests with views, to be safe ?

good call this IMAs but the indexing looks correct to me, I would expect that dl_pack handles the offset

drisspg · 2025-11-01T00:35:22Z

test/inductor/test_flex_flash.py

-            prof_result["found"],
-            f"Flash attention kernel unexpectedly found when force_flash=False. Kernels: {prof_result['kernel_names']}",
-        )
+    # @dtypes(torch.float16, torch.bfloat16)


un comment before land

drisspg · 2025-11-01T05:59:21Z

good call this IMAs but the indexing looks correct to me, I would expect that dl_pack handles the offset
The indexing is the correct, the problem is the cache is hitting:

TestFlexFlashCUDA.test_flash_attention_with_mask_mod_buffer_cuda_float16
TestFlexFlashCUDA.test_flash_attention_mask_mod_with_view_buffer_cuda_float16

The only difference between these two tests is that the captured tensor is a slice w/ the same shape in case 1.

confirmed this by adding compile cache clear inbetween which fixes the test.

So concretely we generated this closure for the mask-mod is

    @cute.jit
    def mask_mod(b_idx, h_idx, q_idx, kv_idx, aux_tensors):

        in_ptr8 = aux_tensors[0]
        tmp1 = q_idx
        tmp2 = kv_idx
        tmp3 = operator.ge(tmp1, tmp2)
        tmp4 = h_idx
        tmp5 = ssa_to_indexable(tmp4, cutlass.Int32)
        tmp6 = cute.make_fragment(1, cutlass.Float16)
        tmp6[0] = (in_ptr8[tmp5])
        tmp7 = (tmp6.load()).to(cutlass.Float32)
        tmp8 = operator.gt(tmp7, cute.full_like(tmp7, 0))
        mask_mod_output = tmp3 | tmp8

        return mask_mod_output

The input tensor for aux_tensors has a shape and stride assert_size_stride(arg5_1, (4, ), (3, ))

We have identical mask mod so we produce the same cache key and hit but now:
Aux_tensors contains a tensor w/ shape and stride: assert_size_stride(arg5_1, (4, ), (1 )) -> contiguous case

We are creating the tensors with: https://github.com/Dao-AILab/flash-attention/blob/0256114fe2381ab293503219bdd9078de3cd26b3/flash_attn/cute/interface.py#L349C1-L351C91

So no alignment or leading dim..

[ghstack-poisoned]

…inate space ghstack-source-id: 5580171 Pull-Request: #166657

[ghstack-poisoned]

…inate space ghstack-source-id: ad1d0f5 Pull-Request: #166657

drisspg · 2025-11-01T07:15:02Z

#166789

Made a repro, landing this so its easier for cutedsl folks to repro and this issue is unrelated to this PR. I updates the tests to have different mask mods and we don't cache in in FA

drisspg · 2025-11-01T07:15:09Z

@pytorchbot merge

pytorchmergebot · 2025-11-01T07:17:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-01T07:43:45Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / build, trunk / win-vs2022-cuda12.8-py3 / build

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

…inate space ghstack-source-id: ecd27d0 Pull-Request: #166657

drisspg · 2025-11-01T20:39:44Z

@pytorchbot merge

pytorchmergebot · 2025-11-01T20:41:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ate space (pytorch#166657) Benchmarks on Hopper: Note the triton impl is not using max-autotune because I didnt feel like waiting for 90x plots <img width="12517" height="5995" alt="combined_comparison" src="https://github.com/user-attachments/assets/d94debd9-920d-4413-b51f-b8e906e4fb01" /> Pull Request resolved: pytorch#166657 Approved by: https://github.com/v0i0, https://github.com/mlazos, https://github.com/eellison ghstack dependencies: pytorch#166359

…calars) Co-authored-by: dilililiwhy<why.wuhuanyu@huawei.com> # message auto-generated for no-merge-commit merge: !26081 merge main_sync_20251028 into master TORCH MAIN SYNC : add update_wrapped_number (bugfix to ForwardADWithScalars) Created-by: dilililiwhy Commit-by: dilililiwhy Merged-by: ascend-robot Description:  **What type of PR is this?** > Uncomment only one ` /kind <>` line, hit enter to put that in a new line, and remove leading whitespaces from that line: > > /kind bug > /kind task > /kind feature **What does this PR do / why do we need it**: 2.10.0.dev20251110 **Which issue(s) this PR fixes**:  Fixes # **Special notes for your reviewers**: pytorch/pytorch#160513 pytorch/pytorch#165784 pytorch/pytorch#166657 See merge request: Ascend/pytorch!26081

Update

851f077

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

Take 2

5b3c94b

ghstack-source-id: 9f02a02 Pull-Request: #166657

This was referenced Oct 30, 2025

[FlexFlash] Wire up mask_mod + blockmask to flash impl #166359

Closed

I hate colexigraphical order #166605

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 30, 2025

Update

53f334d

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

Take 2

111ae5e

ghstack-source-id: fa9ee7b Pull-Request: #166657

Update

404c96f

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

Take 2

b8c7206

ghstack-source-id: d356935 Pull-Request: #166657

Update

5402fe1

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

814e70f

…inate space ghstack-source-id: 99d7270 Pull-Request: #166657

Update

c6c21c6

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

5c29f60

…inate space ghstack-source-id: e7665d6 Pull-Request: #166657

drisspg added a commit that referenced this pull request Oct 30, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

8d17088

…inate space ghstack-source-id: e7665d6 Pull-Request: #166657

drisspg changed the title ~~Take 2~~ [FlexFlash] CuteDSL flat indexer needs to be colexicographic in coordinate space Oct 30, 2025

drisspg changed the title ~~[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coordinate space~~ [FlexFlash] CuteDSL flat indexer needs to be colexigraphic in coordinate space Oct 30, 2025

Update

fb7aeb7

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

5789f59

…inate space ghstack-source-id: 216c04a Pull-Request: #166657

drisspg added the release notes: nn release notes category label Oct 30, 2025

Update

c59233b

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Oct 30, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

88f78f9

…inate space ghstack-source-id: a8b6482 Pull-Request: #166657

Update

086caa6

[ghstack-poisoned]

drisspg requested a review from albanD as a code owner October 31, 2025 00:17

drisspg added a commit that referenced this pull request Oct 31, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

eb231c9

…inate space ghstack-source-id: a0fa477 Pull-Request: #166657

drisspg requested review from Chillee, eellison and v0i0 and removed request for albanD October 31, 2025 00:17

mlazos approved these changes Oct 31, 2025

View reviewed changes

eellison approved these changes Oct 31, 2025

View reviewed changes

Update

361fa74

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 1, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

050ea2a

…inate space ghstack-source-id: 9c4df1d Pull-Request: #166657

drisspg commented Nov 1, 2025

View reviewed changes

Update

5779224

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 1, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

754ad24

…inate space ghstack-source-id: 5580171 Pull-Request: #166657

Update

797b86a

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 1, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

3d08ac5

…inate space ghstack-source-id: ad1d0f5 Pull-Request: #166657

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 1, 2025

pytorchmergebot added the merging label Nov 1, 2025

pytorchmergebot removed the merging label Nov 1, 2025

Update

cba6288

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 1, 2025

[FlexFlash] CuteDSL flat indexer needs to be colexicographic in coord…

24c3eff

…inate space ghstack-source-id: ecd27d0 Pull-Request: #166657

pytorchmergebot added the merging label Nov 1, 2025

pytorchmergebot added the Merged label Nov 1, 2025

pytorchmergebot closed this in a663eb9 Nov 1, 2025

pytorchmergebot removed the merging label Nov 1, 2025

github-actions bot deleted the gh/drisspg/217/head branch December 2, 2025 02:17

Conversation

drisspg commented Oct 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166657

❗ 1 Active SEVs

✅ No Failures

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eellison Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

eellison Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg commented Nov 1, 2025

Uh oh!

drisspg Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Nov 1, 2025

Uh oh!

drisspg commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 1, 2025

Merge failed

Uh oh!

drisspg commented Nov 1, 2025

Uh oh!

pytorchmergebot commented Nov 1, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

drisspg commented Oct 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 30, 2025 •

edited

Loading

drisspg commented Nov 1, 2025 •

edited

Loading