[CuTe] Change the logic of pycute manipulation ops like coalesce, complement from co-lex to lex by fduwjj · Pull Request #162690 · pytorch/pytorch

fduwjj · 2025-09-11T04:35:02Z

Stack from ghstack (oldest at bottom):

PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous.

However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well.

Changes included in this PR:

We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way.
Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage.
One bug fix inside composition, which will lead to infinite recursive call.

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-11T04:35:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162690

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3fc2173 with merge base 5babb4d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 63fedd4 Pull Request resolved: #162690

test/distributed/_pycute/test_coalesce.py

test/distributed/_pycute/test_composition.py

test/distributed/_pycute/test_int_tuple.py

torch/distributed/_pycute/layout.py

… lexico" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 79172fe Pull Request resolved: #162690

torch/distributed/_pycute/layout.py

…alesce, complement from co-lex to lex" PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: c4ac0ef Pull Request resolved: #162690

…alesce, complement from co-lex to lex" PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 47b8880 Pull Request resolved: #162690

ezyang · 2025-09-11T18:37:38Z

torch/distributed/_pycute/int_tuple.py



-# Exclusive prefix product with output congruent to input a
+# Exclusive prefix product with output congruent to input a (lexicographic)


It's not a prefix product anymore right? It's a suffix_product now

It's interesting that you decided to reverse the output. So it's also like a reversed suffix product

agree we should name it suffix_product.. no it is not a reversed suffix. I just don't want to do insert(0, suffix_product(a[i], current_init))

ezyang · 2025-09-11T18:48:00Z

torch/distributed/_pycute/int_tuple.py

        if is_tuple(shape) and is_tuple(stride):  # "int" tuple tuple
            assert len(shape) == len(stride)
-            return tuple(idx2crd(idx, s, d) for s, d in zip(shape, stride))
+            # Process from left to right for lexicographic ordering (opposite of crd2idx)


hmm? The old code processed left to right too??

I think you meant to say "right to left" here.

But you don't actually process right-to-left. It's probably dramatically simpler to go right-to-left

you are right... We don't even need to change this line.. As long as it is lexico, from left to right or from right to left does not matter that much..

ezyang · 2025-09-11T19:09:15Z

torch/distributed/_pycute/layout.py

-        # replace our shape-1 with anything
-        elif result_shape[-1] == 1:
-            result_shape[-1] = shape
-            result_stride[-1] = stride


Why did we lose this case?

So it looks like the way the old code worked was by "absorbing" new shape/stride into shape-1 as we iterate over the shape/stride. Because you didn't change the iteration order above (you are still going left-to-right), this strategy no longer works and you had to invert everything. It does seem plausible you made enough changes to make it work, but there's probably a simpler way to do it.

ezyang · 2025-09-11T19:10:06Z

torch/distributed/_pycute/layout.py

+            result_stride.pop()
+            prev_shape = result_shape.pop()
+            result_shape.append(shape * prev_shape)
+            result_stride.append(stride)


Yeah, it doesn't feel like it should look like this. Instead, it feels like we should have processed shape, stride in reversed order at the loop above

torch/distributed/_pycute/layout.py

ezyang · 2025-09-11T19:11:12Z

torch/distributed/_pycute/layout.py

    elif is_int(layout):
        return Layout(layout)
-    return right_inverse(make_layout(layout, complement(layout)))  # type: ignore[arg-type]
+    return right_inverse(make_layout(complement(layout), layout))  # type: ignore[arg-type]


What's going on here?

Same answer to right_inverse, I found that as long as we swap the order, i == inv_layout(layout(i)) in the case of lexico. You might ask "can you give me proof using math". I don't have any tbh, and since we don't use them, so I decide to cut corners here. Pycute has two UT: test/distributed/_pycute/test_left_inverse.py and test/distributed/_pycute/test_right_inverse.py which tests the logic pretty thoroughly IIUC.

ezyang · 2025-09-11T19:11:23Z

torch/distributed/_pycute/layout.py

        current_idx = shape * stride

+    result_shape.reverse()
+    result_stride.reverse()


This is also surprising

TBF I guess we don't care that much about these, as we have no use of them right now

yes you are right but this change will make CI happy (passing UT) in the lexicographic case.

Just need to make sure the UTs aren't actually explicitly checking for colex.

OK one more point here is that the definition of right_inverse is that one index will be itself after layout(inv_layout(i)) and the UT is pretty thorough on this case. So I think with this change it will work in the lexico case. Why? Because when you do map from crd to idx it is using lexico order now.

torch/distributed/_pycute/layout.py

ezyang · 2025-09-11T19:27:50Z

torch/distributed/_pycute/layout.py

+
+        # Reverse the lists because we build lists in reverse order (append to end), this way it is more efficient.
+        result_shape.reverse()
+        result_stride.reverse()


Everything else here looks plausible but I am going to have to sweat the math lol

sure, basically we just do it the reverse order as described here: https://docs.nvidia.com/cutlass/media/docs/cpp/cute/02_layout_algebra.html..

And I tried to read the proof in https://leimao.github.io/article/CuTe-Layout-Algebra/, for example, Definition 2.13 Composition - Restricted Case from right to left. I mean at the end of the day the divisor is the same it's the division order?

ezyang

While I think some of the changes here can be done more simply, it does look plausible. Not sure if you think the test coverage is good enough, since you certainly had to add more tests!

fduwjj · 2025-09-11T23:07:58Z

@ezyang thanks for really fast review. Let me see if we make the part you commented simpler. I think maybe I can ask llm (in another PR) to generate more UT for this code for sure.

Skylion007 · 2025-09-12T16:37:34Z

torch/distributed/_pycute/int_tuple.py

 def prefix_product(a: IntTuple, init: IntTuple = 1) -> IntTuple:
    if is_tuple(a):
        if is_tuple(init):  # tuple tuple
            assert len(a) == len(init)


With all these length asserts, may want to create a zip_strict wrapper. I think we already have one in the codebase somewhere

I can add a TODO here so we can do the cosmetic change in another PR.

…alesce, complement from co-lex to lex" PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 75d2257 Pull Request resolved: #162690

…alesce, complement from co-lex to lex" PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: e234f84 Pull Request resolved: #162690

…alesce, complement from co-lex to lex" PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 2c0ae02 Pull Request resolved: #162690

fduwjj · 2025-09-16T19:46:05Z

@pytorchbot merge

pytorchmergebot · 2025-09-16T19:48:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 2c0ae02 Pull Request resolved: #162690

…plement from co-lex to lex (pytorch#162690) PyTorch tensor iteration (.view, contiguous, broadcasting) and NumPy array indexing all follow lexicographic (row-major) order. In Lexicographic (lex) on (i0, i1, …, i{k-1}): the leftmost index(stride is larger) changes fastest and the rightmost index changes slowest and usually last dim is contiguous. However original pycute is all based on co-lex, after porting their code into pytorch and some cosmetic change, we now make it lex so that we can use it for use cases like device mesh internal bookkeeping and other stuff as well. Changes included in this PR: 1. We changes all API ported in, included prefix_product(stride inferring and rename it to suffix_product), idx2crd, crd2idx, coalesce, composition, complement, right_inverse and left_inverse to make sure they are working in the lex way. 2. Added more unit test cases for some API mentioned above since existing unit tests do not have full coverage. 3. One bug fix inside composition, which will lead to infinite recursive call. Pull Request resolved: pytorch#162690 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#162413, pytorch#162534, pytorch#162414

[CuTe] Change the logic of coalesce, complement from co to lexico

70d33b4

[ghstack-poisoned]

This was referenced Sep 11, 2025

[CuTe] Copy code from pycute for device mesh bookkeeping #162413

Closed

[CuTe] Add type for CuTe layout via claude #162534

Closed

[DeviceMesh] Make CuTe layout as mesh layout to be ready for using in DeviceMesh #162414

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 11, 2025

fduwjj added a commit that referenced this pull request Sep 11, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

fe10309

ghstack-source-id: 63fedd4 Pull Request resolved: #162690

fduwjj commented Sep 11, 2025

View reviewed changes

test/distributed/_pycute/test_coalesce.py Show resolved Hide resolved

fduwjj commented Sep 11, 2025

View reviewed changes

test/distributed/_pycute/test_composition.py Show resolved Hide resolved

fduwjj commented Sep 11, 2025

View reviewed changes

test/distributed/_pycute/test_composition.py Show resolved Hide resolved

fduwjj commented Sep 11, 2025

View reviewed changes

test/distributed/_pycute/test_int_tuple.py Outdated Show resolved Hide resolved

fduwjj commented Sep 11, 2025

View reviewed changes

test/distributed/_pycute/test_int_tuple.py Show resolved Hide resolved

fduwjj commented Sep 11, 2025

View reviewed changes

torch/distributed/_pycute/layout.py Outdated Show resolved Hide resolved

Update on "[CuTe] Change the logic of coalesce, complement from co to…

c2aae18

… lexico" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Sep 11, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

e6eb592

ghstack-source-id: 79172fe Pull Request resolved: #162690

fduwjj commented Sep 11, 2025

View reviewed changes

torch/distributed/_pycute/layout.py Show resolved Hide resolved

fduwjj added the release notes: DeviceMesh label Sep 11, 2025

fduwjj requested review from ezyang, fegin and tianyu-l September 11, 2025 05:14

fduwjj changed the title ~~[CuTe] Change the logic of coalesce, complement from co to lexico~~ [CuTe] Change the logic of pycute manipulation ops like coalesce, complement from co-lex to lex Sep 11, 2025

fduwjj added a commit that referenced this pull request Sep 11, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

b59e01d

ghstack-source-id: c4ac0ef Pull Request resolved: #162690

fduwjj mentioned this pull request Sep 11, 2025

[DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping #161016

Closed

fduwjj added a commit that referenced this pull request Sep 11, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

da2fb60

ghstack-source-id: 47b8880 Pull Request resolved: #162690

ezyang reviewed Sep 11, 2025

View reviewed changes

torch/distributed/_pycute/layout.py Outdated Show resolved Hide resolved

ezyang reviewed Sep 11, 2025

View reviewed changes

torch/distributed/_pycute/layout.py Outdated Show resolved Hide resolved

ezyang reviewed Sep 11, 2025

View reviewed changes

ezyang approved these changes Sep 11, 2025

View reviewed changes

fduwjj mentioned this pull request Sep 12, 2025

[DeviceMesh] Simplifying internal bookkeeping with CuTe layout #161106

Closed

Skylion007 reviewed Sep 12, 2025

View reviewed changes

fduwjj added a commit that referenced this pull request Sep 12, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

d13cf07

ghstack-source-id: 75d2257 Pull Request resolved: #162690

fduwjj added a commit that referenced this pull request Sep 12, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

273c16a

ghstack-source-id: e234f84 Pull Request resolved: #162690

fduwjj added a commit that referenced this pull request Sep 12, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

89c903f

ghstack-source-id: 2c0ae02 Pull Request resolved: #162690

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 12, 2025

pytorchmergebot added the merging label Sep 16, 2025

pytorchmergebot closed this in 232dd65 Sep 16, 2025

pytorchmergebot added Merged and removed merging labels Sep 16, 2025

pytorchmergebot pushed a commit that referenced this pull request Sep 16, 2025

[CuTe] Change the logic of coalesce, complement from co to lexico

bb3b7aa

ghstack-source-id: 2c0ae02 Pull Request resolved: #162690

This was referenced Sep 17, 2025

[DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping #163211

Closed

[DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping #163212

Closed



		# Exclusive prefix product with output congruent to input a
		# Exclusive prefix product with output congruent to input a (lexicographic)

Conversation

fduwjj commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162690

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Sep 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Sep 16, 2025

Uh oh!

pytorchmergebot commented Sep 16, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

fduwjj commented Sep 11, 2025 •

edited

Loading

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading

fduwjj Sep 12, 2025 •

edited

Loading

fduwjj Sep 12, 2025 •

edited

Loading