[DTensor] Fix grouped_mm strategy for invalid stride cases #158245

wconstab · 2025-07-14T16:30:19Z

Stack from ghstack (oldest at bottom):

local_tensor input to grouped_mm has a stride requirement.

(see _meta_grouped_mm_common in meta_registrations.py or
check_valid_strides_and_return_transposed in native/cuda/Blas.cpp)

Don't allow sharding a tensor if its shape would result in an
incompatible local_tensor stride.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. [ghstack-poisoned]

pytorch-bot · 2025-07-14T16:30:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158245

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit d56e755 with merge base 0d17029 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. ghstack-source-id: 8e05bf9 Pull Request resolved: #158245

XilunWu

LGTM but I have some suggestions and also confusions to clarify.

XilunWu · 2025-07-14T17:20:38Z

torch/distributed/tensor/_ops/_matrix_ops.py

+                meta.shape, mesh, placements
+            )
+            return local_shape, local_stride, meta.dtype
+


local_shape_stride -> compute_local_tensor_meta? And return TensorMeta instead of a Tuple.

XilunWu · 2025-07-14T17:22:43Z

torch/distributed/tensor/_ops/_matrix_ops.py

        )

+    def valid_grouped_mm_strides(
+        input_specs: list[DTensorSpec], output_specs: tuple[Optional[DTensorSpec], ...]


Just a style preference: have mat_a_spec and mat_b_spec instead of input_specs list, which is more clear to me. WDYT?

no can do. This function has to have a generic signature that is not specific to grouped_mm. That's becuase this function's signature is defined by the API of expand_to_full_mesh_op_strategy which is a generic util that can be used by any op.

XilunWu · 2025-07-14T17:23:29Z

torch/distributed/tensor/_ops/_matrix_ops.py

+            dtype: torch.dtype,
+            new_local_stride: tuple[int, ...],
+        ) -> bool:
+            # copied from `_meta_grouped_mm_common` in meta_registrations.py


any reason we prefer replicating the function over calling it?

i can't literally call the checker becuase its buried inside a meta-fn, but perhaps i should just refactor that one so i can.

i also considered a more direct approach, create actual meta tensors for the local tensors, then call the grouped_mm meta with them. I could do this under a try: /except, and if it throws any error i'd call it an invalid sharding. This seems better in a way, so i might try that and see if it is easy to do.

create actual meta tensors for the local tensors, then call the grouped_mm meta with them

It could work. But I'm also okay that we land this approach first then revise if that works out.

XilunWu · 2025-07-14T17:26:40Z

torch/distributed/tensor/_ops/utils.py

+    is_valid_strategy_cb: Optional[
+        Callable[[list[DTensorSpec], tuple[Optional[DTensorSpec], ...]], bool]
+    ] = None,


I don't like that we restrict the callable type to be list[DTensorSpec], tuple[...] -> bool. Do we have to specify the argument types and return type? Can't we just use type hint Optional[Callable]? If the type checker complaints, I prefer we have something like list[Any/object] for argument types.

uh.. i think this is a perfect example of a time it is important to restrict the types of the function. Why? Because more than one user can write a different callback function but they must all share the same type signature. So it should be well defined, as an API.

The idea here is to define an API that gives enough information to let the per-operator callback make its choices. IF you think there is more info needed, then i think we should add that and add it to the type signature explicitly. If we find out later that more info is needed, we can add it to all the callbacks that have been implemented at that time.

XilunWu · 2025-07-14T17:27:23Z

torch/distributed/tensor/_ops/utils.py

        spec_list: list[Optional[DTensorSpec]] = []
        for specs in zip(*strategy_comb):
            if specs[0] is not None:
+                # TODO: we should fill in tensor_meta here.  If nothing else, it helps the filter strategy callback


XilunWu · 2025-07-14T17:35:08Z

torch/distributed/tensor/_ops/utils.py

+        else:
+            if spec_list[0] is not None:
+                output_specs = spec_list[0]  # type: ignore[assignment]
+            else:
+                raise RuntimeError("output spec is None")


can/should input_index < 1 be legitimate? IMO it means that there's no output, which is contradictory to the if spec_list[0] is not None branch where the first spec is treated as output_spec.

FYI i just moved this logic from below, so it is pre-existing logic you are complaining about.

That said, I agree its not well written.

input_index >1 is being checked above, but ==1 is not being checked. That means the else branch has to consider both ==0 and ==1. (perhaps also <1, which would be bad). It could be written more explicitly.

I see, thanks for clarifying!

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. ghstack-source-id: 14ca70f Pull Request resolved: #158245

zpcore · 2025-07-14T18:31:56Z

torch/distributed/tensor/_ops/utils.py

-        # check inputs shardable
-        inputs_shardable = all(
+        output_specs: tuple[Optional[DTensorSpec], ...]
+        if input_index > 1:


Not relate to this PR, I ask AI to figure out what is input_index used for :) A comment will be helpful.

ask and you shall recieve.
(updated docstring for expand_to_full_mesh_op_strategy to cover this)

zpcore · 2025-07-14T20:08:51Z

torch/distributed/tensor/_ops/_matrix_ops.py

        )

+    def valid_grouped_mm_strides(
+        input_specs: list[DTensorSpec], output_specs: tuple[Optional[DTensorSpec], ...]


output_specs is not used.

I see, you are trying to match with is_valid_strategy_cb pattern. Though it is not used here.

correct, I added it to the signature because I thought it might be useful for some other ops even though I did not use if for this op.

zpcore

LGTM!

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k [ghstack-poisoned]

local_tensor input to grouped_mm has a stride requirement. (see `_meta_grouped_mm_common` in meta_registrations.py or `check_valid_strides_and_return_transposed` in native/cuda/Blas.cpp) Don't allow sharding a tensor if its shape would result in an incompatible local_tensor stride. ghstack-source-id: 468444c Pull Request resolved: #158245

XilunWu

Thanks for addressing my question, the PR looks good to me.

wconstab · 2025-07-14T22:17:08Z

@pytorchbot merge

pytorchmergebot · 2025-07-14T22:18:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 14, 2025

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 14, 2025

This was referenced Jul 14, 2025

Fix grouped_mm alignment issue meta-pytorch/autoparallel#33

Closed

Add DeepSeekV3 and figure out what's needed to support it meta-pytorch/autoparallel#29

Draft

wconstab added the release notes: distributed (dtensor) release notes category label Jul 14, 2025

XilunWu reviewed Jul 14, 2025

View reviewed changes

zpcore reviewed Jul 14, 2025

View reviewed changes

zpcore approved these changes Jul 14, 2025

View reviewed changes

XilunWu approved these changes Jul 14, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 14, 2025

pytorchmergebot added the merging label Jul 14, 2025

wconstab mentioned this pull request Jul 14, 2025

[DTensor] implement histc #158298

Closed

pytorchmergebot added the Merged label Jul 15, 2025

pytorchmergebot closed this in 4486a6d Jul 15, 2025

pytorchmergebot removed the merging label Jul 15, 2025

github-actions bot deleted the gh/wconstab/425/head branch August 15, 2025 02:20

[DTensor] Fix grouped_mm strategy for invalid stride cases #158245

[DTensor] Fix grouped_mm strategy for invalid stride cases #158245

Uh oh!

Conversation

wconstab commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158245

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zpcore left a comment

Choose a reason for hiding this comment

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jul 14, 2025

Uh oh!

pytorchmergebot commented Jul 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wconstab commented Jul 14, 2025 •

edited

Loading

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading