[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly #106886

voznesenskym · 2023-08-09T18:19:30Z

Stack from ghstack (oldest at bottom):

…riendly [ghstack-poisoned]

pytorch-bot · 2023-08-09T18:19:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106886

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 4 Unrelated Failures

As of commit 44907c4:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

voznesenskym · 2023-08-09T18:19:58Z

torch/distributed/_composable/contract.py

+    registry_key = getattr(module, REGISTRY_KEY, None)
+    if registry_key is None:
+        default_registry: Dict[str, RegistryItem] = OrderedDict()
+        setattr(module, REGISTRY_KEY, default_registry)
+        return default_registry
+    else:
+        return registry_key


Changed because setdefault NYI

voznesenskym · 2023-08-09T18:20:08Z

torch/distributed/_composable_state.py

-        return _module_state_mapping.get(module, None)
+        if module in _module_state_mapping:
+            return _module_state_mapping[module]
+        else:
+            return None


get w/ default - NYI

this one LGTM

The new code is worse perf though, you do the dict lookup twice, whereas with .get() you only need to do it once

voznesenskym · 2023-08-09T18:24:33Z

torch/distributed/distributed_c10d.py


 class GroupMember(metaclass=_WorldMeta):
-    NON_GROUP_MEMBER = object()
+    NON_GROUP_MEMBER = -100


comparison w/ object NYI - -100 is spiritually okay (any identity)

hmm.. yes, looks ok 👍

This one also seems simple to fix directly in Dynamo

voznesenskym · 2023-08-09T18:24:43Z

torch/distributed/fsdp/_common_utils.py

+                            # TODO(voz): Don't graph break on this
+                            warnings.warn(
+                                "An unexpected prefix is detected. This case "
+                                " should only happen when using DMP with FSDP. "
+                                f"prefix = {prefix}, "
+                                f"submodule_name = {submodule_name}"
+                            )


wconstab

mostly LGTM. couple of questions

wconstab · 2023-08-10T03:34:44Z

torch/distributed/_composable/contract.py

    """
-    default_registry: Dict[str, RegistryItem] = OrderedDict()
-    return module.__dict__.setdefault(REGISTRY_KEY, default_registry)  # type: ignore[call-overload]
+    registry_key = getattr(module, REGISTRY_KEY, None)


nit, but iiuc the LHS var here would be more aptly named registry as it is 'the registry' which is accessed by using the REGISTRY_KEY in the module's dict

agreed, good nit

wconstab · 2023-08-10T03:36:05Z

torch/distributed/_composable/contract.py

+    registry_key = getattr(module, REGISTRY_KEY, None)
+    if registry_key is None:
+        default_registry: Dict[str, RegistryItem] = OrderedDict()
+        setattr(module, REGISTRY_KEY, default_registry)


Does your new version go through module's setattr method, whereas the previous one didn't? I am not sure if that's important, but sometimes subtle bugs creep in that way

I think they are identical, will test.

This will make perf worse, getattr/setattr is worse perf than dict ordinarily, but NN module has a fairly complicated __setattr__ handler which makes it worse. getattr access on NN module is famously slow (remember the conversation about CSE?)

I'm also just not that keen on this change, we should be in the business of modeling __dict__ properly...

wconstab · 2023-08-10T03:36:34Z

torch/distributed/_composable_state.py

-        return _module_state_mapping.get(module, None)
+        if module in _module_state_mapping:
+            return _module_state_mapping[module]
+        else:
+            return None


this one LGTM

wconstab · 2023-08-10T03:38:19Z

torch/distributed/distributed_c10d.py


 class GroupMember(metaclass=_WorldMeta):
-    NON_GROUP_MEMBER = object()
+    NON_GROUP_MEMBER = -100


hmm.. yes, looks ok 👍

wconstab · 2023-08-10T03:38:28Z

torch/distributed/distributed_c10d.py

    if _rank_not_in_group(pg):
        raise RuntimeError("Invalid process group specified")
-    pg_store = _world.pg_map.get(pg, None)
+    pg_store = _world.pg_map[pg] if pg in _world.pg_map else None


wconstab · 2023-08-10T03:39:48Z

torch/distributed/fsdp/_common_utils.py

-                            f"submodule_name = {submodule_name}"
-                        )
+                        if (
+                            not torch.distributed._functional_collectives.is_torchdynamo_compiling()


nit: where should we move this util? seems like it will take on a bigger life than functional_collectives

Not there, that's for sure! Theres 2 options I think:

A top level distributed only flag in utils or top level moduel __init__ - like torch.distributed.utils.is_compiling()

Remove it entirely from distributed, move it to a torch level util, since its a generally useful sort of check import -> check flag pattern other modules will want as well.

yea, i'd be in favor of 2. i am wary of import cycles, that may be the very reason we (I?) put it there in the first place. So probably don't try to do it in this PR unless you're feeling lucky :)

wconstab · 2023-08-10T03:40:20Z

torch/distributed/fsdp/_common_utils.py

+                        if (
+                            not torch.distributed._functional_collectives.is_torchdynamo_compiling()
+                        ):
+                            # TODO(voz): Don't graph break on this


do you have a plan to not graph break on warn? (not in this PR, what you did here looks good for now)

Kind of - its the same as print. TODO is perhaps toooo promise-y? There's a few strategies we can do here.

I like to file an issue and then link to the issue, makes it easier to track

wconstab · 2023-08-10T03:41:32Z

torch/distributed/fsdp/_common_utils.py

+        with no_dispatch():
+            tensor.record_stream(stream)
+    else:
        tensor.record_stream(stream)


is no_dispatch just a problematic graph-break under compile? or something else?

what does compile do with tensor.record_stream anyway?

The no_dispatch was added in #88014 cc @fegin

Looking over the PR, it looks like this is because we don't actually support Stream arguments in torch dispatch, so it just chokes. If Dynamo is able to answer "are there any torch dispatch modes" active (it should answer False), a better version of this would just be to check if there are any modes before disabling dispatch.

wconstab · 2023-08-10T03:44:27Z

torch/distributed/fsdp/_exec_order_utils.py

            # Check that all ranks plan to all-gather the same index parameters
-            for (r1, i1), (r2, i2) in itertools.combinations(
-                (
+            if not torch.distributed._functional_collectives.is_torchdynamo_compiling():


is this error-checking logic really safe to skip under compile?

in the practical sense, probably. but technically we're supposed to throw the same set of errors in PT2 as eager.

Its not amazing to skip, this also plagued me. One thing we can do is potentially pull this out into an op. It just didn't meet the importance bar for the first MVP.

I agree that this is not important for the MVP. I do not know of any case where this error was raised (but that is also a good thing).

expect_true will work here too

wconstab · 2023-08-10T03:47:18Z

torch/distributed/fsdp/_runtime_utils.py

+            # We don't run a even queue for freeing under torch compile atm
+            # But maybe we need to? TODO(voz): Look into this


i think its fine to diverge here, insofar as compile captures a coherent program that it can optimize. e.g. I don't care if compile uses the same sets of streams/events or if it totally ignores eager prefetching and does its own prefetching.

but I care that compiling fsdp doesn't fall on its face if we turn prefetching on and some other code hangs waiting for an event that will never be enqueued.

awgu

The changes look good to me!

awgu · 2023-08-10T12:55:26Z

torch/distributed/fsdp/_exec_order_utils.py

            # Check that all ranks plan to all-gather the same index parameters
-            for (r1, i1), (r2, i2) in itertools.combinations(
-                (
+            if not torch.distributed._functional_collectives.is_torchdynamo_compiling():


I agree that this is not important for the MVP. I do not know of any case where this error was raised (but that is also a good thing).

torch/distributed/fsdp/flat_param.py

torch/distributed/fsdp/_exec_order_utils.py

…it dynamo friendly" [ghstack-poisoned]

wconstab

LGTM if you fix the nits, thanks!

…it dynamo friendly" [ghstack-poisoned]

voznesenskym · 2023-08-11T03:41:48Z

@pytorchbot rebase

pytorchmergebot · 2023-08-11T03:43:42Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…it dynamo friendly" [ghstack-poisoned]

pytorchmergebot · 2023-08-11T03:43:59Z

Successfully rebased gh/voznesenskym/199/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/106886)

…riendly ghstack-source-id: d45babf Pull Request resolved: #106886 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly Dist nits

ezyang · 2023-08-11T14:40:19Z

torch/distributed/fsdp/_runtime_utils.py

-        free_event.record()
-        state._free_event_queue.enqueue(free_event)
+        if not torch.distributed._functional_collectives.is_torchdynamo_compiling():
+            # We don't run a even queue for freeing under torch compile atm


s/even/event/?

ezyang · 2023-08-11T14:48:49Z

torch/distributed/fsdp/_exec_order_utils.py

-                    )
+            if not torch.distributed._functional_collectives.is_torchdynamo_compiling():
+                # TODO(voz): Don't graph break on this - dynamo hates the n1 != n2
+                # tensor comparison control flow.


You kind of want something like expect_true here; you can defer the equality check to runtime because it's purely for error checking.

ezyang · 2023-08-11T14:54:25Z

@awgu calls the shots here, my comments are non blocking

voznesenskym · 2023-08-11T18:52:45Z

@awgu calls the shots here, my comments are non blocking

These are good comments, I will address

…it dynamo friendly" [ghstack-poisoned]

…riendly ghstack-source-id: 333605a Pull Request resolved: #106886 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly Dist nits [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly

voznesenskym · 2023-08-11T20:12:29Z

@pytorchbot merge

pytorchmergebot · 2023-08-11T20:14:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…riendly ghstack-source-id: 356fab1 Pull Request resolved: #106886 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly Dist nits

[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo f…

310f6f4

…riendly [ghstack-poisoned]

voznesenskym requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, yhcharles and zhaojuanmao as code owners August 9, 2023 18:19

voznesenskym mentioned this pull request Aug 9, 2023

[Dynamo x FSDP][1/x] Builder support for deque, appendleft #106884

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Aug 9, 2023

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh, ezyang, jbschlosser, miladm and wconstab August 9, 2023 18:19

voznesenskym commented Aug 9, 2023

View reviewed changes

albanD removed their request for review August 9, 2023 18:21

voznesenskym commented Aug 9, 2023

View reviewed changes

wconstab reviewed Aug 10, 2023

View reviewed changes

awgu approved these changes Aug 10, 2023

View reviewed changes

Update on "[Dynamo x FSDP][2/x] Small changes to distributed to make …

00c090b

…it dynamo friendly" [ghstack-poisoned]

wconstab approved these changes Aug 10, 2023

View reviewed changes

Update on "[Dynamo x FSDP][2/x] Small changes to distributed to make …

ea3eeec

…it dynamo friendly" [ghstack-poisoned]

Update on "[Dynamo x FSDP][2/x] Small changes to distributed to make …

b94f3f2

…it dynamo friendly" [ghstack-poisoned]

ezyang reviewed Aug 11, 2023

View reviewed changes

Update on "[Dynamo x FSDP][2/x] Small changes to distributed to make …

44907c4

…it dynamo friendly" [ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 11, 2023

pytorchmergebot added the merging label Aug 11, 2023

pytorchmergebot added Merged and removed merging labels Aug 11, 2023

pytorchmergebot closed this in 4266001 Aug 11, 2023

facebook-github-bot deleted the gh/voznesenskym/199/head branch August 15, 2023 14:17

		# We don't run a even queue for freeing under torch compile atm
		# But maybe we need to? TODO(voz): Look into this

[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly #106886

[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly #106886

Uh oh!

Conversation

voznesenskym commented Aug 9, 2023 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106886

✅ 4 Unrelated Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym commented Aug 9, 2023 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Aug 9, 2023 •

edited

Loading

voznesenskym Aug 10, 2023 •

edited

Loading