Support DDP ignored parameters in DDPOptimizer #88460

wconstab · 2022-11-03T22:27:58Z

Stack from ghstack (oldest at bottom):

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

[ghstack-poisoned]

pytorch-bot · 2022-11-03T22:28:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88460

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 95887c9:

The following jobs have failed:

cuda11.6-py3.10-gcc7-sm86 / test (inductor, 4, 7, linux.g5.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

torch/_dynamo/optimizations/distributed.py

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

ghstack-source-id: 9ce3b8f Pull Request resolved: #88460

wconstab · 2022-11-04T04:24:11Z

test/distributed/test_dynamo_distributed.py

+        DDP._set_params_and_buffers_to_ignore_for_model(m, parameters_to_ignore)
+        ddp_m = DDP(m, device_ids=self.device_ids, bucket_cap_mb=25)
+        parameter_ids_to_ignore = [
+            id(ddp_m.module.get_parameter(p))


so.. this seems better than the hacky fqn/mangled name thing. But, is it totally reliable? I wondered if there could be edge cases, or if dynamo would possibly make copies, etc.

cc @voznesenskym @zdevito

voznesenskym · 2022-11-04T16:02:59Z

torch/_dynamo/eval_frame.py

+                        parameter_ids_to_ignore=[
+                            id(ddp_module.module.get_parameter(p))
+                            for p in ddp_module.parameters_to_ignore
+                        ],


if these parameters are id stable, why not just annotate them directly? at the ddp level, whenever I add to parameters_to_ignore it should be something like:

def mark_parameter_as_ignored(module, name): assert name in module.named_parameters() ignored_parameter_list.append(name) parameter = module.named_parameters()[name] parameter._ignored = True

And then you don't need to leak your bookeeping of ignored_parameter_list anywhere else (You could even get rid of it, potentially).

And in dynamo, you would just do:

p.requires_grad and not p._ignored:

Thanks @voznesenskym, i think that's a great idea

cc @aazzolini @mrshenli any issues with this approach?

SGTM. This is also what we are proposing for the new annotation-based API: https://fb.quip.com/bpvPA6f2dtrA

The only thing is that, we might want this at parameter level (instead of module-level) for DDP to have parity.

The only thing is that, we might want this at parameter level (instead of module-level) for DDP to have parity.

I thought it already was at the parameter level? See my latest code, i think it's what you want.

But now i'm confused- in DDPOptimizer I simply ignore all buffers, since i thought the implication was they never require grad, and thus wouldn't be allreduced by DDP. If some buffers get allreduced by DDP, then i'd want to follow this up with another PR that tests buffers and gets that behavior right.

For now i've marked both params/buffers that are in the parameters_and_buffers_to_ignore list with the same marker on the DDP side, since that seems consistent with the convention there.

@albanD says we can't rely on parameter id's being stable, that mostly works, but there are a few edge cases where it doesn't. In particular, reparametrization cannot always preserve original parameter id.

what does 'reparametrization' mean exactly?

i'm thinking it might be worth sticking with the current approach (marking params) as it is simple, and the consequences of getting it wrong are relatively minor (graph-breaks wouldn't exactly match ddp's buckets, so perf would degrade anywhere from a little bit to matching dynamo+ddp without graphbreaks.

but if there is another scheme that is not too complex i'd be open to it

Reparametrization is when you register a rule to recompute parameter every time before it's used https://pytorch.org/tutorials/intermediate/parametrizations.html

ok. i think i want to propose that we just land this as is.

i'm not sure any of the users of DDP's ignored parameters flag today are also using parametrizations

it wouldn't be catastrophic if ignored_parameters was not honored in ddp optimizer. (that is the defacto today)

we could potentially revisit this later

Also, I'm curious-does dynamo+AOT handle parametrization on its own currently?

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

wconstab · 2022-11-04T16:29:27Z

torch/nn/parallel/distributed.py

+        for name, param in module.named_parameters():
+            if name in params_and_buffers_to_ignore:
+                param._ddp_ignored = True
+        for name, buffer in module.named_buffers():


@mrshenli i thought buffers by definition do not require grad, and therefore ddp ignores them by default?

if not i should update the logic in DDPOptimizer accordingly.

Buffers are by default broadcast right before the forward pass if broadcast_buffers=True
is passed to DDP constructor. This is true by default.
But then, if the buffer appear on the igore_parameters field, it's not part of the broadcast.
I think buffers shouldn't count for the purposes of splitting the model since we're not syncing them after the backward pass.

ok, this makes sense. then my change is ok- we mark them but we still ignore them in ddp optimizer.

aazzolini · 2022-11-04T18:53:44Z

torch/nn/parallel/distributed.py

+        for name, param in module.named_parameters():
+            if name in params_and_buffers_to_ignore:
+                param._ddp_ignored = True
+        for name, buffer in module.named_buffers():


Buffers are by default broadcast right before the forward pass if broadcast_buffers=True
is passed to DDP constructor. This is true by default.
But then, if the buffer appear on the igore_parameters field, it's not part of the broadcast.
I think buffers shouldn't count for the purposes of splitting the model since we're not syncing them after the backward pass.

aazzolini · 2022-11-04T18:54:20Z

torch/_dynamo/optimizations/distributed.py

                        buckets[0].size += p.storage().nbytes()
-                        # TODO correct FQ name?
-                        buckets[0].params.append(f"{node}_{name}")
+                        buckets[0].params.append(f"{node.target}_{name}")


do we still need this?

it is just useful for visualization purposes. see the debug output on the next PR in this stack. the buckets table is printed using this string.

aazzolini

let's land as is but can you please add a lot of comments on the code explaining where the logic breaks down and how we could solve it etc?

wconstab · 2022-11-04T21:40:31Z

@pytorchbot merge -f "Flaky CI, no gpus available on gpu runner"

pytorchmergebot · 2022-11-04T21:42:10Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: pytorch#88460 Approved by: https://github.com/aazzolini

Support DDP ignored parameters in DDPOptimizer

01a8a21

[ghstack-poisoned]

wconstab requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 3, 2022 22:27

wconstab mentioned this pull request Nov 3, 2022

Add hf_bert + DDP multigpu test #88435

Closed

github-actions bot requested review from Chillee, albanD, anjali411, antoniojkim, bdhirsh, ezyang and miladm November 3, 2022 22:28

github-actions bot added ciflow/inductor module: dynamo labels Nov 3, 2022

wconstab added topic: not user facing topic category release notes: distributed (ddp) release notes category and removed topic: not user facing topic category labels Nov 3, 2022

Update on "Support DDP ignored parameters in DDPOptimizer"

b2ba5c3

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

davidberard98 reviewed Nov 4, 2022

View reviewed changes

torch/_dynamo/optimizations/distributed.py Outdated Show resolved Hide resolved

aazzolini suggested changes Nov 4, 2022

View reviewed changes

Update on "Support DDP ignored parameters in DDPOptimizer"

141bc40

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

wconstab added a commit that referenced this pull request Nov 4, 2022

Support DDP ignored parameters in DDPOptimizer

0c82f5b

ghstack-source-id: 9ce3b8f Pull Request resolved: #88460

wconstab commented Nov 4, 2022

View reviewed changes

wconstab mentioned this pull request Nov 4, 2022

DDPOptimizer replace debug=True/False with using torchdynamo logger #88480

Closed

voznesenskym reviewed Nov 4, 2022

View reviewed changes

Update on "Support DDP ignored parameters in DDPOptimizer"

95887c9

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

wconstab commented Nov 4, 2022

View reviewed changes

wconstab mentioned this pull request Nov 4, 2022

Add single-process DDP accuracy support to dynamo benchmark suite #88511

Closed

aazzolini reviewed Nov 4, 2022

View reviewed changes

aazzolini approved these changes Nov 4, 2022

View reviewed changes

pytorchmergebot added the Merged label Nov 4, 2022

pytorchmergebot closed this in 678d038 Nov 4, 2022

This was referenced Nov 4, 2022

Add docstring to DDPOptimizer #88521

Closed

Enable DDPOptimizer by default in dynamo #88523

Closed

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022

Support DDP ignored parameters in DDPOptimizer (pytorch#88460)

7980635

Pull Request resolved: pytorch#88460 Approved by: https://github.com/aazzolini

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022

Support DDP ignored parameters in DDPOptimizer (pytorch#88460)

5d26cea

Pull Request resolved: pytorch#88460 Approved by: https://github.com/aazzolini

facebook-github-bot deleted the gh/wconstab/27/head branch June 8, 2023 19:16

Support DDP ignored parameters in DDPOptimizer #88460

Support DDP ignored parameters in DDPOptimizer #88460

Uh oh!

Conversation

wconstab commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88460

❌ 1 Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aazzolini left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Nov 4, 2022

Uh oh!

pytorchmergebot commented Nov 4, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wconstab commented Nov 3, 2022 •

edited

Loading

pytorch-bot bot commented Nov 3, 2022 •

edited

Loading