Enable DDPOptimizer by default in dynamo #88523

wconstab · 2022-11-04T22:22:02Z

Stack from ghstack (oldest at bottom):

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) TODO: WIP adding/running these checks, land afterwards Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. [ghstack-poisoned]

pytorch-bot · 2022-11-04T22:22:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88523

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit bba6c0a:

The following jobs have failed:

linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 4, 4, linux.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) TODO: WIP adding/running these checks, land afterwards Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. ghstack-source-id: 0d774d1 Pull Request resolved: #88523

soumith · 2022-11-04T22:28:39Z

can you link to the spreadsheet of benchmarks' performance numbers here

davidberard98 · 2022-11-04T22:34:21Z

Results on 6 models (hf_Bert, hf_GPT2_large, hf_T5, hf_T5_large, resnet50, timm_vision_transformer) are below.

These boxplots compare the dynamo+ddp performance compared to eager+ddp performance. Speedup > 1 is good. This is from ~8 measurements taken on a AWS cluster with 8x A100 per node (i.e. 8 nodes = 64 gpus)

Without DDPOptimizer we see that most models are slower with dynamo than eager:

With DDPOptimizer we see that most models are either faster, or slightly slower than eager:

For internal users: N2750232

soumith · 2022-11-04T22:42:35Z

with DDPOptimizer and 1-GPU ResNet50 seems to have a slowdown below 1.0x, why is that?

davidberard98 · 2022-11-04T22:50:02Z

@soumith we think this is just noise plus the fact that resnet50 + 1 gpu spends too much time on communication to see a speedup from dynamo (at least on this batch size of 32, which is the torchbench default for resnet50). We re-ran over a larger number of samples (28, compared to 8 in the data above) on only the resnet50 model, results shown below (N2752413). We also ran on batch_size=128, where we do see a speedup.

wconstab · 2022-11-04T22:52:58Z

with DDPOptimizer and 1-GPU ResNet50 seems to have a slowdown below 1.0x, why is that?

I'm surprised it is so pronounced. Earlier data https://www.internalfb.com/intern/anp/view/?id=2750232

However, we learned 2 things about resnet

at this model size we saw compute time about matching comms time, leaving us pretty susceptible to noise
one subgraph in resnet ends up having an in-place mutation, which currently causes that subgraph to fall back to eager. We lost compute optimization for that subgraph.

We also changed bs from 32 to 128 at one point and got lower noise, but i'm not sure which bs is in these charts.

The compilation failure is something that we need to fix at the AOTAutograd layer. It's being tracked in AOTAutograd 2.0 discussion, and afaiu it's better to not try to fix it in the DDPOptimizer layer and fix it in the autograd layer. (The alternative is we could try to make DDPOptimizer avoid splits that upset AOTAutograd, but that would complicate DDPOptimizer and also deviate from the buckets used by DDP.

wconstab · 2022-11-07T21:37:34Z

I've run an accuracy/functionality sweep across torchbench, timm_models, huggingface scripts.

The sweep was done on master @ bbaa0637df93292eb372b355f01756437aed3ce9 by running

for SUITE in "torchbench" "huggingface" "timm_models"
do
    SCRIPT="benchmarks/dynamo/${SUITE}.py"
    python $SCRIPT --training --accuracy --backend eager --ddp --no-optimize-ddp --output ${SUITE}_eager_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --backend aot_eager --ddp --no-optimize-ddp --output ${SUITE}_aot_eager_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --backend aot_eager --ddp --output ${SUITE}_aot_eager_ddp.csv
    python $SCRIPT --training --accuracy --inductor --ddp --no-optimize-ddp --output ${SUITE}_inductor_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --inductor --ddp --output ${SUITE}_inductor_ddp.csv
done

The data is here

Suite	Pass	Fail to run	Fail Accuracy	Total Models	Pass Rate
Huggingface	31	12	1	44	0.7045454545
timm_models	59	1	2	62	0.9516129032
torchbench	43	9	4	56	0.7678571429

While the pass rates are not great, only 1 issue in the bunch is unique to turning DDPOptimizer on, and it's an AotAutograd Input mutation issue which we'll depend on being fixed in the updated AotAutograd implementation that handles mutation.

The remainder of the issues (either crashes or accuracy) happen either with dynamo-eager+ddp, or dynamo-inductor+ddp, but without using DDPOptimizer. The most common error categories i've sampled are

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one...
RuntimeError: Trying to backward through the graph a second time...
torch._dynamo.utils: [ERROR] Accuracy failed for key name logits
and probably a few other cases less common

I therefore propose to land this enablement PR as-is, and file new issues for categories of DDP+dynamo/inductor failures to triage accordingly.
cc @davidberard98 @aazzolini @soumith @ezyang @bdhirsh

soumith · 2022-11-07T22:00:56Z

landing it sounds reasonable

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) TODO: WIP adding/running these checks, land afterwards Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. ghstack-source-id: 85a90ba Pull Request resolved: #88523

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) TODO: WIP adding/running these checks, land afterwards Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. ghstack-source-id: 488dae5 Pull Request resolved: #88523

wconstab · 2022-11-09T00:22:32Z

New accuracy sweep here

"Expected to have finished reduction in the prior iteration before starting a new one" fixed by Dynamo DDP accuracy bench uses find_unused_parameters #88645
DLRM failure is not a real issue (see [DDP-Dynamo] failure in DLRM (RuntimeError: Tensors must be CUDA and dense) torchdynamo#1849)
remaining accuracyi ssues/runtime errors happen with inductor (with/without DDP)
EXCEPT for AotAutograd input mutation issues triggered by graph breaks ([DDP-Dynamo] AotAutograd input mutations due to DDPOptimizer #93615)

@davidberard98 seems close to getting the accuracy checks running in the real perf benchmarks. @bdhirsh is working on getting AotAutograd improvements to a state we can at least confirm these graph-break cases are resolved.

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx [ghstack-poisoned]

wconstab · 2022-11-28T21:17:31Z

@pytorchbot merge

pytorchmergebot · 2022-11-28T21:19:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-28T22:24:52Z

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

wconstab · 2022-11-29T01:21:53Z

@pytorchbot merge -f "Flaky CI"

pytorchmergebot · 2022-11-29T01:24:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-29T01:24:23Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch rebase origin/master returned non-zero exit code 1

Rebasing (1/1)
Auto-merging benchmarks/dynamo/distributed.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/distributed.py
error: could not apply 531c9ff3f3... Enable DDPOptimizer by default in dynamo (#88523)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 531c9ff3f3... Enable DDPOptimizer by default in dynamo (#88523)

Details for Dev Infra team

Raised by workflow job

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

wconstab · 2022-11-29T03:17:55Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T03:19:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-11-29T03:29:55Z

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

wconstab · 2022-11-29T05:25:23Z

@pytorchbot merge -f "Flaky CI (gpu SIGIOT)"

pytorchmergebot · 2022-11-29T05:27:01Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. Pull Request resolved: pytorch#88523 Approved by: https://github.com/davidberard98

wconstab mentioned this pull request Nov 4, 2022

DDPOptimizer replace debug=True/False with using torchdynamo logger #88480

Closed

This was referenced Nov 4, 2022

Add single-process DDP accuracy support to dynamo benchmark suite #88511

Closed

Add docstring to DDPOptimizer #88521

Closed

github-actions bot requested review from Chillee, albanD, anjali411, antoniojkim, bdhirsh, ezyang and miladm November 4, 2022 22:22

github-actions bot added ciflow/inductor module: dynamo labels Nov 4, 2022

wconstab added the release notes: distributed (ddp) release notes category label Nov 4, 2022

wconstab mentioned this pull request Nov 8, 2022

Dynamo DDP accuracy bench uses find_unused_parameters #88645

Closed

wconstab mentioned this pull request Nov 9, 2022

Mark dynamo torchbench dlrm as unsupported #88712

Closed

wconstab mentioned this pull request Nov 19, 2022

Special-case fsdp wrapped modules to be Unspecialized #89330

Closed

This was referenced Nov 21, 2022

Add gpu memory profiler script for FSDP bench #89457

Closed

WIP debug fsdp inductor backward hook issue #89458

Closed

wconstab mentioned this pull request Nov 21, 2022

Add limited FSDP correctness to torchdynamo benchmark #89469

Closed

davidberard98 approved these changes Nov 22, 2022

View reviewed changes

albanD removed their request for review November 22, 2022 22:37

anjali411 removed their request for review November 28, 2022 14:39

wconstab added 2 commits November 28, 2022 20:42

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022

pytorchmergebot added the Merged label Nov 29, 2022

pytorchmergebot closed this in 7860fcc Nov 29, 2022

facebook-github-bot deleted the gh/wconstab/31/head branch June 8, 2023 19:16

Enable DDPOptimizer by default in dynamo #88523

Enable DDPOptimizer by default in dynamo #88523

Uh oh!

Conversation

wconstab commented Nov 4, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88523

❌ 1 Failures

Uh oh!

soumith commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidberard98 commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soumith commented Nov 4, 2022

Uh oh!

davidberard98 commented Nov 4, 2022

Uh oh!

wconstab commented Nov 4, 2022

Uh oh!

wconstab commented Nov 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soumith commented Nov 7, 2022

Uh oh!

wconstab commented Nov 9, 2022

Uh oh!

wconstab commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 28, 2022

Merge failed

Uh oh!

wconstab commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge failed

Uh oh!

wconstab commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge failed

Uh oh!

wconstab commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wconstab commented Nov 4, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 4, 2022 •

edited

Loading

soumith commented Nov 4, 2022 •

edited

Loading

davidberard98 commented Nov 4, 2022 •

edited

Loading

wconstab commented Nov 7, 2022 •

edited

Loading