Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 4, 2022

Stack from ghstack (oldest at bottom):

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

TODO: WIP adding/running these checks, land afterwards
Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88523

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit bba6c0a:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Nov 4, 2022
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

TODO: WIP adding/running these checks, land afterwards
Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

ghstack-source-id: 0d774d1
Pull Request resolved: #88523
@wconstab wconstab added the release notes: distributed (ddp) release notes category label Nov 4, 2022
@soumith
Copy link
Contributor

soumith commented Nov 4, 2022

can you link to the spreadsheet of benchmarks' performance numbers here

@davidberard98
Copy link
Contributor

davidberard98 commented Nov 4, 2022

Results on 6 models (hf_Bert, hf_GPT2_large, hf_T5, hf_T5_large, resnet50, timm_vision_transformer) are below.

These boxplots compare the dynamo+ddp performance compared to eager+ddp performance. Speedup > 1 is good. This is from ~8 measurements taken on a AWS cluster with 8x A100 per node (i.e. 8 nodes = 64 gpus)

Without DDPOptimizer we see that most models are slower with dynamo than eager:
nov01_without_ddp

With DDPOptimizer we see that most models are either faster, or slightly slower than eager:
nov01_with_ddpoptimizer

For internal users: N2750232

@soumith
Copy link
Contributor

soumith commented Nov 4, 2022

with DDPOptimizer and 1-GPU ResNet50 seems to have a slowdown below 1.0x, why is that?

@davidberard98
Copy link
Contributor

@soumith we think this is just noise plus the fact that resnet50 + 1 gpu spends too much time on communication to see a speedup from dynamo (at least on this batch size of 32, which is the torchbench default for resnet50). We re-ran over a larger number of samples (28, compared to 8 in the data above) on only the resnet50 model, results shown below (N2752413). We also ran on batch_size=128, where we do see a speedup.

resnet50_raw_samples_128
resnet50_raw_samples
resnet50_speedup

@wconstab
Copy link
Contributor Author

wconstab commented Nov 4, 2022

with DDPOptimizer and 1-GPU ResNet50 seems to have a slowdown below 1.0x, why is that?

I'm surprised it is so pronounced. Earlier data https://www.internalfb.com/intern/anp/view/?id=2750232
image

However, we learned 2 things about resnet

  • at this model size we saw compute time about matching comms time, leaving us pretty susceptible to noise
  • one subgraph in resnet ends up having an in-place mutation, which currently causes that subgraph to fall back to eager. We lost compute optimization for that subgraph.

We also changed bs from 32 to 128 at one point and got lower noise, but i'm not sure which bs is in these charts.

The compilation failure is something that we need to fix at the AOTAutograd layer. It's being tracked in AOTAutograd 2.0 discussion, and afaiu it's better to not try to fix it in the DDPOptimizer layer and fix it in the autograd layer. (The alternative is we could try to make DDPOptimizer avoid splits that upset AOTAutograd, but that would complicate DDPOptimizer and also deviate from the buckets used by DDP.

@wconstab
Copy link
Contributor Author

wconstab commented Nov 7, 2022

I've run an accuracy/functionality sweep across torchbench, timm_models, huggingface scripts.

The sweep was done on master @ bbaa0637df93292eb372b355f01756437aed3ce9 by running

for SUITE in "torchbench" "huggingface" "timm_models"
do
    SCRIPT="benchmarks/dynamo/${SUITE}.py"
    python $SCRIPT --training --accuracy --backend eager --ddp --no-optimize-ddp --output ${SUITE}_eager_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --backend aot_eager --ddp --no-optimize-ddp --output ${SUITE}_aot_eager_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --backend aot_eager --ddp --output ${SUITE}_aot_eager_ddp.csv
    python $SCRIPT --training --accuracy --inductor --ddp --no-optimize-ddp --output ${SUITE}_inductor_ddp_no_optimize.csv
    python $SCRIPT --training --accuracy --inductor --ddp --output ${SUITE}_inductor_ddp.csv
done

The data is here

Suite Pass Fail to run Fail Accuracy Total Models Pass Rate
Huggingface 31 12 1 44 0.7045454545
timm_models 59 1 2 62 0.9516129032
torchbench 43 9 4 56 0.7678571429

While the pass rates are not great, only 1 issue in the bunch is unique to turning DDPOptimizer on, and it's an AotAutograd Input mutation issue which we'll depend on being fixed in the updated AotAutograd implementation that handles mutation.

The remainder of the issues (either crashes or accuracy) happen either with dynamo-eager+ddp, or dynamo-inductor+ddp, but without using DDPOptimizer. The most common error categories i've sampled are

  • RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one...
  • RuntimeError: Trying to backward through the graph a second time...
  • torch._dynamo.utils: [ERROR] Accuracy failed for key name logits
    and probably a few other cases less common

I therefore propose to land this enablement PR as-is, and file new issues for categories of DDP+dynamo/inductor failures to triage accordingly.
cc @davidberard98 @aazzolini @soumith @ezyang @bdhirsh

@soumith
Copy link
Contributor

soumith commented Nov 7, 2022

landing it sounds reasonable

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Nov 8, 2022
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

TODO: WIP adding/running these checks, land afterwards
Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

ghstack-source-id: 85a90ba
Pull Request resolved: #88523
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Nov 9, 2022
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

TODO: WIP adding/running these checks, land afterwards
Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

ghstack-source-id: 488dae5
Pull Request resolved: #88523
@wconstab
Copy link
Contributor Author

wconstab commented Nov 9, 2022

New accuracy sweep here

@davidberard98 seems close to getting the accuracy checks running in the real perf benchmarks. @bdhirsh is working on getting AotAutograd improvements to a state we can at least confirm these graph-break cases are resolved.

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
@albanD albanD removed their request for review November 22, 2022 22:37
@anjali411 anjali411 removed their request for review November 28, 2022 14:39
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Contributor Author

@pytorchbot merge -f "Flaky CI"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch rebase origin/master returned non-zero exit code 1

Rebasing (1/1)
Auto-merging benchmarks/dynamo/distributed.py
CONFLICT (content): Merge conflict in benchmarks/dynamo/distributed.py
error: could not apply 531c9ff3f3... Enable DDPOptimizer by default in dynamo (#88523)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 531c9ff3f3... Enable DDPOptimizer by default in dynamo (#88523)
Details for Dev Infra team Raised by workflow job

Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.  
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Contributor Author

@pytorchbot merge -f "Flaky CI (gpu SIGIOT)"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

Pull Request resolved: pytorch#88523
Approved by: https://github.com/davidberard98
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/31/head branch June 8, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants