[optim] skip .item calls in all optimizers when compiling with dynamo #88173

soumith · 2022-11-01T04:25:40Z

@mlazos: skips item() calls if compiling with dynamo, by defining a helper function _get_value which either returns the result of .item() or the scalar cpu tensor if compiling with dynamo. This was done because removing item() calls significantly regresses eager perf. Additionally, _dispatch_sqrt calls the appropriate sqrt function (math.sqrt, or torch.sqrt).

Fixes pytorch/torchdynamo#1083

This PR will no longer be needed once symint support is default.

This PR closes all remaining graph breaks in the optimizers (!!)

cc @mlazos @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

pytorch-bot · 2022-11-01T04:25:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88173

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 5a05a87:

The following jobs have failed:

cuda11.6-py3.10-gcc7-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

soumith · 2022-11-01T04:26:31Z

after this PR lands, I suspect we can remove the capturable keyword and what it represents.

albanD

Does this make it capturable? These are still CPU Tensors.

albanD · 2022-11-01T14:34:41Z

Also this will trigger #74424 again. So not sure if we can do this.

soumith · 2022-11-01T14:38:34Z

Also this will trigger #74424 again. So not sure if we can do this.

I dont think it will trigger that issue with XLA because we've removed all item calls. We can even get rid of most list comprehensions. So, I have high hopes that this wont be an issue for XLA

soumith · 2022-11-01T14:39:11Z

Does this make it capturable? These are still CPU Tensors.

Capturable (i believe) is whether dynamo can get a whole graph of it -- not that it should all be on GPU.

albanD · 2022-11-01T14:42:30Z

Capturable (i believe) is whether dynamo can get a whole graph of it

Capturable is used in the context of cudagraph: "whether this instance is safe to capture in a CUDA graph.".
So we do need the step to be on the GPU when this is true.

soumith · 2022-11-01T14:59:16Z

Capturable is used in the context of cudagraph: "whether this instance is safe to capture in a CUDA graph.".

Ho, interesting. In that case, I have to check capturable code, it's probably buggy as-is because when learning_rate drops happen, the scalar changes / drops but cudagraph would have baked it in

albanD · 2022-11-01T15:01:49Z

I think there are two things here:

This patch
Scalar Tensor on CPU will be slower than plain python number. I guess we're happy with a small hit here but I think it would be good to measure it as some of these optimizer do quite a few ops (and our framework overhead is non-trivial).
The capturable flag:
I think there is still an impasse here though:

For this optimizer to be capturable by cudagraph, we MUST have the step on the GPU
To get ok performance outside of cudagraph, we MUST keep the step on the CPU

We could consider checking during the step itself if we're using cudagraph but that would mean that we might have to move the step then (which is a sync point so not good).

Also I'm not sure how dynamo will handle that since it wants to use cudagraph, I guess it already breaks on any mix of CPU/GPU compute? And so this change will not really remove the graph breaks within dynamo?

mlazos · 2022-11-12T04:04:15Z

@albanD One question: if the step is on GPU, this would be bad without cudagraphs because we're launching a kernel to do step + 1 essentially right? I think in the dynamo case this won't matter, because this op will get fused into a larger kernel anyway so could we make capturable=True by default, and then for people who aren't using dynamo or cudagraphs, they can manually set capturable=False?

albanD · 2022-11-14T16:03:37Z

this would be bad without cudagraphs because we're launching a kernel to do step + 1 essentially right?

Correct.

could we make capturable=True by default, and then for people who aren't using dynamo or cudagraphs, they can manually set capturable=False?

We can't make it the default no. For GPU-like backends, having these small ops on the device is prohibitively slow #74424

albanD

Sounds good to me!

torch/optim/adam.py

torch/optim/adamax.py

torch/optim/adamw.py

torch/optim/asgd.py

torch/optim/nadam.py

torch/optim/asgd.py

torch/optim/radam.py

byronyi · 2022-11-15T01:12:52Z

For GPU-like backends, having these small ops on the device is prohibitively slow #74424

I guess XLA is not GPU-like backends as it is lazy?

in each training iteration, now state['step'] needs to be move back to CPU from CUDA or XLA to compute things like step = step_t.item()

Seems the overhead mainly coming from moving those tensors back to CPU, which is exactly the issue this PR fixes.

cc @JackCaoG; I believe it would be beneficial to make capturable=True by default for all fusion-capable backends, e.g. XLA or CUDA graphs.

JackCaoG · 2022-11-15T01:24:43Z

Generally speaking removing item call will be only beneficial to the xla backend. item call is one of the common cause of early execution. Without it, lazy will fuse all optimizer + forward + backward. I am not entirely sure what capturable does, @shunting314 might be able to comment for our dynamo->xla bridge.

shunting314 · 2022-11-15T02:30:19Z

@shunting314 might be able to comment for our dynamo->xla bridge.

in the dynamoc -> xla bridge for training, we are not running optimizer actually ( https://github.com/pytorch/pytorch/blob/master/benchmarks/dynamo/common.py#L894 ). But looks like people are adding them.

byronyi · 2022-11-15T03:10:05Z

I am not entirely sure what capturable does

If capturable is True then the step is a device tensor instead of put on CPU, and the parameter decay is calculated on device. This will help reducing the CPU overhead when the backend could fuse them into a single optimizer computation (used mainly for CUDA graphs, but potentially benefit other backends as well).

See https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L366-L393

torch/optim/adam.py

test/test_cuda.py

torch/optim/adam.py

torch/optim/asgd.py

torch/optim/nadam.py

torch/optim/radam.py

albanD · 2022-12-07T00:12:58Z

Do we have the sanity check benchmark as discussed above to make sure nothing is obviously broken in terms of speed due to these changes?

mlazos · 2022-12-09T17:35:36Z

Update: This causes a pretty bad perf regression in both single and multi tensor mode for eager. As a workaround I will use is_compiling to call item() when in eager, and to use cpu tensors when using dynamo. I will verify this fixes the regressions and then we should be able to merge this.

cc @albanD

… in eager

mlazos · 2022-12-10T01:57:11Z

Update: The current version matches the eager perf of the old version per Alban's request. Should be good to go, just need a sign off. Updated nums

Tests issues are related to the dynamo import in optim, should be fixed now I think

albanD

That is acceptable but really not great in term of readability.

torch/_dynamo/config.py

albanD · 2022-12-12T13:10:39Z

torch/optim/nadam.py

+            param.addcdiv_(grad, denom)
+            param.addcdiv_(exp_avg, denom)
        else:
+            mu_product_next = _get_value(mu_product) * mu * mu_next


Why do you do the _get_value on mu_product here and not on mu_product_next like in the old code?

torch/optim/optimizer.py

albanD · 2022-12-12T13:36:00Z

By the way @mlazos I guess that #90624 should actually allow you to remove the .item() changes?
What about the other changes? I think that it is part of the dynamo goal to be able to do these integrations without modifying user code with if is_compiling a-la torchscipt?

mlazos · 2022-12-12T17:22:50Z

I don't think symint is ready for prime time since there are still some lingering bugs. Once dynamic shapes is on by default I'm fine with removing this.

Yes we usually don't want to modify code to compile with dynamo. In most other cases we don't care about eager perf as much though. Like in this case if we didn't care about eager perf I'd just remove the item calls and call it a day since dynamo provides good speed up. It ends up being a trade off off of is this worth modifying the compiler and adding x feature vs should I just remove x from the code if it isn't really necessary. With user code I lean towards the former while in code we own it's more balanced.

mlazos · 2022-12-12T17:30:55Z

@pytorchbot merge -f "Unrelated test failure"

pytorchmergebot · 2022-12-12T17:32:30Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

albanD · 2022-12-12T20:16:23Z

In most other cases we don't care about eager perf as much though.

I would challenge that actually. People that write library code (nn, vision, audio, etc) need to support both. So this is a major use case!

soumith requested a review from albanD as a code owner November 1, 2022 04:25

soumith force-pushed the optim_item branch from 80ddf67 to 0712972 Compare November 1, 2022 05:13

albanD reviewed Nov 1, 2022

View reviewed changes

albanD added release notes: nn release notes category topic: improvements topic category labels Nov 1, 2022

albanD mentioned this pull request Nov 1, 2022

Add _foreach_addc(div/mul)(_).Tensor #88157

Closed

[optim] remove .item calls in all optimizers

7be7d89

soumith force-pushed the optim_item branch from 0712972 to 7be7d89 Compare November 3, 2022 03:24

soumith mentioned this pull request Nov 11, 2022

Fixes for tracing optimizers with 1000s of parameters #88881

Closed

albanD reviewed Nov 14, 2022

View reviewed changes

Merge with master

f31eae2

mlazos requested a review from janeyx99 as a code owner December 2, 2022 02:17

mlazos self-assigned this Dec 2, 2022

mlazos added 3 commits December 2, 2022 02:22

Merge branch 'master' into optim_item

c4d4ba5

Merge remote-tracking branch 'origin' into optim_item

4fe90dd

Use stack instead of tensor, fix nadam test (it captures)

353b51e

Fix bug in nadam

3661989

mlazos reviewed Dec 4, 2022

View reviewed changes

torch/optim/adam.py Show resolved Hide resolved

Fixes for removing item precision loss

c5e9208

albanD reviewed Dec 6, 2022

View reviewed changes

mlazos added 2 commits December 7, 2022 00:57

Fix assorted issues

b06dc37

Fix for dynamo

c1cbb52

Merge remote-tracking branch 'origin' into optim_item

46f8f3d

mlazos changed the title ~~[optim] remove .item calls in all optimizers~~ [optim] avoid .item calls in all optimizers when compiling with dynamo Dec 9, 2022

mlazos changed the title ~~[optim] avoid .item calls in all optimizers when compiling with dynamo~~ [optim] skip .item calls in all optimizers when compiling with dynamo Dec 9, 2022

mlazos added 4 commits December 9, 2022 22:32

Close remaining graph breaks, use is_compile to allow item calls when…

85304b2

… in eager

Merge branch 'master' into optim_item

f2c0030

Fixed slowdowns

c60c103

Missed one call in asgd

594d3f7

Fix issue with importing dynamo to get is_compiling available in optim

eabaa47

github-actions bot added the ciflow/inductor label Dec 11, 2022

mlazos added 2 commits December 11, 2022 23:28

Fix return types for torch script

303409c

Make torchscript happy

5a05a87

albanD approved these changes Dec 12, 2022

View reviewed changes

pytorchmergebot added the Merged label Dec 12, 2022

pytorchmergebot closed this in 06326a7 Dec 12, 2022

github-actions bot deleted the optim_item branch June 12, 2024 01:53

[optim] skip .item calls in all optimizers when compiling with dynamo #88173

[optim] skip .item calls in all optimizers when compiling with dynamo #88173

Uh oh!

Conversation

soumith commented Nov 1, 2022 • edited by mlazos Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88173

❌ 1 Failures

Uh oh!

soumith commented Nov 1, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Nov 1, 2022

Uh oh!

soumith commented Nov 1, 2022

Uh oh!

soumith commented Nov 1, 2022

Uh oh!

albanD commented Nov 1, 2022

Uh oh!

soumith commented Nov 1, 2022

Uh oh!

albanD commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlazos commented Nov 12, 2022

Uh oh!

albanD commented Nov 14, 2022

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

byronyi commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Nov 15, 2022

Uh oh!

shunting314 commented Nov 15, 2022

Uh oh!

byronyi commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albanD commented Dec 7, 2022

Uh oh!

mlazos commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlazos commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albanD Dec 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soumith commented Nov 1, 2022 •

edited by mlazos

Loading

pytorch-bot bot commented Nov 1, 2022 •

edited

Loading

albanD commented Nov 1, 2022 •

edited

Loading

byronyi commented Nov 15, 2022 •

edited

Loading

byronyi commented Nov 15, 2022 •

edited

Loading

mlazos commented Dec 9, 2022 •

edited

Loading

mlazos commented Dec 10, 2022 •

edited

Loading