Skip to content

Conversation

@soumith
Copy link
Contributor

@soumith soumith commented Nov 1, 2022

@mlazos: skips item() calls if compiling with dynamo, by defining a helper function _get_value which either returns the result of .item() or the scalar cpu tensor if compiling with dynamo. This was done because removing item() calls significantly regresses eager perf. Additionally, _dispatch_sqrt calls the appropriate sqrt function (math.sqrt, or torch.sqrt).

Fixes pytorch/torchdynamo#1083

This PR will no longer be needed once symint support is default.

This PR closes all remaining graph breaks in the optimizers (!!)

cc @mlazos @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

@soumith soumith requested a review from albanD as a code owner November 1, 2022 04:25
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 1, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88173

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures

As of commit 5a05a87:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@soumith
Copy link
Contributor Author

soumith commented Nov 1, 2022

after this PR lands, I suspect we can remove the capturable keyword and what it represents.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make it capturable? These are still CPU Tensors.

@albanD albanD added release notes: nn release notes category topic: improvements topic category labels Nov 1, 2022
@albanD
Copy link
Collaborator

albanD commented Nov 1, 2022

Also this will trigger #74424 again. So not sure if we can do this.

@soumith
Copy link
Contributor Author

soumith commented Nov 1, 2022

Also this will trigger #74424 again. So not sure if we can do this.

I dont think it will trigger that issue with XLA because we've removed all item calls. We can even get rid of most list comprehensions. So, I have high hopes that this wont be an issue for XLA

@soumith
Copy link
Contributor Author

soumith commented Nov 1, 2022

Does this make it capturable? These are still CPU Tensors.

Capturable (i believe) is whether dynamo can get a whole graph of it -- not that it should all be on GPU.

@albanD
Copy link
Collaborator

albanD commented Nov 1, 2022

Capturable (i believe) is whether dynamo can get a whole graph of it

Capturable is used in the context of cudagraph: "whether this instance is safe to capture in a CUDA graph.".
So we do need the step to be on the GPU when this is true.

@soumith
Copy link
Contributor Author

soumith commented Nov 1, 2022

Capturable is used in the context of cudagraph: "whether this instance is safe to capture in a CUDA graph.".

Ho, interesting. In that case, I have to check capturable code, it's probably buggy as-is because when learning_rate drops happen, the scalar changes / drops but cudagraph would have baked it in

@albanD
Copy link
Collaborator

albanD commented Nov 1, 2022

I think there are two things here:

  1. This patch
    Scalar Tensor on CPU will be slower than plain python number. I guess we're happy with a small hit here but I think it would be good to measure it as some of these optimizer do quite a few ops (and our framework overhead is non-trivial).

  2. The capturable flag:
    I think there is still an impasse here though:

  • For this optimizer to be capturable by cudagraph, we MUST have the step on the GPU
  • To get ok performance outside of cudagraph, we MUST keep the step on the CPU

We could consider checking during the step itself if we're using cudagraph but that would mean that we might have to move the step then (which is a sync point so not good).

Also I'm not sure how dynamo will handle that since it wants to use cudagraph, I guess it already breaks on any mix of CPU/GPU compute? And so this change will not really remove the graph breaks within dynamo?

@mlazos
Copy link
Contributor

mlazos commented Nov 12, 2022

@albanD One question: if the step is on GPU, this would be bad without cudagraphs because we're launching a kernel to do step + 1 essentially right? I think in the dynamo case this won't matter, because this op will get fused into a larger kernel anyway so could we make capturable=True by default, and then for people who aren't using dynamo or cudagraphs, they can manually set capturable=False?

@albanD
Copy link
Collaborator

albanD commented Nov 14, 2022

this would be bad without cudagraphs because we're launching a kernel to do step + 1 essentially right?

Correct.

could we make capturable=True by default, and then for people who aren't using dynamo or cudagraphs, they can manually set capturable=False?

We can't make it the default no. For GPU-like backends, having these small ops on the device is prohibitively slow #74424

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

@byronyi
Copy link

byronyi commented Nov 15, 2022

For GPU-like backends, having these small ops on the device is prohibitively slow #74424

I guess XLA is not GPU-like backends as it is lazy?

in each training iteration, now state['step'] needs to be move back to CPU from CUDA or XLA to compute things like step = step_t.item()

Seems the overhead mainly coming from moving those tensors back to CPU, which is exactly the issue this PR fixes.

cc @JackCaoG; I believe it would be beneficial to make capturable=True by default for all fusion-capable backends, e.g. XLA or CUDA graphs.

@JackCaoG
Copy link
Collaborator

Generally speaking removing item call will be only beneficial to the xla backend. item call is one of the common cause of early execution. Without it, lazy will fuse all optimizer + forward + backward. I am not entirely sure what capturable does, @shunting314 might be able to comment for our dynamo->xla bridge.

@shunting314
Copy link
Contributor

@shunting314 might be able to comment for our dynamo->xla bridge.

in the dynamoc -> xla bridge for training, we are not running optimizer actually ( https://github.com/pytorch/pytorch/blob/master/benchmarks/dynamo/common.py#L894 ). But looks like people are adding them.

@byronyi
Copy link

byronyi commented Nov 15, 2022

I am not entirely sure what capturable does

If capturable is True then the step is a device tensor instead of put on CPU, and the parameter decay is calculated on device. This will help reducing the CPU overhead when the backend could fuse them into a single optimizer computation (used mainly for CUDA graphs, but potentially benefit other backends as well).

See https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L366-L393

@mlazos mlazos requested a review from janeyx99 as a code owner December 2, 2022 02:17
@mlazos mlazos self-assigned this Dec 2, 2022
@albanD
Copy link
Collaborator

albanD commented Dec 7, 2022

Do we have the sanity check benchmark as discussed above to make sure nothing is obviously broken in terms of speed due to these changes?

@mlazos
Copy link
Contributor

mlazos commented Dec 9, 2022

Update: This causes a pretty bad perf regression in both single and multi tensor mode for eager. As a workaround I will use is_compiling to call item() when in eager, and to use cpu tensors when using dynamo. I will verify this fixes the regressions and then we should be able to merge this.

cc @albanD

@mlazos mlazos changed the title [optim] remove .item calls in all optimizers [optim] avoid .item calls in all optimizers when compiling with dynamo Dec 9, 2022
@mlazos mlazos changed the title [optim] avoid .item calls in all optimizers when compiling with dynamo [optim] skip .item calls in all optimizers when compiling with dynamo Dec 9, 2022
@mlazos
Copy link
Contributor

mlazos commented Dec 10, 2022

Update: The current version matches the eager perf of the old version per Alban's request. Should be good to go, just need a sign off. Updated nums

Tests issues are related to the dynamo import in optim, should be fixed now I think

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is acceptable but really not great in term of readability.

param.addcdiv_(grad, denom)
param.addcdiv_(exp_avg, denom)
else:
mu_product_next = _get_value(mu_product) * mu * mu_next
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do the _get_value on mu_product here and not on mu_product_next like in the old code?

@albanD
Copy link
Collaborator

albanD commented Dec 12, 2022

By the way @mlazos I guess that #90624 should actually allow you to remove the .item() changes?
What about the other changes? I think that it is part of the dynamo goal to be able to do these integrations without modifying user code with if is_compiling a-la torchscipt?

@mlazos
Copy link
Contributor

mlazos commented Dec 12, 2022

I don't think symint is ready for prime time since there are still some lingering bugs. Once dynamic shapes is on by default I'm fine with removing this.

Yes we usually don't want to modify code to compile with dynamo. In most other cases we don't care about eager perf as much though. Like in this case if we didn't care about eager perf I'd just remove the item calls and call it a day since dynamo provides good speed up. It ends up being a trade off off of is this worth modifying the compiler and adding x feature vs should I just remove x from the code if it isn't really necessary. With user code I lean towards the former while in code we own it's more balanced.

@mlazos
Copy link
Contributor

mlazos commented Dec 12, 2022

@pytorchbot merge -f "Unrelated test failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD
Copy link
Collaborator

albanD commented Dec 12, 2022

In most other cases we don't care about eager perf as much though.

I would challenge that actually. People that write library code (nn, vision, audio, etc) need to support both. So this is a major use case!

@github-actions github-actions bot deleted the optim_item branch June 12, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove .item() calls from optimizer.step()

8 participants