[BE]: Follow detach().clone() pattern for SGD #144468

Skylion007 · 2025-01-09T14:48:57Z

Clone() copies the gradients too, but we immediately detach them. Detach returns a view of the tensor without it's gradients, and the copies only that subset. Related to #144270

pytorch-bot · 2025-01-09T14:49:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144468

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 8 New Failures

As of commit 84f1442 with merge base 8f54e56 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 5, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 5, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, ephemeral.linux.2xlarge) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, ephemeral.linux.4xlarge) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, ephemeral.linux.4xlarge) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, ephemeral.linux.2xlarge) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
profiler/test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step
trunk / win-vs2022-cpu-py3 / test (default, 3, 3, ephemeral.windows.4xlarge.nonephemeral) (gh)
profiler\test_memory_profiler.py::TestMemoryProfilerE2E::test_categories_e2e_simple_module_fwd_bwd_step

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99

Okay, though practically speaking this is likely not changing much perf in most use cases as grad normally does not require grad.

janeyx99 · 2025-01-09T15:27:51Z

@pytorchbot merge

pytorchmergebot · 2025-01-09T15:29:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-01-09T16:59:48Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

Skylion007 · 2025-01-09T19:04:54Z

@pytorchbot merge -r

pytorchmergebot · 2025-01-09T19:06:55Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-09T19:06:59Z

Tried to rebase and push PR #144468, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

pytorchmergebot · 2025-01-09T19:06:59Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Skylion007 · 2025-01-09T19:12:11Z

@janeyx99 Seems like a false positive, but I would like to get confirmation

janeyx99 · 2025-01-09T19:15:54Z

Ah I think you do have to modify the test now that the version counts are different.

github-actions · 2025-03-10T20:35:27Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Skylion007 · 2025-05-04T16:09:27Z

@pytorchbot rebase

pytorchmergebot · 2025-05-04T16:10:58Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-04T16:11:01Z

Successfully rebased skylion007/fix-sgd-detach-2025-01-09 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout skylion007/fix-sgd-detach-2025-01-09 && git pull --rebase)

cyyever

Are there still undiscovered clones like these?

cyyever · 2025-05-04T16:56:15Z

torch/optim/sgd.py


            if buf is None:
-                buf = torch.clone(grad).detach()
+                buf = torch.clone(grad.detach())


Use grad.detach().clone ? Since that is a common pattern in the code base.

cyyever · 2025-05-04T16:56:28Z

torch/optim/sgd.py

                        buf = device_momentum_buffer_list[i] = momentum_buffer_list[
                            indices[i]
-                        ] = torch.clone(device_grads[i]).detach()
+                        ] = torch.clone(device_grads[i].detach())


Same as the above.

albanD · 2025-05-05T19:28:55Z

Clone() copies the gradients too

That is NOT true.
torch.clone() and Tensor.clone() do NOT clone the gradients.
Maybe you're thinking of deepcopy(), which indeed does?

janeyx99

As clone does not copy gradients, can you update the PR description? We can still follow @cyyever's suggestions here to land cleaner code for BE, even though it will only be marginally better.

Skylion007 requested review from albanD and ezyang January 9, 2025 14:48

Skylion007 requested a review from janeyx99 as a code owner January 9, 2025 14:48

pytorch-bot bot added the release notes: optim label Jan 9, 2025

Skylion007 requested a review from malfet January 9, 2025 14:56

Skylion007 added better-engineering Relatively self-contained tasks for better engineering contributors release notes: performance_as_product labels Jan 9, 2025

pytorchbot added the open source label Jan 9, 2025

janeyx99 approved these changes Jan 9, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 9, 2025

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 9, 2025

pytorchmergebot added the merging label Jan 9, 2025

pytorchmergebot removed the merging label Jan 9, 2025

albanD removed their request for review January 9, 2025 20:01

github-actions bot added the Stale label Mar 10, 2025

github-actions bot closed this Apr 9, 2025

Skylion007 reopened this May 4, 2025

Skylion007 requested a review from albanD as a code owner May 4, 2025 16:09

[BE]: Improve SGD efficiency by cloning less data

84f1442

pytorchmergebot force-pushed the skylion007/fix-sgd-detach-2025-01-09 branch from 1b78678 to 84f1442 Compare May 4, 2025 16:11

cyyever approved these changes May 4, 2025

View reviewed changes

janeyx99 changed the title ~~[BE]: Improve SGD efficiency by cloning less data~~ [BE]: Follow detach().clone() pattern May 5, 2025

janeyx99 reviewed May 5, 2025

View reviewed changes

janeyx99 changed the title ~~[BE]: Follow detach().clone() pattern~~ [BE]: Follow detach().clone() pattern for SGD May 5, 2025

github-actions bot closed this Jun 4, 2025

[BE]: Follow detach().clone() pattern for SGD #144468

[BE]: Follow detach().clone() pattern for SGD #144468

Uh oh!

Conversation

Skylion007 commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144468

❗ 1 Active SEVs

❌ 8 New Failures

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Jan 9, 2025

Uh oh!

pytorchmergebot commented Jan 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Jan 9, 2025

Merge failed

Uh oh!

Skylion007 commented Jan 9, 2025

Uh oh!

pytorchmergebot commented Jan 9, 2025

Uh oh!

pytorchmergebot commented Jan 9, 2025

Uh oh!

pytorchmergebot commented Jan 9, 2025

Uh oh!

Skylion007 commented Jan 9, 2025

Uh oh!

janeyx99 commented Jan 9, 2025

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

Skylion007 commented May 4, 2025

Uh oh!

pytorchmergebot commented May 4, 2025

Uh oh!

pytorchmergebot commented May 4, 2025

Uh oh!

cyyever left a comment

Choose a reason for hiding this comment

Uh oh!

cyyever May 4, 2025

Choose a reason for hiding this comment

Uh oh!

cyyever May 4, 2025

Choose a reason for hiding this comment

Uh oh!

albanD commented May 5, 2025

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Skylion007 commented Jan 9, 2025 •

edited

Loading

pytorch-bot bot commented Jan 9, 2025 •

edited

Loading