-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[nn] zero_grad() set_to_none default True #92731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92731
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit b589af5: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few tests need updating. But YES!
Attempts to fix #92656 [ghstack-poisoned]
Attempts to fix #92656 [ghstack-poisoned]
Attempts to fix #92656 [ghstack-poisoned]
Attempts to fix #92656 [ghstack-poisoned]
Attempts to fix #92656 [ghstack-poisoned]
|
Can you please add a bc-breaking note here? |
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, if tests pass
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
ooooh exciting. this is a big change :) |
|
This PR leads speedups widely for many models in torchbench. 22 models obtain over 1.03X speedup on A100! Thanks for your work! |
|
@FindHao Recently came back from PTO so sorry this response is delayed. Thanks for this callout! I'm curious about the yolov3 slowdown--have you been able to root cause it thus far? The simple workaround is to just directly pass set_to_none=False to regain perf, but I would like to help with figuring out the cause here. |
|
Hi @janeyx99 , we found it is caused by torch.cuda.empty_cache(). It takes longer than the original version. Since torchbench only tests one iteration for training and we don't need to empty the cache, we removed this function as a workaround. But we still don't know why this function takes longer. Do you have any ideas? |
|
@FindHao Ah, I spoke with @albanD and this is not surprising. When set_to_none was False, the same grad tensor was allocated once and kept alive throughout the iterations (it would be filled with 0s and then filled with values, and so forth). Now, because we set grad to None, it would increase the number of allocations whenever we alternate between None -> real values -> None -> real values, and so forth. This is typically not a problem except for certain configurations (like given a particular batch_size on a particular GPU) where the stars align just right and allocations incur an actual communication to the GPU vs being able to service from existing allocated memory. |
|
@janeyx99 Thanks for your explanation! It makes sense. I have another question. If I understand it correctly, setting it to none means marking the memory allocated to current tensors as |
|
No, |
|
Haha I will attempt to answer this question by setting down some terminology. PyTorch has a CUDACachingAllocator which reserves and manages memory for the duration of a PyTorch program. In a sense, you can imagine that it reserves a chunk of memory from the GPU and interfaces on top of the actual GPU so that every time the program releases/requests memory, we don't have to talk to the GPU. For example, if the PyTorch program releases memory, our CachingAllocator will hold onto it instead of immediately releasing it to the GPU so that later on when the program wants memory again, it can lend that memory out. This would save time as communication between the GPU would be avoided entirely. Thus, we have the concept of memory reserved and memory allocated. The memory reserved is the total memory managed by the CUDACachingAllocator, and the memory allocated is the memory taken up by actual PyTorch tensors. Setting to None here will "free" the tensor so that the memory allocated goes down immediately BUT the memory reserved would remain the same. Calling torch.cuda.empty_cache() will empty the memory so that memory reserved approaches memory allocated. |
Attempts to fix #92656
BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).
Stack from ghstack (oldest at bottom):
BC-breaking note
Gradients are now set to
Noneinstead of zeros by default intorch.optim.*.zero_grad()andtorch.nn.Module.zero_grad()This changes the default behavior of
zero_grad()to zero out the grads by setting them toNoneinstead of zero tensors. In other words, theset_to_nonekwarg is nowTrueby default instead ofFalse. Setting grads toNonereduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after callingzero_grad()as they will now beNone. To revert to the old behavior, pass inzero_grad(set_to_none=False).1.13
2.0