-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Added requested_bytes to CUDA Caching Allocator Stats #88575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
f6b245f to
3bd959a
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
3bd959a to
609851b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
609851b to
7f16c56
Compare
7f16c56 to
050cba1
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
050cba1 to
9b98296
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
9b98296 to
e80ba76
Compare
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to record this stat. I have a few inline comments. I also think that there is code missing to handle resetting the statistic when all the other statistics are reset. I don't see tests for reading requested_bytes out of the block info.
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
e80ba76 to
63f645d
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
63f645d to
348f39f
Compare
3f07abe to
4e65787
Compare
4e65787 to
1f0e522
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Thank you for the review! I've addressed the inline comments, added tests for reading requested_bytes out of segment info/block info and added code to reset the statistics when all the other statistics are reset. |
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have a couple nits that should be addressed (missing blockinfo export and test), but this otherwise looks good to me.
1f0e522 to
22f94d8
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
Summary: Pull Request resolved: pytorch#88575 The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 fbshipit-source-id: d71624c0173c1ca272a1a76acc2bc1ff022edfab
22f94d8 to
28f9656
Compare
|
This pull request was exported from Phabricator. Differential Revision: D40810674 |
|
@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@c-odrin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Memory usage increase after pytorch#88575
Memory usage increases after #88575. Docker crashes with exit code 137, clearly means out of memory Pull Request resolved: #94548 Approved by: https://github.com/seemethere
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.
We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
- "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead
Test Plan: Added test case in caffe2/test/test_cuda.py
Differential Revision: D40810674