-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Caching allocator tracing #86241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching allocator tracing #86241
Conversation
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86241
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9903d85: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: 36c6749 Pull Request resolved: #86241
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace [ghstack-poisoned]
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: 04758de Pull Request resolved: #86241
c10/cuda/CUDACachingAllocator.h
Outdated
| }; | ||
|
|
||
| struct TraceEntry { | ||
| enum Action { ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, SNAPSHOT, OOM }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave small comments about what each of the actions mean?
test/test_cuda.py
Outdated
| x = False | ||
|
|
||
| def cb(device, alloc, device_alloc, device_free): | ||
| print(device, alloc, device_alloc, device_free) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug print?
c10/cuda/CUDACachingAllocator.cpp
Outdated
| }); | ||
| if (block->history) { | ||
| record_trace( | ||
| TraceEntry::FREE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncertain if FREE should be recorded here - true, this is where FREE request happen, but actual freeing might not happen if there are other streams using this block (see the check below), so trace would be confusing - it would imply that memory was freed when in fact it wasn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, I'm going to change the trace events to record both the 'free_requested' and 'free_completed'. Having free_requested separate is still useful when experimenting with other allocator behavior where a trace of what the user code called is needed to simulate a different allocator.
test/test_cuda.py
Outdated
| del x | ||
| torch.cuda.empty_cache() | ||
| ss = torch.cuda.memory._snapshot() | ||
| assert(ss['device_traces'][0][-1]['action'] == 'segment_free') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertTrue
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 additional jobs have failed, first few of them are: trunk Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 additional jobs have failed, first few of them are: trunk Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "tests appear to be passing but thinks a green check didn't pass" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]
|
@pytorchbot merge |
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hey @zdevito. |
Summary: We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: #86241 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/91b1bae1df1e72e17d2ab296845c214bc39422a0 Reviewed By: seemethere Differential Revision: D40217961 Pulled By: seemethere fbshipit-source-id: 33751b6f3b87f5e47816e146615870ac8bbbad87
Stack from ghstack (oldest at bottom):
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.
We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.
As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).
This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.