Skip to content

Conversation

@zdevito
Copy link
Contributor

@zdevito zdevito commented Oct 4, 2022

Stack from ghstack (oldest at bottom):

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace

TODO:

* Make the N configurable, so the amount of data in the snapshot can be limited
* Fix race condition problems with exposing internal caching allocator structures.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 4, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86241

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9903d85:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace

TODO:

* Make the N configurable, so the amount of data in the snapshot can be limited
* Fix race condition problems with exposing internal caching allocator structures.

[ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace

TODO:

* Make the N configurable, so the amount of data in the snapshot can be limited
* Fix race condition problems with exposing internal caching allocator structures.

[ghstack-poisoned]
Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace

TODO:

* Make the N configurable, so the amount of data in the snapshot can be limited
* Fix race condition problems with exposing internal caching allocator structures.

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Oct 5, 2022
Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace
* Can control whether we capture context information and attach to the
  trace.
ghstack-source-id: 36c6749
Pull Request resolved: #86241
Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace


[ghstack-poisoned]
@zdevito zdevito requested a review from ngimel October 5, 2022 17:38
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. 

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Oct 5, 2022
Adds a ring-buffer to the memory history that keeps a rolling buffer of
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace
* Can control whether we capture context information and attach to the
  trace.
ghstack-source-id: 04758de
Pull Request resolved: #86241
};

struct TraceEntry {
enum Action { ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, SNAPSHOT, OOM };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave small comments about what each of the actions mean?

x = False

def cb(device, alloc, device_alloc, device_free):
print(device, alloc, device_alloc, device_free)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug print?

});
if (block->history) {
record_trace(
TraceEntry::FREE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain if FREE should be recorded here - true, this is where FREE request happen, but actual freeing might not happen if there are other streams using this block (see the check below), so trace would be confusing - it would imply that memory was freed when in fact it wasn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, I'm going to change the trace events to record both the 'free_requested' and 'free_completed'. Having free_requested separate is still useful when experimenting with other allocator behavior where a trace of what the user code called is needed to simulate a different allocator.

del x
torch.cuda.empty_cache()
ss = torch.cuda.memory._snapshot()
assert(ss['device_traces'][0][-1]['action'] == 'segment_free')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertTrue

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. 

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

[ghstack-poisoned]
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. 

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Oct 6, 2022
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace
* Can control whether we capture context information and attach to the
  trace.
ghstack-source-id: 74339b5
Pull Request resolved: #86241
@zdevito zdevito requested a review from ngimel October 7, 2022 00:19
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2022
@zdevito
Copy link
Contributor Author

zdevito commented Oct 7, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team Raised by workflow job

@zdevito
Copy link
Contributor Author

zdevito commented Oct 7, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team Raised by workflow job

@zdevito
Copy link
Contributor Author

zdevito commented Oct 7, 2022

@pytorchbot merge -f "tests appear to be passing but thinks a green check didn't pass"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3e1443115436551e161baecb8f7e3a8f22787590 returned non-zero exit code 1

Auto-merging test/test_cuda.py
Auto-merging torch/_C/__init__.pyi.in
CONFLICT (content): Merge conflict in torch/_C/__init__.pyi.in
Auto-merging torch/csrc/cuda/Module.cpp
CONFLICT (content): Merge conflict in torch/csrc/cuda/Module.cpp
Auto-merging torch/cuda/memory.py
CONFLICT (content): Merge conflict in torch/cuda/memory.py
error: could not apply 3e14431154... Adds a ring-buffer to the memory history that keeps a rolling buffer of
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. 

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

[ghstack-poisoned]
@zdevito
Copy link
Contributor Author

zdevito commented Oct 7, 2022

@pytorchbot merge

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. 

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Oct 7, 2022
the last N actions that the allocator did.

* _memory_viz is updated to print better information about segments
* _memory_viz can pretty print the trace
* Can control whether we capture context information and attach to the
  trace.
ghstack-source-id: cfc712d
Pull Request resolved: #86241
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2022

Hey @zdevito.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Oct 10, 2022
Summary:
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

Pull Request resolved: #86241
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/91b1bae1df1e72e17d2ab296845c214bc39422a0

Reviewed By: seemethere

Differential Revision: D40217961

Pulled By: seemethere

fbshipit-source-id: 33751b6f3b87f5e47816e146615870ac8bbbad87
@facebook-github-bot facebook-github-bot deleted the gh/zdevito/190/head branch June 8, 2023 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants