Caching allocator tracing #86241

zdevito · 2022-10-04T22:04:31Z

Stack from ghstack (oldest at bottom):

-> Caching allocator tracing #86241

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]

pytorch-bot · 2022-10-04T22:04:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86241

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9903d85:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace TODO: * Make the N configurable, so the amount of data in the snapshot can be limited * Fix race condition problems with exposing internal caching allocator structures. [ghstack-poisoned]

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: 36c6749 Pull Request resolved: #86241

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace [ghstack-poisoned]

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: 04758de Pull Request resolved: #86241

ngimel · 2022-10-06T04:38:56Z

c10/cuda/CUDACachingAllocator.h

 };

+struct TraceEntry {
+  enum Action { ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, SNAPSHOT, OOM };


can you leave small comments about what each of the actions mean?

ngimel · 2022-10-06T04:41:58Z

test/test_cuda.py

+        x = False
+
+        def cb(device, alloc, device_alloc, device_free):
+            print(device, alloc, device_alloc, device_free)


debug print?

ngimel · 2022-10-06T04:45:37Z

c10/cuda/CUDACachingAllocator.cpp

    });
+    if (block->history) {
+      record_trace(
+          TraceEntry::FREE,


I'm uncertain if FREE should be recorded here - true, this is where FREE request happen, but actual freeing might not happen if there are other streams using this block (see the check below), so trace would be confusing - it would imply that memory was freed when in fact it wasn't.

That is a good point, I'm going to change the trace events to record both the 'free_requested' and 'free_completed'. Having free_requested separate is still useful when experimenting with other allocator behavior where a trace of what the user code called is needed to simulate a different allocator.

ngimel · 2022-10-06T04:50:07Z

test/test_cuda.py

+            del x
+            torch.cuda.empty_cache()
+            ss = torch.cuda.memory._snapshot()
+            assert(ss['device_traces'][0][-1]['action'] == 'segment_free')


self.assertTrue

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]

the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: 74339b5 Pull Request resolved: #86241

zdevito · 2022-10-07T01:10:06Z

@pytorchbot merge

pytorchmergebot · 2022-10-07T01:11:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-10-07T06:14:16Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team

Raised by workflow job

zdevito · 2022-10-07T16:46:16Z

@pytorchbot merge

pytorchmergebot · 2022-10-07T16:47:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-10-07T16:47:44Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team

Raised by workflow job

zdevito · 2022-10-07T18:19:51Z

@pytorchbot merge -f "tests appear to be passing but thinks a green check didn't pass"

pytorchmergebot · 2022-10-07T18:22:34Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-10-07T18:22:40Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 3e1443115436551e161baecb8f7e3a8f22787590 returned non-zero exit code 1

Auto-merging test/test_cuda.py
Auto-merging torch/_C/__init__.pyi.in
CONFLICT (content): Merge conflict in torch/_C/__init__.pyi.in
Auto-merging torch/csrc/cuda/Module.cpp
CONFLICT (content): Merge conflict in torch/csrc/cuda/Module.cpp
Auto-merging torch/cuda/memory.py
CONFLICT (content): Merge conflict in torch/cuda/memory.py
error: could not apply 3e14431154... Adds a ring-buffer to the memory history that keeps a rolling buffer of
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]

zdevito · 2022-10-07T20:20:48Z

@pytorchbot merge

We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. [ghstack-poisoned]

the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace * Can control whether we capture context information and attach to the trace. ghstack-source-id: cfc712d Pull Request resolved: #86241

pytorchmergebot · 2022-10-07T20:23:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-07T23:20:42Z

Hey @zdevito.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: #86241 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/91b1bae1df1e72e17d2ab296845c214bc39422a0 Reviewed By: seemethere Differential Revision: D40217961 Pulled By: seemethere fbshipit-source-id: 33751b6f3b87f5e47816e146615870ac8bbbad87

zdevito mentioned this pull request Oct 4, 2022

Memory snapshots from C++ #86190

Closed

facebook-github-bot added the cla signed label Oct 4, 2022

zdevito mentioned this pull request Oct 5, 2022

[allocator tracing] missing GIL acquire #86254

Closed

Update on "Caching allocator tracing"

03b3a03

Adds a ring-buffer to the memory history that keeps a rolling buffer of the last N actions that the allocator did. * _memory_viz is updated to print better information about segments * _memory_viz can pretty print the trace [ghstack-poisoned]

zdevito requested a review from ngimel October 5, 2022 17:38

ngimel reviewed Oct 6, 2022

View reviewed changes

zdevito requested a review from ngimel October 7, 2022 00:19

ngimel approved these changes Oct 7, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2022

pytorchmergebot added the Merged label Oct 7, 2022

pytorchmergebot closed this in 91b1bae Oct 7, 2022

facebook-github-bot deleted the gh/zdevito/190/head branch June 8, 2023 19:26

Caching allocator tracing #86241

Caching allocator tracing #86241

Uh oh!

Conversation

zdevito commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86241

✅ No Failures

Uh oh!

ngimel Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

zdevito Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel Oct 6, 2022

Choose a reason for hiding this comment

Uh oh!

zdevito commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge failed

Uh oh!

zdevito commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge failed

Uh oh!

zdevito commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge failed

Uh oh!

zdevito commented Oct 7, 2022

Uh oh!

pytorchmergebot commented Oct 7, 2022

Merge started

Uh oh!

github-actions bot commented Oct 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zdevito commented Oct 4, 2022 •

edited

Loading

pytorch-bot bot commented Oct 4, 2022 •

edited

Loading