Methods for checking CUDA memory usage #4511

ssnl · 2018-01-06T22:33:54Z

Adds torch.cuda.memory_cached, torch.cuda.max_memory_cached, torch.cuda.memory_allocated and torch.cuda.max_memory_allocated to provide per-device memory stats. These will be useful for monitoring and benchmarking.
Adds two tests (single/multi-gpu) to test these four methods.

related issue: #1529

aten/src/THC/THCCachingAllocator.cpp

+  static size_t max_memory_allocated;
+
+  // current memory allocated
+  static size_t memory_allocated;


torch/cuda/__init__.py

+    """Returns the current total GPU memory usage by tensors in bytes.
+
+    .. note:: This is likely less than the amount shown in `nvidia-smi` since
+    some unused memory can be held by the cached memory allocator and some


test/test_cuda.py

+            # comp > 0: increased
+            # comp = 0: equal
+            # comp < 0: decreased
+            nonlocal last_m, max_m


test/test_cuda.py

+        for i in range(int(N / 2)):
+            x = tensors2[i].numel()
+            del tensors2[i]
+            assert_change(-x)


ezyang · 2018-01-07T00:40:33Z

Great, very happy to see that this happened. :)

OOC, how did you check that you hit all of the necessary entry points to modify the allocator?

ssnl · 2018-01-07T00:42:53Z

I'll update the PR to provide usage per GPU instead. :)

ssnl · 2018-01-07T00:43:47Z

@ezyang I didn't do any tests, but I checked with @colesbury on that :)
Btw, should the nightly build fail?

yf225 · 2018-01-07T00:48:27Z

Windows failure is the following:

22:45:58 C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-build\aten\src\THC\THCCachingAllocator.cpp(172): error C2039: 'max': is not a member of 'std'
22:45:58 C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\include\unordered_map(15): note: see declaration of 'std'
22:45:58 C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-build\aten\src\THC\THCCachingAllocator.cpp(172): error C3861: 'max': identifier not found

To fix it we can add #include <algorithm>, according to https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5915ad0-a9d1-49f3-8643-ffd623f72b93/error-c2039-max-is-not-a-member-of-std?forum=vcgeneral

ssnl · 2018-01-07T00:58:05Z

@yf225 I'll add it, thanks!

ezyang · 2018-01-07T00:59:18Z

I noticed that you added stats for the "residency" (actual amount of live data), but not memory that was actually allocated from the GPU. The latter is also a pretty useful stat to collect, since (1) it is the thing that end users will actually see when the query nvidia-smi, and (2) it is the thing that will determine if you are actually out of memory or not is the actual memory allocated. Fragmentation could mean we are wasting memory in the caching allocator, that doesn't show up in the residency computation. (Though, tensor allocations tend to be big, so this isn't a big deal in practice...)

There's tons of other stats we could add to the allocator but I guess no one needed them, so we ain't gonna add 'em :)

ezyang · 2018-01-07T01:03:10Z

Re CI failures, I think there is an OOD. Looking.

ezyang · 2018-01-07T01:17:41Z

@pytorchbot retest this please

ssnl · 2018-01-07T01:21:00Z

@ezyang I'd love to add actual allocated GPU usage (unused in cache + context), but I'm not sure how to count the size of the context. Perhaps just getting total cache size is fine for now?

Also, I actually think that total residency memory (i.e. memory pointed by the tensors) is sometimes a better indication for OOM because there can be large amount of available memory in cache. When nvidia-smi shows little memory left, users may still allocate large tensors.

ezyang · 2018-01-07T01:27:51Z

Yep, the number of blocks in the cache is the usual thing to use.

And yes, you're right, residency can be low while the cache is large, which is why residency is important too :)

apaszke · 2018-01-07T11:12:31Z

One way to get more detailed stats would be to just go over all blocks and accumulate information then instead of keeping some running stats. That's more expensive, but I'd assume that these functions are mostly useful for one-off debugging

aten/src/THC/THCCachingAllocator.cpp

    *devPtr = (void*)block->ptr;
+
+    memory_allocated += block->size;
+    max_memory_allocated = std::max(max_memory_allocated, memory_allocated);


ssnl · 2018-01-07T16:32:14Z

@apaszke Yes, but the max memory allocated is also a useful stat, so I went with the running stats route.

ezyang · 2018-01-07T16:39:20Z

Yeah, generally you want running stats if at all possible.

torch/cuda/__init__.py

torch/_tensor_docs.py

ssnl · 2018-01-08T18:48:31Z

Added cache stats. Changed the functions to provide per-device stats.

ssnl · 2018-01-08T18:51:08Z

@pytorchbot retest this please

ssnl · 2018-01-08T23:24:32Z

The build errors are fixed in #4544

docs/source/notes/cuda.rst

ssnl · 2018-01-09T04:01:50Z

@pytorchbot retest this please

test/test_cuda.py

+        def alloc(*size):
+            with torch.cuda.device(device):
+                # NOTE: do **not** use methods that will have additional
+                #       overhead, e.g., inplace random sampling methods.


test/test_cuda.py

+        assert_change(-1)
+        self.assertEqual(torch.cuda.memory_allocated(device), m0)
+
+        assert_change(0, empty_cache=True)


ssnl · 2018-01-09T16:43:58Z

again test fails are unrelated, and fixed in #4544

soumith added the in progress label Jan 6, 2018

ssnl added 2 commits January 6, 2018 17:34

gpu mem allocated

9cdbf5f

add test

ccf104f

ssnl force-pushed the cuda_mem branch from ab5ac82 to ccf104f Compare January 6, 2018 22:34

onnxbot-worker-3 mentioned this pull request Jan 6, 2018

[auto] pytorch-pr-4511 onnxbot/onnx-fb-universe#197

Closed

apaszke reviewed Jan 7, 2018

View reviewed changes

addressed some of @apaszke 's comments

0ab6150

peterjc123 reviewed Jan 7, 2018

View reviewed changes

aten/src/THC/THCCachingAllocator.cpp Outdated

*devPtr = (void*)block->ptr;

memory_allocated += block->size;

max_memory_allocated = std::max(max_memory_allocated, memory_allocated);

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

soumith reviewed Jan 8, 2018

View reviewed changes

torch/cuda/__init__.py Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ssnl force-pushed the cuda_mem branch from b03387a to aab6dec Compare January 8, 2018 18:41

ssnl commented Jan 8, 2018

View reviewed changes

torch/_tensor_docs.py Outdated

This comment was marked as off-topic.

Sign in to view

ssnl force-pushed the cuda_mem branch 2 times, most recently from b18a68f to 1bb7125 Compare January 8, 2018 18:47

ssnl force-pushed the cuda_mem branch 2 times, most recently from 159b2ec to 28958c4 Compare January 8, 2018 22:45

jlquinn reviewed Jan 9, 2018

View reviewed changes

docs/source/notes/cuda.rst Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ssnl force-pushed the cuda_mem branch from 28958c4 to 1255dfd Compare January 9, 2018 02:10

cache stats

0522619

ssnl force-pushed the cuda_mem branch from 1255dfd to 0522619 Compare January 9, 2018 04:06

apaszke approved these changes Jan 9, 2018

View reviewed changes

add more comments about test

0e87fbd

soumith merged commit 5918243 into pytorch:master Jan 9, 2018

soumith removed the in progress label Jan 9, 2018

ssnl deleted the cuda_mem branch January 9, 2018 16:49

ssnl mentioned this pull request Jan 19, 2018

Heuristic-based autograd execution order #4746

Merged

soumith added 0.3.1 and removed 0.3.1 labels Feb 5, 2018

ir413 mentioned this pull request Apr 9, 2018

Missing cutorch.getMemoryUsage? #2671

Closed

ssnl mentioned this pull request Apr 28, 2018

[feature request] Caching allocator diagnostics and memory allocation tracing/visualization #1529

Open

8 tasks

ezyang added the open source label Jun 24, 2019

Methods for checking CUDA memory usage #4511

Methods for checking CUDA memory usage #4511

Uh oh!

Conversation

ssnl commented Jan 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

ssnl commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yf225 commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented Jan 7, 2018

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

ssnl commented Jan 7, 2018

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

apaszke commented Jan 7, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented Jan 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jan 7, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented Jan 8, 2018

Uh oh!

ssnl commented Jan 8, 2018

Uh oh!

ssnl commented Jan 8, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented Jan 9, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

ssnl commented Jan 6, 2018 •

edited

Loading

ssnl commented Jan 7, 2018 •

edited

Loading

ssnl commented Jan 7, 2018 •

edited

Loading

yf225 commented Jan 7, 2018 •

edited

Loading

ssnl commented Jan 7, 2018 •

edited

Loading