Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jan 6, 2018

  1. Adds torch.cuda.memory_cached, torch.cuda.max_memory_cached, torch.cuda.memory_allocated and torch.cuda.max_memory_allocated to provide per-device memory stats. These will be useful for monitoring and benchmarking.
  2. Adds two tests (single/multi-gpu) to test these four methods.

related issue: #1529

static size_t max_memory_allocated;

// current memory allocated
static size_t memory_allocated;

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

"""Returns the current total GPU memory usage by tensors in bytes.
.. note:: This is likely less than the amount shown in `nvidia-smi` since
some unused memory can be held by the cached memory allocator and some

This comment was marked as off-topic.

# comp > 0: increased
# comp = 0: equal
# comp < 0: decreased
nonlocal last_m, max_m

This comment was marked as off-topic.

for i in range(int(N / 2)):
x = tensors2[i].numel()
del tensors2[i]
assert_change(-x)

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

Great, very happy to see that this happened. :)

OOC, how did you check that you hit all of the necessary entry points to modify the allocator?

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 7, 2018

I'll update the PR to provide usage per GPU instead. :)

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 7, 2018

@ezyang I didn't do any tests, but I checked with @colesbury on that :)
Btw, should the nightly build fail?

@yf225
Copy link
Contributor

yf225 commented Jan 7, 2018

Windows failure is the following:

22:45:58 C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-build\aten\src\THC\THCCachingAllocator.cpp(172): error C2039: 'max': is not a member of 'std'
22:45:58 C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\include\unordered_map(15): note: see declaration of 'std'
22:45:58 C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-build\aten\src\THC\THCCachingAllocator.cpp(172): error C3861: 'max': identifier not found

To fix it we can add #include <algorithm>, according to https://social.msdn.microsoft.com/Forums/vstudio/en-US/f5915ad0-a9d1-49f3-8643-ffd623f72b93/error-c2039-max-is-not-a-member-of-std?forum=vcgeneral

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 7, 2018

@yf225 I'll add it, thanks!

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

I noticed that you added stats for the "residency" (actual amount of live data), but not memory that was actually allocated from the GPU. The latter is also a pretty useful stat to collect, since (1) it is the thing that end users will actually see when the query nvidia-smi, and (2) it is the thing that will determine if you are actually out of memory or not is the actual memory allocated. Fragmentation could mean we are wasting memory in the caching allocator, that doesn't show up in the residency computation. (Though, tensor allocations tend to be big, so this isn't a big deal in practice...)

There's tons of other stats we could add to the allocator but I guess no one needed them, so we ain't gonna add 'em :)

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

Re CI failures, I think there is an OOD. Looking.

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

@pytorchbot retest this please

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 7, 2018

@ezyang I'd love to add actual allocated GPU usage (unused in cache + context), but I'm not sure how to count the size of the context. Perhaps just getting total cache size is fine for now?

Also, I actually think that total residency memory (i.e. memory pointed by the tensors) is sometimes a better indication for OOM because there can be large amount of available memory in cache. When nvidia-smi shows little memory left, users may still allocate large tensors.

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

Yep, the number of blocks in the cache is the usual thing to use.

And yes, you're right, residency can be low while the cache is large, which is why residency is important too :)

@apaszke
Copy link
Contributor

apaszke commented Jan 7, 2018

One way to get more detailed stats would be to just go over all blocks and accumulate information then instead of keeping some running stats. That's more expensive, but I'd assume that these functions are mostly useful for one-off debugging

*devPtr = (void*)block->ptr;

memory_allocated += block->size;
max_memory_allocated = std::max(max_memory_allocated, memory_allocated);

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 7, 2018

@apaszke Yes, but the max memory allocated is also a useful stat, so I went with the running stats route.

@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2018

Yeah, generally you want running stats if at all possible.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl ssnl force-pushed the cuda_mem branch 2 times, most recently from b18a68f to 1bb7125 Compare January 8, 2018 18:47
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 8, 2018

Added cache stats. Changed the functions to provide per-device stats.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 8, 2018

@pytorchbot retest this please

@ssnl ssnl force-pushed the cuda_mem branch 2 times, most recently from 159b2ec to 28958c4 Compare January 8, 2018 22:45
@ssnl
Copy link
Collaborator Author

ssnl commented Jan 8, 2018

The build errors are fixed in #4544

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 9, 2018

@pytorchbot retest this please

def alloc(*size):
with torch.cuda.device(device):
# NOTE: do **not** use methods that will have additional
# overhead, e.g., inplace random sampling methods.

This comment was marked as off-topic.

This comment was marked as off-topic.

assert_change(-1)
self.assertEqual(torch.cuda.memory_allocated(device), m0)

assert_change(0, empty_cache=True)

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator Author

ssnl commented Jan 9, 2018

again test fails are unrelated, and fixed in #4544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants