-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Methods for checking CUDA memory usage #4511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aten/src/THC/THCCachingAllocator.cpp
Outdated
| static size_t max_memory_allocated; | ||
|
|
||
| // current memory allocated | ||
| static size_t memory_allocated; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/cuda/__init__.py
Outdated
| """Returns the current total GPU memory usage by tensors in bytes. | ||
| .. note:: This is likely less than the amount shown in `nvidia-smi` since | ||
| some unused memory can be held by the cached memory allocator and some |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_cuda.py
Outdated
| # comp > 0: increased | ||
| # comp = 0: equal | ||
| # comp < 0: decreased | ||
| nonlocal last_m, max_m |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_cuda.py
Outdated
| for i in range(int(N / 2)): | ||
| x = tensors2[i].numel() | ||
| del tensors2[i] | ||
| assert_change(-x) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Great, very happy to see that this happened. :) OOC, how did you check that you hit all of the necessary entry points to modify the allocator? |
|
I'll update the PR to provide usage per GPU instead. :) |
|
@ezyang I didn't do any tests, but I checked with @colesbury on that :) |
|
Windows failure is the following: To fix it we can add |
|
@yf225 I'll add it, thanks! |
|
I noticed that you added stats for the "residency" (actual amount of live data), but not memory that was actually allocated from the GPU. The latter is also a pretty useful stat to collect, since (1) it is the thing that end users will actually see when the query nvidia-smi, and (2) it is the thing that will determine if you are actually out of memory or not is the actual memory allocated. Fragmentation could mean we are wasting memory in the caching allocator, that doesn't show up in the residency computation. (Though, tensor allocations tend to be big, so this isn't a big deal in practice...) There's tons of other stats we could add to the allocator but I guess no one needed them, so we ain't gonna add 'em :) |
|
Re CI failures, I think there is an OOD. Looking. |
|
@pytorchbot retest this please |
|
@ezyang I'd love to add actual allocated GPU usage (unused in cache + context), but I'm not sure how to count the size of the context. Perhaps just getting total cache size is fine for now? Also, I actually think that total residency memory (i.e. memory pointed by the tensors) is sometimes a better indication for OOM because there can be large amount of available memory in cache. When nvidia-smi shows little memory left, users may still allocate large tensors. |
|
Yep, the number of blocks in the cache is the usual thing to use. And yes, you're right, residency can be low while the cache is large, which is why residency is important too :) |
|
One way to get more detailed stats would be to just go over all blocks and accumulate information then instead of keeping some running stats. That's more expensive, but I'd assume that these functions are mostly useful for one-off debugging |
aten/src/THC/THCCachingAllocator.cpp
Outdated
| *devPtr = (void*)block->ptr; | ||
|
|
||
| memory_allocated += block->size; | ||
| max_memory_allocated = std::max(max_memory_allocated, memory_allocated); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@apaszke Yes, but the max memory allocated is also a useful stat, so I went with the running stats route. |
|
Yeah, generally you want running stats if at all possible. |
torch/cuda/__init__.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/_tensor_docs.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
b18a68f to
1bb7125
Compare
|
Added cache stats. Changed the functions to provide per-device stats. |
|
@pytorchbot retest this please |
159b2ec to
28958c4
Compare
|
The build errors are fixed in #4544 |
docs/source/notes/cuda.rst
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@pytorchbot retest this please |
test/test_cuda.py
Outdated
| def alloc(*size): | ||
| with torch.cuda.device(device): | ||
| # NOTE: do **not** use methods that will have additional | ||
| # overhead, e.g., inplace random sampling methods. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| assert_change(-1) | ||
| self.assertEqual(torch.cuda.memory_allocated(device), m0) | ||
|
|
||
| assert_change(0, empty_cache=True) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
again test fails are unrelated, and fixed in #4544 |
torch.cuda.memory_cached,torch.cuda.max_memory_cached,torch.cuda.memory_allocatedandtorch.cuda.max_memory_allocatedto provide per-device memory stats. These will be useful for monitoring and benchmarking.related issue: #1529