-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve CUDA out-of-memory error message #13751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The new error message now looks like (from Python): RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total; 4.00 GiB allocated; 179.00 KiB cached) Summary of terms: "total": total global memory on GPU "allocated": memory allocated by the program using the caching allocator "cached": memory held by the allocator but not used by the program The "allocated" amount does not include memory allocated outside of the caching allocator, such as memory allocated by other programs or memory held by the driver. Note that at this point cuda_malloc_retry has already returned all possible "cached" memory to the driver. The only remaining "cached" memory is split from a larger block that is partially in-use.
|
I think |
|
|
||
| Block search_key(device, stream, size); | ||
| auto& free_blocks = small ? large_blocks : small_blocks; | ||
| auto& free_blocks = small ? small_blocks : large_blocks; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/THC/THCCachingAllocator.cpp
Outdated
| cudaGetLastError(); // clear CUDA error | ||
|
|
||
| cudaDeviceProp prop; | ||
| AT_CUDA_CHECK(cudaGetDeviceProperties(&prop, device)); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
```
The new error message now looks like (from Python):
RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached)
Summary of terms:
"total capacity": total global memory on GPU
"already allocated": memory allocated by the program using the
caching allocator
"free": free memory as reported by the CUDA API
"cached": memory held by the allocator but not used by the program
The "allocated" amount does not include memory allocated outside
of the caching allocator, such as memory allocated by other programs
or memory held by the driver.
The sum of "allocated" + "free" + "cached" may be less than the
total capacity due to memory held by the driver and usage by other
programs.
Note that at this point cuda_malloc_retry has already returned all
possible "cached" memory to the driver. The only remaining "cached"
memory is split from a larger block that is partially in-use.
```
This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared.
Pull Request resolved: pytorch/pytorch#13751
Differential Revision: D13007177
Pulled By: colesbury
fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab
This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because
cudaGetLastError()was not cleared.