Improve CUDA out-of-memory error message #13751

colesbury · 2018-11-08T22:23:10Z

The new error message now looks like (from Python):

  RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached)

Summary of terms:

  "total capacity": total global memory on GPU
  "already allocated": memory allocated by the program using the
                       caching allocator
  "free": free memory as reported by the CUDA API
  "cached": memory held by the allocator but not used by the program
 
  The "allocated" amount  does not include memory allocated outside
  of the caching allocator, such as memory allocated by other programs
  or memory held by the driver.
 
  The sum of "allocated" + "free" + "cached" may be less than the
  total capacity due to memory held by the driver and usage by other
  programs.

  Note that at this point cuda_malloc_retry has already returned all
  possible "cached" memory to the driver. The only remaining "cached"
  memory is split from a larger block that is partially in-use.

This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because cudaGetLastError() was not cleared.

The new error message now looks like (from Python): RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total; 4.00 GiB allocated; 179.00 KiB cached) Summary of terms: "total": total global memory on GPU "allocated": memory allocated by the program using the caching allocator "cached": memory held by the allocator but not used by the program The "allocated" amount does not include memory allocated outside of the caching allocator, such as memory allocated by other programs or memory held by the driver. Note that at this point cuda_malloc_retry has already returned all possible "cached" memory to the driver. The only remaining "cached" memory is split from a larger block that is partially in-use.

soumith · 2018-11-08T22:26:17Z

I think allocated here is confusing, because it conflicts with Tried to allocate. Would it make sense to say:

GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 179.00 KiB cached

aten/src/THC/THCCachingAllocator.cpp


    Block search_key(device, stream, size);
-    auto& free_blocks = small ? large_blocks : small_blocks;
+    auto& free_blocks = small ? small_blocks : large_blocks;


aten/src/THC/THCCachingAllocator.cpp

+          cudaGetLastError();  // clear CUDA error
+
+          cudaDeviceProp prop;
+          AT_CUDA_CHECK(cudaGetDeviceProperties(&prop, device));


facebook-github-bot

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: ``` The new error message now looks like (from Python): RuntimeError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 11.93 GiB total capacity; 4.00 GiB already allocated; 7.33 GiB free; 179.00 KiB cached) Summary of terms: "total capacity": total global memory on GPU "already allocated": memory allocated by the program using the caching allocator "free": free memory as reported by the CUDA API "cached": memory held by the allocator but not used by the program The "allocated" amount does not include memory allocated outside of the caching allocator, such as memory allocated by other programs or memory held by the driver. The sum of "allocated" + "free" + "cached" may be less than the total capacity due to memory held by the driver and usage by other programs. Note that at this point cuda_malloc_retry has already returned all possible "cached" memory to the driver. The only remaining "cached" memory is split from a larger block that is partially in-use. ``` This also fixes an issue where on out-of-memory could cause an unrelated subsequent CUDA kernel launch to fail because `cudaGetLastError()` was not cleared. Pull Request resolved: pytorch/pytorch#13751 Differential Revision: D13007177 Pulled By: colesbury fbshipit-source-id: ea7121461b3f2a34646102959b45bde19f2fabab

Update error message with Soumith's feedback

fd95e1c

pietern reviewed Nov 8, 2018

View reviewed changes

colesbury added 2 commits November 8, 2018 18:43

Fix deleted return statement and use AT_CUDA_CHECK in insert_events

0e29ad5

Merge branch 'master' into cuda_oom

5deea0a

apaszke approved these changes Nov 9, 2018

View reviewed changes

aten/src/THC/THCCachingAllocator.cpp Outdated

cudaGetLastError(); // clear CUDA error

cudaDeviceProp prop;

AT_CUDA_CHECK(cudaGetDeviceProperties(&prop, device));

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

Report free memory as well

eb622af

facebook-github-bot reviewed Nov 9, 2018

View reviewed changes

colesbury mentioned this pull request Nov 9, 2018

Avoid grabbing DeviceGuard in at::empty when possible #13785

Closed

facebook-github-bot closed this in 014ea1e Nov 9, 2018

ezyang added the merged label Jun 25, 2019

albanD mentioned this pull request Jan 7, 2020

torch.no_grad() context manager seems to leak memory #31497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve CUDA out-of-memory error message #13751

Improve CUDA out-of-memory error message #13751

Uh oh!

colesbury commented Nov 8, 2018 •

edited

Loading

Uh oh!

soumith commented Nov 8, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Improve CUDA out-of-memory error message #13751

Improve CUDA out-of-memory error message #13751

Uh oh!

Conversation

colesbury commented Nov 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soumith commented Nov 8, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

colesbury commented Nov 8, 2018 •

edited

Loading