-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add env variable to bypass CUDACachingAllocator for debugging #45294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` [ghstack-poisoned]
Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` ghstack-source-id: 5a28c87 Pull Request resolved: #45294
|
So this is a pretty minimal debug path for the cuda allocator; I'd love some advice on whether this is a reasonable approach, and whether there's anything else I should be doing here. Also, suggestions for perf testing would be great -- I think we have an operator overhead bench that I'll try out, anything else? It seems like a perf-sensitive path so with any luck it doesn't slow that down. |
Codecov Report
@@ Coverage Diff @@
## gh/bertmaher/23/base #45294 +/- ##
======================================================
Coverage 68.05% 68.05%
======================================================
Files 396 393 -3
Lines 51232 50914 -318
======================================================
- Hits 34864 34651 -213
+ Misses 16368 16263 -105
Continue to review full report at Codecov.
|
dzhulgakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Maybe call it PYTORCH_NO_CUDA_MEMORY_CACHING or something like that? 'debugging' implies some fancy tool/report
|
Yeah, this looks fine, but agreed with Dmytro on env var renaming. You should also think about where to document this option. |
|
This section https://pytorch.org/docs/master/cuda.html#memory-management looks like a reasonable place for documentation |
…ing" Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` [ghstack-poisoned]
Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` ghstack-source-id: 6b44289 Pull Request resolved: #45294
|
@bertmaher merged this pull request in 03342af. |
Stack from ghstack:
Summary: While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (
PYTORCH_CUDA_DEBUG_MEMORY) that, when set,bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan: Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
Differential Revision: D23964734