Add env variable to bypass CUDACachingAllocator for debugging #45294

bertmaher · 2020-09-24T19:33:45Z

Stack from ghstack:

Add env variable to bypass CUDACachingAllocator for debugging #45294 Add env variable to bypass CUDACachingAllocator for debugging

Summary: While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.

This PR adds a runtime knob (PYTORCH_CUDA_DEBUG_MEMORY) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.

Test Plan: Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.

Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826

And ran:

PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py

Differential Revision: D23964734

@ngimel

Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` [ghstack-poisoned]

@ngimel

Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` ghstack-source-id: 5a28c87 Pull Request resolved: #45294

bertmaher · 2020-09-24T19:39:50Z

So this is a pretty minimal debug path for the cuda allocator; I'd love some advice on whether this is a reasonable approach, and whether there's anything else I should be doing here. Also, suggestions for perf testing would be great -- I think we have an operator overhead bench that I'll try out, anything else? It seems like a perf-sensitive path so with any luck it doesn't slow that down.

codecov · 2020-09-24T23:14:36Z

Codecov Report

Merging #45294 into gh/bertmaher/23/base will increase coverage by 0.00%.
The diff coverage is n/a.

@@                  Coverage Diff                   @@
##           gh/bertmaher/23/base   #45294    +/-   ##
======================================================
  Coverage                 68.05%   68.05%            
======================================================
  Files                       396      393     -3     
  Lines                     51232    50914   -318     
======================================================
- Hits                      34864    34651   -213     
+ Misses                    16368    16263   -105

Impacted Files	Coverage Δ
torch/distributed/rpc/options.py	`33.33% <0.00%> (-50.01%)`	⬇️
torch/distributed/rpc/backend_registry.py	`32.35% <0.00%> (-16.04%)`	⬇️
torch/utils/_benchmark/utils/common.py	`77.68% <0.00%> (-13.23%)`	⬇️
torch/testing/_internal/common_cuda.py	`54.21% <0.00%> (-9.83%)`	⬇️
torch/backends/cuda/__init__.py	`62.50% <0.00%> (-8.34%)`	⬇️
torch/distributed/optim/optimizer.py	`29.78% <0.00%> (-7.57%)`	⬇️
torch/nn/quantized/modules/conv.py	`85.25% <0.00%> (-4.32%)`	⬇️
torch/optim/adagrad.py	`79.03% <0.00%> (-4.31%)`	⬇️
torch/testing/_internal/dist_utils.py	`33.33% <0.00%> (-2.11%)`	⬇️
torch/onnx/symbolic_opset12.py	`25.00% <0.00%> (-1.79%)`	⬇️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4950a0...7dd3b7c. Read the comment docs.

dzhulgakov

Looks good. Maybe call it PYTORCH_NO_CUDA_MEMORY_CACHING or something like that? 'debugging' implies some fancy tool/report

ezyang · 2020-09-25T14:18:32Z

Yeah, this looks fine, but agreed with Dmytro on env var renaming. You should also think about where to document this option.

ngimel · 2020-09-25T16:26:59Z

This section https://pytorch.org/docs/master/cuda.html#memory-management looks like a reasonable place for documentation

@ngimel

…ing" Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_CUDA_DEBUG_MEMORY`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_CUDA_DEBUG_MEMORY=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` [ghstack-poisoned]

@ngimel

Summary: While tracking down a recent memory corruption bug we found that cuda-memcheck wasn't finding the bad accesses, and @ngimel pointed out that it's because we use a caching allocator so a lot of "out of bounds" accesses land in a valid slab. This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set, bypasses the caching allocator's caching logic so that allocations go straight to cudaMalloc. This way, cuda-memcheck will actually work. Test Plan: Insert some memory errors and run a test under cuda-memcheck; observe that cuda-memcheck flags an error where expected. Specifically I removed the output-masking logic here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826 And ran: ``` PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py ``` ghstack-source-id: 6b44289 Pull Request resolved: #45294

facebook-github-bot · 2020-09-28T20:20:05Z

@bertmaher merged this pull request in 03342af.

bertmaher requested review from dzhulgakov, ezyang and ngimel September 24, 2020 19:37

dzhulgakov approved these changes Sep 25, 2020

View reviewed changes

facebook-github-bot closed this in 03342af Sep 28, 2020

facebook-github-bot added the merged label Sep 28, 2020

facebook-github-bot deleted the gh/bertmaher/23/head branch October 2, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add env variable to bypass CUDACachingAllocator for debugging #45294

Add env variable to bypass CUDACachingAllocator for debugging #45294

Uh oh!

bertmaher commented Sep 24, 2020 •

edited

Loading

Uh oh!

bertmaher commented Sep 24, 2020

Uh oh!

codecov bot commented Sep 24, 2020 •

edited

Loading

Uh oh!

dzhulgakov left a comment

Uh oh!

ezyang commented Sep 25, 2020

Uh oh!

ngimel commented Sep 25, 2020

Uh oh!

facebook-github-bot commented Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Add env variable to bypass CUDACachingAllocator for debugging #45294

Add env variable to bypass CUDACachingAllocator for debugging #45294

Uh oh!

Conversation

bertmaher commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bertmaher commented Sep 24, 2020

Uh oh!

codecov bot commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 25, 2020

Uh oh!

ngimel commented Sep 25, 2020

Uh oh!

facebook-github-bot commented Sep 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bertmaher commented Sep 24, 2020 •

edited

Loading

codecov bot commented Sep 24, 2020 •

edited

Loading