-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix race condition, bad lock hierarchy. Move getFreeMutex() into AutoNcclGroup. #22173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does calling |
|
I believe this PR is related to #14870. NCCL 2.x documentation concerning Group Calls: https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/groups.html From the NCCL sources, in ncclGroupEnd: In light of the above, it's not clear to me whether NCCL 2 fixed the race with cudaFree or not. However, the NCCL 2 code guarantees that only the last CPU thread to arrive at the collective for a given communicator will perform the launching for all devices. For sufficiently new (CUDART_VERSION >= 9000) and capable hardware, it will use cudaCooperativeLaunchMultiDevice. An alternative to this PR would be to remove the cuda_free_mutex and replace it with the recursive_mutex in the same file. This would also prevent deadlock, but without removing the cudaFree race protection. But it is not clear what the performance impact would be relative to just removing the cuda_free_mutex when NCCL v2 is in use. |
|
Additional related issues/discussions: |
|
Thanks for the detailed analysis, @jeffdaily. I went and looked at the NCCL code and did some more investigation myself today and found the following:
Regarding the lock ordering, this is indeed an issue. I like the idea of using the |
8fd31c7 to
42cf602
Compare
|
Hi @pietern . I have rebased the PR and revised it based on your comments. The free lock remains and its lifetime is managed by the |
|
Thanks for updating the PR @jeffdaily. This looks good to me. @mrshenli can you take a look as well? |
|
@pietern I am looking now |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM too!
|
|
||
| namespace { | ||
|
|
||
| struct AutoNcclGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, is this the same as the one in nccl.h except the error checking? Is it possible to directly use that one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the code here is identical to the one in nccl.h except the error checking macro. This 'new' AutoNcclGroup is local to the ProcessGroupNCCL.cpp file. I chose to repeat the code here instead of trying to add the header dependency to torch/csrc/cuda when the code there is depending on c10 -- circular dep? I'm not yet completely familiar with how the PyTorch directories and libraries are structured. Let me know how I should proceed. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks! Then, let's keep it as is.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Thanks @jeffdaily and @mrshenli! |
|
This PR breaks master CUDA builds with: Jul 18 14:59:18 In file included from /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupNCCL.hpp:6:0, I am reverting it. |
|
How did CI not catch this earlier? This |
|
@yf225 can we fix this with a new PR rather than revert the entire PR? |
|
@jeffdaily If the fix PR takes longer, it would mean that all other PRs based on current master have their CUDA builds broken, which I don't think is a good idea. Since most of this PR has been reviewed, opening a new PR with the fix added shouldn't take too long to review, and we can make sure CI is green before merging it. |
|
@yf225 ok to revert. I will start the new PR now, with the fix added. |
|
@ezyang do you know why our CI didn't catch this error? |
|
@mrshenli I suspect that it's because the last CI run was too old (16 days ago) |
There are two mutexes within CUDACachingAllocator that cause a deadlock. One of the mutexes was added in order to work around the issue of NCCL interacting poorly with cudaFree. See
As of NCCL version 2 and its new group start/end APIs, the protection surrounding cudaFree() is no longer needed. The PyTorch code was updated to use the NCCL2 group start/end API, but the corresponding cuda_free_mutex and its getter getFreeMutex() were not revised. This PR removes the use of the getFreeMutex() when NCCL2 is used by moving calls to getFreeMutex() into the AutoNcclGroup. That way, depending on the NCCL version used, we either use the mutex or we use the new group APIs.
The race condition is as follows, thanks to @skeelyamd:
The deadlock occurs between hip_free_mutex (aka cuda_free_mutex in github) (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L165) and mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L162).
hip_free_mutex is exported from THCCachingAllocator in getFreeMutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L660) and is acquired in ProcessGroupNCCL::collective (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L397), which then calls back into THCCachingAllocator via c10::cuda::CUDACachingAllocator::recordStream (https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp#L416 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L655 to https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L379). At this point it acquires mutex (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L384).
This requires hip_free_mutex to be locked before mutex.
However, in free_blocks (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L505) THCCachingAllocator locks hip_free_mutex. Free_blocks is called from emptyCache (https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllocator.cpp#L328) which locks mutex.
That requires mutex to be locked before hip_free_mutex.
emptyCache and ProcessGroupNCCL::collective may not be executed concurrently but this is occurring and deadlocking the CPU.
free_blocks is also called by malloc (via cuda_malloc_retry -> free_cached_blocks -> free_blocks) which also locks mutex first and so malloc must not execute concurrent with ProcessGroupNCCL::collective.