Fix nccl regression on PyTorch 2.3 upgrade by fxmarty · Pull Request #2099 · huggingface/text-generation-inference

fxmarty · 2024-06-20T18:01:56Z

As per title, fixes NVIDIA/nccl#1251 in TGI's cuda image, regression introduced in #1730 & #1833

We hit this issue e.g. with llama 3 70B model with TP=4 or TP=8 on H100 & default cuda graphs, one can e.g. repro the hanging with text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct --sequence-length 128 --decode-length 10 --warmups 2 --runs 100 -b 1, where shards hang in

Thread 1302975 (active): "MainThread"
    sched_yield (libc.so.6)
    ncclLaunchKernelBefore_NoUncapturedCuda (enqueue.cc:968)
    doLaunches (group.cc:161)
    groupLaunch (group.cc:339)
    ncclGroupEndInternal (group.cc:418)
    ncclGroupEndInternal (group.cc:368)
    ncclEnqueueCheck (enqueue.cc:1981)
    ncclAllReduce (collectives.cc:49)
    c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}> (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::allreduce_impl (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::allreduce (libtorch_cuda.so)
    c10d::ops::(anonymous namespace)::allreduce_CUDA (libtorch_cpu.so)

PyTorch 2.3 has a hard requirement on nccl 2.20.5 so I am not completely sure this fix is fine. We could also choose to downgrade.

interesting read as well https://pytorch.slack.com/archives/C3PDTEV8E/p1713223950622429?thread_ts=1712807088.459829&cid=C3PDTEV8E

Will wait for the build to run to check TGI's benchmark again & any potential regression.

fxmarty · 2024-06-20T18:36:12Z

Dockerfile

    pip install -r requirements_cuda.txt && \
-    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
+    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
+    pip install nvidia-nccl-cu12==2.22.3


Would have liked to use pyproject.toml for that, but poetry disapproves of conflict handling python-poetry/poetry#697 (comment)

Narsil · 2024-06-24T17:24:17Z

Thanks a lot for the find, the fix and the details.

I'm more on the fence of waiting for torch to fix it (2.3.1 hasn't fixed it yet) since afaik this does NOT affect production.
If it did, 100% on your solution (seems better than downgrading for the time being since torch 2.3 still received some nice ugprades).

fxmarty · 2024-06-25T09:36:08Z

As you'd like. I am using this fix to benchmark.

Hugoch · 2024-07-01T11:21:48Z

Nice fix @fxmarty !
I confirm that upgrading NCCL as proposed fixes the systematic hang on 8xH100 P5 instances. TGI freezes without crashing. Pytorch 2.4 should be released this month, let's check if NCCL gets updated, otherwise it would be nice to merge that patch.

OlivierDehaene

Since this affect real deployments, let's merge this.

OlivierDehaene · 2024-07-08T08:11:59Z

Dockerfile

+    pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir && \
+    pip install nvidia-nccl-cu12==2.22.3
+
+ENV LD_PRELOAD=/opt/conda/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2


Why do we need to preload?

Otherwise, the shared object is not used. The current base docker image of TGI is nvidia/cuda:12.1.0-base-ubuntu22.04, where there is no libnccl.so anywhere and it is not loaded by pytorch either, although we have /opt/conda/lib/libcudart.so.12.1.105 etc. COPY --from=pytorch-install /opt/conda /opt/conda does not seem to copy any libnccl.so. Weird.

@samsamoa

* fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD

fxmarty added 2 commits June 20, 2024 17:57

fix nccl issue

2502ce4

add note in dockerfile

a76b6f4

fxmarty requested review from Hugoch and OlivierDehaene June 20, 2024 18:05

fxmarty added 2 commits June 20, 2024 18:12

use v2.22.3 that also fixes @samsamoa's repro

27a3792

poetry actually can't handle the conflict between torch and nccl

62a1ddb

fxmarty commented Jun 20, 2024

View reviewed changes

set LD_PRELOAD

a1695ce

OlivierDehaene approved these changes Jul 8, 2024

View reviewed changes

Hugoch mentioned this pull request Jul 8, 2024

Queue size increases indefinitely #2192

Closed

4 tasks

OlivierDehaene merged commit 4c50b6d into main Jul 8, 2024

OlivierDehaene deleted the fix-nccl-regression branch July 8, 2024 15:52

HoKim98 mentioned this pull request Jul 11, 2024

Tgi crash on multi GPUs #2207

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nccl regression on PyTorch 2.3 upgrade#2099

Fix nccl regression on PyTorch 2.3 upgrade#2099
OlivierDehaene merged 5 commits intomainfrom
fix-nccl-regression

fxmarty commented Jun 20, 2024 •

edited

Loading

Uh oh!

fxmarty Jun 20, 2024

Uh oh!

Narsil commented Jun 24, 2024

Uh oh!

fxmarty commented Jun 25, 2024

Uh oh!

Hugoch commented Jul 1, 2024

Uh oh!

OlivierDehaene left a comment

Uh oh!

OlivierDehaene Jul 8, 2024

Uh oh!

fxmarty Jul 8, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fxmarty commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty Jun 20, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil commented Jun 24, 2024

Uh oh!

fxmarty commented Jun 25, 2024

Uh oh!

Hugoch commented Jul 1, 2024

Uh oh!

OlivierDehaene left a comment

Choose a reason for hiding this comment

Uh oh!

OlivierDehaene Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

fxmarty Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fxmarty commented Jun 20, 2024 •

edited

Loading

fxmarty Jul 8, 2024 •

edited

Loading