Add gpu guard for broadcast_coalesce #5655
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch fixes a bug triggered by #5182 when we have multiple layers in the model, and the DDP is run on a single node, with a subset of GPUs each.
For example, as in the test we run 2 processes on a 8 GPU node, both processes are visible to all GPUs. We create the DDP model by
nn.parallel.DistributedDataParallel(model_DDP, device_ids=gpu_subset)where gpu_subset is 0,1,2,3 for process 1, and 4,5,6,7 for process 2.utils::flatten_dense_tensors(chunk.tensors) will actually create a new Tensor which a flatten version of layer weights. Without this patch, this tensor goes to default GPU 0 despite all layers weights for process 2 are on GPU4, this will further error out when broadcast requires the tensor to be on the GPU 4 for process 2.
The gpu guard inside the for loop has nothing to do with the current bug, I thought it's good to add it as a safety guard.
@apaszke