Skip to content

Conversation

@ailzhang
Copy link
Contributor

@ailzhang ailzhang commented Mar 9, 2018

This patch fixes a bug triggered by #5182 when we have multiple layers in the model, and the DDP is run on a single node, with a subset of GPUs each.
For example, as in the test we run 2 processes on a 8 GPU node, both processes are visible to all GPUs. We create the DDP model by nn.parallel.DistributedDataParallel(model_DDP, device_ids=gpu_subset) where gpu_subset is 0,1,2,3 for process 1, and 4,5,6,7 for process 2.
utils::flatten_dense_tensors(chunk.tensors) will actually create a new Tensor which a flatten version of layer weights. Without this patch, this tensor goes to default GPU 0 despite all layers weights for process 2 are on GPU4, this will further error out when broadcast requires the tensor to be on the GPU 4 for process 2.
The gpu guard inside the for loop has nothing to do with the current bug, I thought it's good to add it as a safety guard.
@apaszke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants