Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance #4870
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@csarofeen 's simple DistributedDataParallel (DDP) version (with sync buffers ON, for fair comparison since we always sync buffer in our DDP) can hit 0.159 sec per iteration on 8 GPU training using 8 DDP processes for Resnet50 32 mini-batch size / GPU, on single node with 8 P100s. And our current DDP can hit 0.164 sec per iteration as the current state with 8 processes single node training.
This PR improves the performance from 0.164 sec/iteration to 0.162 sec/iteration on single-node dist training on 8 P100s