Skip to content

Conversation

@teng-li
Copy link
Contributor

@teng-li teng-li commented Jan 26, 2018

@csarofeen 's simple DistributedDataParallel (DDP) version (with sync buffers ON, for fair comparison since we always sync buffer in our DDP) can hit 0.159 sec per iteration on 8 GPU training using 8 DDP processes for Resnet50 32 mini-batch size / GPU, on single node with 8 P100s. And our current DDP can hit 0.164 sec per iteration as the current state with 8 processes single node training.

This PR improves the performance from 0.164 sec/iteration to 0.162 sec/iteration on single-node dist training on 8 P100s

@teng-li teng-li changed the title Slightly improve DDP single GPU multi-process dist training performance Slightly improve DistributedDataParallel single GPU multi-process dist training performance Jan 26, 2018
@teng-li teng-li changed the title Slightly improve DistributedDataParallel single GPU multi-process dist training performance Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance Jan 26, 2018
@apaszke apaszke merged commit ae28411 into pytorch:master Jan 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants