Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Jul 9, 2020

Stack from ghstack:

Closes #25162.

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of drop_last is False so the old behavior is the default.

Differential Revision: D22449974

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 9, 2020
DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

ghstack-source-id: 107409858
Pull Request resolved: #41171
@dr-ci
Copy link

dr-ci bot commented Jul 9, 2020

💊 CI failures summary and remediations

As of commit 61eb3e8 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


1 failure confirmed as flaky and can be ignored:

  • binary_windows_libtorch_3_7_cpu_debug_build

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Test failure is real:

Jul 09 04:15:46 ======================================================================
Jul 09 04:15:46 ERROR [0.106s]: test_DistributedSampler_padding (__main__.TestDistBackend)
Jul 09 04:15:46 ----------------------------------------------------------------------
Jul 09 04:15:46 Traceback (most recent call last):
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jul 09 04:15:46     self._join_processes(fn)
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 311, in _join_processes
Jul 09 04:15:46     self._check_return_codes(elapsed_time)
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
Jul 09 04:15:46     raise RuntimeError(error)
Jul 09 04:15:46 RuntimeError: Processes 0 exited with error code 10

# that each rank gets the same amount of data when iterating this
# dataloader.
self.num_samples = math.ceil((len(self.dataset) - self.num_replicas) / self.num_replicas)
self.total_size = self.num_samples * self.num_replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be moved out of the if-else block?

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me, just one minor nit

process_group_sync = res50_model_sync.layer1[0].bn1.process_group
self.assertEqual(process_group_sync, process_group)

def test_DistributedSampler_padding(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm wondering whether the test can be moved to data loader related test file?

Copy link
Contributor Author

@rohan-varma rohan-varma Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave it in this file as test_dataloader doesn't have the multiprocessing setup, and we have the support for it in test_distributed, which we need since the test uses distributed collectives. Let me know if it's better to have this in the dataloader test though, and we can probably add a basic multiprocessing setup in the test to enable that.

Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 17, 2020
Pull Request resolved: #41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 107965736

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 20, 2020
Pull Request resolved: #41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108135529

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 24, 2020
Pull Request resolved: #41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108476056

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 27, 2020
Pull Request resolved: #41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108573091

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. 

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. 

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. 

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 27, 2020
Pull Request resolved: #41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516

Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5ed7cd0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants