Allow drop_last option in DistributedSampler #41171

rohan-varma · 2020-07-09T02:45:18Z

Stack from ghstack:

Allow drop_last option in DistributedSampler #41171 Allow drop_last option in DistributedSampler

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.

The change is backwards compatible because the default value of drop_last is False so the old behavior is the default.

Differential Revision: D22449974

DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) ghstack-source-id: 107409858 Pull Request resolved: #41171

dr-ci · 2020-07-09T04:32:52Z

💊 CI failures summary and remediations

As of commit 61eb3e8 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

1 failure confirmed as flaky and can be ignored:

binary_windows_libtorch_3_7_cpu_debug_build

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

mrshenli

LGTM! Test failure is real:

Jul 09 04:15:46 ======================================================================
Jul 09 04:15:46 ERROR [0.106s]: test_DistributedSampler_padding (__main__.TestDistBackend)
Jul 09 04:15:46 ----------------------------------------------------------------------
Jul 09 04:15:46 Traceback (most recent call last):
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jul 09 04:15:46     self._join_processes(fn)
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 311, in _join_processes
Jul 09 04:15:46     self._check_return_codes(elapsed_time)
Jul 09 04:15:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
Jul 09 04:15:46     raise RuntimeError(error)
Jul 09 04:15:46 RuntimeError: Processes 0 exited with error code 10

mrshenli · 2020-07-10T14:55:09Z

torch/utils/data/distributed.py

+            # that each rank gets the same amount of data when iterating this
+            # dataloader.
+            self.num_samples = math.ceil((len(self.dataset) - self.num_replicas) / self.num_replicas)
+            self.total_size = self.num_samples * self.num_replicas


this line can be moved out of the if-else block?

zhaojuanmao

overall looks good to me, just one minor nit

zhaojuanmao · 2020-07-10T17:27:32Z

test/distributed/test_distributed.py

        process_group_sync = res50_model_sync.layer1[0].bn1.process_group
        self.assertEqual(process_group_sync, process_group)

+    def test_DistributedSampler_padding(self):


nit: I'm wondering whether the test can be moved to data loader related test file?

I'd prefer to leave it in this file as test_dataloader doesn't have the multiprocessing setup, and we have the support for it in test_distributed, which we need since the test uses distributed collectives. Let me know if it's better to have this in the dataloader test though, and we can probably add a basic multiprocessing setup in the test to enable that.

Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 107965736 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108135529 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108476056 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108573091 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]

Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108617516 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)

facebook-github-bot · 2020-07-28T20:12:50Z

This pull request has been merged in 5ed7cd0.

rohan-varma requested review from apaszke, mrshenli, pritamdamania87 and zhaojuanmao as code owners July 9, 2020 02:45

mrshenli reviewed Jul 10, 2020

View reviewed changes

zhaojuanmao reviewed Jul 10, 2020

View reviewed changes

rohan-varma requested review from mrshenli and zhaojuanmao July 17, 2020 00:03

mrshenli approved these changes Jul 24, 2020

View reviewed changes

facebook-github-bot closed this in 5ed7cd0 Jul 28, 2020

facebook-github-bot added the merged label Jul 28, 2020

rohan-varma mentioned this pull request Jul 30, 2020

Incorrect Validation Accuracy Due to Distributed Sampler #25162

Open

facebook-github-bot deleted the gh/rohan-varma/145/head branch August 1, 2020 14:26

blisc mentioned this pull request Oct 26, 2020

Please support drop_last in DistributedSampler when you update to PyTorch 1.7.0 Lightning-AI/pytorch-lightning#4377

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow drop_last option in DistributedSampler #41171

Allow drop_last option in DistributedSampler #41171

Uh oh!

rohan-varma commented Jul 9, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 9, 2020 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Jul 10, 2020

Uh oh!

zhaojuanmao left a comment

Uh oh!

zhaojuanmao Jul 10, 2020

Uh oh!

rohan-varma Jul 16, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Allow drop_last option in DistributedSampler #41171

Allow drop_last option in DistributedSampler #41171

Uh oh!

Conversation

rohan-varma commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rohan-varma commented Jul 9, 2020 •

edited

Loading

dr-ci bot commented Jul 9, 2020 •

edited

Loading

rohan-varma Jul 16, 2020 •

edited

Loading