-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Allow drop_last option in DistributedSampler #41171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) ghstack-source-id: 107409858 Pull Request resolved: #41171
💊 CI failures summary and remediationsAs of commit 61eb3e8 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 1 failure confirmed as flaky and can be ignored:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 28 times. |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Test failure is real:
Jul 09 04:15:46 ======================================================================
Jul 09 04:15:46 ERROR [0.106s]: test_DistributedSampler_padding (__main__.TestDistBackend)
Jul 09 04:15:46 ----------------------------------------------------------------------
Jul 09 04:15:46 Traceback (most recent call last):
Jul 09 04:15:46 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jul 09 04:15:46 self._join_processes(fn)
Jul 09 04:15:46 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 311, in _join_processes
Jul 09 04:15:46 self._check_return_codes(elapsed_time)
Jul 09 04:15:46 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
Jul 09 04:15:46 raise RuntimeError(error)
Jul 09 04:15:46 RuntimeError: Processes 0 exited with error code 10
torch/utils/data/distributed.py
Outdated
| # that each rank gets the same amount of data when iterating this | ||
| # dataloader. | ||
| self.num_samples = math.ceil((len(self.dataset) - self.num_replicas) / self.num_replicas) | ||
| self.total_size = self.num_samples * self.num_replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line can be moved out of the if-else block?
zhaojuanmao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me, just one minor nit
| process_group_sync = res50_model_sync.layer1[0].bn1.process_group | ||
| self.assertEqual(process_group_sync, process_group) | ||
|
|
||
| def test_DistributedSampler_padding(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm wondering whether the test can be moved to data loader related test file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave it in this file as test_dataloader doesn't have the multiprocessing setup, and we have the support for it in test_distributed, which we need since the test uses distributed collectives. Let me know if it's better to have this in the dataloader test though, and we can probably add a basic multiprocessing setup in the test to enable that.
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 107965736 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108135529 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108476056 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108573091 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
Closes #25162. DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. The change is backwards compatible because the default value of `drop_last` is `False` so the old behavior is the default. Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/) [ghstack-poisoned]
Pull Request resolved: #41171 DistributedSampler allows data to be split evenly across workers in DDP, but it has always added additional samples in order for the data to be evenly split in the case that the # of samples is not evenly divisible by the number of workers. This can cause issues such as when doing distributed validation accuracy, where multiple samples could be considered twice. This PR adds a drop_last option where the tail of the data is dropped such that the effective dataset size is still evenly divisible across the workers. This ensures that DDP can train fine (there is no uneven inputs) and each replica gets an equal number of data indices. ghstack-source-id: 108617516 Differential Revision: [D22449974](https://our.internmc.facebook.com/intern/diff/D22449974/)
|
This pull request has been merged in 5ed7cd0. |
Stack from ghstack:
Closes #25162.
DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice. Also applications may not want to repeat the data within a single epoch when training and would rather drop the tail.
This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
The change is backwards compatible because the default value of
drop_lastisFalseso the old behavior is the default.Differential Revision: D22449974