-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Chunk Dataset API #26545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
thiagocrepaldi
wants to merge
21
commits into
pytorch:master
from
thiagocrepaldi:thiagofc/chunk_dataset
Closed
Chunk Dataset API #26545
thiagocrepaldi
wants to merge
21
commits into
pytorch:master
from
thiagocrepaldi:thiagofc/chunk_dataset
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Before all the chucnk cache was being shuffled. Now, only the new chunk is shuffle before appending to cache Epoch concept was added to DisctributedChunkSampler
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
module: dataloader
Related to torch.utils.data.DataLoader and Sampler
module: typing
Related to mypy type annotations
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ChunkDataset API proposal
Problem to be solved
A typical data loading in PyTorch assumes all the data is accessible to every participating process. Randomization is performed by the sampler with the knowledge of total length of the dataset. While this approach is simpler and natural to scenarios such as a directory full of images, it does not map well to situations where a large dataset with unknown size is available in collection of files or a single large file. The global randomization incurs many disk seeks and the user needs to carefully partition data to support distributed training. Manually splitting the data, distribute amongst computing units without duplicates and performing efficient shuffling are not strictly related to training models, but are still important. We often implement similar boiler plate code in different projects, leading to increase in development time.
Proposed solution
The proposed
ChunkDatasetis a stateful dataset that supports hierarchical sampling and efficient reading through chunks. Achunk, in this context, could be a file, such as audio or image, section of a file in the case of a large text-file, a folder, a URL, or any other abstraction that allows data to be segmented roughly the same size.Unlike regular datasets,
ChunkDatasetimplements two levels of sampling, i.e. hierarchical sampling, to operate. In the first level, achunkis selected based on a sampling strategy and second, a sample is selected from thechunkusing another or similar sampling strategy. The hierarchical sampling approach adopted here provides satisfactory randomness and is inspired by the following paper.By using ChunkDataset API, tasks such as splitting data between computing units with proper randomization become trivial. All user has to do is to provide a
ChunkDataReaderimplementation that reads a chunk, instantiate aDistributedChunkSamplerwith the desired shuffling strategy and finally putting all together in aChunkDataSetinstance. Once this dataset is passed to PyTorchDataLoader, every worker will learn its correct rank, reads their pieces of data and continue on the regularDataloaderflow.Brief discussion on API
ChunkDataReader class
In order to perform reading of a particular chunk chosen by
DistributedChunkSampler, the user has to implement a reader class that extendsChunkDataReader:DistributedChunkSampler class
DistributedChunkSampleris already implemented and the user only needs to instantiate it and inject intoChunkDataset.Similarly to
DistributedSampler,DistributedChunkSamplertakes :attr:num_replicas, :attr:rankand :attr:shuffleon its constructor to specify the number of processes participating in the distributed training, the current rank of a process and the shuffling strategy. One main difference between two samplers is that becauseDistributedChunkSampleroperates onIterableDatasetwith unknown size, it takes :attr:num_chunksas input to draw indices as opposed toDistributedSampler:attr:datasetparameter. Another important difference between both samplers is thatDistributedSamplerperforms padding on its generated indices, which can't be done for chunks to prevent duplicate reading on different workers.The
DistributedChunkSamplerpublic API is:ChunkDataset class
ChunkDatasetis already implemented and the user only needs to instantiate it and inject into PyTorchDataLoader.As mentioned before,
ChunkDatasetis anIterableDatasetimplementation, which focus on representing a dataset with unknown size. Once it is passed in to PyTorchDataLoader, it iterates over the dataset until it is exhausted. At this point, an exception is raised and reading is gracefully finished.ChunkDatasetmust beresetafter each epoch to reset the internal state of the sampler and to optionally improve shuffling by injectingepoch.The
ChunkDatasetpublic API is:ps: This PR builds on IterableDataset and the original C++ implementation for ChunkDataset API