-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsmodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 Feature
We want to build a unified data pipeline interface that offers building blocks for others to build on with the following objectives:
- Standardize datasets across domains.
- Offer flexible building blocks that can be combine to obtain other datasets.
- Enable datasets that do not fit in memory.
- Share code among domains.
- Facilitate parallel loading and processing of data.
- Decouple data loading and preprocessing/transformation.
- Offer static typing for datasets
Motivation
- The Domains currently each have their own non-standard dataset structure that may also download the data. This duplicate efforts and adds complexity to the user.
- A common bottleneck when generating datasets is reading the data. We want to offer an interface that enables reading the data and running initial preprocessing while maximizing available computing resources utilization.
- We may want to leverage specialize libraries such as NVIDIA DALI.
Additional Information
- torch.utils.data
- tf.data (e.g. uses dictionary for data point iteration)
- fast.ai's basic_data and data_block
- tnt
- torchnet
torchdata
Datasets:
- Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) text#624 Add a new dataset - Enwik9 text#610 new dataset format with librispeech and commonvoice audio#303 new datasets in domains
- Add VGGface2 dataset vision#1193 wants to select which metadata to return
- Internal: overview torchtext core torchvision
- safe datasets
Dataloader:
- torchaudio background iterator
- Shared Dataset Functionality #24915 wants to re-use worker processes
- FastDataLoader
- python 3.8 shared memory
- Internal: torchdata gil experiment DataLoader+Iterable
Features:
- Move collate_fn functionality / responsibility into Dataset object #12672 wants to move collate_fn functionality to datasets
- ChunkDataset API proposal #26547 wants distributed random sampling
- Sampler for IterableDataset #28743 for sampler for iterable datasets
- MultiCompose and SegmentationCompose [proof-of-concept] vision#1315 wants to apply an instance of random transform sequence to many images
Evpok, RicCu, mrshenli and tomassosoriocpuhrsch and tomassosorio
Metadata
Metadata
Assignees
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsmodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module