Shared Dataset Functionality

## 🚀 Feature

We want to build a unified data pipeline interface that offers building blocks for others to build on with the following objectives:
* Standardize datasets across domains.
* Offer flexible building blocks that can be combine to obtain other datasets.
* Enable datasets that do not fit in memory.
* Share code among domains.
* Facilitate parallel loading and processing of data.
* Decouple data loading and preprocessing/transformation.
* Offer static typing for datasets

## Motivation

* The Domains currently each have their own non-standard dataset structure that may also download the data. This duplicate efforts and adds complexity to the user.
* A common bottleneck when generating datasets is reading the data. We want to offer an interface that enables reading the data and running initial preprocessing while maximizing available computing resources utilization.
* We may want to leverage specialize libraries such as NVIDIA DALI.

## Additional Information

* [torch.utils.data](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/)
* [tf.data](https://www.tensorflow.org/beta/guide/data) (e.g. uses dictionary for data point iteration)
* fast.ai's [basic_data](https://docs.fast.ai/basic_data.html) and [data_block](https://docs.fast.ai/data_block.html)
* [tnt](https://github.com/pytorch/tnt/blob/master/torchnet/dataset/dataset.py)
* [torchnet](https://github.com/torchnet/torchnet/tree/master/dataset)
* [~~torchdata~~](https://pypi.org/project/torchdata/)

Datasets:
* pytorch/text#624 pytorch/text#610 pytorch/audio#303 new datasets in domains
* pytorch/vision#1193 wants to select which metadata to return
* Internal: [overview](https://fb.quip.com/vlWwA35cmq0t) [torchtext](https://fb.quip.com/LncwAsC1cUZt) [core](https://fb.quip.com/B0PeACndlZEE) [torchvision](https://fb.quip.com/WGsUApsce6xN)
* [safe datasets](https://github.com/msamogh/nonechucks)

Dataloader:
* [torchaudio background iterator](https://github.com/pytorch/audio/blob/master/torchaudio/datasets/utils.py#L314)
* #24915 wants to re-use worker processes
* [FastDataLoader](https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048)
* [python 3.8 shared memory](https://docs.python.org/3/library/multiprocessing.shared_memory.html)
* Internal: [torchdata](https://fb.quip.com/ekJJAsYqMG7X) [gil](https://docs.google.com/document/d/1InJP79dWTIYj-xGVU65Y2r-K2HeL6t2l1xGDfKTU4Rw/edit#) [experiment](https://fb.quip.com/imVLAOdyJfAI) [DataLoader+Iterable](https://fb.workplace.com/groups/2162019300778793/permalink/3398854433474998/)

Features:
* #12672 wants to move collate_fn functionality to datasets
* #26547 wants distributed random sampling
* #28743 for sampler for iterable datasets
* pytorch/vision#1315 wants to apply an instance of random transform sequence to many images

cc @SsnL @fmassa @zhangguanheng66 @vincentqb @mrshenli 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shared Dataset Functionality #24915

🚀 Feature

Motivation

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shared Dataset Functionality #24915

Description

🚀 Feature

Motivation

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions