Support large data inputs and DDP consumption

### 🚀 The feature

DataPipe support for very large datasets at source and sink levels:

*Source level*: 
Support for large sharded datasets (think 100M protein sequences across 1000s of CSV files, 100M molecular graphs in 1000s of pickled binaries, ...)
- allow for loading and batched pre-processing
- some support for shuffling

*Sink level*: 
Support for easy parallelization of dataloading when training with `DistributedDataParallel`
- distribute the input data across GPUs and workers without custom code
- provide options to deal with uneven data distribution at the dataset level


### Motivation, pitch

Writing training pipelines for very large datasets is currently quite labor-intensive, but often uses several concepts that I think could be abstracted away.
1 - pre-processing steps are often vectorize-able, so loading the data like in the README example:

```python
@functional_datapipe("parse_csv_files")
class CSVParserIterDataPipe(IterDataPipe):
    def __init__(self, dp, **fmtparams):
        self.dp = dp
        self.fmtparams = fmtparams

    def __iter__(self):
        for filename, stream in self.dp:
            reader = csv.reader(stream, **self.fmtparams)
            for row in reader:
                yield filename, row
```

would lead to inefficiencies. A proposed way would be some implementation that would allow batching and batch processing of datasets.

2 - when using DDP with datasets above, in the setting in which there are multiple GPUs and multiple workers processing and loading the data per GPU, I think the current approach is to:
- shard the initial dataset (in `n` files, for example)
- distribute the files amongst the workers*GPUs to ensure that every GPU gets a different slice of the dataset

This could be facilitated if there was a way to load a dataset in chunks and distribute the chunks across workers automatically using DataPipes.



### Alternatives

Some in progress/parallel implementations of this I've come across:
- An API was proposed and implemented in https://github.com/pytorch/pytorch/pull/26545
- Petastorm project tries to address this problem specifically for parquet files: https://github.com/uber/petastorm - however found this to not work in practice because does not guarantee even shards: https://github.com/uber/petastorm/issues/508
- YogaDL is an open source data sharding that adds a caching layer for initial dataset shuffling: https://github.com/determined-ai/yogadl

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support large data inputs and DDP consumption #245

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support large data inputs and DDP consumption #245

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions