Skip to content

Support large data inputs and DDP consumption #245

@henripal

Description

@henripal

🚀 The feature

DataPipe support for very large datasets at source and sink levels:

Source level:
Support for large sharded datasets (think 100M protein sequences across 1000s of CSV files, 100M molecular graphs in 1000s of pickled binaries, ...)

  • allow for loading and batched pre-processing
  • some support for shuffling

Sink level:
Support for easy parallelization of dataloading when training with DistributedDataParallel

  • distribute the input data across GPUs and workers without custom code
  • provide options to deal with uneven data distribution at the dataset level

Motivation, pitch

Writing training pipelines for very large datasets is currently quite labor-intensive, but often uses several concepts that I think could be abstracted away.
1 - pre-processing steps are often vectorize-able, so loading the data like in the README example:

@functional_datapipe("parse_csv_files")
class CSVParserIterDataPipe(IterDataPipe):
    def __init__(self, dp, **fmtparams):
        self.dp = dp
        self.fmtparams = fmtparams

    def __iter__(self):
        for filename, stream in self.dp:
            reader = csv.reader(stream, **self.fmtparams)
            for row in reader:
                yield filename, row

would lead to inefficiencies. A proposed way would be some implementation that would allow batching and batch processing of datasets.

2 - when using DDP with datasets above, in the setting in which there are multiple GPUs and multiple workers processing and loading the data per GPU, I think the current approach is to:

  • shard the initial dataset (in n files, for example)
  • distribute the files amongst the workers*GPUs to ensure that every GPU gets a different slice of the dataset

This could be facilitated if there was a way to load a dataset in chunks and distribute the chunks across workers automatically using DataPipes.

Alternatives

Some in progress/parallel implementations of this I've come across:

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions