-
Notifications
You must be signed in to change notification settings - Fork 170
Description
🚀 The feature
DataPipe support for very large datasets at source and sink levels:
Source level:
Support for large sharded datasets (think 100M protein sequences across 1000s of CSV files, 100M molecular graphs in 1000s of pickled binaries, ...)
- allow for loading and batched pre-processing
- some support for shuffling
Sink level:
Support for easy parallelization of dataloading when training with DistributedDataParallel
- distribute the input data across GPUs and workers without custom code
- provide options to deal with uneven data distribution at the dataset level
Motivation, pitch
Writing training pipelines for very large datasets is currently quite labor-intensive, but often uses several concepts that I think could be abstracted away.
1 - pre-processing steps are often vectorize-able, so loading the data like in the README example:
@functional_datapipe("parse_csv_files")
class CSVParserIterDataPipe(IterDataPipe):
def __init__(self, dp, **fmtparams):
self.dp = dp
self.fmtparams = fmtparams
def __iter__(self):
for filename, stream in self.dp:
reader = csv.reader(stream, **self.fmtparams)
for row in reader:
yield filename, rowwould lead to inefficiencies. A proposed way would be some implementation that would allow batching and batch processing of datasets.
2 - when using DDP with datasets above, in the setting in which there are multiple GPUs and multiple workers processing and loading the data per GPU, I think the current approach is to:
- shard the initial dataset (in
nfiles, for example) - distribute the files amongst the workers*GPUs to ensure that every GPU gets a different slice of the dataset
This could be facilitated if there was a way to load a dataset in chunks and distribute the chunks across workers automatically using DataPipes.
Alternatives
Some in progress/parallel implementations of this I've come across:
- An API was proposed and implemented in Chunk Dataset API pytorch/pytorch#26545
- Petastorm project tries to address this problem specifically for parquet files: https://github.com/uber/petastorm - however found this to not work in practice because does not guarantee even shards: Petastorm sharding + Distributed PyTorch uber/petastorm#508
- YogaDL is an open source data sharding that adds a caching layer for initial dataset shuffling: https://github.com/determined-ai/yogadl
Additional context
No response