Skip to content

[RFC] DataLoader architecture updates and TarDataset implementation #49440

@VitalyFedyunin

Description

@VitalyFedyunin

DataLoader architecture updates and TarDataset implementation

Problem statement

This proposal aims to construct a modular, user-friendly, and performant toolset to address the ambiguous activity referred to as “dataloading” within PyTorch, a simplification attributable to the indivisibility of the DataLoader abstraction prescribed today. In reality, “dataloading” is a diverse set of operations that should be supported by extensible building blocks, out of which the present abstractions, and far more could be easily built. Some typical needs which are scarcely supported in the present implementation include:

  • Lazy loading - Users want to point PyTorch to a remote data source (e.g. http, S3, GCP, Azure, Manifold, Hive) and iterate over the contents without downloading the entire dataset, ideally only downloading samples as-soon-needed. Further, If a user writes such code for a remote storage platform, there is no natural place to contribute it for public or private reuse.
  • Structured data, heterogeneous storage
    • There are hundreds of ways to store a single structured dataset, each requiring a custom or highly configured DataLoader. Users want to take advantage of modularity and not reimplement complete DataSets over and over again. Suppose we have a simple dataset of sample:label pairs of images and ints. There are a number of dimensions whose outer product enumerates the possible storage formats for this data, each requiring a distinct (or specifically configured) DataLoader:

      • Primitive formats - Are images stored in an image storage format (e.g. one of these), as a tensor (.pt), as a serialized (e.g. pickle, json) Python data structure, etc.?
      • Grouping - Are pairs grouped together by directory, by an archive format (tar, HDF5), by a serialization format (e.g. json, protobuf), by common file string (e.g. image00012321.jpg, label00012321.txt), by meaningful filenames (e.g. image_00023423_dog.jpg), by contents of file headers, by pickled Python data structures, etc? Are filenames otherwise meaningful? Are file headers otherwise meaningful?
      • Sharding - Is the dataset partitioned for performance reasons into arbitrary groups, each containing grouped pairs? In which grouping format (e.g. directories, tar, arrow, parquet, HDF5)?
      • Compression - Are files or groups compressed or binarized, e.g. gz, zip, protobuf?
      • Locale - Are the files local, remote via custom request format (e.g. proprietary data, public REST API, kaggle dataset), on an http server, in cloud object storage?
    • The above example is only for an extremely simple data structure case. The reality of data is often dramatically more heterogeneous and complex (e.g. variable-length lists of bounding box points and strings in object detection, highly nested structures of user or product features in ranking).

    • Further, users want to find or contribute ops to decode specific file types (e.g. HDF5) and accelerated kernels (e.g. GPU mp3 decoding).

    • Given PyTorch maintains decoders for many storage formats, users want more powerful top-level abstractions, such as simply pointing PyTorch to a local or remote directory and receiving an iterator over best-efforts deserializations of the files within.

  • Shuffling - Users want control over when shuffling occurs within “Dataloading.” It often makes a big performance and accuracy difference whether samples within a shard are shuffled, samples are globally shuffled, shards are shuffled, etc.
  • Pipelining and parallelism - Users want to be able to pipeline their loading and preprocessing (rather than make multiple CPU passes, for example), specify a number of workers to read and preprocess data, and not worry about whether reading, preprocessing, or model execution are starved. This can include asynchronous processes which prefetch data to feed to others.

Tensorflow addresses many of the above needs with their TFRecord, dramatically simplifying the problem by taking a strong opinion of the data format with which Tensorflow works best. This has been extremely successful from a performance perspective. However, by prescribing a single storage format, all others are demoted, and the diversity of data needs and entrenched formats made ubiquitous adoption of TFRecord for storage practically impossible. We’ve heard directly from users that they do not want to be forced into a single first-class format, and the public datasets (which Google rehosts in TFRecord), tend to agree (by completely disagreeing on format). For this reason, we prefer extensibility over prescription, wherein we provide performant support for a basic set of formats in-tree (e.g. Hive and Manifold internally, tar shard and Arrow externally) but users can plug in modular extensions for new formats easily.

Underlying DataLoader Issues

Beyond the needs described above, the existing DataLoader is also a frequent source of user requests and github issues. Such feedback includes, but is not limited to:

  • Fork and general multi-processing memory usage patterns - There are multiple reports in GitHub that users are confused about how Fork’s copy-on-write and Python’s object counting work together, and that leading to OOMs. And Pytorch users shop for custom solutions as separate list management processes, or sharing binary segments etc.
  • Threading vs Multiprocessing - Different use cases require one or the other. For example, threading generally performs better while multiprocessing works better with third-party libraries with non-threadlocal state.
  • Overcomplication of solutions - TarDataset requires custom shuffling and sampling implemented as Datasets, while our built-in solution requires altering the DataLoader. It would be best to separate data processing (reordering included) from process management.
  • Multiprocessing support - Today, proper pre-fetching is not possible due to the synchronous nature of Datasets. In order to bypass this, users must implement custom multiprocessing-enabled Datasets and DataLoaders themselves.
  • Manual sharding - Sharding is increasingly becoming a mandatory feature, allowing better multiprocessing and distributed execution. Currently users must implement it manually.
    Finally, the ubiquity of the DataLoader necessitates strong backward compatibility. For this reason we do not plan to deprecate any existing functionality, but in some cases may offer a more modern way of doing things.

Solution

Break down Dataset into smaller components DataPipe-s reducing logic to a queue of data-in and a queue of data-out.

DataLoader observes the acyclic graph of DataPipe-s and provides the necessary level of parallelism using multiprocessing and multithreading.

Bear in mind that even if we use IterDataPipe in examples below, all this also applicable to MapDataPipe.

Separating by smaller DataPipe-s and connecting them together

class ListFiles(datapipes.iter.IterDataPipe):
  #...
  def __iter__(self):
      # yield file_names
class LoadFiles(datapipes.iter.IterDataPipe):
  def __init__(self, listfiles_dp):
      self._listfiles_dp = listfiles_dp
      # ...
   def __iter__(self):
      for file_name in listfiles_dp:
          yield (file_name, load_file(file_name))

Will allow us to simplify DataPipe code and make them reusable across various implementations (for example ImageFolder and TarDataset). Also necessary in case of moving memory consuming DataPipe into separate processes.

Turning IterDataPipe (or IterableDataset) and MapDataPipe (or MapDataset) into NonBlockingIterDataPipe and NonBlockingMapDataPipe

Multiprocessing/threading support makes us prefer nonblocking_next over __next__ function. Key difference is that nonblocking_next might throw NotAvailable exception, meaning that data is not yet available and should be requested again with nonblocking_next.

DataPipe (and older Datasets) which implements only nonblocking_next can be easily used as standard DataPipe because parent class provides necessary API:

class NonBlockingIterDataPipe(datapipes.iter.IterDataPipe):
  def __iter__(self):
      return self

  def __next__(self):
      while True:
          try:
              return self.nonblocking_next()
          except StopIteration:
              raise StopIteration
          except NotAvailable:
              time.sleep(DELAY)
              EventLoop.iteration()

  def nonblocking_next(self):
      raise NotImplemented

Existing synchronous DataPipe (and older Datasets) can be turned into non-blocking DataPipe using helper function:

def EnsureNonBlockingNextDataPipe(validated_datapipe):
   if not isinstance(validated_datapipe, IterDataPipe):
       raise Exception('Not IteratableDataset')
   if isinstance(validated_datapipe, NonBlockingIterDataPipe):
       return validated_datapipe
   if not hasattr(validated_datapipe, '_as_iterator'):
       setattr(validated_datapipe, '_as_iterator', None)
   if not hasattr(validated_datapipe, 'nonblocking_next'):
       def nonblocking_next(self):
           if self._as_iterator is None:
               self._as_iterator = iter(self)
           return next(self._as_iterator)
       setattr(validated_datapipe, 'nonblocking_next', nonblocking_next)
       validated_datapipe.nonblocking_next = types.MethodType(nonblocking_next, validated_datapipe)
   return validated_datapipe

Combination of two approaches will allow a mix of old-style DataPipe (and datasets) and new non-blocking datapipes.

As nonblocking_next does not guarantee results to be returned, it can be used to schedule requests ahead:

class Prefetcher(datapipes.iter.NonBlockingIterDataPipe):
  def __init__(self, source_dp, buffer_size = 10):
      self._souce_pd = source_pd
      self._buffer_size = buffer_size
      self._buffer = []
      self._source_depleted = False

  def nonblocking_next(self):
      if not self._source_depleted:
          while len(self._buffer) < self._buffer_size:
              try:
                  data = self._souce_dp.nonblocking_next()
              except NotAvailable:
                  # break or put more requests, depends from implementation
                  break
              except StopIteration:
                  self._source_depleted = True
                  break
              self._buffer.append(data)
      if len(self._buffer):
          data = self._buffer.pop(0)
          return data
      else:
          if self._source_depleted:
              raise StopIteration
          else:
              raise NotAvailable

Similar approach will be applied to MapDataPipe with nonblocking_get(id).

Connecting blocks with queues

Having all datapipes as non-blocking (asynchronous), allows to connect them with a couple of queues.

For example in multiprocessing version, sub process main loop can look like this:

def IteratableDataPipeToQueuesLoop(source_datapipe, req_queue, res_queue):
  steps = 0
  EventLoop.enabled = False
  for _ in IteratableDataPipeBehindQueues(source_datapipe, req_queue, res_queue, raise_stop=True):
      steps += 1
      time.sleep(DELAY)
      pass

def IteratableDataPipeBehindQueues(source_datapipe, req_queue, res_queue, raise_stop = False):
  source_datapipe = EnsureAsyncNextDataset(source_datapipe)
  while True:
      try:
          req_queue.get(block = False)
      except:
          yield True
          continue
      while True:
          try:
              value = source_datapipe.nonblocking_next()
          except NotAvailable:
              yield True
              continue
          except StopIteration:
              res_queue.put(StopIteration())
              if raise_stop:
                  raise StopIteration
              else:
                  yield True
              continue
          res_queue.put(value)
          yield True # Returns control

When main process can transparently access this datapipe with simple wrapper:

class QIteratableDataPipe(datapipes.iter.NonBlockingIterDataPipe):
  def __init__(self, request_queue, response_queue, response_wait_time = 0.00001):
      self._req_q = request_queue
      self._res_q = response_queue
      self._req_sent = False
      self.counter = 0
      self._stop_iteration = False
      self._response_wait_time = response_wait_time

  def nonblocking_next(self):
      if self._stop_iteration:
          raise Exception('next called after receiving StopIteration')
      if not self._req_sent:
          self._req_q.put(self.counter)
          self.counter += 1
          self._req_sent = True
      try:
          value = self._res_q.get(block = True, timeout = self._response_wait_time)
      except:
          raise NotAvailable
      self._req_sent = False
      if isinstance(value, StopIteration):
          self._stop_iteration = True
          raise StopIteration
      return value

Allow to send DataPipe into separate process by few lines of code:

req_queue = multiprocessing.Queue()
res_queue = multiprocessing.Queue()
p2 = multiprocessing.Process(target=IteratableDataPipeToQueuesLoop, args=(source_datapipe, req_queue, res_queue))
p2.start()
separated_source_datapipe = QIteratableDataPipe(req_queue, res_queue)

Please note, that only one request in the queue, is an implementation restriction and not enforced by design.

DataLoaderQueue

The above examples using standard multiprocessing Queue, but it is not the best choice (performance-wise) in some cases and not working in others. Instead we suggest to replace it with higher abstraction DataLoaderQueue.

DataLoaderQueue - used to pass data between elements of a pipeline inside a single thread, between threads, between processes, in distributed env. DataLoader will replace queue with best for the moment implementation, but they all should follow next requirements:

  • Non-blocking
  • Guaranteed delivery
  • Guaranteed no duplicates
  • Guaranteed order
  • Customizable length
  • Queue is always between TWO processes/threads

API:

  • def get(blocking=True) - returns any python structure, or raises NotAvailableException, or raises QueueError
  • def put(data, blocking=True) - data is any Python structure, may raise QueueError

DataLoaderQueue implementation also defines ‘serialization’ technique, from simple pass object reference inside the same thread to IPC calls and full object serialization to be passed via network.

Users API

DataPipe should work as standard iterators (or implement get__item) outside of DataLoader.

numbers_dp = datapipes.iter.Numbers() # Returns range of integers
dp1, dp2, dp3 = datapipes.iter.Multiply(numbers_dp, 3) # Creates 3 copies of input data
def mult100(x):
  return x * 100
dp2_modified = datapipes.iter.Callable(dp2, mult100)
def mult111(x):
  return x * 111
dp3_modified = datapipes.iter.Callable(dp2, mult111)
joined_dp = datapipes.iter.GreedyJoin(dp1, dp2_modified, dp3_modified)
for i in iter(joined_dp):
  print(i) # 0 0 0 1 100 111 222 200 2 ......

DataLoader output should be exactly the same, but different pieces of graph might be executed as separate threads/processes.

for i in DataLoader(joined_dp):
   print(i) # 0 0 0 1 100 111 222 200 2 ......

Naming

There are a number of concepts which we would like to take this refactoring opportunity to clarify, though we also emphasize the importance of backward compatibility. We propose the following naming scheme for the components described within the scope of this doc, including typical end-user code samples.

  • Dataset - A factory producing a data preparation iterator, a graph of DataPipes.
    • ImageNet() -> function or class (doesn’t matter) returning an iterator (DataPipe) over ImageNet batches.
    • There is no Dataset “base class” for the purposes of a given function signature or functionality, it is now only a name. The existing Dataset classes remain for BC, but they simply wrap Datapipe.
  • DataPipe - A node in a data preparation graph, taking one iterator or index to another (e.g. Untar(ListTarFiles(()))).
  • DataLoader - An execution engine for passing data through the datapipe graph and persisting loading settings, taking advantage of device and parallelism opportunities.

Sharding

Sharding should be implemented on the framework level and hidden from DataPipe users. DataPipe developers will get control over sharding settings and running configurations. DataLoader will decide how to split DataPipe into shards and run configuration.

DataPipe blocks will provide information to the DataLoader if they support sharding via datapipe.is_shardable(). If a function is not defined DataPipe will be considered as non-shardable.
DataLoader will callback DataPipe objects with sharding settings using datapipe.sharding_settings(total_num_of_shards, id_of_shard).

Example:

list_files_dp = datapipes.map.ListFiles(root = '.') * marked as non shardable
load_bins_dp = datapipes.map.LoadFiles(list_files_dp) * marked as shardable
decode_images_dp = datapipes.map.DecodeImages(load_bins_dp) * marked as shardable
transform_dp = datapipes.map.TransformImages(decode_images_dp) * marked as shardable
shuffle_dp = datapipes.map.Shuffle(transform_dp) * marked as non shardable
sampler_dp = datapipes.iter.Sampler(shuffle_dp) * marked as non shardable

Individual Process (Thread)

Situations like prefetching and large non-forkable arrays require to spawn separate processes for a DataPipe. DataPipe blocks will provide information to the DataLoader if they are recommended to be executed as separate processing via datapipe.is_separate_process().

Lazy Initialization

In some cases it is inefficient to initialize DataPipe data before usage. For example, we need to postpone loading a full list of files before forking out a file scanner. For this purpose lazy_init function will be called prior to any __len__, __get_item__, __iter__ operators.

Functional DataPipe

DataLoader should not care about any data logic (including sampling, shuffle, and collate).

Moving Sampler from DataLoader into separate DataPipe

We are planning to create Samplers DataPipe for each existing logic as well as a wrapper around existing Sampler classes.
PR: #49363

# use default sequential sampler (basically do nothing)
sequential_sampled_ds = datapipes.iter.Sample(iter_ds) 

# use random sampler with replacement to generate random item from input datapipe
random_sampled_ds = datapipes.iter.Sample(iter_ds, sampler=RandomSampler, replacement=True) 

Note:

All of SamplerDataPipes can be replaced by another Iterable DataPipe, and Sampler is not required in the Data pipeline.

  • RandomSampler without replacement or SubsetSampler can be replaced by ShuffleIterableDataset with different buffer size
  • WeightedSampler -> WeightedShuffleDataset (If needed)
  • BatchSamper -> BatchDataset
  • Other customized samplers can be replaced by Callable DataPipe to run customized sample function
    In general, sampler datapipe is not suggested to be used in the new pipeline, and we keep it in favor of non BC-breaking.
    Example for the replacement of SubsetSampler:
def subset_sampler(ds):
    buffer = []
    for x in ds:
        if len(buffer) == buffer_size:
            idx = random.randint(0, buffer_size - 1)
            yield buffer[idx]
            buffer[idx] = x
        else:
            buffer.append(x)
    random.shuffle(buffer)
    while buffer:
        yield buffer.pop()
out = datapipes.iter.Callable(ds, subset_sampler)

Moving Collate functions from DataLoader into separate DataPipes

We are going to move collate logic out of DataLoader and implement it as IterDataPipe, it will accept old collate functions as argument or can be rewritten entirely.
PR: #48933

batch_dp = datapipes.iter.BatchNumbers() # Returns batch of integers [1,2,3],[4,5,6],..
default_collated_dp = datapipes.iter.Collate(batch_ds) # use original default collate function
for i in DataLoader(default_collated_dp):
   print(i) # tensor([1, 2, 3]), tensor([4, 5, 6]), ...
 
def collate_fn(batch):
    sum = batch[0] + batch[1] + batch[2]
    return torch.tensor(sum, dtype=torch.float)
default_collated_dp = datapipes.iter.Collate(batch_ds, collate_fn=collate_fn)
for i in DataLoader(default_collated_dp):
   print(i) # tensor([6.]), tensor([15.]), ...

Moving Shuffle from DataLoader into separate Datasets

class BufferedShuffleDataset(IterableDataset[T_co]):
r"""Dataset shuffled from the original dataset.
This class is useful to shuffle an existing instance of an IterableDataset.
The buffer with `buffer_size` is filled with the items from the dataset first. Then,
each item will be yielded from the buffer by reservoir sampling via iterator.
`buffer_size` is required to be larger than 0. For `buffer_size == 1`, the
dataset is not shuffled. In order to fully shuffle the whole dataset, `buffer_size`
is required to be greater than or equal to the size of dataset.
When it is used with :class:`~torch.utils.data.DataLoader`, each item in the
dataset will be yielded from the :class:`~torch.utils.data.DataLoader` iterator.
And, the method to set up a random seed is different based on :attr:`num_workers`.
For single-process mode (:attr:`num_workers == 0`), the random seed is required to
be set before the :class:`~torch.utils.data.DataLoader` in the main process.
>>> ds = BufferedShuffleDataset(dataset)
>>> random.seed(...)
>>> print(list(torch.utils.data.DataLoader(ds, num_workers=0)))
For multi-process mode (:attr:`num_workers > 0`), the random seed is set by a callable
function in each worker.
>>> ds = BufferedShuffleDataset(dataset)
>>> def init_fn(worker_id):
... random.seed(...)
>>> print(list(torch.utils.data.DataLoader(ds, ..., num_workers=n, worker_init_fn=init_fn)))
Arguments:
dataset (IterableDataset): The original IterableDataset.
buffer_size (int): The buffer size for shuffling.
"""
dataset: IterableDataset[T_co]
buffer_size: int
def __init__(self, dataset: IterableDataset[T_co], buffer_size: int) -> None:
super(BufferedShuffleDataset, self).__init__()
assert buffer_size > 0, "buffer_size should be larger than 0"
self.dataset = dataset
self.buffer_size = buffer_size
def __iter__(self) -> Iterator[T_co]:
buf: List[T_co] = []
for x in self.dataset:
if len(buf) == self.buffer_size:
idx = random.randint(0, self.buffer_size - 1)
yield buf[idx]
buf[idx] = x
else:
buf.append(x)
random.shuffle(buf)
while buf:
yield buf.pop()

iter_dp = dp # Returns 0, 1, 2, 3, 4, 5, 6, 7, 8,...
shuffled_dp = datapipes.iter.Shuffle(iter_dp) # Returns 5, 2, 9, 0,...

Other functional DataPipes

In order to provide more versatile API, we plan to add more functional DataPipe for users.

  • Batch
  • PaddedBatch
  • unbatch ...
  • Repeat
  • Cache
  • Filter
  • zip
  • ...

Reproducibility and randomness

Should be part of DataLoader implementation, to be able to define random seed in case of various parallelization techniques.

Async (non-blocking) operations also introduce non-determinism of order, so we would need to implement a DataLoader attribute to order of non-blocking calls fulfillments and to guarantee order determinism.

To Do

This document doesn’t touch the problem of varying batch size for different phases of processing. It is archivable by passing a list of objects into the queue and will be considered at the phase of queue implementation. However it is better to put code example here.

This document doesn't cover distributed training in detail. We are going to extend on this topic using additional sharding parameters and queue implementations.

Considerations

User defined sharding was considered unnecessary at the early stages, however, nothing in the proposed architecture prevents from implementing it later.

CPP implementation was considered as non-flexible. However, nothing prevents users from creating DataPipes with CPP internals.

Torchscript can be used inside of DataPipes, but we are not limited to it.

Arrow/Proto/… can be used to pass data between DataPipes.

Error Tracing?

C++

cc @ssnl @VitalyFedyunin @ejguan

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions