Skip to content

Shared Dataset Functionality #24915

@vincentqb

Description

@vincentqb

🚀 Feature

We want to build a unified data pipeline interface that offers building blocks for others to build on with the following objectives:

  • Standardize datasets across domains.
  • Offer flexible building blocks that can be combine to obtain other datasets.
  • Enable datasets that do not fit in memory.
  • Share code among domains.
  • Facilitate parallel loading and processing of data.
  • Decouple data loading and preprocessing/transformation.
  • Offer static typing for datasets

Motivation

  • The Domains currently each have their own non-standard dataset structure that may also download the data. This duplicate efforts and adds complexity to the user.
  • A common bottleneck when generating datasets is reading the data. We want to offer an interface that enables reading the data and running initial preprocessing while maximizing available computing resources utilization.
  • We may want to leverage specialize libraries such as NVIDIA DALI.

Additional Information

Datasets:

Dataloader:

Features:

cc @ssnl @fmassa @zhangguanheng66 @vincentqb @mrshenli

Metadata

Metadata

Assignees

No one assigned

    Labels

    better-engineeringRelatively self-contained tasks for better engineering contributorsmodule: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions