Skip to content

Sampler for IterableDataset #28743

@daniel-j-h

Description

@daniel-j-h

🚀 Feature

The IterableDataset is too restrictive by not allowing the combination with samplers. Sampling from a stream is well understood and possible on the fly. IterableDataset should support these use cases.

Motivation

The IterableDataset abstraction is great for abstracting a stream of data we want to iterate over in a forward fashion. Right now it is not compatible with samplers, though. From the docs:

Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.

Here are two different use-cases where sampling from an IterableDataset is necessary:

  1. The user knows the total size in advance

For example I have one IterableDataset per video (yielding clips), and I know the number of frames for each video and the total number of videos in advance. I can sample k random clips with

pick = set(random.sample(range(self.total), k))
mask = [i in pick for i in range(self.total)]

it = itertools.chain(*self.videos)
it = itertools.compress(it, mask)

and abstract over this in an IterableDataset to only walk once through all videos.

  1. The user does not know the total size in advance

For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.

What are your thoughts on this?

cc @ssnl

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions