-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 Feature
The IterableDataset is too restrictive by not allowing the combination with samplers. Sampling from a stream is well understood and possible on the fly. IterableDataset should support these use cases.
Motivation
The IterableDataset abstraction is great for abstracting a stream of data we want to iterate over in a forward fashion. Right now it is not compatible with samplers, though. From the docs:
Neither sampler nor batch_sampler is compatible with iterable-style datasets, since such datasets have no notion of a key or an index.
Here are two different use-cases where sampling from an IterableDataset is necessary:
- The user knows the total size in advance
For example I have one IterableDataset per video (yielding clips), and I know the number of frames for each video and the total number of videos in advance. I can sample k random clips with
pick = set(random.sample(range(self.total), k))
mask = [i in pick for i in range(self.total)]
it = itertools.chain(*self.videos)
it = itertools.compress(it, mask)
and abstract over this in an IterableDataset to only walk once through all videos.
- The user does not know the total size in advance
For example I have videos with clips but I don't want / can get the number of frames per video and therefore don't know the total size in advance. I can still sample k random clips out of an unknown n total clips e.g. via reservoir sampling and only walk once through all videos.
What are your thoughts on this?
cc @ssnl