Skip to content

Conversation

@xzhu1900
Copy link
Contributor

This PR contains the implementation of chunk dataset, with the API proposed in PR #15562

A chunk dataset is derived from StatefulDataset. It utilizes worker threads to prefetches chunk data, splits it into batches and caches them into a queue. When get_batch is called from dataloader, batch data is retrieved from the queue, and data in new chunks will be pushed for later following batches.

Chunk dataset uses two samplers (chunk_sampler and example_sampler) to perform sampling. The chunk_sampler decides which chunk to load, and example_sampler shuffles the examples inside a specific chunk. More detail of this sampling approach can be found here: http://martin.zinkevich.org/publications/nips2010.pdf

Copy link
Contributor

@goldsborough goldsborough left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions. Likely need some more back and forth

@xzhu1900 xzhu1900 requested a review from yf225 as a code owner January 19, 2019 00:54
ChunkReader chunk_reader_;

// chunk sampler to shuffle different chunks
samplers::ThreadSafeSampler<ChunkSamplerType> chunk_sampler_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that you're forcing the locking strategy here, while the provided sampler could have already been thread safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be and it won't break the logic even if someone provided a thread safe sampler. Besides, we only have two non thread safe samplers in the cpp library to use. This also does not break the API, if we develop several thread safe samplers, we can remove the wrapper later with some checks at the constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest it's better to add this later than remove it later, since the direction you're proposing is a breaking change (and the other one is not). That's my concern. Most samplers can easily be implemented using atomics and won't need any locks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a static_assert that the Sampler someone templates on is a subclass of some dummy ThreadSafeSampler class, whose only purpose is to make people aware that the code they wrote should be made thread safe if they want to use it in this dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this in new PR. What we need is a distributed, thread safe sampler for this. I can provide an update as the next PR.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although it would be good to remove the reserve call which assumes a bit too much about the container API.

UnwrappedBatchType current_batch;

// Allocate the batch memory ahead of time.
current_batch.reserve(batch_size_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit of a stretch. It heavily assumes that batches are std::vectors. While requiring .size() and .empty() is benign, this is very specific.

@xzhu1900 xzhu1900 closed this Jan 26, 2019
@xzhu1900
Copy link
Contributor Author

Reopen to run CI tests.

@xzhu1900 xzhu1900 reopened this Jan 26, 2019
@xzhu1900 xzhu1900 closed this Jan 28, 2019
@xzhu1900
Copy link
Contributor Author

Re-try the CI test.

@xzhu1900 xzhu1900 reopened this Jan 28, 2019
@xzhu1900 xzhu1900 force-pushed the xuzhu/chunk_dataset_imp branch from 0d103bb to db4f118 Compare January 28, 2019 20:05
@zou3519 zou3519 mentioned this pull request Jan 29, 2019
@soumith soumith force-pushed the xuzhu/chunk_dataset_imp branch from db4f118 to 6daf2cf Compare January 29, 2019 18:30
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

#pragma once

#include <torch/data/datasets/stateful.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that you resolved this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am preparing a PR to remove the LockedSampler that we all did not like. I will include this in that too.


ExampleSampler& example_sampler_;

// configurable maximun number of elements the queue can hold at one time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: maximum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants