-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Chunk dataset implementation #15932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk dataset implementation #15932
Conversation
goldsborough
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions. Likely need some more back and forth
| ChunkReader chunk_reader_; | ||
|
|
||
| // chunk sampler to shuffle different chunks | ||
| samplers::ThreadSafeSampler<ChunkSamplerType> chunk_sampler_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that you're forcing the locking strategy here, while the provided sampler could have already been thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be and it won't break the logic even if someone provided a thread safe sampler. Besides, we only have two non thread safe samplers in the cpp library to use. This also does not break the API, if we develop several thread safe samplers, we can remove the wrapper later with some checks at the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest it's better to add this later than remove it later, since the direction you're proposing is a breaking change (and the other one is not). That's my concern. Most samplers can easily be implemented using atomics and won't need any locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a static_assert that the Sampler someone templates on is a subclass of some dummy ThreadSafeSampler class, whose only purpose is to make people aware that the code they wrote should be made thread safe if they want to use it in this dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do this in new PR. What we need is a distributed, thread safe sampler for this. I can provide an update as the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although it would be good to remove the reserve call which assumes a bit too much about the container API.
| UnwrappedBatchType current_batch; | ||
|
|
||
| // Allocate the batch memory ahead of time. | ||
| current_batch.reserve(batch_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit of a stretch. It heavily assumes that batches are std::vectors. While requiring .size() and .empty() is benign, this is very specific.
|
Reopen to run CI tests. |
|
Re-try the CI test. |
0d103bb to
db4f118
Compare
…chDataBuffer to detail namespace and many more.
db4f118 to
6daf2cf
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| #pragma once | ||
|
|
||
| #include <torch/data/datasets/stateful.h> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that you resolved this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am preparing a PR to remove the LockedSampler that we all did not like. I will include this in that too.
|
|
||
| ExampleSampler& example_sampler_; | ||
|
|
||
| // configurable maximun number of elements the queue can hold at one time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: maximum
This PR contains the implementation of chunk dataset, with the API proposed in PR #15562
A chunk dataset is derived from StatefulDataset. It utilizes worker threads to prefetches chunk data, splits it into batches and caches them into a queue. When get_batch is called from dataloader, batch data is retrieved from the queue, and data in new chunks will be pushed for later following batches.
Chunk dataset uses two samplers (chunk_sampler and example_sampler) to perform sampling. The chunk_sampler decides which chunk to load, and example_sampler shuffles the examples inside a specific chunk. More detail of this sampling approach can be found here: http://martin.zinkevich.org/publications/nips2010.pdf