Skip to content

Commit a102ba2

Browse files
westonpacelidavidm
authored andcommitted
ARROW-12288: [C++] Create Scanner interface
To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match. ## What is removed: * `Scanner::GetFragments` was only used in `FileSystemDataset::Write`. The correct source of truth for fragments is the `Dataset`. Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation. * `Scanner::schema` is redundant and ambiguous. There are two schemas at the scan level. The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression). Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there. This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options. * `FileFormat::splittable` and `Fragment::splittable`. These were intended to advertise that batch readahead was available on the given fragment/format. However, there is no need to advertise this. They are not used by the `SyncScanner` and the `AsyncScanner` will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in `ScanOptions`) * Direct instantiation of `Scanner`. All `Scanner` creation should go through `ScannerBuilder` now. This allows the `ScannerBuilder` to determine what implementation to use. This was mostly the way things were implemented already. Only a few tests instantiated a `Scanner` directly. ## What is deprecated * `Scanner::Scan` is going to be deprecated (ARROW-11797). It will not be implemented by `AsyncScanner`. I do not actually deprecate it in this PR as I reserve that for ARROW-11797. Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal. ## What is new * `Scanner::ScanBatches` and `Scanner::ScanBatchesUnordered` have been added. These functions will be the new preferred "scan" method going forward. This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life. * `ScanOptions::batch_readahead` and `ScanOptions::fragment_readahead` options allow more fine grained control over how to perform readahead. One technicality is that these options will not be respected well by the `SyncScanner` (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement `AsyncScanner`. * `ScanOptions::cpu_executor` and `ScanOptions::io_context` are added and should be fairly self explanatory. * `ScanOptions::use_async` will toggle which scanner to use. Closes apache#9947 from westonpace/feature/arrow-12288 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
1 parent 1ed6819 commit a102ba2

15 files changed

Lines changed: 396 additions & 87 deletions

File tree

cpp/src/arrow/dataset/dataset.h

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,6 @@ class ARROW_DS_EXPORT Fragment : public std::enable_shared_from_this<Fragment> {
6464
/// To receive a record batch stream which is fully filtered and projected, use Scanner.
6565
virtual Result<ScanTaskIterator> Scan(std::shared_ptr<ScanOptions> options) = 0;
6666

67-
/// \brief Return true if the fragment can benefit from parallel scanning.
68-
virtual bool splittable() const = 0;
69-
7067
virtual std::string type_name() const = 0;
7168
virtual std::string ToString() const { return type_name(); }
7269

@@ -111,8 +108,6 @@ class ARROW_DS_EXPORT InMemoryFragment : public Fragment {
111108

112109
Result<ScanTaskIterator> Scan(std::shared_ptr<ScanOptions> options) override;
113110

114-
bool splittable() const override { return false; }
115-
116111
std::string type_name() const override { return "in-memory"; }
117112

118113
protected:

cpp/src/arrow/dataset/dataset_test.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -442,7 +442,7 @@ TEST_F(TestEndToEnd, EndToEndSingleDataset) {
442442
// In the simplest case, consumption is simply conversion to a Table.
443443
ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
444444

445-
auto expected = TableFromJSON(scanner->schema(), {R"([
445+
auto expected = TableFromJSON(scanner_builder->projected_schema(), {R"([
446446
{"sales": 152.25, "model": "3", "country": "CA"},
447447
{"sales": 273.5, "model": "3", "country": "US"}
448448
])"});
@@ -547,7 +547,7 @@ class TestSchemaUnification : public TestUnionDataset {
547547
void AssertScanEquals(std::shared_ptr<Scanner> scanner,
548548
const std::vector<TupleType>& expected_rows) {
549549
std::vector<std::string> columns;
550-
for (const auto& field : scanner->schema()->fields()) {
550+
for (const auto& field : scanner->options()->projected_schema->fields()) {
551551
columns.push_back(field->name());
552552
}
553553

cpp/src/arrow/dataset/file_base.cc

Lines changed: 7 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,7 @@ struct WriteState {
369369
std::unordered_map<std::string, std::unique_ptr<WriteQueue>> queues;
370370
};
371371

372-
Status WriteNextBatch(WriteState& state, const std::shared_ptr<ScanTask>& scan_task,
372+
Status WriteNextBatch(WriteState& state, const std::shared_ptr<Fragment>& fragment,
373373
std::shared_ptr<RecordBatch> batch) {
374374
ARROW_ASSIGN_OR_RAISE(auto groups, state.write_options.partitioning->Partition(batch));
375375
batch.reset(); // drop to hopefully conserve memory
@@ -382,8 +382,8 @@ Status WriteNextBatch(WriteState& state, const std::shared_ptr<ScanTask>& scan_t
382382

383383
std::unordered_set<WriteQueue*> need_flushed;
384384
for (size_t i = 0; i < groups.batches.size(); ++i) {
385-
auto partition_expression = and_(std::move(groups.expressions[i]),
386-
scan_task->fragment()->partition_expression());
385+
auto partition_expression =
386+
and_(std::move(groups.expressions[i]), fragment->partition_expression());
387387
auto batch = std::move(groups.batches[i]);
388388

389389
ARROW_ASSIGN_OR_RAISE(auto part,
@@ -432,7 +432,7 @@ Future<> WriteInternal(const ScanOptions& scan_options, WriteState& state,
432432
ARROW_ASSIGN_OR_RAISE(auto batches_gen, scan_task->ExecuteAsync(cpu_executor));
433433
std::function<Status(std::shared_ptr<RecordBatch> batch)> batch_visitor =
434434
[&, scan_task](std::shared_ptr<RecordBatch> batch) {
435-
return WriteNextBatch(state, scan_task, std::move(batch));
435+
return WriteNextBatch(state, scan_task->fragment(), std::move(batch));
436436
};
437437
scan_futs.push_back(VisitAsyncGenerator(batches_gen, batch_visitor));
438438
} else {
@@ -441,7 +441,7 @@ Future<> WriteInternal(const ScanOptions& scan_options, WriteState& state,
441441

442442
for (auto maybe_batch : batches) {
443443
ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
444-
RETURN_NOT_OK(WriteNextBatch(state, scan_task, std::move(batch)));
444+
RETURN_NOT_OK(WriteNextBatch(state, scan_task->fragment(), std::move(batch)));
445445
}
446446

447447
return Status::OK();
@@ -469,21 +469,8 @@ Status FileSystemDataset::Write(const FileSystemDatasetWriteOptions& write_optio
469469
//
470470
// NB: neither of these will have any impact whatsoever on the common case of writing
471471
// an in-memory table to disk.
472-
ARROW_ASSIGN_OR_RAISE(auto fragment_it, scanner->GetFragments());
473-
ARROW_ASSIGN_OR_RAISE(FragmentVector fragments, fragment_it.ToVector());
474-
ScanTaskVector scan_tasks;
475-
476-
for (const auto& fragment : fragments) {
477-
auto options = std::make_shared<ScanOptions>(*scanner->options());
478-
// Avoid contention with multithreaded readers
479-
options->use_threads = false;
480-
ARROW_ASSIGN_OR_RAISE(auto scan_task_it,
481-
Scanner(fragment, std::move(options)).Scan());
482-
for (auto maybe_scan_task : scan_task_it) {
483-
ARROW_ASSIGN_OR_RAISE(auto scan_task, maybe_scan_task);
484-
scan_tasks.push_back(std::move(scan_task));
485-
}
486-
}
472+
ARROW_ASSIGN_OR_RAISE(auto scan_task_it, scanner->Scan());
473+
ARROW_ASSIGN_OR_RAISE(ScanTaskVector scan_tasks, scan_task_it.ToVector());
487474

488475
WriteState state(write_options);
489476
auto res = internal::RunSynchronously<arrow::detail::Empty>(

cpp/src/arrow/dataset/file_base.h

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -134,9 +134,6 @@ class ARROW_DS_EXPORT FileFormat : public std::enable_shared_from_this<FileForma
134134
/// \brief The name identifying the kind of file format
135135
virtual std::string type_name() const = 0;
136136

137-
/// \brief Return true if fragments of this format can benefit from parallel scanning.
138-
virtual bool splittable() const { return false; }
139-
140137
virtual bool Equals(const FileFormat& other) const = 0;
141138

142139
/// \brief Indicate if the FileSource is supported/readable by this format.
@@ -176,7 +173,6 @@ class ARROW_DS_EXPORT FileFragment : public Fragment {
176173

177174
std::string type_name() const override { return format_->type_name(); }
178175
std::string ToString() const override { return source_.path(); };
179-
bool splittable() const override { return format_->splittable(); }
180176

181177
const FileSource& source() const { return source_; }
182178
const std::shared_ptr<FileFormat>& format() const { return format_; }

cpp/src/arrow/dataset/file_csv.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ using internal::checked_cast;
4545
using internal::checked_pointer_cast;
4646
using internal::Executor;
4747
using internal::SerialExecutor;
48-
using RecordBatchGenerator = AsyncGenerator<std::shared_ptr<RecordBatch>>;
48+
using RecordBatchGenerator = std::function<Future<std::shared_ptr<RecordBatch>>()>;
4949

5050
Result<std::unordered_set<std::string>> GetColumnNames(
5151
const csv::ParseOptions& parse_options, util::string_view first_block,

cpp/src/arrow/dataset/file_ipc.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,6 @@ class ARROW_DS_EXPORT IpcFileFormat : public FileFormat {
4242
return type_name() == other.type_name();
4343
}
4444

45-
bool splittable() const override { return true; }
46-
4745
Result<bool> IsSupported(const FileSource& source) const override;
4846

4947
/// \brief Return the schema of the file if possible.

cpp/src/arrow/dataset/file_ipc_test.cc

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,13 @@ class TestIpcFileSystemDataset : public testing::Test,
234234
format_ = ipc_format;
235235
SetWriteOptions(ipc_format->DefaultWriteOptions());
236236
}
237+
238+
std::shared_ptr<Scanner> MakeScanner(const std::shared_ptr<Dataset>& dataset,
239+
const std::shared_ptr<ScanOptions>& scan_options) {
240+
ScannerBuilder builder(dataset, scan_options);
241+
EXPECT_OK_AND_ASSIGN(auto scanner, builder.Finish());
242+
return scanner;
243+
}
237244
};
238245

239246
TEST_F(TestIpcFileSystemDataset, WriteWithIdenticalPartitioningSchema) {
@@ -259,7 +266,7 @@ TEST_F(TestIpcFileSystemDataset, WriteExceedsMaxPartitions) {
259266
// require that no batch be grouped into more than 2 written batches:
260267
write_options_.max_partitions = 2;
261268

262-
auto scanner = std::make_shared<Scanner>(dataset_, scan_options_);
269+
auto scanner = MakeScanner(dataset_, scan_options_);
263270
EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, testing::HasSubstr("This exceeds the maximum"),
264271
FileSystemDataset::Write(write_options_, scanner));
265272
}

cpp/src/arrow/dataset/file_parquet.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,6 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
7070

7171
std::string type_name() const override { return kParquetTypeName; }
7272

73-
bool splittable() const override { return true; }
74-
7573
bool Equals(const FileFormat& other) const override;
7674

7775
struct ReaderOptions {

cpp/src/arrow/dataset/scanner.cc

Lines changed: 120 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,108 @@ Result<RecordBatchGenerator> ScanTask::ExecuteAsync(internal::Executor*) {
7070

7171
bool ScanTask::supports_async() const { return false; }
7272

73-
Result<FragmentIterator> Scanner::GetFragments() {
73+
Result<ScanTaskIterator> Scanner::Scan() {
74+
// TODO(ARROW-12289) This is overridden in SyncScanner and will never be implemented in
75+
// AsyncScanner. It is deprecated and will eventually go away.
76+
return Status::NotImplemented("This scanner does not support the legacy Scan() method");
77+
}
78+
79+
Result<EnumeratedRecordBatchIterator> Scanner::ScanBatchesUnordered() {
80+
// If a scanner doesn't support unordered scanning (i.e. SyncScanner) then we just
81+
// fall back to an ordered scan and assign the appropriate tagging
82+
ARROW_ASSIGN_OR_RAISE(auto ordered_scan, ScanBatches());
83+
return AddPositioningToInOrderScan(std::move(ordered_scan));
84+
}
85+
86+
Result<EnumeratedRecordBatchIterator> Scanner::AddPositioningToInOrderScan(
87+
TaggedRecordBatchIterator scan) {
88+
ARROW_ASSIGN_OR_RAISE(auto first, scan.Next());
89+
if (IsIterationEnd(first)) {
90+
return MakeEmptyIterator<EnumeratedRecordBatch>();
91+
}
92+
struct State {
93+
State(TaggedRecordBatchIterator source, TaggedRecordBatch first)
94+
: source(std::move(source)),
95+
batch_index(0),
96+
fragment_index(0),
97+
finished(false),
98+
prev_batch(std::move(first)) {}
99+
TaggedRecordBatchIterator source;
100+
int batch_index;
101+
int fragment_index;
102+
bool finished;
103+
TaggedRecordBatch prev_batch;
104+
};
105+
struct EnumeratingIterator {
106+
Result<EnumeratedRecordBatch> Next() {
107+
if (state->finished) {
108+
return IterationEnd<EnumeratedRecordBatch>();
109+
}
110+
ARROW_ASSIGN_OR_RAISE(auto next, state->source.Next());
111+
if (IsIterationEnd<TaggedRecordBatch>(next)) {
112+
state->finished = true;
113+
return EnumeratedRecordBatch{
114+
{std::move(state->prev_batch.record_batch), state->batch_index, true},
115+
{std::move(state->prev_batch.fragment), state->fragment_index, true}};
116+
}
117+
auto prev = std::move(state->prev_batch);
118+
bool prev_is_last_batch = false;
119+
auto prev_batch_index = state->batch_index;
120+
auto prev_fragment_index = state->fragment_index;
121+
// Reference equality here seems risky but a dataset should have a constant set of
122+
// fragments which should be consistent for the lifetime of a scan
123+
if (prev.fragment.get() != next.fragment.get()) {
124+
state->batch_index = 0;
125+
state->fragment_index++;
126+
prev_is_last_batch = true;
127+
} else {
128+
state->batch_index++;
129+
}
130+
state->prev_batch = std::move(next);
131+
return EnumeratedRecordBatch{
132+
{std::move(prev.record_batch), prev_batch_index, prev_is_last_batch},
133+
{std::move(prev.fragment), prev_fragment_index, false}};
134+
}
135+
std::shared_ptr<State> state;
136+
};
137+
return EnumeratedRecordBatchIterator(
138+
EnumeratingIterator{std::make_shared<State>(std::move(scan), std::move(first))});
139+
}
140+
141+
Result<TaggedRecordBatchIterator> SyncScanner::ScanBatches() {
142+
// TODO(ARROW-11797) Provide a better implementation that does readahead. Also, add
143+
// unit testing
144+
ARROW_ASSIGN_OR_RAISE(auto scan_task_it, Scan());
145+
struct BatchIter {
146+
explicit BatchIter(ScanTaskIterator scan_task_it)
147+
: scan_task_it(std::move(scan_task_it)) {}
148+
149+
Result<TaggedRecordBatch> Next() {
150+
while (true) {
151+
if (current_task == nullptr) {
152+
ARROW_ASSIGN_OR_RAISE(current_task, scan_task_it.Next());
153+
if (IsIterationEnd<std::shared_ptr<ScanTask>>(current_task)) {
154+
return IterationEnd<TaggedRecordBatch>();
155+
}
156+
ARROW_ASSIGN_OR_RAISE(batch_it, current_task->Execute());
157+
}
158+
ARROW_ASSIGN_OR_RAISE(auto next, batch_it.Next());
159+
if (IsIterationEnd<std::shared_ptr<RecordBatch>>(next)) {
160+
current_task = nullptr;
161+
} else {
162+
return TaggedRecordBatch{next, current_task->fragment()};
163+
}
164+
}
165+
}
166+
167+
ScanTaskIterator scan_task_it;
168+
RecordBatchIterator batch_it;
169+
std::shared_ptr<ScanTask> current_task;
170+
};
171+
return TaggedRecordBatchIterator(BatchIter(std::move(scan_task_it)));
172+
}
173+
174+
Result<FragmentIterator> SyncScanner::GetFragments() {
74175
if (fragment_ != nullptr) {
75176
return MakeVectorIterator(FragmentVector{fragment_});
76177
}
@@ -81,7 +182,7 @@ Result<FragmentIterator> Scanner::GetFragments() {
81182
return GetFragmentsFromDatasets({dataset_}, scan_options_->filter);
82183
}
83184

84-
Result<ScanTaskIterator> Scanner::Scan() {
185+
Result<ScanTaskIterator> SyncScanner::Scan() {
85186
// Transforms Iterator<Fragment> into a unified
86187
// Iterator<ScanTask>. The first Iterator::Next invocation is going to do
87188
// all the work of unwinding the chained iterators.
@@ -110,7 +211,7 @@ ScannerBuilder::ScannerBuilder(std::shared_ptr<Dataset> dataset,
110211
fragment_(nullptr),
111212
scan_options_(std::move(scan_options)) {
112213
scan_options_->dataset_schema = dataset_->schema();
113-
DCHECK_OK(Filter(literal(true)));
214+
DCHECK_OK(Filter(scan_options_->filter));
114215
}
115216

116217
ScannerBuilder::ScannerBuilder(std::shared_ptr<Schema> schema,
@@ -120,13 +221,17 @@ ScannerBuilder::ScannerBuilder(std::shared_ptr<Schema> schema,
120221
fragment_(std::move(fragment)),
121222
scan_options_(std::move(scan_options)) {
122223
scan_options_->dataset_schema = std::move(schema);
123-
DCHECK_OK(Filter(literal(true)));
224+
DCHECK_OK(Filter(scan_options_->filter));
124225
}
125226

126227
const std::shared_ptr<Schema>& ScannerBuilder::schema() const {
127228
return scan_options_->dataset_schema;
128229
}
129230

231+
const std::shared_ptr<Schema>& ScannerBuilder::projected_schema() const {
232+
return scan_options_->projected_schema;
233+
}
234+
130235
Status ScannerBuilder::Project(std::vector<std::string> columns) {
131236
return SetProjection(scan_options_.get(), std::move(columns));
132237
}
@@ -170,9 +275,15 @@ Result<std::shared_ptr<Scanner>> ScannerBuilder::Finish() {
170275
}
171276

172277
if (dataset_ == nullptr) {
173-
return std::make_shared<Scanner>(fragment_, scan_options_);
278+
// AsyncScanner does not support this method of running. It may in the future
279+
return std::make_shared<SyncScanner>(fragment_, scan_options_);
280+
}
281+
if (scan_options_->use_async) {
282+
// TODO(ARROW-12289)
283+
return Status::NotImplemented("The asynchronous scanner is not yet available");
284+
} else {
285+
return std::make_shared<SyncScanner>(dataset_, scan_options_);
174286
}
175-
return std::make_shared<Scanner>(dataset_, scan_options_);
176287
}
177288

178289
static inline RecordBatchVector FlattenRecordBatchVector(
@@ -202,13 +313,13 @@ struct TableAssemblyState {
202313
}
203314
};
204315

205-
Result<std::shared_ptr<Table>> Scanner::ToTable() {
316+
Result<std::shared_ptr<Table>> SyncScanner::ToTable() {
206317
return internal::RunSynchronously<std::shared_ptr<Table>>(
207318
[this](Executor* executor) { return ToTableInternal(executor); },
208319
scan_options_->use_threads);
209320
}
210321

211-
Future<std::shared_ptr<Table>> Scanner::ToTableInternal(
322+
Future<std::shared_ptr<Table>> SyncScanner::ToTableInternal(
212323
internal::Executor* cpu_executor) {
213324
ARROW_ASSIGN_OR_RAISE(auto scan_task_it, Scan());
214325
auto task_group = scan_options_->TaskGroup();
@@ -218,6 +329,7 @@ Future<std::shared_ptr<Table>> Scanner::ToTableInternal(
218329
/// and the mutex/batches fail out of scope.
219330
auto state = std::make_shared<TableAssemblyState>();
220331

332+
// TODO (ARROW-11797) Migrate to using ScanBatches()
221333
size_t scan_task_id = 0;
222334
std::vector<Future<>> scan_futures;
223335
for (auto maybe_scan_task : scan_task_it) {

0 commit comments

Comments
 (0)