You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds a naive implementation of `AsyncScanner` which is different from `SyncScanner` in a few ways:
* It does not use `ScanTask` and instead relies on `Fragment::ScanBatchesAsync` which returns `RecordBatchGenerator`.
* It does an unordered scan by default (i.e. batches from file N may arrive before all batches from file N-1 have arrived) and can order it if asked for
* It uses the unordered scan for `ToTable`.
It is "naive" because this PR does not add a complete implementation for `FileFragment::ScanBatchesAsync`. This method relies on `FileFormat::ScanBatchesAsync` (in the same way that `FileFragment::Scan` relies on `FileFormat::ScanFile`). This method (`FileFormat::ScanBatchesAsync`) _should_ be overridden in each of the formats (to rely on an async reader) but it is not (yet).
As a result, the performance for `AsyncScanner` is poor since it does not do any "per-file" parallelism nor does it do any "per-batch" parallelism. Follow-up tasks are ARROW-12355 (CSV), ARROW-11772 (IPC), ARROW-11843 (Parquet)
In addition, this PR is built on top of ARROW-12287 so that will need to be merged first. It will also need to rebase changes from ARROW-12161 and ARROW-11797.
Closesapache#10008 from westonpace/feature/arrow-12289
Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
0 commit comments