You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ARROW-14658: [C++] Add basic support for nested field refs in scanning
This implements the following:
- Being able to project and filter on nested fields in the scanner/query engine.
Parquet, ORC, and Feather are supported/tested. For ORC and Feather, we will read the entire top-level column. (CSV does not support reading any nested types, though if it does in the future, it should behave the same as Feather/ORC.) For Parquet, we could materialize only the leaf nodes necessary for the projection, but without ARROW-1888 this will fail later on in the scanning pipeline, so we behave the same as Feather/ORC.
The following are not implemented:
- Normally, the scanner can fill in a column of nulls if a requested column does not exist in a file. This is not supported for nested field refs because we need ARROW-1888 to be implemented.
- A nested field ref cannot be used as a key/target of an aggregation or join. However, you can first project the nested fields into their own fields, then aggregate/join on them as usual.
This limitation is because the aggregate/join nodes currently compute a FieldPath to resolve a FieldRef, but then throw away the path, keeping only the first index. To implement this, we would need to store the FieldPath and use the struct_field kernel to resolve the actual array, however, this will have more overhead and we should be careful about regressions here, especially in the common case of no nested field refs.
- Only FieldRefs consisting of field names are supported. For FieldRefs consisting of FieldPath (= a sequence of indices), the semantics are unclear. So far, the scanner is robust to individual files having fields in a different order than the overall dataset, but this won't work for FieldPath, so either we must require that the schema is consistent across files, or come up with some way to map file schemas onto the dataset schema so that indices have a consistent meaning.
Closesapache#11704 from lidavidm/arrow-14658
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
0 commit comments