Skip to content

Add date_partition_column to SparkSource #4835

@niklasvm

Description

@niklasvm

Is your feature request related to a problem? Please describe.
The current spark implementation scans over all parquet files. This process can be made faster and more efficient by specifying a date_partition_column. During execution, this column would be used to filter the data at a file level. Only files who's date is within the range would be scanned.

Describe the solution you'd like
Add date_partition_column to SparkSource. A similar implementation exists for the AthenaSource

Describe alternatives you've considered
None

I have implemented this locally and it works. I'm happy to open a PR

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions