-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
DataSource abstract interface contains methods like validate and get_table_column_names_and_types for validation and schema extraction respectively. This doesn't make too much sense when you consider the separation of concerns between DataSource and OfflineStore. DataSource is supposed to be a static description of source dataset, while OfflineStore is an engine that knows how to read one or more data source types. Having these methods in DataSource classes means data sources should also be able to somehow access the underlying sources.
I propose to move validate method to OfflineStore abstract class as a validate_data_source. This also makes sense for scenarios when a single source can be read by multiple offline stores. For example, FileSource (which can be read by dask, duckdb and probably spark in the future) is right now validated with pyarrow instead of leaving it up to the OfflineStores to choose how to validate the sources).