Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines

**Is your feature request related to a problem? Please describe.**
Feast currently can run on [spark](https://github.com/feast-dev/feast/tree/master/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store), [dask and ray](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/file.py). Using this engines the support for [deltalake](https://delta.io/), [apache iceberg](https://iceberg.apache.org/) and [apache hudi](https://hudi.apache.org/) (which was requested by the community) can be added. This feature will be very helpful for the teams which base on `Spark` but also for integrations with `Flink`, `Spark Streaming` eg. when historical features are saved in the data lake formats.  

**Describe the solution you'd like**
![EnginesSources drawio](https://user-images.githubusercontent.com/14150080/158275190-3ff14de2-375e-4f27-adaf-819930119032.png)

The solution assumes adding new data lake sources:
* `DeltaDataSource`
* `IcebergDataSource`
* `HudiDataSource`

and also support for CSV files.

The data lake sources support will be covered by the Spark engine (which is already in contrib) for users which use Feast on Spark (or Databricks) but also for Dask and Ray. 

Additional assumptions:
1) Data sources can be mixed eg. you can use `DeltaDataSource`, `CSV` and `IcebergDataSource` to fetch historical features.
2) The engine change won't require changes in the code (only `feature_store.yaml` configuration) thus user can test on the laptop using Dask (without any cluster setup) and then deploy to Spark cluster (or Dask and Ray clusters)
3) The implementation should enable to simply add new DataSources like [Apache Arrow Flight](https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/) (if the python api will be added) and simply mix them with other data sources in the future.

## Delta Lake
The support for Delta Lake for Feast on Spark is already proposed and tested on Databricks and local spark (https://github.com/qooba/feast-pyspark). The solution is based on the plain `pyspark` rather than on `Spark SQL` and `Jinja` thus it is to decide which implementation will be more desirable and maintainable.

The support for Delta Lake for Feast on Dask (Ray) can be implemented using delta python interface:
* https://github.com/delta-io/delta-rs
* https://github.com/rajagurunath/dask_deltatable

## Apache Iceberg

The Apache Iceberg is covered by the Spark engine but also by the [python api](https://iceberg.apache.org/docs/latest/python-feature-support/) which can be used to add Dask/Ray implementation.

## Apache Hudi

The Apache Hudi is covered by the Spark. Currently there is no python api (as far as I know).

## CSV 

The support for `csv` files will be dedicated for the data scientists which would like to conduct ad-hoc experiments. 

**Describe alternatives you've considered**
N/A

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines #2406

Delta Lake

Apache Iceberg

Apache Hudi

CSV

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines #2406

Description

Delta Lake

Apache Iceberg

Apache Hudi

CSV

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions