Skip to content

Support for DeltaLake, Iceberg, Hudi as an offline source covered by Spark, Ray and Dask engines #2406

@qooba

Description

@qooba

Is your feature request related to a problem? Please describe.
Feast currently can run on spark, dask and ray. Using this engines the support for deltalake, apache iceberg and apache hudi (which was requested by the community) can be added. This feature will be very helpful for the teams which base on Spark but also for integrations with Flink, Spark Streaming eg. when historical features are saved in the data lake formats.

Describe the solution you'd like
EnginesSources drawio

The solution assumes adding new data lake sources:

  • DeltaDataSource
  • IcebergDataSource
  • HudiDataSource

and also support for CSV files.

The data lake sources support will be covered by the Spark engine (which is already in contrib) for users which use Feast on Spark (or Databricks) but also for Dask and Ray.

Additional assumptions:

  1. Data sources can be mixed eg. you can use DeltaDataSource, CSV and IcebergDataSource to fetch historical features.
  2. The engine change won't require changes in the code (only feature_store.yaml configuration) thus user can test on the laptop using Dask (without any cluster setup) and then deploy to Spark cluster (or Dask and Ray clusters)
  3. The implementation should enable to simply add new DataSources like Apache Arrow Flight (if the python api will be added) and simply mix them with other data sources in the future.

Delta Lake

The support for Delta Lake for Feast on Spark is already proposed and tested on Databricks and local spark (https://github.com/qooba/feast-pyspark). The solution is based on the plain pyspark rather than on Spark SQL and Jinja thus it is to decide which implementation will be more desirable and maintainable.

The support for Delta Lake for Feast on Dask (Ray) can be implemented using delta python interface:

Apache Iceberg

The Apache Iceberg is covered by the Spark engine but also by the python api which can be used to add Dask/Ray implementation.

Apache Hudi

The Apache Hudi is covered by the Spark. Currently there is no python api (as far as I know).

CSV

The support for csv files will be dedicated for the data scientists which would like to conduct ad-hoc experiments.

Describe alternatives you've considered
N/A

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions