Skip to content

Add Spark materialization engine for parallel, distributed materialization of large datasets. #3167

@ckarwicki

Description

@ckarwicki

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Current implementation of Spark offline store doesn't have Spark based materialization engine. This makes materialization slow, inefficient and makes Spark offline store not very useful since materialization is still happening in driver node and will be limited by its resources.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Spark based materialization engine.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
BytewaxMaterializationEngine - it relies on offline_job.to_remote_storage() but SparkRetrievalJob doesn't support to_remote_storage(). Also, would rather use one stack for job execution (preferably Spark) instead of two.

Additional context
Add any other context or screenshots about the feature request here.
spark_materialization_engine would make Feast highly scalable and leverage full Spark potential. Right now it it very limited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions