Skip to content

Modularize ingestion distributed compute engine support #444

@ches

Description

@ches

This is a companion to #402 and the larger topic of storage engine modularization which was realized in #529 and subsequent PRs that implemented the new interfaces.

Just as adding support for new storage engines tends to cause a dependency explosion for Feast ingestion & serving, the same is true for Beam Runner / job management adapter glue in core (this all could move to serving with future plans, but that won't change the fundamental problem this issue is about).

So for both storage and compute engines, I feel that some modularity strategy is needed for loose binding at build time, configurable for runtime. The goals would be to:

  • Minimize dependency pains that developers and contributors to Feast need to deal with if they are not actively working on a particular stack. The dependency trees are often large and fragile, especially in the Hadoop ecosystem, such as Hive and Spark.
  • Reduce deployment bloat if operators wish to package Feast internally with only the module JARs they need to support their organization's stack. (IIRC last I checked, hadoop-common or hadoop-client leave you with close to 200MB of jars, and beam-runners-spark and beam-sdks-java-io-hcatalog among others have these deps [as provided scope, but the point stands I believe]).

Possibilities might be OSGi or java.util.ServiceLoader (and Spring integration or alternatives thereof). Open to other ideas!

Relates to #362

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions