-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Add HDFS as a feature registry #5655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
d1c3270
feature(): support hdfs as registry
chimeyrock999 4401b54
fix(): add type for HDFSRegistryStore
chimeyrock999 943b2d3
fix(): reformat code of registry.py file
chimeyrock999 b77ea6b
fix(): change hdfs remove api and fix hdfs registry test
chimeyrock999 ed4fb69
ci: install hadoop dependencies for pyarrow.fs.HDFSFileSystem
chimeyrock999 751e3d5
ci: fix install-hadoop-dependencies-ci
chimeyrock999 abdc263
ci: typo in install-hadoop-dependencies-ci
chimeyrock999 017e0f5
ci: add HADOOP_USER_NAME env var
chimeyrock999 9ffb856
docs: add document for HDFS registry
chimeyrock999 c4cd9b5
fix: change wait logs of hdfs_registry test to ensure containers are …
chimeyrock999 3e0f467
Merge branch 'feast-dev:master' into master
chimeyrock999 2d075d3
ci: fix typo in install-hadoop-dependencies
chimeyrock999 720a9ca
docs(): Add pre-requisites for hdfs registry
chimeyrock999 d826cd5
ci(): cache hadoop tarball
chimeyrock999 b6e0018
ci(): rename hadoop-3.4.2 to hadoop
chimeyrock999 9ecb03f
ci(): readd install-hadoop-dependencies-ci
chimeyrock999 8c9eadf
Merge branch 'feast-dev:master' into master
chimeyrock999 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # HDFS Registry | ||
|
|
||
| ## Description | ||
|
|
||
| HDFS registry provides support for storing the protobuf representation of your feature store objects (data sources, feature views, feature services, etc.) in Hadoop Distributed File System (HDFS). | ||
|
|
||
| While it can be used in production, there are still inherent limitations with a file-based registries, since changing a single field in the registry requires re-writing the whole registry file. With multiple concurrent writers, this presents a risk of data loss, or bottlenecks writes to the registry since all changes have to be serialized (e.g. when running materialization for multiple feature views or time ranges concurrently). | ||
|
|
||
| ### Pre-requisites | ||
|
|
||
| The HDFS registry requires Hadoop 3.3+ to be installed and the `HADOOP_HOME` environment variable set. | ||
|
|
||
| ### Authentication and User Configuration | ||
|
|
||
| The HDFS registry is using `pyarrow.fs.HadoopFileSystem` and **does not** support specifying HDFS users or Kerberos credentials directly in the `feature_store.yaml` configuration. It relies entirely on the Hadoop and system environment configuration available to the process running Feast. | ||
|
|
||
| By default, `pyarrow.fs.HadoopFileSystem` inherits authentication from the underlying Hadoop client libraries and environment variables, such as: | ||
|
|
||
| - `HADOOP_USER_NAME` | ||
| - `KRB5CCNAME` | ||
| - `hadoop.security.authentication` | ||
| - Any other relevant properties in `core-site.xml` and `hdfs-site.xml` | ||
|
|
||
| For more information, refer to: | ||
| - [pyarrow.fs.HadoopFileSystem API Reference](https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html) | ||
| - [Hadoop Security: Simple & Kerberos Authentication](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SecureMode.html) | ||
|
|
||
| ## Example | ||
|
|
||
| An example of how to configure this would be: | ||
|
|
||
| {% code title="feature_store.yaml" %} | ||
| ```yaml | ||
| project: feast_hdfs | ||
| registry: | ||
| path: hdfs://[YOUR NAMENODE HOST]:[YOUR NAMENODE PORT]/[PATH TO REGISTRY]/registry.pb | ||
| cache_ttl_seconds: 60 | ||
| online_store: null | ||
chimeyrock999 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| offline_store: null | ||
| ``` | ||
| {% endcode %} | ||
|
|
||
Empty file.
121 changes: 121 additions & 0 deletions
121
sdk/python/feast/infra/registry/contrib/hdfs/hdfs_registry_store.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| import json | ||
| import uuid | ||
| from pathlib import Path, PurePosixPath | ||
| from typing import Optional | ||
| from urllib.parse import urlparse | ||
|
|
||
| from pyarrow import fs | ||
|
|
||
| from feast.infra.registry.registry_store import RegistryStore | ||
| from feast.protos.feast.core.Registry_pb2 import Registry as RegistryProto | ||
| from feast.repo_config import RegistryConfig | ||
| from feast.utils import _utc_now | ||
|
|
||
|
|
||
| class HDFSRegistryStore(RegistryStore): | ||
| """HDFS implementation of RegistryStore. | ||
| registryConfig.path should be a hdfs path like hdfs://namenode:8020/path/to/registry.db | ||
| """ | ||
|
|
||
| def __init__(self, registry_config: RegistryConfig, repo_path: Path): | ||
| try: | ||
| from pyarrow.fs import HadoopFileSystem | ||
| except ImportError as e: | ||
| from feast.errors import FeastExtrasDependencyImportError | ||
|
|
||
| raise FeastExtrasDependencyImportError( | ||
| "pyarrow.fs.HadoopFileSystem", str(e) | ||
| ) | ||
| uri = registry_config.path | ||
| self._uri = urlparse(uri) | ||
| if self._uri.scheme != "hdfs": | ||
| raise ValueError( | ||
| f"Unsupported scheme {self._uri.scheme} in HDFS path {uri}" | ||
| ) | ||
| self._hdfs = HadoopFileSystem(self._uri.hostname, self._uri.port or 8020) | ||
| self._path = PurePosixPath(self._uri.path) | ||
|
|
||
| def get_registry_proto(self): | ||
| registry_proto = RegistryProto() | ||
| if _check_hdfs_path_exists(self._hdfs, str(self._path)): | ||
| with self._hdfs.open_input_file(str(self._path)) as f: | ||
| registry_proto.ParseFromString(f.read()) | ||
| return registry_proto | ||
| raise FileNotFoundError( | ||
| f'Registry not found at path "{self._uri.geturl()}". Have you run "feast apply"?' | ||
| ) | ||
|
|
||
| def update_registry_proto(self, registry_proto: RegistryProto): | ||
| self._write_registry(registry_proto) | ||
|
|
||
| def teardown(self): | ||
| if _check_hdfs_path_exists(self._hdfs, str(self._path)): | ||
| self._hdfs.delete_file(str(self._path)) | ||
| else: | ||
| # Nothing to do | ||
| pass | ||
|
|
||
| def _write_registry(self, registry_proto: RegistryProto): | ||
| """Write registry protobuf to HDFS.""" | ||
| registry_proto.version_id = str(uuid.uuid4()) | ||
| registry_proto.last_updated.FromDatetime(_utc_now()) | ||
|
|
||
| dir_path = self._path.parent | ||
| if not _check_hdfs_path_exists(self._hdfs, str(dir_path)): | ||
| self._hdfs.create_dir(str(dir_path), recursive=True) | ||
|
|
||
| with self._hdfs.open_output_stream(str(self._path)) as f: | ||
| f.write(registry_proto.SerializeToString()) | ||
|
|
||
| def set_project_metadata(self, project: str, key: str, value: str): | ||
| """Set a custom project metadata key-value pair in the registry (HDFS backend).""" | ||
| registry_proto = self.get_registry_proto() | ||
| found = False | ||
|
|
||
| for pm in registry_proto.project_metadata: | ||
| if pm.project == project: | ||
| # Load JSON metadata from project_uuid | ||
| try: | ||
| meta = json.loads(pm.project_uuid) if pm.project_uuid else {} | ||
| except Exception: | ||
| meta = {} | ||
|
|
||
| if not isinstance(meta, dict): | ||
| meta = {} | ||
|
|
||
| meta[key] = value | ||
| pm.project_uuid = json.dumps(meta) | ||
| found = True | ||
| break | ||
|
|
||
| if not found: | ||
| # Create new ProjectMetadata entry | ||
| from feast.project_metadata import ProjectMetadata | ||
|
|
||
| pm = ProjectMetadata(project_name=project) | ||
| pm.project_uuid = json.dumps({key: value}) | ||
| registry_proto.project_metadata.append(pm.to_proto()) | ||
|
|
||
| # Write back | ||
| self.update_registry_proto(registry_proto) | ||
|
|
||
| def get_project_metadata(self, project: str, key: str) -> Optional[str]: | ||
| """Get custom project metadata key from registry (HDFS backend).""" | ||
| registry_proto = self.get_registry_proto() | ||
|
|
||
| for pm in registry_proto.project_metadata: | ||
| if pm.project == project: | ||
| try: | ||
| meta = json.loads(pm.project_uuid) if pm.project_uuid else {} | ||
| except Exception: | ||
| meta = {} | ||
|
|
||
| if not isinstance(meta, dict): | ||
| return None | ||
| return meta.get(key, None) | ||
| return None | ||
|
|
||
|
|
||
| def _check_hdfs_path_exists(hdfs, path: str) -> bool: | ||
| info = hdfs.get_file_info([path])[0] | ||
| return info.type != fs.FileType.NotFound |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.