Skip to content

feat: Operational metrics for offline store and SOX metrics for both#6340

Merged
ntkathole merged 5 commits into
feast-dev:masterfrom
jyejare:remaining_ops_metrics
Jun 3, 2026
Merged

feat: Operational metrics for offline store and SOX metrics for both#6340
ntkathole merged 5 commits into
feast-dev:masterfrom
jyejare:remaining_ops_metrics

Conversation

@jyejare

@jyejare jyejare commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

What this PR does / why we need it:

Adds operational metrics instrumentation for Feast's offline store and structured SOX audit logging for both online and offline feature retrieval paths. This is needed to satisfy the operational observability requirements.

Changes:

Offline Store RED Metrics (metrics.py, offline_store.py)

New Prometheus metrics: feast_offline_store_request_total (Counter), feast_offline_store_request_latency_seconds (Histogram), feast_offline_store_row_count (Histogram)
Instrumented RetrievalJob.to_arrow() as the single source of truth for offline retrieval metrics — captures request count, error rate, latency, and row count
Defensive try/except in the finally block ensures instrumentation failures never mask query errors

SOX Audit Logging (metrics.py, feature_server.py, offline_store.py)

emit_online_audit_log(): structured JSON audit entries for online requests — captures requestor identity, entity keys, feature views, feature count, status, and latency
emit_offline_audit_log(): structured JSON audit entries for offline retrievals — captures method, feature views, row count, status, start/end timestamps, and duration
Logs routed to a dedicated feast.audit logger so operators can independently route audit entries to a SOX-compliant sink
Only entity key names are logged (not values) to minimize PII exposure

Online Audit Integration (feature_server.py)

Extracted _parse_feature_info() helper to DRY up feature view name / count extraction (shared by _resolve_feature_counts and _emit_online_audit)
_emit_online_audit() wraps audit emission with best-effort error handling (logs at warning level on failure, never breaks the request)
Wired into /get-online-features endpoint's finally block

Configuration (base_config.py)

offline_features: bool = True — toggle for offline store Prometheus metrics
audit_logging: bool = False — toggle for structured audit log emission (opt-in)
Both flags integrated into _MetricsFlags and build_metrics_flags()

@jyejare jyejare requested review from a team as code owners April 27, 2026 14:03
@jyejare jyejare requested review from ejscribner, robhowley and shuchu and removed request for a team April 27, 2026 14:03

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread sdk/python/feast/infra/offline_stores/offline_store.py Outdated
@jyejare jyejare force-pushed the remaining_ops_metrics branch from 9c69d73 to e0a5d54 Compare April 28, 2026 00:57
Comment thread sdk/python/tests/unit/test_metrics.py Outdated
Comment thread sdk/python/feast/metrics.py
@ntkathole

Copy link
Copy Markdown
Member

Add metrics in docs/reference/feature-servers/python-feature-server.md

Comment thread sdk/python/feast/feature_server.py Outdated
Comment thread sdk/python/feast/infra/feature_servers/base_config.py
@jyejare jyejare force-pushed the remaining_ops_metrics branch from 2225fc3 to dcfaab9 Compare May 11, 2026 08:59
@ntkathole ntkathole force-pushed the remaining_ops_metrics branch from dcfaab9 to 55b229e Compare May 14, 2026 09:05
Comment thread sdk/python/feast/infra/offline_stores/offline_store.py Outdated
@jyejare jyejare force-pushed the remaining_ops_metrics branch 3 times, most recently from 469b8c5 to ad546f6 Compare May 29, 2026 14:20
detail=f"Failed to query Prometheus: {e}",
)

@router.get("/system-metrics/query", tags=["System Metrics"])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No authentication check required on all system-metrics endpoint ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, not required, as these endpoints expose operational infrastructure metrics, not feature-level data, so no additional RBAC (assert_permissions) checks are required.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, anyone who has access to UI, can query prometheus with /query and /query_range proxy to cluster-wide Thanos, it's on prometheus oe thanos to accept or reject queries - is that what you mean ? Also, Any user who can reach the UI can make Feast act as a proxy to prometheus. The query goes out with the Feast pod's SA token - not the caller's identity. So prometheus sees "Feast SA is asking this query" and applies Feast SA's permissions ?

@jyejare jyejare Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we dont want to introduce RBAC / user_token for watching the operational metrics from Prometheus.

it's on prometheus oe thanos to accept or reject queries - is that what you mean ?

No, Afaik we dont have OR want to get into thanos authentication.

Also, Any user who can reach the UI can make Feast act as a proxy to prometheus.

Agree this is a security concern and can be a concern even though the authenticated user of feast gets to Prometheus.(Even today the authenticated user is only able to access the UI in production development).

The query goes out with the Feast pod's SA token - not the caller's identity. So prometheus sees "Feast SA is asking this query" and applies Feast SA's permissions ?

Yeah, but again prometheus side authorization accessing only feast or project side metrics is a challenge. We need have a authorization at prometheus side or promQL queries needs to be re-build/customized-in-code with project restriction based on users request at the same time provide dashboard to have a custom promQL query.

Wdyt ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed GETting of operational metrics in feast as discussed, please re-review

Comment thread sdk/python/feast/api/registry/rest/system_metrics.py Outdated
Comment thread sdk/python/feast/api/registry/rest/system_metrics.py Outdated
Comment thread sdk/python/feast/infra/feature_servers/base_config.py Outdated
Comment thread sdk/python/feast/api/registry/rest/__init__.py Outdated
Comment thread sdk/python/feast/api/registry/rest/system_metrics.py Outdated
Comment thread sdk/python/feast/api/registry/rest/system_metrics.py Outdated
@jyejare jyejare force-pushed the remaining_ops_metrics branch from ad546f6 to bb4f8b5 Compare June 1, 2026 08:39
@jyejare jyejare force-pushed the remaining_ops_metrics branch 2 times, most recently from 63a7116 to b468808 Compare June 1, 2026 09:23
jyejare added 5 commits June 2, 2026 20:14
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the remaining_ops_metrics branch from 89a0ea0 to 76a3118 Compare June 2, 2026 14:44
@ntkathole ntkathole merged commit 65b1b80 into feast-dev:master Jun 3, 2026
33 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Jun 13, 2026
# [0.64.0](v0.63.0...v0.64.0) (2026-06-13)

### Bug Fixes

* Add async_supported property to RedisOnlineStore ([9b088fe](9b088fe))
* Add missing feast init templates to operator CRD and enhance persistence documentation ([1941d4d](1941d4d))
* Allow to publish from reference branch ([5458ec8](5458ec8))
* API calls list ([4203eb7](4203eb7))
* **bigquery:** Enable list inference for parquet loads in offline_write_batch ([9243497](9243497)), closes [#5845](#5845)
* Bump grpcio dependencies ([07b4782](07b4782))
* **compute-engine/local:** Honor field_mapping on join keys in dedup + join nodes ([#6395](#6395)) ([bd01824](bd01824))
* **dynamodb:** Avoid tag race condition by using diff-based tag updates ([#6479](#6479)) ([bad2b7d](bad2b7d)), closes [#6418](#6418)
* **dynamodb:** Fix mypy type for _build_projection_expression return ([217b4da](217b4da))
* Fix intermittent async test failures for DynamoDB and Redis ([63c5eb1](63c5eb1))
* Fix mongodb blog title ([57d28d4](57d28d4))
* Fix shared SQL registry crash - avoid unnecessary UDF deserialization in proto cache building ([ac588d7](ac588d7))
* Fix SparkRetrievalJob.persist() failing for SparkSource ([209d7cd](209d7cd))
* Fixed formatting and image for mongo blog ([#6377](#6377)) ([f8389fb](f8389fb))
* Fixes for ray source ([7f592a4](7f592a4))
* **go:** skip registry refresh when cache_ttl_seconds <= 0 ([97ed40c](97ed40c))
* Handle array of strings columns in Athena materialization ([#6324](#6324)) ([4ed0278](4ed0278))
* make milvus VARCHAR max_length configurable, remove hardcoded 512 limit ([3b98c22](3b98c22))
* **operator:** Set appProtocol: grpc on registry gRPC Service ([#6367](#6367)) ([c9ae2b4](c9ae2b4))
* PyJWT 2.10+ added validation that rejects empty HMAC keys ([e756ffe](e756ffe))
* RemoteOnlineStore sends all features in a single HTTP request ([8f187dd](8f187dd))
* Remove registry proto dump to enforce RBAC and add permission checks to Commit/Refresh RPCs ([328431f](328431f))
* Remove selector migration job - no longer needed ([51c325e](51c325e))
* replace broken .claude skill symlink with correct relative path ([4541690](4541690))
* Replace selector label strip patch with migration Job for upgrade-safe selector uniqueness ([00dea50](00dea50))
* Scope feature view name conflict check to current project in file-based registry ([#6369](#6369)) ([a4fde83](a4fde83)), closes [#6209](#6209)
* **snowflake:** Stop double-quoting connection identifiers ([#6462](#6462)) ([e914d59](e914d59))
* **spark:** S3/GCS PyArrow filesystem resolution for staging paths ([#6442](#6442)) ([ae50414](ae50414))
* **trino:** Clean up temporary entity tables after retrieval ([#6381](#6381)) ([d86b13d](d86b13d)), closes [#6306](#6306)
* Update go-feature-server base image to Go 1.25 and fix operator Dockerfile COPY permissions ([86ef0bc](86ef0bc))

### Features

* [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI ([#6202](#6202)) ([5458c37](5458c37))
* Add apache flink compute engine ([#6476](#6476)) ([9636d6a](9636d6a))
* Add demo noteboooks for users ([e362173](e362173))
* Add enabled/disabled toggle for feature views ([#6401](#6401)) ([5f1fa0d](5f1fa0d)), closes [#6395](#6395)
* Add Label View to init template ([ec272d5](ec272d5))
* Add mTLS support to remote registry gRPC client ([#6474](#6474)) ([c9602d8](c9602d8))
* Add Prometheus gauges for FeatureStore installation telemetry ([#6354](#6354)) ([1b681b7](1b681b7))
* Adds registry REST API endpoints for managing entities, data sources, and feature views ([#6413](#6413)) ([f77bd1d](f77bd1d))
* Allow CRUD on entities, data sources, and feature views from UI ([#6412](#6412)) ([2321c07](2321c07))
* Allow default openlineage configuration ([#6467](#6467)) ([276b6df](276b6df))
* **bigquery:** Support DATE-type event timestamp columns ([#6362](#6362)) ([753dee5](753dee5)), closes [#2530](#2530)
* **cli:** Add `feast projects delete` command (closes [#5095](#5095)) ([#6318](#6318)) ([1a4b96c](1a4b96c))
* Data Quality Monitoring added in feast UI ([#6422](#6422)) ([fa271be](fa271be))
* **dynamodb:** Use ProjectionExpression when requested_features is set ([0adc906](0adc906)), closes [#6058](#6058)
* Enhance DataSource and FeatureView modals with error handling and submission states ([96d7169](96d7169))
* Expose registry endpoints on feature server for MCP access ([f77981c](f77981c))
* Feast First-Class LabelView Implementation ([#6292](#6292)) ([c0e7e5d](c0e7e5d))
* Feast-MLflow Integration ([#6235](#6235)) ([7279c75](7279c75))
* Operational metrics for offline store and SOX metrics for both ([#6340](#6340)) ([65b1b80](65b1b80))
* Pre-compute feature service ([8011550](8011550))
* REST API-backed UI for RBAC compatibility and per-page lazy loading ([#6414](#6414)) ([6ae80af](6ae80af))
* Support non-string map key types ([#6382](#6382)) ([#6383](#6383)) ([728aa2e](728aa2e))
* Update FeatureStore CRD with DRA Fields ([01241e4](01241e4))

### Performance Improvements

* Cache feature view resolution in get_online_features to reduce per-request overhead ([55c2f18](55c2f18))
* Optimize feature serving latency with batched async Redis, cached checks fix ([103809a](103809a))
* Replace MessageToDict with optimized custom dict builder ([#6015](#6015)) ([9902064](9902064))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants