Skip to content

feat: [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI#6202

Merged
ntkathole merged 10 commits into
feast-dev:masterfrom
jyejare:monitoring_plus
Jun 9, 2026
Merged

feat: [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI#6202
ntkathole merged 10 commits into
feast-dev:masterfrom
jyejare:monitoring_plus

Conversation

@jyejare

@jyejare jyejare commented Mar 31, 2026

Copy link
Copy Markdown
Collaborator

To check real UI monitoring:

Visit PR #6422, see Demo.

What this PR does / why we need it:

This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

Core Monitoring Engine

  • Hybrid computation engine — SQL push-down on the native OfflineStore as the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).
  • Fully native storage — Monitoring metrics are stored within the configured OfflineStore backend itself (no separate monitoring database). Six static methods on the OfflineStore base class (compute_monitoring_metrics, get_monitoring_max_timestamp, ensure_monitoring_tables, save_monitoring_metrics, query_monitoring_metrics, clear_monitoring_baseline) handle compute and storage.
  • PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:
    • Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
    • Categorical features: top-N value counts with other/unique counts
    • Automatic feature type classification from Feast's PrimitiveFeastType and ValueType

Multi-Backend Support (8 Offline Stores)

All 6 native monitoring methods implemented for each backend with dialect-specific SQL:

Backend Compute Storage Dialect highlights
PostgreSQL SQL push-down INSERT ON CONFLICT PERCENTILE_CONT, WIDTH_BUCKET
Snowflake SQL push-down MERGE with VARIANT JSON APPROX_PERCENTILE, WIDTH_BUCKET
BigQuery SQL push-down MERGE into BQ tables APPROX_QUANTILES, parameterized queries
Redshift SQL push-down MERGE via Data API APPROXIMATE PERCENTILE_DISC
Spark SparkSQL push-down Parquet tables PERCENTILE_APPROX, spark.sql()
Oracle SQL via Ibis MERGE FROM DUAL PERCENTILE_CONT WITHIN GROUP
DuckDB In-memory SQL Parquet files QUANTILE_CONT, HISTOGRAM
Dask PyArrow compute Parquet files pyarrow.compute + numpy

Multi-Granularity Time-Series Metrics

  • 5 granularities: daily, weekly, biweekly, monthly, quarterly
  • Auto-compute mode: Detects latest event timestamp and computes all granularities in one job
  • Pre-computed metrics stored per date + granularity for fast retrieval
  • On-demand transient compute: Fresh statistics for arbitrary date ranges (not stored)

Batch + Log Data Source Support

  • Batch source: Reads from the feature view's batch_source via OfflineStore.pull_all_from_table_or_query()
  • Log source: Reads from feature serving logs via FeatureService.logging_config destination, using __log_timestamp as event timestamp
  • Feature name normalization: Prefixed log column names (driver_stats__conv_rate) are parsed back to their original feature_view_name + feature_name for storage compatibility and drift detection
  • data_source_type column (batch / log) differentiates metrics in storage

Orchestration Service (MonitoringService)

  • Ties registry, offline store, calculator, and storage together
  • Computes and aggregates metrics at feature, feature view, and feature service levels
  • Cached OfflineStore instance for performance
  • Unified compute/timestamp methods handling both batch and log paths with SQL push-down + fallback

NaN/Inf Sanitization

  • Multi-layered protection against NaN/Inf float values that break JSON serialization:
    • opt_float() in monitoring_utils.py — sanitizes at SQL result parsing
    • _sanitize_floats() in monitoring_service.py — final safety net on all API read paths
    • Applied in PyArrow compute paths (MetricsCalculator, Dask backend)
  • Prevents HTTP 422 errors from Out of range float values are not JSON compliant: nan

Shared Utilities (monitoring_utils.py)

  • Centralized table name constants, column lists, PK definitions
  • monitoring_table_meta(), opt_float(), empty_numeric_metric(), empty_categorical_metric(), normalize_monitoring_row(), build_view_aggregate()
  • Used by all 8 backends — eliminates duplication and ensures consistency

DQM Job Engine (DQMJobManager)

  • Asynchronous job abstraction for metric computation (compute, baseline, auto_compute)
  • Job status tracking in feast_monitoring_jobs table
  • Forwards all parameters including set_baseline to the compute engine
  • Supports future integration with Ray/Spark job runners

REST API (/monitoring/)

Method Endpoint Description
POST /monitoring/compute Submit batch DQM job
POST /monitoring/auto_compute Auto-detect dates, all granularities
POST /monitoring/compute/transient On-demand compute (not stored)
POST /monitoring/compute/log Compute from serving logs
POST /monitoring/auto_compute/log Auto-detect log dates, all granularities
GET /monitoring/jobs/{job_id} DQM job status
GET /monitoring/metrics/features Per-feature metrics
GET /monitoring/metrics/feature_views Per-view aggregates
GET /monitoring/metrics/feature_services Per-service aggregates
GET /monitoring/metrics/baseline Baseline distribution retrieval
GET /monitoring/metrics/timeseries Time-series data for trend analysis

All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, granularity, data_source_type, date range.

RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute).

CLI (feast monitor run)

Options:
  --feature-view TEXT       Feature view name (omit for all)
  --feature-service TEXT    Feature service name (required for --source-type log with explicit dates)
  --feature-name TEXT       Feature name(s), repeatable
  --start-date TEXT         Start date YYYY-MM-DD (omit for auto-detect)
  --end-date TEXT           End date YYYY-MM-DD (omit for auto-detect)
  --granularity TEXT        daily | weekly | biweekly | monthly | quarterly
  --set-baseline            Mark this computation as baseline
  --source-type TEXT        batch | log | all (default: batch)

Auto-Baseline on feast apply

  • Automatically queues baseline metric computation for new features on feast apply
  • Non-blocking (async DQM job), idempotent (skips existing baselines)
  • Configurable — can be disabled via feature_store.yaml:
data_quality_monitoring:
  auto_baseline: false

Feast Operator Support

  • New CRD type: DataQualityMonitoringConfig added to FeatureStoreSpec
  • Operator generates data_quality_monitoring section in feature_store.yaml when config is set
  • DeepCopy methods auto-generated via make generate
  • Disabling auto-baseline from operator CR:
apiVersion: feast.dev/v1
kind: FeatureStore
spec:
  feastProject: my_project
  dataQualityMonitoring:
    autoBaseline: false

Documentation

  • How-to guide: docs/how-to-guides/feature-monitoring.md — Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility table
  • Quickstart notebook: examples/monitoring/monitoring-quickstart.ipynb — 12-step hands-on walkthrough with visualization examples
  • docs/SUMMARY.md updated with links to both

Design decisions:

  • Native OfflineStore compute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed.
  • Hybrid fallback — Backends that don't implement native compute fall back to Python/PyArrow, ensuring all offline stores are supported.
  • Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.
  • DQM Job Engine for async computation — Supports future Ray/Spark integration for distributed metric computation.
  • Top-level data_quality_monitoring config — Sits alongside materialization and openlineage in RepoConfig, reflecting that it spans offline store compute/storage + apply trigger + server API.

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Operator unit tests (Ginkgo)

Test coverage (all passing):

Test Suite Count Covers
test_metrics_calculator.py 30+ Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification, PyArrow type classification, NaN/Inf sanitization, JSON serializability
test_compute_correctness.py 40+ Per-backend metric accuracy for all 8 offline stores (DuckDB, Dask, PostgreSQL, Snowflake, BigQuery, Redshift, Oracle, Spark) — numeric stats, histograms, categorical counts
test_monitoring_integration.py 16+ End-to-end batch/log computation, baseline flow, view/service aggregation, native storage dispatch, log feature name normalization, REST API endpoints, CLI, RBAC enforcement
repo_config_test.go 92 Operator repo config generation including DataQualityMonitoring config with auto_baseline disabled, YAML serialization verification

Snyk SAST scan: 0 vulnerabilities across all new files.

@jyejare jyejare requested a review from a team as a code owner March 31, 2026 10:53
@jyejare jyejare marked this pull request as draft March 31, 2026 10:54
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the monitoring_plus branch 4 times, most recently from d0b45bb to c06853e Compare April 21, 2026 14:00
@jyejare jyejare marked this pull request as ready for review April 21, 2026 14:34
@jyejare jyejare requested review from a team and sudohainguyen as code owners April 21, 2026 14:34
@jyejare jyejare requested review from lokeshrangineni, robhowley and tokoko and removed request for a team April 21, 2026 14:34
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare changed the title feat: Add feature quality monitoring with statistical metrics, REST API, and CLI feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI Apr 21, 2026
@jyejare jyejare force-pushed the monitoring_plus branch from 3da4dde to 0344087 Compare May 5, 2026 08:52
Comment thread sdk/python/feast/api/registry/rest/monitoring.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_utils.py Outdated
Comment thread sdk/python/feast/monitoring/dqm_job_manager.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_service.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_store.py Outdated
Comment thread infra/feast-operator/api/v1/featurestore_types.go Outdated
(feast_feature_freshness_seconds)."""


class DqmInitialDistributionConfig(FeastConfigBaseModel):

@ntkathole ntkathole May 6, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.

#feature_store.yaml        

feature_monitoring:   
  auto_baseline: false 

This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.

@jyejare jyejare force-pushed the monitoring_plus branch from 58795bc to a26ab3c Compare May 8, 2026 07:23
Comment thread sdk/python/feast/repo_operations.py Outdated
Comment thread sdk/python/feast/repo_operations.py Outdated
Comment thread sdk/python/feast/api/registry/rest/system_metrics.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/duckdb.py Outdated
jyejare added a commit to jyejare/feast that referenced this pull request Jun 3, 2026
- Forward set_baseline parameter in DQMJobManager.execute_job for
  compute jobs so user intent to mark a computation as baseline is
  no longer silently dropped.
- Add "1=1" fallback when ts_filter is empty (both start_date and
  end_date are None) in BigQuery, PostgreSQL, and Snowflake monitoring
  compute to prevent invalid SQL generation.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the monitoring_plus branch 2 times, most recently from 0fe6970 to d114171 Compare June 3, 2026 12:36
jyejare added a commit to jyejare/feast that referenced this pull request Jun 3, 2026
- Forward set_baseline parameter in DQMJobManager.execute_job for
  compute jobs so user intent to mark a computation as baseline is
  no longer silently dropped.
- Add "1=1" fallback when ts_filter is empty (both start_date and
  end_date are None) in BigQuery, PostgreSQL, and Snowflake monitoring
  compute to prevent invalid SQL generation.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the monitoring_plus branch 7 times, most recently from 462da5b to 1d61e7c Compare June 4, 2026 14:22
@ntkathole

Copy link
Copy Markdown
Member

@jyejare please resolve conflicts

jyejare and others added 10 commits June 9, 2026 13:43
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the monitoring_plus branch from 1d61e7c to 50b45ac Compare June 9, 2026 08:19

@ntkathole ntkathole left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@ntkathole ntkathole merged commit 5458c37 into feast-dev:master Jun 9, 2026
36 of 37 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Jun 13, 2026
# [0.64.0](v0.63.0...v0.64.0) (2026-06-13)

### Bug Fixes

* Add async_supported property to RedisOnlineStore ([9b088fe](9b088fe))
* Add missing feast init templates to operator CRD and enhance persistence documentation ([1941d4d](1941d4d))
* Allow to publish from reference branch ([5458ec8](5458ec8))
* API calls list ([4203eb7](4203eb7))
* **bigquery:** Enable list inference for parquet loads in offline_write_batch ([9243497](9243497)), closes [#5845](#5845)
* Bump grpcio dependencies ([07b4782](07b4782))
* **compute-engine/local:** Honor field_mapping on join keys in dedup + join nodes ([#6395](#6395)) ([bd01824](bd01824))
* **dynamodb:** Avoid tag race condition by using diff-based tag updates ([#6479](#6479)) ([bad2b7d](bad2b7d)), closes [#6418](#6418)
* **dynamodb:** Fix mypy type for _build_projection_expression return ([217b4da](217b4da))
* Fix intermittent async test failures for DynamoDB and Redis ([63c5eb1](63c5eb1))
* Fix mongodb blog title ([57d28d4](57d28d4))
* Fix shared SQL registry crash - avoid unnecessary UDF deserialization in proto cache building ([ac588d7](ac588d7))
* Fix SparkRetrievalJob.persist() failing for SparkSource ([209d7cd](209d7cd))
* Fixed formatting and image for mongo blog ([#6377](#6377)) ([f8389fb](f8389fb))
* Fixes for ray source ([7f592a4](7f592a4))
* **go:** skip registry refresh when cache_ttl_seconds <= 0 ([97ed40c](97ed40c))
* Handle array of strings columns in Athena materialization ([#6324](#6324)) ([4ed0278](4ed0278))
* make milvus VARCHAR max_length configurable, remove hardcoded 512 limit ([3b98c22](3b98c22))
* **operator:** Set appProtocol: grpc on registry gRPC Service ([#6367](#6367)) ([c9ae2b4](c9ae2b4))
* PyJWT 2.10+ added validation that rejects empty HMAC keys ([e756ffe](e756ffe))
* RemoteOnlineStore sends all features in a single HTTP request ([8f187dd](8f187dd))
* Remove registry proto dump to enforce RBAC and add permission checks to Commit/Refresh RPCs ([328431f](328431f))
* Remove selector migration job - no longer needed ([51c325e](51c325e))
* replace broken .claude skill symlink with correct relative path ([4541690](4541690))
* Replace selector label strip patch with migration Job for upgrade-safe selector uniqueness ([00dea50](00dea50))
* Scope feature view name conflict check to current project in file-based registry ([#6369](#6369)) ([a4fde83](a4fde83)), closes [#6209](#6209)
* **snowflake:** Stop double-quoting connection identifiers ([#6462](#6462)) ([e914d59](e914d59))
* **spark:** S3/GCS PyArrow filesystem resolution for staging paths ([#6442](#6442)) ([ae50414](ae50414))
* **trino:** Clean up temporary entity tables after retrieval ([#6381](#6381)) ([d86b13d](d86b13d)), closes [#6306](#6306)
* Update go-feature-server base image to Go 1.25 and fix operator Dockerfile COPY permissions ([86ef0bc](86ef0bc))

### Features

* [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI ([#6202](#6202)) ([5458c37](5458c37))
* Add apache flink compute engine ([#6476](#6476)) ([9636d6a](9636d6a))
* Add demo noteboooks for users ([e362173](e362173))
* Add enabled/disabled toggle for feature views ([#6401](#6401)) ([5f1fa0d](5f1fa0d)), closes [#6395](#6395)
* Add Label View to init template ([ec272d5](ec272d5))
* Add mTLS support to remote registry gRPC client ([#6474](#6474)) ([c9602d8](c9602d8))
* Add Prometheus gauges for FeatureStore installation telemetry ([#6354](#6354)) ([1b681b7](1b681b7))
* Adds registry REST API endpoints for managing entities, data sources, and feature views ([#6413](#6413)) ([f77bd1d](f77bd1d))
* Allow CRUD on entities, data sources, and feature views from UI ([#6412](#6412)) ([2321c07](2321c07))
* Allow default openlineage configuration ([#6467](#6467)) ([276b6df](276b6df))
* **bigquery:** Support DATE-type event timestamp columns ([#6362](#6362)) ([753dee5](753dee5)), closes [#2530](#2530)
* **cli:** Add `feast projects delete` command (closes [#5095](#5095)) ([#6318](#6318)) ([1a4b96c](1a4b96c))
* Data Quality Monitoring added in feast UI ([#6422](#6422)) ([fa271be](fa271be))
* **dynamodb:** Use ProjectionExpression when requested_features is set ([0adc906](0adc906)), closes [#6058](#6058)
* Enhance DataSource and FeatureView modals with error handling and submission states ([96d7169](96d7169))
* Expose registry endpoints on feature server for MCP access ([f77981c](f77981c))
* Feast First-Class LabelView Implementation ([#6292](#6292)) ([c0e7e5d](c0e7e5d))
* Feast-MLflow Integration ([#6235](#6235)) ([7279c75](7279c75))
* Operational metrics for offline store and SOX metrics for both ([#6340](#6340)) ([65b1b80](65b1b80))
* Pre-compute feature service ([8011550](8011550))
* REST API-backed UI for RBAC compatibility and per-page lazy loading ([#6414](#6414)) ([6ae80af](6ae80af))
* Support non-string map key types ([#6382](#6382)) ([#6383](#6383)) ([728aa2e](728aa2e))
* Update FeatureStore CRD with DRA Fields ([01241e4](01241e4))

### Performance Improvements

* Cache feature view resolution in get_online_features to reduce per-request overhead ([55c2f18](55c2f18))
* Optimize feature serving latency with batched async Redis, cached checks fix ([103809a](103809a))
* Replace MessageToDict with optimized custom dict builder ([#6015](#6015)) ([9902064](9902064))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revamp Data Quality Monitoring

3 participants