feat: [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI#6202
Conversation
4340dbb to
940a4af
Compare
d0b45bb to
c06853e
Compare
| (feast_feature_freshness_seconds).""" | ||
|
|
||
|
|
||
| class DqmInitialDistributionConfig(FeastConfigBaseModel): |
There was a problem hiding this comment.
I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.
#feature_store.yaml
feature_monitoring:
auto_baseline: false
This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.
There was a problem hiding this comment.
This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.
- Forward set_baseline parameter in DQMJobManager.execute_job for compute jobs so user intent to mark a computation as baseline is no longer silently dropped. - Add "1=1" fallback when ts_filter is empty (both start_date and end_date are None) in BigQuery, PostgreSQL, and Snowflake monitoring compute to prevent invalid SQL generation. Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
0fe6970 to
d114171
Compare
- Forward set_baseline parameter in DQMJobManager.execute_job for compute jobs so user intent to mark a computation as baseline is no longer silently dropped. - Add "1=1" fallback when ts_filter is empty (both start_date and end_date are None) in BigQuery, PostgreSQL, and Snowflake monitoring compute to prevent invalid SQL generation. Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
462da5b to
1d61e7c
Compare
|
@jyejare please resolve conflicts |
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
# [0.64.0](v0.63.0...v0.64.0) (2026-06-13) ### Bug Fixes * Add async_supported property to RedisOnlineStore ([9b088fe](9b088fe)) * Add missing feast init templates to operator CRD and enhance persistence documentation ([1941d4d](1941d4d)) * Allow to publish from reference branch ([5458ec8](5458ec8)) * API calls list ([4203eb7](4203eb7)) * **bigquery:** Enable list inference for parquet loads in offline_write_batch ([9243497](9243497)), closes [#5845](#5845) * Bump grpcio dependencies ([07b4782](07b4782)) * **compute-engine/local:** Honor field_mapping on join keys in dedup + join nodes ([#6395](#6395)) ([bd01824](bd01824)) * **dynamodb:** Avoid tag race condition by using diff-based tag updates ([#6479](#6479)) ([bad2b7d](bad2b7d)), closes [#6418](#6418) * **dynamodb:** Fix mypy type for _build_projection_expression return ([217b4da](217b4da)) * Fix intermittent async test failures for DynamoDB and Redis ([63c5eb1](63c5eb1)) * Fix mongodb blog title ([57d28d4](57d28d4)) * Fix shared SQL registry crash - avoid unnecessary UDF deserialization in proto cache building ([ac588d7](ac588d7)) * Fix SparkRetrievalJob.persist() failing for SparkSource ([209d7cd](209d7cd)) * Fixed formatting and image for mongo blog ([#6377](#6377)) ([f8389fb](f8389fb)) * Fixes for ray source ([7f592a4](7f592a4)) * **go:** skip registry refresh when cache_ttl_seconds <= 0 ([97ed40c](97ed40c)) * Handle array of strings columns in Athena materialization ([#6324](#6324)) ([4ed0278](4ed0278)) * make milvus VARCHAR max_length configurable, remove hardcoded 512 limit ([3b98c22](3b98c22)) * **operator:** Set appProtocol: grpc on registry gRPC Service ([#6367](#6367)) ([c9ae2b4](c9ae2b4)) * PyJWT 2.10+ added validation that rejects empty HMAC keys ([e756ffe](e756ffe)) * RemoteOnlineStore sends all features in a single HTTP request ([8f187dd](8f187dd)) * Remove registry proto dump to enforce RBAC and add permission checks to Commit/Refresh RPCs ([328431f](328431f)) * Remove selector migration job - no longer needed ([51c325e](51c325e)) * replace broken .claude skill symlink with correct relative path ([4541690](4541690)) * Replace selector label strip patch with migration Job for upgrade-safe selector uniqueness ([00dea50](00dea50)) * Scope feature view name conflict check to current project in file-based registry ([#6369](#6369)) ([a4fde83](a4fde83)), closes [#6209](#6209) * **snowflake:** Stop double-quoting connection identifiers ([#6462](#6462)) ([e914d59](e914d59)) * **spark:** S3/GCS PyArrow filesystem resolution for staging paths ([#6442](#6442)) ([ae50414](ae50414)) * **trino:** Clean up temporary entity tables after retrieval ([#6381](#6381)) ([d86b13d](d86b13d)), closes [#6306](#6306) * Update go-feature-server base image to Go 1.25 and fix operator Dockerfile COPY permissions ([86ef0bc](86ef0bc)) ### Features * [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI ([#6202](#6202)) ([5458c37](5458c37)) * Add apache flink compute engine ([#6476](#6476)) ([9636d6a](9636d6a)) * Add demo noteboooks for users ([e362173](e362173)) * Add enabled/disabled toggle for feature views ([#6401](#6401)) ([5f1fa0d](5f1fa0d)), closes [#6395](#6395) * Add Label View to init template ([ec272d5](ec272d5)) * Add mTLS support to remote registry gRPC client ([#6474](#6474)) ([c9602d8](c9602d8)) * Add Prometheus gauges for FeatureStore installation telemetry ([#6354](#6354)) ([1b681b7](1b681b7)) * Adds registry REST API endpoints for managing entities, data sources, and feature views ([#6413](#6413)) ([f77bd1d](f77bd1d)) * Allow CRUD on entities, data sources, and feature views from UI ([#6412](#6412)) ([2321c07](2321c07)) * Allow default openlineage configuration ([#6467](#6467)) ([276b6df](276b6df)) * **bigquery:** Support DATE-type event timestamp columns ([#6362](#6362)) ([753dee5](753dee5)), closes [#2530](#2530) * **cli:** Add `feast projects delete` command (closes [#5095](#5095)) ([#6318](#6318)) ([1a4b96c](1a4b96c)) * Data Quality Monitoring added in feast UI ([#6422](#6422)) ([fa271be](fa271be)) * **dynamodb:** Use ProjectionExpression when requested_features is set ([0adc906](0adc906)), closes [#6058](#6058) * Enhance DataSource and FeatureView modals with error handling and submission states ([96d7169](96d7169)) * Expose registry endpoints on feature server for MCP access ([f77981c](f77981c)) * Feast First-Class LabelView Implementation ([#6292](#6292)) ([c0e7e5d](c0e7e5d)) * Feast-MLflow Integration ([#6235](#6235)) ([7279c75](7279c75)) * Operational metrics for offline store and SOX metrics for both ([#6340](#6340)) ([65b1b80](65b1b80)) * Pre-compute feature service ([8011550](8011550)) * REST API-backed UI for RBAC compatibility and per-page lazy loading ([#6414](#6414)) ([6ae80af](6ae80af)) * Support non-string map key types ([#6382](#6382)) ([#6383](#6383)) ([728aa2e](728aa2e)) * Update FeatureStore CRD with DRA Fields ([01241e4](01241e4)) ### Performance Improvements * Cache feature view resolution in get_online_features to reduce per-request overhead ([55c2f18](55c2f18)) * Optimize feature serving latency with batched async Redis, cached checks fix ([103809a](103809a)) * Replace MessageToDict with optimized custom dict builder ([#6015](#6015)) ([9902064](9902064))
To check real UI monitoring:
Visit PR #6422, see Demo.
What this PR does / why we need it:
This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.
What it adds:
Core Monitoring Engine
OfflineStoreas the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).OfflineStorebackend itself (no separate monitoring database). Six static methods on theOfflineStorebase class (compute_monitoring_metrics,get_monitoring_max_timestamp,ensure_monitoring_tables,save_monitoring_metrics,query_monitoring_metrics,clear_monitoring_baseline) handle compute and storage.MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:PrimitiveFeastTypeandValueTypeMulti-Backend Support (8 Offline Stores)
All 6 native monitoring methods implemented for each backend with dialect-specific SQL:
INSERT ON CONFLICTPERCENTILE_CONT,WIDTH_BUCKETMERGEwithVARIANTJSONAPPROX_PERCENTILE,WIDTH_BUCKETMERGEinto BQ tablesAPPROX_QUANTILES, parameterized queriesMERGEvia Data APIAPPROXIMATE PERCENTILE_DISCPERCENTILE_APPROX,spark.sql()MERGE FROM DUALPERCENTILE_CONT WITHIN GROUPQUANTILE_CONT,HISTOGRAMpyarrow.compute+numpyMulti-Granularity Time-Series Metrics
daily,weekly,biweekly,monthly,quarterlyBatch + Log Data Source Support
batch_sourceviaOfflineStore.pull_all_from_table_or_query()FeatureService.logging_configdestination, using__log_timestampas event timestampdriver_stats__conv_rate) are parsed back to their originalfeature_view_name+feature_namefor storage compatibility and drift detectiondata_source_typecolumn (batch/log) differentiates metrics in storageOrchestration Service (
MonitoringService)OfflineStoreinstance for performanceNaN/Inf Sanitization
NaN/Inffloat values that break JSON serialization:opt_float()inmonitoring_utils.py— sanitizes at SQL result parsing_sanitize_floats()inmonitoring_service.py— final safety net on all API read pathsOut of range float values are not JSON compliant: nanShared Utilities (
monitoring_utils.py)monitoring_table_meta(),opt_float(),empty_numeric_metric(),empty_categorical_metric(),normalize_monitoring_row(),build_view_aggregate()DQM Job Engine (
DQMJobManager)compute,baseline,auto_compute)feast_monitoring_jobstableset_baselineto the compute engineREST API (
/monitoring/)POST/monitoring/computePOST/monitoring/auto_computePOST/monitoring/compute/transientPOST/monitoring/compute/logPOST/monitoring/auto_compute/logGET/monitoring/jobs/{job_id}GET/monitoring/metrics/featuresGET/monitoring/metrics/feature_viewsGET/monitoring/metrics/feature_servicesGET/monitoring/metrics/baselineGET/monitoring/metrics/timeseriesAll endpoints support cascading filters:
project,feature_service_name,feature_view_name,feature_name,granularity,data_source_type, date range.RBAC enforced using existing
AuthzedAction.DESCRIBE(read) andAuthzedAction.UPDATE(compute).CLI (
feast monitor run)Auto-Baseline on
feast applyfeast applyfeature_store.yaml:Feast Operator Support
DataQualityMonitoringConfigadded toFeatureStoreSpecdata_quality_monitoringsection infeature_store.yamlwhen config is setmake generateDocumentation
docs/how-to-guides/feature-monitoring.md— Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility tableexamples/monitoring/monitoring-quickstart.ipynb— 12-step hands-on walkthrough with visualization examplesdocs/SUMMARY.mdupdated with links to bothDesign decisions:
OfflineStorecompute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed./monitoring/route rather than extending existing/metrics/— The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.data_quality_monitoringconfig — Sits alongsidematerializationandopenlineageinRepoConfig, reflecting that it spans offline store compute/storage + apply trigger + server API.Which issue(s) this PR fixes:
Partially Fixes #5919
Checks
git commit -s)Testing Strategy
Test coverage (all passing):
test_metrics_calculator.pytest_compute_correctness.pytest_monitoring_integration.pyrepo_config_test.goSnyk SAST scan: 0 vulnerabilities across all new files.