-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem? Please describe.
The Feature Server currently has limited built-in observability. While there is initial OpenTelemetry (OTEL) support, it does not expose standard service-level RED metrics (Rate, Errors, Duration) needed to operate the server in production.
Users need visibility into throughput, error rates, and latency for the core APIs:
/get-online-features/retrieve-online-documents/push/write-to-online-store/materialize/materialize-incremental
Today this often requires custom middleware or external tooling.
Describe the solution you'd like
Extend Feature Server’s OTEL integration to emit standard RED metrics out of the box, with useful Feast-specific breakdowns.
Metrics per endpoint should include:
- Request rate/count
- Error count/rate (by HTTP status class, optionally basic error categories)
- Latency histograms (supporting p50/p95/p99)
Additional breakdowns:
- Break out metrics by Feature Service name (and Feature View name where applicable/available), with safeguards to limit label cardinality (e.g., allowlist and/or config flags; default off if needed).
- Segment latency by number of features requested using configurable bins (e.g.,
1–10,11–50,51–200,201+).
Implementation note (based on current feature_server.py structure):
- Add a FastAPI middleware to record RED metrics for every request (keyed by
endpoint+status_class). - Populate Feast-specific labels (feature_service / feature_view / feature_count_bin, etc.) inside the request handlers (since they come from the parsed request body) and attach them via
request.statefor the middleware to include. - For long-running endpoints (
/materialize*), ensure duration histograms support multi-second/minute latencies.
Optionally expose basic internal timings via tracing spans and/or additional histograms:
- online store read/write duration
- on-demand transformation execution duration
- materialization step duration
Metrics should use native OpenTelemetry APIs and work with common OTEL collectors/exporters (e.g., Prometheus, OpenTelemetry Collector, Grafana).
Describe alternatives you've considered
- Custom HTTP middleware
- Service mesh / proxy metrics (miss Feast-specific context like FS/FV/feature_count)
- Manual instrumentation
Additional context
Standard RED metrics with Feast-aware labels would significantly improve Feature Server operability and align Feast with common production observability practices, while keeping label cardinality under control.