Skip to content

Improve Feature Server Observability #5920

@franciscojavierarceo

Description

@franciscojavierarceo

Is your feature request related to a problem? Please describe.

The Feature Server currently has limited built-in observability. While there is initial OpenTelemetry (OTEL) support, it does not expose standard service-level RED metrics (Rate, Errors, Duration) needed to operate the server in production.

Users need visibility into throughput, error rates, and latency for the core APIs:

  • /get-online-features
  • /retrieve-online-documents
  • /push
  • /write-to-online-store
  • /materialize
  • /materialize-incremental

Today this often requires custom middleware or external tooling.

Describe the solution you'd like

Extend Feature Server’s OTEL integration to emit standard RED metrics out of the box, with useful Feast-specific breakdowns.

Metrics per endpoint should include:

  • Request rate/count
  • Error count/rate (by HTTP status class, optionally basic error categories)
  • Latency histograms (supporting p50/p95/p99)

Additional breakdowns:

  • Break out metrics by Feature Service name (and Feature View name where applicable/available), with safeguards to limit label cardinality (e.g., allowlist and/or config flags; default off if needed).
  • Segment latency by number of features requested using configurable bins (e.g., 1–10, 11–50, 51–200, 201+).

Implementation note (based on current feature_server.py structure):

  • Add a FastAPI middleware to record RED metrics for every request (keyed by endpoint + status_class).
  • Populate Feast-specific labels (feature_service / feature_view / feature_count_bin, etc.) inside the request handlers (since they come from the parsed request body) and attach them via request.state for the middleware to include.
  • For long-running endpoints (/materialize*), ensure duration histograms support multi-second/minute latencies.

Optionally expose basic internal timings via tracing spans and/or additional histograms:

  • online store read/write duration
  • on-demand transformation execution duration
  • materialization step duration

Metrics should use native OpenTelemetry APIs and work with common OTEL collectors/exporters (e.g., Prometheus, OpenTelemetry Collector, Grafana).

Describe alternatives you've considered

  • Custom HTTP middleware
  • Service mesh / proxy metrics (miss Feast-specific context like FS/FV/feature_count)
  • Manual instrumentation

Additional context

Standard RED metrics with Feast-aware labels would significantly improve Feature Server operability and align Feast with common production observability practices, while keeping label cardinality under control.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions