Improve Feature Server Observability

**Is your feature request related to a problem? Please describe.**

The Feature Server currently has limited built-in observability. While there is initial OpenTelemetry (OTEL) support, it does not expose standard service-level RED metrics (Rate, Errors, Duration) needed to operate the server in production.

Users need visibility into throughput, error rates, and latency for the core APIs:

- `/get-online-features`
- `/retrieve-online-documents`
- `/push`
- `/write-to-online-store`
- `/materialize`
- `/materialize-incremental`

Today this often requires custom middleware or external tooling.

**Describe the solution you'd like**

Extend Feature Server’s OTEL integration to emit standard RED metrics out of the box, with useful Feast-specific breakdowns.

Metrics per endpoint should include:

- **Request rate/count**
- **Error count/rate** (by HTTP status class, optionally basic error categories)
- **Latency histograms** (supporting p50/p95/p99)

Additional breakdowns:

- Break out metrics by **Feature Service name** (and **Feature View name** where applicable/available), with safeguards to limit label cardinality (e.g., allowlist and/or config flags; default off if needed).
- Segment latency by **number of features requested** using configurable bins (e.g., `1–10`, `11–50`, `51–200`, `201+`).

Implementation note (based on current `feature_server.py` structure):
- Add a FastAPI middleware to record RED metrics for every request (keyed by `endpoint` + `status_class`).
- Populate Feast-specific labels (feature_service / feature_view / feature_count_bin, etc.) inside the request handlers (since they come from the parsed request body) and attach them via `request.state` for the middleware to include.
- For long-running endpoints (`/materialize*`), ensure duration histograms support multi-second/minute latencies.

Optionally expose basic internal timings via tracing spans and/or additional histograms:
- online store read/write duration
- on-demand transformation execution duration
- materialization step duration

Metrics should use native OpenTelemetry APIs and work with common OTEL collectors/exporters (e.g., Prometheus, OpenTelemetry Collector, Grafana).

**Describe alternatives you've considered**

- Custom HTTP middleware
- Service mesh / proxy metrics (miss Feast-specific context like FS/FV/feature_count)
- Manual instrumentation

**Additional context**

Standard RED metrics with Feast-aware labels would significantly improve Feature Server operability and align Feast with common production observability practices, while keeping label cardinality under control.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Feature Server Observability #5920

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Feature Server Observability #5920

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions