-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
To keep it brief, modern applications should provider a richer way to troubleshoot problems than trawling through log data. Logs give information about what happened (an event), but lack other dimensions such as how long something took and what triggered that. Logging should be a last resort when troubleshooting, and instead metrics and trace data should be utilized (which can then link to applicable logs to reduce noise and simplify the troubleshooting process). Metrics can help answer questions known ahead of time, while trace data can go beyond and answer questions known in the future.
This epic is about implementing tracing and metrics specifically for webhooks (but with the intention of clearing the path to easily instrument the operators, expanding the troubleshooting capabilities).
OpenTelemetry SDKs and the OTLP protocol will be used, but this should be explained futher in the Improve Observability initiative ticket (currently in a non-public repository).
The diagram below gives a high level overview of where the various telemetry data can end up, and will likely become a stack/demo to aid in development and eventually assist Stackable users in getting setup.
Part 1
This is the library side implementation, and does not cover actual operator implementations.
### Tasks
- [ ] https://github.com/stackabletech/demos/pull/35
- [x] Instrument the webhook handlers with [`#[tracing::instrument]`](https://docs.rs/tracing/latest/tracing/attr.instrument.html) and [`tracing::debug!(...)`](https://docs.rs/tracing/latest/tracing/macro.debug.html#examples) (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [x] Create tracing subscriber initialization helpers (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [ ] https://github.com/stackabletech/operator-rs/pull/767
- [ ] https://github.com/stackabletech/operator-rs/pull/811
- [ ] https://github.com/stackabletech/operator-rs/pull/796
- [ ] https://github.com/stackabletech/operator-rs/pull/801
- [ ] https://github.com/stackabletech/operator-rs/pull/815
Acceptance Criteria
- Reusable code for when we do the same to operators.
- Using Semantic Conventions (even if done by a library) for span fields.
- Uses OTLP as the transport (prefer gRPC over HTTP, avoid JSON serialization).
- Instrument something (dummy webhook?) and see traces in Jaeger
Part 2
Moved to https://github.com/stackabletech/issues/issues/598
Part 3
Plan to implement OpenTelemetry Metrics Provider for Operators (Prometheus, and/or OTLP export).
References
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
