Abstract Clickhouse usage behind a pluggable OLAP/Observability interface #5679
blarghmatey
started this conversation in
Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal: Pluggable OLAP Backend for TensorZero Observability (ClickHouse + StarRocks)
Author: @blarghmatey
Date: 2026-01-16
Status: Draft / Request for Feedback
Scope:
tensorzero/tensorzero(core + gateway + docs/examples)Summary
TensorZero currently uses ClickHouse as the storage engine for its observability and analytics data (e.g., inference logs, feedback, and usage metrics). This proposal suggests:
The goal is to support additional OLAP engines for the same use cases where ClickHouse is employed today, without materially changing the user‑facing behavior or observability features.
I’m opening this to get maintainers’ feedback on whether this direction is aligned with the project’s roadmap and how you’d like it to be structured.
Motivation
1. Deployment flexibility
Many teams already have an OLAP engine in place (such as StarRocks) and would prefer to reuse that instead of adding and operating an additional ClickHouse cluster. Supporting multiple OLAP options:
2. Future extensibility
Right now, ClickHouse is deeply baked into the codebase and documentation:
tensorzero-core/src/db/clickhouse/**(connection info, migrations, tests).docs/deployment/clickhouse.mdx,docs/quickstart.mdx) that assumeTENSORZERO_CLICKHOUSE_URL.clickhouse_url=...and notebooks that importclickhouse_connect.If we introduce a clear abstraction around “observability storage” now, it becomes much easier to:
3. StarRocks specifically
StarRocks is an MPP, columnar OLAP engine with relatively familiar SQL semantics. It’s commonly used for analytical workloads similar to what TensorZero’s observability layer provides.
From a feature perspective, StarRocks is a natural candidate for:
The main differences are in DDL, function sets, and ingestion patterns, which can be hidden behind a Rust abstraction layer.
Current ClickHouse Usage (High‑Level)
Based on a repo scan, ClickHouse is used in the following ways:
Rust core (
tensorzero-core):ClickHouseConnectionInfoand related code undertensorzero-core/src/db/clickhouse.tensorzero-core/src/db/clickhouse/migration_manager/migrations/(e.g.,Migration0034<'a> { pub clickhouse: &'a ClickHouseConnectionInfo }).tensorzero-core/src/db/clickhouse/test_helpers.rs(e.g.,get_clickhouse()andget_clickhouse_replica()).Gateway / Python clients:
TensorZeroGateway.build_embeddedandAsyncTensorZeroGateway.build_embeddedaccept aclickhouse_urlargument.patch_openai_client(..., clickhouse_url=...)is used to integrate with the Gateway from the OpenAI client.examples/**assumes ClickHouse is the only OLAP backend.Docs & deployment:
docs/deployment/clickhouse.mdxdescribes ClickHouse as the observability engine.docs/quickstart.mdx, Docker Compose, and Helm examples configureTENSORZERO_CLICKHOUSE_URLand show ClickHouse deployments explicitly.Functionally, the observability layer is:
The key observation is that this path is not currently abstracted as a generic “observability store” – it is specifically “ClickHouse”.
Proposal
1. Introduce a generic Observability Store abstraction (Rust)
Define a Rust‑level trait that captures what the Gateway and UI need from the observability store, abstracting away ClickHouse specifics.
High‑level sketch:
Then:
ObservabilityStorefor the existing ClickHouse implementation (wrappingClickHouseConnectionInfoand current migration manager).StarRocksStorethat also implementsObservabilityStore, with:Higher‑level components (gateway, UI) would depend on
dyn ObservabilityStoreinstead of directly onClickHouseConnectionInfo.2. Migration and schema abstraction
Today, migrations live in
tensorzero-core/src/db/clickhouse/migration_manager. They are very ClickHouse‑specific (engines, replication config, views, etc).To support multiple backends, we could:
Factor out a minimal
MigrationBackendinterface:Keep the existing ClickHouse migrations as‑is but drive them through a
ClickHouseMigrationBackendthat implementsMigrationBackend.Add a parallel set of StarRocks migrations:
Alternate approach (if you prefer less duplication):
I’d appreciate guidance on which approach you’d prefer: duplicated but explicit per‑backend migrations vs. a shared logical schema that emits different DDL per backend.
3. Runtime backend selection
Introduce a generic configuration story:
New environment variables (names are up for discussion):
TENSORZERO_OBSERVABILITY_BACKEND=clickhouse|starrocksTENSORZERO_OBSERVABILITY_URL=...Maintain backward compatibility:
TENSORZERO_OBSERVABILITY_BACKENDis unset butTENSORZERO_CLICKHOUSE_URLis present, infer:ObservabilityEngine::ClickHouseObservabilityConfig.url = TENSORZERO_CLICKHOUSE_URLIn the Python helpers and gateway constructors:
Add new parameters like:
Continue to support existing
clickhouse_urlfor backward compatibility (and translate it into the new config model under the hood).This way, existing deployments that only set
TENSORZERO_CLICKHOUSE_URLcontinue to work unchanged, while new users can explicitly opt into StarRocks or other backends.4. StarRocks backend implementation (initial scope)
For a first pass, the StarRocks backend could aim for:
Feature equivalence for core observability:
Development scope:
ObservabilityStore.docs/deployment/clickhouse.mdx.We could start with a “minimal viable backend”:
Data migration tooling from ClickHouse to StarRocks (for existing users) could be considered separately.
5. Documentation and examples
To keep the docs coherent and approachable:
docs/deployment/clickhouse.mdx).Recipes and notebooks that directly use
clickhouse_connectand ClickHouse SQL could, at least initially, remain ClickHouse‑specific, with notes indicating that they assume a ClickHouse backend. In a later phase, they could be refactored to use a small higher‑level query API that is backend‑agnostic.Backward Compatibility
The primary concerns and mitigations:
Existing env vars:
TENSORZERO_CLICKHOUSE_URLexactly as today.TENSORZERO_OBSERVABILITY_*variables as an optional configuration layer.Existing Python APIs:
clickhouse_urlarguments working.observability_*arguments once that API is stable.Existing migrations and tests:
Potential Risks and Open Questions
Migration complexity:
How do you prefer to handle multi‑backend migrations?
SQL dialect drift:
Some queries or functions (especially around arrays, JSON, or advanced aggregation) might not map 1:1 between ClickHouse and StarRocks. Is it acceptable if StarRocks initially supports a subset of observability features while parity is improved over time?
CI and maintenance cost:
Are you open to running the test suite (or a subset of it) against multiple backends in CI? This would add some maintenance cost but is important to prevent backend‑specific regressions.
Roadmap alignment:
Does a pluggable observability backend fit the project’s vision? Or is the intention to double‑down on ClickHouse as the sole, deeply integrated observability engine?
Data migration tools:
Would you want this proposal to also include guidance or tooling for migrating existing ClickHouse data into StarRocks, or is that out of scope for now?
Proposed Implementation Phases
If this direction is acceptable, a possible phased approach:
Design & alignment
ObservabilityStore,ObservabilityConfig, env var names).Abstraction groundwork
ObservabilityStoreand wrap existing ClickHouse implementation.dyn ObservabilityStore.TENSORZERO_OBSERVABILITY_*env vars and keep ClickHouse as the default.StarRocks prototype
StarRocksStorewith just enough schema and queries to support core observability.Feature parity & documentation
Optional: higher‑level query API for recipes
Request for Feedback
I’d love feedback from maintainers and contributors on:
ObservabilityStore, env vars).If this seems like a reasonable direction, I’d be happy to:
Thanks for considering this proposal, and I’m looking forward to your thoughts.
Beta Was this translation helpful? Give feedback.
All reactions