Abstract Clickhouse usage behind a pluggable OLAP/Observability interface #5679

blarghmatey · 2026-01-16T14:48:33Z

blarghmatey
Jan 16, 2026

Proposal: Pluggable OLAP Backend for TensorZero Observability (ClickHouse + StarRocks)

Author: @blarghmatey
Date: 2026-01-16
Status: Draft / Request for Feedback
Scope: tensorzero/tensorzero (core + gateway + docs/examples)

Summary

TensorZero currently uses ClickHouse as the storage engine for its observability and analytics data (e.g., inference logs, feedback, and usage metrics). This proposal suggests:

Introducing a pluggable OLAP backend abstraction for observability storage in the core.
Keeping ClickHouse as the default and first‑class backend.
Adding a StarRocks implementation behind that abstraction as an alternative backend.
Adjusting configs/docs so that:
- “observability storage” is described generically.
- ClickHouse and StarRocks are documented as supported backends.
- Existing ClickHouse setups continue to work unchanged.

The goal is to support additional OLAP engines for the same use cases where ClickHouse is employed today, without materially changing the user‑facing behavior or observability features.

I’m opening this to get maintainers’ feedback on whether this direction is aligned with the project’s roadmap and how you’d like it to be structured.

Motivation

1. Deployment flexibility

Many teams already have an OLAP engine in place (such as StarRocks) and would prefer to reuse that instead of adding and operating an additional ClickHouse cluster. Supporting multiple OLAP options:

Reduces operational overhead for users who are standardized on a particular engine.
Makes TensorZero easier to adopt in environments where ClickHouse is not the default choice or is not approved.

2. Future extensibility

Right now, ClickHouse is deeply baked into the codebase and documentation:

Rust core under tensorzero-core/src/db/clickhouse/** (connection info, migrations, tests).
Config/deploy docs (docs/deployment/clickhouse.mdx, docs/quickstart.mdx) that assume TENSORZERO_CLICKHOUSE_URL.
Python clients and examples with parameters like clickhouse_url=... and notebooks that import clickhouse_connect.

If we introduce a clear abstraction around “observability storage” now, it becomes much easier to:

Incrementally support other OLAP engines later.
Make parts of the system (e.g., dashboards, metrics) engine‑agnostic.
Potentially support additional storage topologies (e.g., sharded, managed services, etc.) behind a common interface.

3. StarRocks specifically

StarRocks is an MPP, columnar OLAP engine with relatively familiar SQL semantics. It’s commonly used for analytical workloads similar to what TensorZero’s observability layer provides.

From a feature perspective, StarRocks is a natural candidate for:

Inference/event logging.
Feedback/metrics analytics.
Aggregations and time‑series style queries.

The main differences are in DDL, function sets, and ingestion patterns, which can be hidden behind a Rust abstraction layer.

Current ClickHouse Usage (High‑Level)

Based on a repo scan, ClickHouse is used in the following ways:

Rust core (tensorzero-core):
- Connection handling, config, and batching are encapsulated in ClickHouseConnectionInfo and related code under tensorzero-core/src/db/clickhouse.
- Migrations live under tensorzero-core/src/db/clickhouse/migration_manager/migrations/ (e.g., Migration0034<'a> { pub clickhouse: &'a ClickHouseConnectionInfo }).
- There are ClickHouse‑specific tests in tensorzero-core/src/db/clickhouse/test_helpers.rs (e.g., get_clickhouse() and get_clickhouse_replica()).
- Configuration and replication checks are expressed in terms of ClickHouse concepts (cluster name, replicated vs non‑replicated deployments).
Gateway / Python clients:
- Helpers like TensorZeroGateway.build_embedded and AsyncTensorZeroGateway.build_embedded accept a clickhouse_url argument.
- patch_openai_client(..., clickhouse_url=...) is used to integrate with the Gateway from the OpenAI client.
- Example code under examples/** assumes ClickHouse is the only OLAP backend.
Docs & deployment:
- docs/deployment/clickhouse.mdx describes ClickHouse as the observability engine.
- docs/quickstart.mdx, Docker Compose, and Helm examples configure TENSORZERO_CLICKHOUSE_URL and show ClickHouse deployments explicitly.

Functionally, the observability layer is:

“Gateway → ClickHouse → queries for dashboards/metrics/recipes”.

The key observation is that this path is not currently abstracted as a generic “observability store” – it is specifically “ClickHouse”.

Proposal

1. Introduce a generic Observability Store abstraction (Rust)

Define a Rust‑level trait that captures what the Gateway and UI need from the observability store, abstracting away ClickHouse specifics.

High‑level sketch:

pub struct ObservabilityConfig {
    pub url: url::Url,
    pub engine: ObservabilityEngine,
    // Engine-agnostic settings; engine-specific config can be nested if needed.
}

pub enum ObservabilityEngine {
    ClickHouse,
    StarRocks,
    // Future: BigQuery, Snowflake, etc.
}

#[async_trait::async_trait]
pub trait ObservabilityStore: Send + Sync {
    async fn run_migrations(&self) -> Result<(), Error>;

    async fn insert_inference_batch(&self, batch: &[ModelInference]) -> Result<(), Error>;
    async fn insert_feedback_batch(&self, batch: &[Feedback]) -> Result<(), Error>;

    async fn query_metrics(&self, query: MetricsQuery) -> Result<MetricsResult, Error>;

    // Any additional operations the gateway/UI requires from the backend.
}

Then:

Implement ObservabilityStore for the existing ClickHouse implementation (wrapping ClickHouseConnectionInfo and current migration manager).
Introduce a new StarRocksStore that also implements ObservabilityStore, with:
- Connection setup (via StarRocks or a compatible driver).
- Batch insert and query methods implementing the same semantics.

Higher‑level components (gateway, UI) would depend on dyn ObservabilityStore instead of directly on ClickHouseConnectionInfo.

2. Migration and schema abstraction

Today, migrations live in tensorzero-core/src/db/clickhouse/migration_manager. They are very ClickHouse‑specific (engines, replication config, views, etc).

To support multiple backends, we could:

Factor out a minimal MigrationBackend interface:

#[async_trait::async_trait]
pub trait MigrationBackend {
    async fn execute_ddl(&self, ddl: &str) -> Result<(), Error>;
    // Optionally, schema introspection where needed.
}

Keep the existing ClickHouse migrations as‑is but drive them through a ClickHouseMigrationBackend that implements MigrationBackend.
Add a parallel set of StarRocks migrations:
- With conceptually identical tables and indices to preserve feature parity.
- Adjusted for StarRocks DDL (engines, partitioning, primary keys, materialized views).

Alternate approach (if you prefer less duplication):

Keep one common logical migration set and emit backend‑specific DDL using conditional code or templates. This might be more complex initially but keeps schema drift under control.

I’d appreciate guidance on which approach you’d prefer: duplicated but explicit per‑backend migrations vs. a shared logical schema that emits different DDL per backend.

3. Runtime backend selection

Introduce a generic configuration story:

New environment variables (names are up for discussion):
- TENSORZERO_OBSERVABILITY_BACKEND=clickhouse|starrocks
- TENSORZERO_OBSERVABILITY_URL=...
Maintain backward compatibility:
- If TENSORZERO_OBSERVABILITY_BACKEND is unset but TENSORZERO_CLICKHOUSE_URL is present, infer:
  - ObservabilityEngine::ClickHouse
  - ObservabilityConfig.url = TENSORZERO_CLICKHOUSE_URL

In the Python helpers and gateway constructors:

Add new parameters like:

TensorZeroGateway.build_embedded(
    observability_backend="clickhouse",  # or "starrocks"
    observability_url="http://...",
    # possibly keep clickhouse_url as a deprecated alias
)

Continue to support existing clickhouse_url for backward compatibility (and translate it into the new config model under the hood).

This way, existing deployments that only set TENSORZERO_CLICKHOUSE_URL continue to work unchanged, while new users can explicitly opt into StarRocks or other backends.

4. StarRocks backend implementation (initial scope)

For a first pass, the StarRocks backend could aim for:

Feature equivalence for core observability:
- Storing inference logs and feedback.
- Supporting whatever queries the UI/dashboards/metrics currently issue against ClickHouse.
Development scope:
- Define StarRocks schema and migrations equivalent to the ClickHouse schema (as much as practical).
- Implement batch ingestion and query operations in Rust, behind ObservabilityStore.
- Provide example deployment docs for StarRocks similar to docs/deployment/clickhouse.mdx.

We could start with a “minimal viable backend”:

Support new deployments backed by StarRocks (no migration of existing ClickHouse data).
Run an integration test matrix against both backends in CI.

Data migration tooling from ClickHouse to StarRocks (for existing users) could be considered separately.

5. Documentation and examples

To keep the docs coherent and approachable:

Update general documentation to describe an “observability store” conceptually, with ClickHouse and StarRocks as concrete implementations.
Keep ClickHouse as the default / most documented path, but add StarRocks sections in:
- Quickstart (short “If you use StarRocks” subsection).
- Deployment guides (similar to docs/deployment/clickhouse.mdx).
- Helm / docker examples (possibly a separate example directory for StarRocks).

Recipes and notebooks that directly use clickhouse_connect and ClickHouse SQL could, at least initially, remain ClickHouse‑specific, with notes indicating that they assume a ClickHouse backend. In a later phase, they could be refactored to use a small higher‑level query API that is backend‑agnostic.

Backward Compatibility

The primary concerns and mitigations:

Existing env vars:
- Continue to support TENSORZERO_CLICKHOUSE_URL exactly as today.
- Only introduce new TENSORZERO_OBSERVABILITY_* variables as an optional configuration layer.
Existing Python APIs:
- Keep clickhouse_url arguments working.
- Deprecate them gradually, with clear migration guidance to observability_* arguments once that API is stable.
Existing migrations and tests:
- Do not remove or significantly change ClickHouse migrations.
- Refactor tests to use the new abstractions where possible but ensure all ClickHouse test coverage remains intact.

Potential Risks and Open Questions

Migration complexity:
How do you prefer to handle multi‑backend migrations?
- Separate folders per backend (easier to reason about, risk of divergence), or
- A shared logical migration framework that emits backend‑specific SQL (more initial complexity, better long‑term consistency)?
SQL dialect drift:
Some queries or functions (especially around arrays, JSON, or advanced aggregation) might not map 1:1 between ClickHouse and StarRocks. Is it acceptable if StarRocks initially supports a subset of observability features while parity is improved over time?
CI and maintenance cost:
Are you open to running the test suite (or a subset of it) against multiple backends in CI? This would add some maintenance cost but is important to prevent backend‑specific regressions.
Roadmap alignment:
Does a pluggable observability backend fit the project’s vision? Or is the intention to double‑down on ClickHouse as the sole, deeply integrated observability engine?
Data migration tools:
Would you want this proposal to also include guidance or tooling for migrating existing ClickHouse data into StarRocks, or is that out of scope for now?

Proposed Implementation Phases

If this direction is acceptable, a possible phased approach:

Design & alignment
- Finalize trait and config design (ObservabilityStore, ObservabilityConfig, env var names).
- Agree on migration strategy (per‑backend vs. shared logical migrations).
Abstraction groundwork
- Introduce ObservabilityStore and wrap existing ClickHouse implementation.
- Update gateway to depend on dyn ObservabilityStore.
- Add generic TENSORZERO_OBSERVABILITY_* env vars and keep ClickHouse as the default.
StarRocks prototype
- Implement minimal StarRocksStore with just enough schema and queries to support core observability.
- Add basic StarRocks deployment docs and a CI job that runs a subset of tests.
Feature parity & documentation
- Iterate on StarRocks implementation until it covers the same observability features as ClickHouse (where possible).
- Broaden docs to mention both backends and clarify any current limitations.
Optional: higher‑level query API for recipes
- Refactor recipes and notebooks to use a backend‑agnostic query helper instead of directly tying them to ClickHouse SQL.

Request for Feedback

I’d love feedback from maintainers and contributors on:

Whether supporting multiple OLAP backends (starting with StarRocks) is aligned with TensorZero’s goals.
Preferences for:
- The config model and naming (ObservabilityStore, env vars).
- Migration strategy (duplicated per‑backend vs shared logical migrations).
Any architectural constraints or plans around the observability subsystem that I should be aware of before working on an implementation.
Scope: should a first iteration focus solely on making the core gateway observability work with StarRocks, leaving examples and notebooks ClickHouse‑specific, or should documentation parity be a requirement for merging?

If this seems like a reasonable direction, I’d be happy to:

Flesh out a more detailed technical design (with concrete Rust trait definitions and module layout).
Contribute an initial implementation behind a feature flag or experimental configuration.
Iterate based on your review and guidance.

Thanks for considering this proposal, and I’m looking forward to your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract Clickhouse usage behind a pluggable OLAP/Observability interface #5679

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Abstract Clickhouse usage behind a pluggable OLAP/Observability interface #5679

Uh oh!

blarghmatey Jan 16, 2026

Proposal: Pluggable OLAP Backend for TensorZero Observability (ClickHouse + StarRocks)

Summary

Motivation

1. Deployment flexibility

2. Future extensibility

3. StarRocks specifically

Current ClickHouse Usage (High‑Level)

Proposal

1. Introduce a generic Observability Store abstraction (Rust)

2. Migration and schema abstraction

3. Runtime backend selection

4. StarRocks backend implementation (initial scope)

5. Documentation and examples

Backward Compatibility

Potential Risks and Open Questions

Proposed Implementation Phases

Request for Feedback

Replies: 0 comments

blarghmatey
Jan 16, 2026