feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format

### Which problem is this feature request solving?

This feature request proposes switching the data insertion mechanism in the CloudQuery ClickHouse destination plugin from the current native protocol (`clickhouse-go`) approach to using ClickHouse's HTTP interface with the `FORMAT ArrowStream`. The goal is to significantly improve insertion performance and reduce plugin-side resource consumption (CPU, memory) by eliminating the current Go-based intermediate type conversion layer (`typeconv/ch/values`) and leveraging ClickHouse's native, optimized C++ Arrow parser directly.

### Current Behavior & Problem

*   The plugin currently receives data as `apache/arrow-go` `arrow.Record` batches via the `WriteTableBatch` method.
*   Before sending data to ClickHouse using the `clickhouse-go/v2` driver's batch interface (`driver.Batch`), the plugin performs a conversion step within the `typeconv/ch/values` package (specifically `values.FromArray` and `values.BatchAddRecords`).
*   This conversion translates each `arrow.Array` chunk into a corresponding Go slice type (e.g., `[]*int32`, `[]*time.Time`, `[]map[string]any`, `[][]*string`) that the `clickhouse-go` driver's `Append` method expects. This effectively follows an `Arrow (plugin) -> Go Slice (plugin) -> CH Native (driver)` data flow before reaching the server.
*   **Problem:** This intermediate `Arrow -> Go Slice` conversion step within the Go plugin (`typeconv/ch/values`) can be a significant performance bottleneck and source of high resource usage:
    *   **CPU Intensive:** It involves iterating over potentially large Arrow arrays and performing type-specific conversions. Reflection is used for complex types (lists, maps, structs), adding overhead.
    *   **Memory Intensive:** It requires allocating Go slices for *every column chunk* in the batch. For complex or nested types, these allocations can become substantial.
    *   **Redundant Work:** ClickHouse itself has highly optimized native capabilities to ingest data directly from the Arrow IPC Stream format. The current plugin performs work (Arrow -> Go Slice) that the database could potentially do more efficiently.

### Alternatives Considered

*   **Optimizing `typeconv/ch/values`:** While possible (e.g., using code generation instead of reflection), it's unlikely to match the performance of ClickHouse's native C++ Arrow parser and still incurs Go allocation overhead for the intermediate slices.
*   **Using `clickhouse-go`'s HTTP Interface:** The `clickhouse-go/v2` library *does* have HTTP support, but currently only implements the `Native` format over HTTP for compatibility reasons. Support for other formats like `Arrow` over HTTP is planned for a future v3 but is low priority and not yet implemented. Therefore, relying on the existing driver's HTTP capabilities is not currently a viable alternative for sending Arrow data.

### Success Criteria

*   Measurable improvement in insertion throughput (rows/sec or MB/sec) compared to the current implementation under various load conditions (simple vs. complex schemas, small vs. large batches).
*   Measurable reduction in plugin CPU and peak memory usage during insertion operations.
*   Successful migration and data insertion for all currently supported data types and schema configurations via the new HTTP transport.

### Describe the solution you'd like

### Proposed Behavior

*   **Replace Transport:** Replace the `clickhouse-go` native protocol connection (`clickhouse.Conn`) for insertion operations with Go's standard `net/http` client, configured for efficiency (connection pooling, keep-alives, HTTP/2).
*   **Utilize `FORMAT ArrowStream`:** Modify the `INSERT` query generation to use `FORMAT ArrowStream`.
*   **Direct Arrow Serialization:** In `WriteTableBatch` (or a refactored equivalent):
    *   Take the incoming `[]arrow.Record` batch.
    *   Use the `apache/arrow-go/v18/arrow/ipc` package (`ipc.NewWriter`) to serialize the batch directly into the Arrow IPC Stream format.
    *   **Crucially:** The schema written to the IPC stream must be *canonized* based on the target ClickHouse table schema, using Arrow types that ClickHouse's `ArrowStream` parser expects for input (referencing ClickHouse documentation for type mapping, e.g., SDK `UUIDType` -> Arrow `StringType`, SDK `JSONType` -> Arrow `StringType`). The existing `typeconv.CanonizedTable` logic needs adaptation for this mapping.
    *   Stream the serialized Arrow IPC data as the body of an HTTP `POST` request using memory-efficient techniques like `io.Pipe`.
    *   Optionally support request body compression (`gzip`) via `Content-Encoding`.
*   **Offload Parsing:** ClickHouse receives the Arrow stream via HTTP and uses its native parser to ingest the data directly.

### Clarification on Conversion

*   **Conversion Still Occurs:** It's important to acknowledge that the conversion from the Arrow format to ClickHouse's internal columnar representation **still needs to happen**.
*   **Shifted Responsibility:** This feature request **shifts** this conversion responsibility from the CloudQuery Go plugin (`Arrow -> Go Slice -> CH Native`) to the ClickHouse server itself (`Arrow IPC Stream -> CH Internal`).
*   **Cost Difference:** The direct `Arrow -> CH Internal` conversion performed by the ClickHouse server (typically implemented in optimized C++) is expected to be **significantly cheaper** in terms of CPU and memory overhead compared to the current intermediate `Arrow -> Go Slice` conversion performed within the Go plugin, which involves Go runtime overhead, potential reflection, and explicit Go slice allocations.

### Expected Benefits

*   **Improved Insertion Throughput:** Significant reduction in plugin-side CPU usage by eliminating the Go conversion step should lead to higher insertion rates, especially for large batches or complex schemas.
*   **Reduced Plugin Memory Usage:** Eliminating the allocation of large Go slices in `typeconv/ch/values` should lower the plugin's peak memory footprint during writes. Streaming the request body via `io.Pipe` further enhances memory efficiency.
*   **Leverage Native ClickHouse Optimizations:** Directly utilizes Clickhouse's likely highly optimized C++ implementation for Arrow data ingestion.
*   **Simplified Plugin Code (Runtime):** While canonization logic remains, the complex runtime *conversion* code in Go (`typeconv/ch/values`) is removed.

### Implementation Considerations & Challenges

*   **Transport Layer Rewrite:** Requires replacing `clickhouse.Conn` usage for writes with a properly configured `net/http.Client` (handling TLS, pooling, timeouts, auth).
*   **Configuration Changes:** The `spec.Spec` needs updates to handle HTTP connection parameters (URL, user, pass, etc.) instead of/alongside the DSN. `spec.Options` needs rewriting.
*   **Serialization Logic:** Implement Arrow IPC Stream serialization using `arrow/ipc` and `io.Pipe` for efficient streaming.
*   **Schema Canonization Adaptation:** The logic in `typeconv.CanonizedTable`/`CanonizedField` must be updated to map from SDK Arrow types to the specific Arrow base types expected by ClickHouse's `ArrowStream` input format for the corresponding target ClickHouse column type.
*   **DDL and Schema Introspection:** All operations currently relying on `conn.Exec` or `conn.Query` (e.g., `MigrateTables`, `getTableDefinitions`, `getPartitionKeyAndSortingKey`) must be rewritten to use HTTP requests (e.g., `POST /?query=CREATE TABLE... FORMAT ...`, `POST /?query=SELECT ... FROM system.columns ... FORMAT JSON`). This includes parsing the responses (e.g., JSON) from these administrative queries. This is a substantial change.
*   **Error Handling:** Implement robust handling for HTTP status codes and parsing error messages from ClickHouse HTTP responses.
*   **Compression:** Decide on and potentially implement optional request body compression (`gzip`).
*   **Testing:** Thorough performance benchmarking (throughput, CPU, memory) against the current implementation is essential to validate the benefits across different data shapes and volumes. Functional testing for all types and DDL operations is critical.

### Pull request (optional)

- [x] I can submit a pull request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format #20621

Which problem is this feature request solving?

Current Behavior & Problem

Alternatives Considered

Success Criteria

Describe the solution you'd like

Proposed Behavior

Clarification on Conversion

Expected Benefits

Implementation Considerations & Challenges

Pull request (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format #20621

Description

Which problem is this feature request solving?

Current Behavior & Problem

Alternatives Considered

Success Criteria

Describe the solution you'd like

Proposed Behavior

Clarification on Conversion

Expected Benefits

Implementation Considerations & Challenges

Pull request (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions