Skip to content

feat: Improve ClickHouse plugin performance by using HTTP interface with ArrowStream format #20621

@ei-grad

Description

@ei-grad

Which problem is this feature request solving?

This feature request proposes switching the data insertion mechanism in the CloudQuery ClickHouse destination plugin from the current native protocol (clickhouse-go) approach to using ClickHouse's HTTP interface with the FORMAT ArrowStream. The goal is to significantly improve insertion performance and reduce plugin-side resource consumption (CPU, memory) by eliminating the current Go-based intermediate type conversion layer (typeconv/ch/values) and leveraging ClickHouse's native, optimized C++ Arrow parser directly.

Current Behavior & Problem

  • The plugin currently receives data as apache/arrow-go arrow.Record batches via the WriteTableBatch method.
  • Before sending data to ClickHouse using the clickhouse-go/v2 driver's batch interface (driver.Batch), the plugin performs a conversion step within the typeconv/ch/values package (specifically values.FromArray and values.BatchAddRecords).
  • This conversion translates each arrow.Array chunk into a corresponding Go slice type (e.g., []*int32, []*time.Time, []map[string]any, [][]*string) that the clickhouse-go driver's Append method expects. This effectively follows an Arrow (plugin) -> Go Slice (plugin) -> CH Native (driver) data flow before reaching the server.
  • Problem: This intermediate Arrow -> Go Slice conversion step within the Go plugin (typeconv/ch/values) can be a significant performance bottleneck and source of high resource usage:
    • CPU Intensive: It involves iterating over potentially large Arrow arrays and performing type-specific conversions. Reflection is used for complex types (lists, maps, structs), adding overhead.
    • Memory Intensive: It requires allocating Go slices for every column chunk in the batch. For complex or nested types, these allocations can become substantial.
    • Redundant Work: ClickHouse itself has highly optimized native capabilities to ingest data directly from the Arrow IPC Stream format. The current plugin performs work (Arrow -> Go Slice) that the database could potentially do more efficiently.

Alternatives Considered

  • Optimizing typeconv/ch/values: While possible (e.g., using code generation instead of reflection), it's unlikely to match the performance of ClickHouse's native C++ Arrow parser and still incurs Go allocation overhead for the intermediate slices.
  • Using clickhouse-go's HTTP Interface: The clickhouse-go/v2 library does have HTTP support, but currently only implements the Native format over HTTP for compatibility reasons. Support for other formats like Arrow over HTTP is planned for a future v3 but is low priority and not yet implemented. Therefore, relying on the existing driver's HTTP capabilities is not currently a viable alternative for sending Arrow data.

Success Criteria

  • Measurable improvement in insertion throughput (rows/sec or MB/sec) compared to the current implementation under various load conditions (simple vs. complex schemas, small vs. large batches).
  • Measurable reduction in plugin CPU and peak memory usage during insertion operations.
  • Successful migration and data insertion for all currently supported data types and schema configurations via the new HTTP transport.

Describe the solution you'd like

Proposed Behavior

  • Replace Transport: Replace the clickhouse-go native protocol connection (clickhouse.Conn) for insertion operations with Go's standard net/http client, configured for efficiency (connection pooling, keep-alives, HTTP/2).
  • Utilize FORMAT ArrowStream: Modify the INSERT query generation to use FORMAT ArrowStream.
  • Direct Arrow Serialization: In WriteTableBatch (or a refactored equivalent):
    • Take the incoming []arrow.Record batch.
    • Use the apache/arrow-go/v18/arrow/ipc package (ipc.NewWriter) to serialize the batch directly into the Arrow IPC Stream format.
    • Crucially: The schema written to the IPC stream must be canonized based on the target ClickHouse table schema, using Arrow types that ClickHouse's ArrowStream parser expects for input (referencing ClickHouse documentation for type mapping, e.g., SDK UUIDType -> Arrow StringType, SDK JSONType -> Arrow StringType). The existing typeconv.CanonizedTable logic needs adaptation for this mapping.
    • Stream the serialized Arrow IPC data as the body of an HTTP POST request using memory-efficient techniques like io.Pipe.
    • Optionally support request body compression (gzip) via Content-Encoding.
  • Offload Parsing: ClickHouse receives the Arrow stream via HTTP and uses its native parser to ingest the data directly.

Clarification on Conversion

  • Conversion Still Occurs: It's important to acknowledge that the conversion from the Arrow format to ClickHouse's internal columnar representation still needs to happen.
  • Shifted Responsibility: This feature request shifts this conversion responsibility from the CloudQuery Go plugin (Arrow -> Go Slice -> CH Native) to the ClickHouse server itself (Arrow IPC Stream -> CH Internal).
  • Cost Difference: The direct Arrow -> CH Internal conversion performed by the ClickHouse server (typically implemented in optimized C++) is expected to be significantly cheaper in terms of CPU and memory overhead compared to the current intermediate Arrow -> Go Slice conversion performed within the Go plugin, which involves Go runtime overhead, potential reflection, and explicit Go slice allocations.

Expected Benefits

  • Improved Insertion Throughput: Significant reduction in plugin-side CPU usage by eliminating the Go conversion step should lead to higher insertion rates, especially for large batches or complex schemas.
  • Reduced Plugin Memory Usage: Eliminating the allocation of large Go slices in typeconv/ch/values should lower the plugin's peak memory footprint during writes. Streaming the request body via io.Pipe further enhances memory efficiency.
  • Leverage Native ClickHouse Optimizations: Directly utilizes Clickhouse's likely highly optimized C++ implementation for Arrow data ingestion.
  • Simplified Plugin Code (Runtime): While canonization logic remains, the complex runtime conversion code in Go (typeconv/ch/values) is removed.

Implementation Considerations & Challenges

  • Transport Layer Rewrite: Requires replacing clickhouse.Conn usage for writes with a properly configured net/http.Client (handling TLS, pooling, timeouts, auth).
  • Configuration Changes: The spec.Spec needs updates to handle HTTP connection parameters (URL, user, pass, etc.) instead of/alongside the DSN. spec.Options needs rewriting.
  • Serialization Logic: Implement Arrow IPC Stream serialization using arrow/ipc and io.Pipe for efficient streaming.
  • Schema Canonization Adaptation: The logic in typeconv.CanonizedTable/CanonizedField must be updated to map from SDK Arrow types to the specific Arrow base types expected by ClickHouse's ArrowStream input format for the corresponding target ClickHouse column type.
  • DDL and Schema Introspection: All operations currently relying on conn.Exec or conn.Query (e.g., MigrateTables, getTableDefinitions, getPartitionKeyAndSortingKey) must be rewritten to use HTTP requests (e.g., POST /?query=CREATE TABLE... FORMAT ..., POST /?query=SELECT ... FROM system.columns ... FORMAT JSON). This includes parsing the responses (e.g., JSON) from these administrative queries. This is a substantial change.
  • Error Handling: Implement robust handling for HTTP status codes and parsing error messages from ClickHouse HTTP responses.
  • Compression: Decide on and potentially implement optional request body compression (gzip).
  • Testing: Thorough performance benchmarking (throughput, CPU, memory) against the current implementation is essential to validate the benefits across different data shapes and volumes. Functional testing for all types and DDL operations is critical.

Pull request (optional)

  • I can submit a pull request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions