-
Notifications
You must be signed in to change notification settings - Fork 548
Open
Labels
Description
Which problem is this feature request solving?
This feature request proposes switching the data insertion mechanism in the CloudQuery ClickHouse destination plugin from the current native protocol (clickhouse-go) approach to using ClickHouse's HTTP interface with the FORMAT ArrowStream. The goal is to significantly improve insertion performance and reduce plugin-side resource consumption (CPU, memory) by eliminating the current Go-based intermediate type conversion layer (typeconv/ch/values) and leveraging ClickHouse's native, optimized C++ Arrow parser directly.
Current Behavior & Problem
- The plugin currently receives data as
apache/arrow-goarrow.Recordbatches via theWriteTableBatchmethod. - Before sending data to ClickHouse using the
clickhouse-go/v2driver's batch interface (driver.Batch), the plugin performs a conversion step within thetypeconv/ch/valuespackage (specificallyvalues.FromArrayandvalues.BatchAddRecords). - This conversion translates each
arrow.Arraychunk into a corresponding Go slice type (e.g.,[]*int32,[]*time.Time,[]map[string]any,[][]*string) that theclickhouse-godriver'sAppendmethod expects. This effectively follows anArrow (plugin) -> Go Slice (plugin) -> CH Native (driver)data flow before reaching the server. - Problem: This intermediate
Arrow -> Go Sliceconversion step within the Go plugin (typeconv/ch/values) can be a significant performance bottleneck and source of high resource usage:- CPU Intensive: It involves iterating over potentially large Arrow arrays and performing type-specific conversions. Reflection is used for complex types (lists, maps, structs), adding overhead.
- Memory Intensive: It requires allocating Go slices for every column chunk in the batch. For complex or nested types, these allocations can become substantial.
- Redundant Work: ClickHouse itself has highly optimized native capabilities to ingest data directly from the Arrow IPC Stream format. The current plugin performs work (Arrow -> Go Slice) that the database could potentially do more efficiently.
Alternatives Considered
- Optimizing
typeconv/ch/values: While possible (e.g., using code generation instead of reflection), it's unlikely to match the performance of ClickHouse's native C++ Arrow parser and still incurs Go allocation overhead for the intermediate slices. - Using
clickhouse-go's HTTP Interface: Theclickhouse-go/v2library does have HTTP support, but currently only implements theNativeformat over HTTP for compatibility reasons. Support for other formats likeArrowover HTTP is planned for a future v3 but is low priority and not yet implemented. Therefore, relying on the existing driver's HTTP capabilities is not currently a viable alternative for sending Arrow data.
Success Criteria
- Measurable improvement in insertion throughput (rows/sec or MB/sec) compared to the current implementation under various load conditions (simple vs. complex schemas, small vs. large batches).
- Measurable reduction in plugin CPU and peak memory usage during insertion operations.
- Successful migration and data insertion for all currently supported data types and schema configurations via the new HTTP transport.
Describe the solution you'd like
Proposed Behavior
- Replace Transport: Replace the
clickhouse-gonative protocol connection (clickhouse.Conn) for insertion operations with Go's standardnet/httpclient, configured for efficiency (connection pooling, keep-alives, HTTP/2). - Utilize
FORMAT ArrowStream: Modify theINSERTquery generation to useFORMAT ArrowStream. - Direct Arrow Serialization: In
WriteTableBatch(or a refactored equivalent):- Take the incoming
[]arrow.Recordbatch. - Use the
apache/arrow-go/v18/arrow/ipcpackage (ipc.NewWriter) to serialize the batch directly into the Arrow IPC Stream format. - Crucially: The schema written to the IPC stream must be canonized based on the target ClickHouse table schema, using Arrow types that ClickHouse's
ArrowStreamparser expects for input (referencing ClickHouse documentation for type mapping, e.g., SDKUUIDType-> ArrowStringType, SDKJSONType-> ArrowStringType). The existingtypeconv.CanonizedTablelogic needs adaptation for this mapping. - Stream the serialized Arrow IPC data as the body of an HTTP
POSTrequest using memory-efficient techniques likeio.Pipe. - Optionally support request body compression (
gzip) viaContent-Encoding.
- Take the incoming
- Offload Parsing: ClickHouse receives the Arrow stream via HTTP and uses its native parser to ingest the data directly.
Clarification on Conversion
- Conversion Still Occurs: It's important to acknowledge that the conversion from the Arrow format to ClickHouse's internal columnar representation still needs to happen.
- Shifted Responsibility: This feature request shifts this conversion responsibility from the CloudQuery Go plugin (
Arrow -> Go Slice -> CH Native) to the ClickHouse server itself (Arrow IPC Stream -> CH Internal). - Cost Difference: The direct
Arrow -> CH Internalconversion performed by the ClickHouse server (typically implemented in optimized C++) is expected to be significantly cheaper in terms of CPU and memory overhead compared to the current intermediateArrow -> Go Sliceconversion performed within the Go plugin, which involves Go runtime overhead, potential reflection, and explicit Go slice allocations.
Expected Benefits
- Improved Insertion Throughput: Significant reduction in plugin-side CPU usage by eliminating the Go conversion step should lead to higher insertion rates, especially for large batches or complex schemas.
- Reduced Plugin Memory Usage: Eliminating the allocation of large Go slices in
typeconv/ch/valuesshould lower the plugin's peak memory footprint during writes. Streaming the request body viaio.Pipefurther enhances memory efficiency. - Leverage Native ClickHouse Optimizations: Directly utilizes Clickhouse's likely highly optimized C++ implementation for Arrow data ingestion.
- Simplified Plugin Code (Runtime): While canonization logic remains, the complex runtime conversion code in Go (
typeconv/ch/values) is removed.
Implementation Considerations & Challenges
- Transport Layer Rewrite: Requires replacing
clickhouse.Connusage for writes with a properly configurednet/http.Client(handling TLS, pooling, timeouts, auth). - Configuration Changes: The
spec.Specneeds updates to handle HTTP connection parameters (URL, user, pass, etc.) instead of/alongside the DSN.spec.Optionsneeds rewriting. - Serialization Logic: Implement Arrow IPC Stream serialization using
arrow/ipcandio.Pipefor efficient streaming. - Schema Canonization Adaptation: The logic in
typeconv.CanonizedTable/CanonizedFieldmust be updated to map from SDK Arrow types to the specific Arrow base types expected by ClickHouse'sArrowStreaminput format for the corresponding target ClickHouse column type. - DDL and Schema Introspection: All operations currently relying on
conn.Execorconn.Query(e.g.,MigrateTables,getTableDefinitions,getPartitionKeyAndSortingKey) must be rewritten to use HTTP requests (e.g.,POST /?query=CREATE TABLE... FORMAT ...,POST /?query=SELECT ... FROM system.columns ... FORMAT JSON). This includes parsing the responses (e.g., JSON) from these administrative queries. This is a substantial change. - Error Handling: Implement robust handling for HTTP status codes and parsing error messages from ClickHouse HTTP responses.
- Compression: Decide on and potentially implement optional request body compression (
gzip). - Testing: Thorough performance benchmarking (throughput, CPU, memory) against the current implementation is essential to validate the benefits across different data shapes and volumes. Functional testing for all types and DDL operations is critical.
Pull request (optional)
- I can submit a pull request
Reactions are currently unavailable