Skip to content

Latest commit

 

History

History

README.md

K2I Documentation

K2I is a standalone Rust service for Kafka-to-Apache-Iceberg ingestion. It consumes a configured Kafka topic, decodes raw or Confluent-framed Protobuf messages, keeps recent rows visible through an Arrow-backed local read path, and writes Parquet data files through Iceberg catalog commits.

The current release docs are organized around the working implementation and local verification flows. Historical PRDs, research notes, and older website drafts live under archive.

Start Here

Guide Use It For
Kafka to Iceberg Main explanation of the K2I data path
Quickstart Local Docker proof and first manual run
Configuration Complete TOML reference
Architecture System design, ordering, and hot/cold visibility
Comparisons K2I vs Kafka Connect, Flink, Spark, TableFlow, and Moonlink
FAQ Short answers for common user questions

Implementation Deep Dives

Guide Use It For
DuckDB Iceberg Validation Docker E2E, direct Parquet reads, and DuckDB iceberg_scan
Schema Registry Protobuf Confluent Protobuf decoding and schema evolution behavior
Iceberg REST Catalog REST catalog commits and catalog backend caveats
Commands CLI command reference and E2E scripts
Man Pages Generated man pages for every CLI command and subcommand
Deployment Deployment patterns and operational notes
Troubleshooting Common issues and recovery guidance
Production Readiness Verification status, caveats, and follow-up issues

Quick Local Proof

# Correctness flow: Protobuf evolution, read-state RPC, DuckDB Parquet checks
scripts/e2e-docker.sh

# Real Iceberg REST metadata and DuckDB iceberg_scan
scripts/e2e-docker-iceberg.sh

# 100,000-row Iceberg load profile
K2I_E2E_LOAD_MESSAGES=100000 scripts/e2e-docker-iceberg-load.sh

The Iceberg E2E success line is:

ok: DuckDB iceberg_scan validated real Iceberg metadata

Current Release Scope

K2I is production-oriented, but the docs intentionally keep caveats visible:

  • one configured Kafka topic and one configured Iceberg table per process today;
  • REST catalog real-metadata path validated locally; Glue, Hive, and Nessie abstractions require backend-specific validation;
  • hot reads are local read-state RPC, while query engines see data after an Iceberg commit;
  • exactly-once-style durability is designed around manual Kafka offsets, transaction-log records, idempotency records, immutable Parquet writes, and atomic Iceberg commits;
  • multi-partition hardening, startup recovery application, async Kafka commit acknowledgement, per-entry fsync behavior, GCS/Azure writer wiring, and maintenance scheduler wiring remain production follow-ups.

See Production Readiness before broad rollout.