K2I: Kafka to Apache Iceberg in One Rust Binary

Stream final-form Kafka events into Apache Iceberg tables with Protobuf Schema Registry decoding, Arrow hot reads, Parquet writes, and Docker-verified DuckDB/Iceberg validation.

K2I is an open-source, standalone Kafka-to-Iceberg ingestion engine written in Rust. It consumes a Kafka topic, decodes raw or Confluent-framed Protobuf messages, keeps recent rows visible through an Arrow-backed local read path, and flushes Parquet data files through Iceberg catalog commits.

K2I is built for teams that want fresh lakehouse tables without operating a Flink job, Spark micro-batch pipeline, or Kafka Connect cluster for a simple final-form event stream. One process owns one configured Kafka topic and one configured Iceberg table today.

Use K2I When

You have final-form Kafka events that should become analytics rows in Apache Iceberg.
You want a single Rust service/container instead of a stream-processing cluster for this ingestion job.
You need local fresh-read visibility before the next Iceberg snapshot is committed.
You use Confluent Schema Registry Protobuf and want additive schema evolution guarded by readiness checks.
You want local Docker E2E validation that proves the written Iceberg table is readable by DuckDB.

K2I is not a general stream processing framework. Use Flink or another stream processor for joins, windows, stateful transformations, multi-source ETL, or complex CDC delete/update semantics.

Quick Local Proof

Run the real Iceberg/DuckDB flow locally:

scripts/e2e-docker-iceberg.sh

The script starts Kafka, Schema Registry, K2I, an Iceberg REST fixture, and the E2E runner. A passing run ends with:

ok: DuckDB iceberg_scan validated real Iceberg metadata

Run the local Iceberg load profile with 100,000 rows:

K2I_E2E_LOAD_MESSAGES=100000 scripts/e2e-docker-iceberg-load.sh

What K2I Does

Capability	Current behavior
Kafka ingest	Uses `rdkafka`, manual offset management, batching, retry, and backpressure
Payload decoding	Raw bytes, JSON-compatible raw payloads, and Confluent-framed Protobuf
Schema Registry	Resolves Protobuf descriptors, caches schemas in memory and on disk, supports subject strategies
Schema evolution	Adds compatible nullable Protobuf fields and pauses readiness on breaking changes
Hot reads	Exposes local read-state RPC over a Unix socket with Arrow IPC rows and committed file references
Iceberg writes	Writes Parquet files and commits real Iceberg REST metadata through `iceberg-rust`
Durability design	Records offsets, flushes, files, schema events, and idempotency data in an append-only transaction log
Operations	HTTP health/readiness, Prometheus metrics, CLI commands, generated man pages, and Docker E2E scripts

Architecture

flowchart LR
    kafka[(Kafka topic)]
    registry[(Schema Registry)]
    client[Local read client]
    metrics[Health / Metrics]

    subgraph k2i[K2I Rust process]
        consumer[Kafka consumer<br/>manual offsets + backpressure]
        decoder[Decoder<br/>raw / Protobuf]
        hot[Arrow hot buffer<br/>read LSNs]
        wal[Transaction log<br/>offsets + idempotency]
        writer[Parquet writer]
        rpc[Read-state RPC<br/>Unix socket]
    end

    store[(Object storage<br/>Parquet data files)]
    catalog[(Iceberg catalog<br/>REST / Glue / Hive / Nessie)]
    engines[Query engines<br/>DuckDB / Trino / Spark]

    kafka --> consumer --> decoder --> hot --> writer --> store
    registry --> decoder
    hot --> rpc --> client
    decoder --> wal
    writer --> wal
    writer --> catalog
    catalog --> engines
    store --> engines
    k2i --> metrics

K2I separates hot and cold visibility:

flowchart TB
    record[Kafka record accepted]
    hot[Hot path<br/>Arrow buffer + read-state RPC]
    flush[Flush trigger<br/>time / size / count]
    cold[Cold path<br/>Parquet + Iceberg snapshot]
    query_hot[Local freshness reads]
    query_cold[Lakehouse queries]

    record --> hot --> query_hot
    hot --> flush --> cold --> query_cold

Hot-path visibility is local and intended for co-located readers or sidecars. Cold-path visibility depends on flush thresholds, object-store writes, and Iceberg catalog commit timing.

K2I vs Alternatives

Dimension	K2I	Kafka Connect Iceberg Sink	Flink Iceberg Sink	Spark Micro-Batch	Confluent TableFlow	Moonlink
Primary fit	Final-form Kafka events to Iceberg	Connector-based ingestion	Stream processing and transforms	Batch/micro-batch ETL	Managed Confluent pipeline	Postgres CDC to Iceberg
Deployment	Single Rust binary/container	Kafka Connect cluster	Flink cluster	Spark runtime	Managed service	Service/extension stack
Transformations	Intentionally minimal	SMT/basic connector config	Strong	Strong	Limited/managed	CDC-focused
Hot reads	Local Arrow read-state RPC	No	No native local hot path	No	No local hot path	CDC-oriented
Schema path	Confluent Protobuf additive evolution	Connector/schema dependent	Engine dependent	Job dependent	Managed	CDC/schema dependent
Choose when	Events are already analytics-shaped	You already run Connect	You need joins/windows/state	Batch jobs are acceptable	You use Confluent Cloud	Source is Postgres

See comparisons for the longer decision guide.

Installation

Download the latest binary from the GitHub Releases page.

macOS

brew install osodevops/tap/k2i

Linux / macOS Shell Installer

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/osodevops/k2i/releases/latest/download/k2i-cli-installer.sh | sh

Docker

docker pull ghcr.io/osodevops/k2i:latest
docker run --rm -v /path/to/config:/etc/k2i ghcr.io/osodevops/k2i:latest ingest --config /etc/k2i/config.toml

From Source

git clone https://github.com/osodevops/k2i.git
cd k2i
cargo build --release

Binary location: target/release/k2i

Source builds require Rust 1.75+, CMake, and OpenSSL development libraries. Kerberos/GSSAPI support is not enabled by default; if you need it, add the gssapi feature to rdkafka and install the matching SASL development libraries for your platform.

Quick Start

Create a minimal configuration:

[kafka]
bootstrap_servers = ["localhost:9092"]
topic = "events"
consumer_group = "k2i-ingestion"

[kafka.format]
type = "raw"

[schema_evolution]
mode = "auto-additive"
on_breaking_change = "pause"
schema_update_min_interval_seconds = 60

[buffer]
max_size_mb = 500
flush_interval_seconds = 30
flush_batch_size = 10000

[iceberg]
catalog_type = "rest"
warehouse_path = "s3://my-bucket/warehouse"
database_name = "analytics"
table_name = "events"
rest_uri = "http://localhost:8181"

[transaction_log]
log_dir = "./transaction_logs"

[monitoring]
metrics_port = 9090
health_port = 8080

[rpc]
enabled = false

Validate and run:

k2i validate --config config.toml
k2i ingest --config config.toml

Monitor:

k2i status --url http://localhost:8080
curl http://localhost:8080/health
curl http://localhost:9090/metrics

Use config/example.toml and the configuration reference for the complete set of options.

What Is Validated

The current implementation has been verified locally with:

cargo fmt --all --check
git diff --check
cargo check --workspace --all-targets
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace --no-fail-fast
cargo run -p k2i-cli -- completions man --output-dir docs/man/man1
cargo test -p k2i-cli --test man_pages --no-fail-fast
scripts/e2e-docker.sh
K2I_E2E_LOAD_MESSAGES=100000 scripts/e2e-docker-load.sh
scripts/e2e-docker-iceberg.sh
K2I_E2E_LOAD_MESSAGES=100000 scripts/e2e-docker-iceberg-load.sh

The Docker flows cover Protobuf schema evolution, schema-pause readiness behavior, read-state RPC, direct Parquet validation with DuckDB, real Iceberg REST metadata commits, snapshot growth, and DuckDB iceberg_scan.

Current Release Caveats

K2I is ready for a first public release as a production-oriented Kafka-to-Iceberg ingestion engine, but these areas should remain explicit follow-up items before broad production rollout:

Multi-partition flush and offset commit behavior needs continued hardening.
Startup recovery computes state, but Kafka seeking/deduplication and startup orphan cleanup need further wiring.
Kafka offset commits are async; broker durability acknowledgement is not confirmed by the current helper.
Transaction-log entries are flushed, but not every entry is fsynced individually.
GCS and Azure object-store configuration is declared, but writer creation is not complete for those backends.
Maintenance commands and task implementations exist; scheduler wiring should be reviewed for each deployment.

See Production Readiness for the detailed review checklist.

Documentation

Guide	Description
Kafka to Iceberg	Main explanation of the K2I data path
Quickstart	Local proof and first manual run
Configuration	Complete TOML reference
Architecture	System design, ordering, and hot/cold reads
Comparisons	K2I vs Kafka Connect, Flink, Spark, TableFlow, and Moonlink
DuckDB Iceberg Validation	How local Docker E2E proves real Iceberg metadata
Schema Registry Protobuf	Protobuf decoding and schema evolution
Iceberg REST Catalog	REST catalog commit path and backend caveats
Commands	CLI command reference
Man Pages	Generated man pages for every CLI command and subcommand
Deployment	Deployment patterns and operational notes
Troubleshooting	Common issues and recovery guidance
FAQ	Short answers for common user questions
Production Readiness	Verification status, caveats, and follow-up issues

Project Structure

k2i/
|-- crates/
|   |-- k2i-core/         # Core ingestion library
|   |-- k2i-cli/          # CLI binary and HTTP server
|   |-- k2i-rpc/          # Read-state protocol types and framing
|   |-- k2i-rpc-server/   # Unix socket RPC server
|   `-- k2i-e2e-runner/   # Docker E2E producer/verifier
|-- config/               # Example configuration
|-- docker/e2e/           # Local E2E compose stacks
|-- docs/                 # Current release docs plus historical archive
`-- scripts/              # E2E wrapper scripts

FAQ

Is K2I a Kafka Connect plugin?

No. K2I is a standalone Rust service with its own Kafka consumer, transaction log, writer, CLI, health server, and metrics server.

Does K2I provide exactly-once delivery?

K2I is designed for exactly-once-style durability by combining manual Kafka offset management, transaction-log recovery records, idempotency records, immutable Parquet writes, and atomic Iceberg commits. See the production-readiness caveats for the remaining hardening work.

How fresh is data in K2I?

Recent rows can be visible through the local read-state RPC before the next cold commit. Iceberg query engines see data after a flush writes Parquet and commits an Iceberg snapshot.

Can DuckDB read tables written by K2I?

Yes. The Docker Iceberg E2E validates K2I output with DuckDB direct Parquet reads and DuckDB iceberg_scan against real Iceberg REST metadata.

Is K2I a CDC tool?

No. K2I is Kafka-native and optimized for append-oriented event streams. CDC updates/deletes and deletion vectors are outside the current scope.

See the full FAQ.

Looking for Enterprise Apache Kafka Support?

OSO engineers are focused on deploying, operating, and maintaining Apache Kafka platforms. If you need SLA-backed support, security review, deployment help, or a broader data lakehouse strategy, contact enquiries@oso.sh or visit oso.sh/contact.

Contributing

Report bugs with a minimal reproduction and relevant config.
Suggest features with the target workflow and failure mode.
Run the verification commands above before opening a PR that changes ingestion, schema, catalog, or CLI behavior.
Regenerate man pages after changing CLI help text, flags, or subcommands:

cargo run -p k2i-cli -- completions man --output-dir docs/man/man1

License

K2I is licensed under the Apache License 2.0.

Acknowledgments

K2I draws architectural inspiration from Moonlink by Mooncake Labs, adapted for Kafka-native final-form event streams rather than Postgres CDC.

Built with:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
config		config
crates		crates
docker/e2e		docker/e2e
docs		docs
scripts		scripts
seo-research/output		seo-research/output
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
claude.md		claude.md
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K2I: Kafka to Apache Iceberg in One Rust Binary

Use K2I When

Quick Local Proof

What K2I Does

Architecture

K2I vs Alternatives

Installation

macOS

Linux / macOS Shell Installer

Docker

From Source

Quick Start

What Is Validated

Current Release Caveats

Documentation

Project Structure

FAQ

Looking for Enterprise Apache Kafka Support?

Contributing

License

Acknowledgments

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

K2I: Kafka to Apache Iceberg in One Rust Binary

Use K2I When

Quick Local Proof

What K2I Does

Architecture

K2I vs Alternatives

Installation

macOS

Linux / macOS Shell Installer

Docker

From Source

Quick Start

What Is Validated

Current Release Caveats

Documentation

Project Structure

FAQ

Looking for Enterprise Apache Kafka Support?

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages