model/iceberg_mode: generalize config to key/value/headers sections#30467
model/iceberg_mode: generalize config to key/value/headers sections#30467wdberkeley wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR generalizes model::iceberg_mode from legacy flat variants into a structured key/value/headers configuration while preserving legacy string and wire compatibility where possible.
Changes:
- Replaces the internal
iceberg_moderepresentation with disabled/enabled section configs. - Adds parsing, formatting, serde round-trip behavior, and Boost tests for legacy and new config strings.
- Migrates datalake and metrics code to use the new value-section API.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/v/model/metadata.h |
Defines the new structured iceberg_mode API and config types. |
src/v/model/model.cc |
Implements parsing, formatting, and wire serialization for legacy and section-based modes. |
src/v/model/tests/iceberg_mode_test.cc |
Adds coverage for parsing, formatting, string round-trips, and wire round-trips. |
src/v/model/tests/BUILD |
Registers the new Boost test target. |
src/v/datalake/datalake_manager.cc |
Switches datalake resolver/translator selection to the value section. |
src/v/cluster/metrics_reporter.cc |
Updates Iceberg metrics classification to use the new API. |
docs/iceberg-mode-config-design.md |
Adds the design document for the generalized Iceberg mode configuration. |
Comments suppressed due to low confidence (2)
src/v/model/model.cc:838
- This compatibility shortcut drops non-default fields that are not part of the legacy encoding, such as
headers:on_decode_error=dropwith the defaultvalue_type=binary(and key/valuesubjectorprotobuf_namewhen their mode remains binary). Those strings parse intoenabled_implvalues that compare unequal after a wire or string round-trip, despite being accepted by the new grammar. Either reject/normalize options that are not meaningful for the selected mode during parsing, or include all non-default stored fields in the check before using a legacy discriminant.
if (
e.key.mode == iceberg_mode::schema_mode::binary
&& e.headers.value_type == iceberg_mode::header_value_type::binary) {
src/v/model/model.cc:916
- The legacy formatting path is selected based only on
key.modeandheaders.value_type, so accepted configs with other non-default fields (for exampleheaders:on_decode_error=dropwhilevalue_typedefaults to binary, or asubjecton a binary key/value section) are formatted as a legacy token and lose those fields on parse. Please either reject/normalize those fields when they are not meaningful, or only choose the legacy string when the full stored config is equivalent to that legacy mode.
if (
e.key.mode == schema_mode::binary
&& e.headers.value_type == header_value_type::binary) {
Retry command for Build#84416please wait until all jobs are finished before running the slash command |
3d92530 to
f8deed0
Compare
Replaces the flat 4-variant iceberg_mode (disabled, key_value,
value_schema_id_prefix, value_schema_latest) with a structured
section-based config supporting independent key, value, and headers
options.
New config grammar:
<section> (";" <section>)*
section ::= ("key"|"value"|"headers") ":" <opts>
Key/value opts: mode=(binary|schema_id_prefix|schema_latest),
subject=<str>, protobuf_name=<str>
Headers opts: value_type=(binary|string),
on_decode_error=(replace|null|drop)
All old config strings (disabled, key_value, value_schema_id_prefix,
value_schema_latest[:subject=S,protobuf_name=N]) remain valid and parse
into equivalent enabled_impl objects. Wire format preserves old
discriminants 0-3 for configs expressible in the old format; discriminant
4 is added for configs requiring key schema or string headers.
The key and headers sections are fully parsed, validated, and
round-tripped but not yet acted on by the translation pipeline — that
is Rocks 2 and 3. datalake_manager.cc is migrated to the new API and
currently reads only the value section's schema mode.
f8deed0 to
aac388c
Compare
Retry command for Build#84517please wait until all jobs are finished before running the slash command |
| /// Action to take when a header value cannot be decoded as UTF-8. | ||
| /// Only meaningful when header_value_type == string. | ||
| enum class header_on_decode_error : uint8_t { | ||
| replace, ///< Replace invalid byte sequences with U+FFFD. | ||
| null, ///< Set the header value to null. | ||
| drop, ///< Drop the entire header entry. | ||
| }; |
There was a problem hiding this comment.
Just wondering if the header_on_decode_error is a hard requirement. It's a bit hard to articulate why I'd ever want to choose anything but replace. Was it a customer ask?
If not, maybe we just do replace by default and omit on_decode_error as a section, and in the future add whatever else we might want?
| static iceberg_mode disabled; | ||
| static iceberg_mode key_value; | ||
| static iceberg_mode value_schema_id_prefix; |
| constexpr uint8_t wire_value_schema_latest = 3; | ||
| constexpr uint8_t wire_key_and_value_schema = 4; | ||
|
|
||
| static ss::logger iceberg_mode_log("iceberg_mode"); |
There was a problem hiding this comment.
nit: kind of an odd choice, relative to the rest of the codebase
| if (v == "binary") { | ||
| cfg.mode = iceberg_mode::schema_mode::binary; | ||
| } else if (v == "schema_id_prefix") { | ||
| cfg.mode = iceberg_mode::schema_mode::schema_id_prefix; | ||
| } else if (v == "schema_latest") { | ||
| cfg.mode = iceberg_mode::schema_mode::schema_latest; | ||
| } else { | ||
| return false; | ||
| } |
| // Bit flags tracking seen known sections: bit 0=key, 1=value, 2=headers. | ||
| uint8_t seen_known = 0; |
There was a problem hiding this comment.
Why is it important to use bit flags here, vs storing the set of seen sections?
| cfg.subject = {}; | ||
| cfg.protobuf_name = {}; |
There was a problem hiding this comment.
Here and for the header on_decode_error, WDYT about making it an error instead? By being permissive I'm a little worried that a bad config could sneak in and surprise is one day
| ], | ||
| ) | ||
|
|
||
| redpanda_cc_btest( |
There was a problem hiding this comment.
nit: new tests should prefer gtest, unless there's something specific in boost you are trying to leverage?
| BOOST_CHECK_EQUAL(*m, model::iceberg_mode::disabled); | ||
| } | ||
|
|
||
| // --- legacy strings --- |
There was a problem hiding this comment.
It'd also be good to make sure the ducktape upgrade tests that use Iceberg are tested with all the legacy strings
Replaces the flat 4-variant iceberg_mode (disabled, key_value, value_schema_id_prefix, value_schema_latest) with a structured section-based config supporting independent key, value, and headers options.
New config grammar:
Key/value opts: mode=(binary|schema_id_prefix|schema_latest),
subject=, protobuf_name=
Headers opts: value_type=(binary|string),
on_decode_error=(replace|null|drop)
All old config strings (disabled, key_value, value_schema_id_prefix, value_schema_latest[:subject=S,protobuf_name=N]) remain valid and parse into equivalent enabled_impl objects. Wire format preserves old discriminants 0-3 for configs expressible in the old format; discriminant 4 is added for configs requiring key schema or string headers.
The key and headers sections are fully parsed, validated, and round-tripped but not yet acted on by the translation pipeline. datalake_manager.cc is migrated to the new API and currently reads only the value section's schema mode.
Follow-ups will implement key and header translation.
Release notes come when we are landing the final PR, not this initial one.
Backports Required
Release Notes