Skip to content

Commit ebce99f

Browse files
g-talbotclaude
andcommitted
feat(merge): legacy-prefix promotion path + schema-evolution body cols
Two adversarial-review follow-ups grouped because they share the streaming engine's input-routing and union-schema seams. ## (b) Legacy-prefix promotion A new operation type pairs a prefix_len=0 split with prefix_len>0 peers in one merge, so legacy splits can be folded into prefix- aligned buckets instead of aging out via retention. Adds: - `ParquetMergeOperation::promote_legacy(splits, target_prefix_len)`: relaxes MP-3 to allow mixed `rg_partition_prefix_len` as long as every input is `<= target`. Sort_fields + window equality unchanged. - `ParquetMergeOperation::target_prefix_len_override: Option<u32>` field records the promotion target; `None` is the default regular-merge form. - `merge_parquet_split_metadata(..., mixed_prefix_ok)`: skips the input-side prefix-len equality check in promotion mode. The output prefix_len still comes from the writer's KV stamp via `MergeOutputFile.output_rg_partition_prefix_len` (CS-1 holds by construction post-F1). - `merge::execute_merge_operation(op, sources, ...)`: new thin executor that opens each input as either `LegacyInputAdapter` (when `split.rg_partition_prefix_len < target`) or `StreamingParquetReader` (otherwise), then feeds them to the streaming engine. Becomes the seam PR-7 will wire from above. Tests: - `test_promote_legacy_pairs_legacy_with_aligned_peer`, `test_promote_legacy_rejects_higher_prefix_input`, `test_promote_legacy_still_enforces_sort_fields`, `test_promote_legacy_all_at_target_is_valid`. - `test_mixed_prefix_ok_skips_input_equality_check`. - `test_promote_legacy_executor_end_to_end`: legacy single-RG + aligned multi-RG → 3-RG output passing `assert_unique_rg_prefix_keys` with `prefix_len = 1`, plus metastore CS-1. - `test_executor_mismatched_sources_count_bails`. ## F6 + F13: Schema evolution for body columns The merger now supports MC-4 across heterogeneous body-col schemas: - F6: `normalize_type` collapses `Binary`/`LargeBinary` (and dict variants) to `Binary`, analogous to the existing string-flavour collapse. Two inputs whose body col differs only by byte-array flavour merge cleanly; before this they hit a "type conflict" at alignment time. - F13: `streaming_writer.rs::write_list_via_serialized_column_writer` (renamed from `..._non_nullable_...`) now handles nullable outer `List<T>` / `LargeList<T>`. MC-4 forces the union to be nullable when a List col is present in only some inputs; before this the writer rejected the merged output. Uses Dremel max_def_level = 2 (0 = outer null, 1 = empty list, 2 = element present) for nullable outer; non-nullable path unchanged. Test: `test_mc2_mixed_schemas_round_trip` builds two inputs A and B with the same sort schema but different body cols (Utf8 vs Dict<Utf8>, LargeBinary vs Binary, List<Float64> in A only, Int32 A-only, Int64 B-only, common Float64). The merge produces the union schema; per-row rendering via `render_cell` matches across flavour boundaries; List cells from B render as nulls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8882c0f commit ebce99f

7 files changed

Lines changed: 938 additions & 116 deletions

File tree

quickwit/quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -231,11 +231,23 @@ impl Handler<ParquetMergeScratch> for ParquetMergeExecutor {
231231
return Ok(());
232232
}
233233

234+
// `mixed_prefix_ok` matches the operation's promotion mode:
235+
// promote-legacy operations bundle inputs from different
236+
// `rg_partition_prefix_len` buckets (the adapter normalizes
237+
// them at read time), so the input-side equality check in
238+
// `merge_parquet_split_metadata` would spuriously fail. Regular
239+
// merges keep the strict check.
240+
let mixed_prefix_ok = scratch
241+
.merge_operation
242+
.target_prefix_len_override
243+
.is_some();
244+
234245
let mut merged_splits = Vec::with_capacity(outputs.len());
235246
for output in &outputs {
236-
let mut metadata = merge_parquet_split_metadata(input_splits, output)
237-
.context("failed to build merge output metadata")
238-
.map_err(|e| ActorExitStatus::from(anyhow::anyhow!(e)))?;
247+
let mut metadata =
248+
merge_parquet_split_metadata(input_splits, output, mixed_prefix_ok)
249+
.context("failed to build merge output metadata")
250+
.map_err(|e| ActorExitStatus::from(anyhow::anyhow!(e)))?;
239251

240252
// Use the split ID that was assigned when the merge operation was
241253
// planned, rather than the one generated inside

quickwit/quickwit-parquet-engine/src/merge/metadata_aggregation.rs

Lines changed: 60 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,17 @@ use crate::split::{ParquetSplitId, ParquetSplitMetadata};
4040
/// # Preconditions
4141
///
4242
/// All input splits must share the same kind, index_uid, partition_id,
43-
/// sort_fields, and window.
43+
/// sort_fields, and window. In the default case (`mixed_prefix_ok = false`)
44+
/// they must also share `rg_partition_prefix_len`. In legacy-promotion
45+
/// mode (`mixed_prefix_ok = true`) the prefix-len equality check is
46+
/// skipped because inputs come from different prefix buckets — the
47+
/// output's prefix_len is taken from the writer's KV stamp via
48+
/// `output.output_rg_partition_prefix_len` (CS-1), so the input-side
49+
/// equality is no longer load-bearing for the metastore record.
4450
pub fn merge_parquet_split_metadata(
4551
inputs: &[ParquetSplitMetadata],
4652
output: &MergeOutputFile,
53+
mixed_prefix_ok: bool,
4754
) -> Result<ParquetSplitMetadata> {
4855
if inputs.is_empty() {
4956
bail!("merge_parquet_split_metadata requires at least one input split");
@@ -93,10 +100,11 @@ pub fn merge_parquet_split_metadata(
93100
first.window
94101
);
95102
}
96-
if input.rg_partition_prefix_len != first.rg_partition_prefix_len {
103+
if !mixed_prefix_ok && input.rg_partition_prefix_len != first.rg_partition_prefix_len {
97104
bail!(
98105
"input {} has rg_partition_prefix_len {}, expected {} — splits with different \
99-
prefix lengths must not appear in the same merge",
106+
prefix lengths must not appear in the same regular merge (legacy-promotion \
107+
operations bypass this check)",
100108
i,
101109
input.rg_partition_prefix_len,
102110
first.rg_partition_prefix_len
@@ -248,7 +256,7 @@ mod tests {
248256
make_test_split("s1", (1200, 2000), 0),
249257
];
250258
let output = make_output(200, 9000);
251-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
259+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
252260

253261
// Invariant fields come from inputs.
254262
assert_eq!(result.kind, ParquetSplitKind::Metrics);
@@ -267,7 +275,7 @@ mod tests {
267275
make_test_split("s1", (1200, 2000), 0),
268276
];
269277
let output = make_output_with_metadata(200, 9000, (1000, 2000), &["cpu.usage", "mem.used"]);
270-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
278+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
271279

272280
// Data-dependent fields come from the output, not inputs.
273281
assert_eq!(result.time_range.start_secs, 1000);
@@ -302,7 +310,7 @@ mod tests {
302310
.or_default()
303311
.insert("api".to_string());
304312

305-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
313+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
306314

307315
let service_values = result.low_cardinality_tags.get("service").unwrap();
308316
assert_eq!(service_values.len(), 2);
@@ -323,7 +331,7 @@ mod tests {
323331
.insert(format!("host-{i}"));
324332
}
325333

326-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
334+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
327335

328336
assert!(result.high_cardinality_tag_keys.contains("host"));
329337
assert!(!result.low_cardinality_tags.contains_key("host"));
@@ -337,15 +345,15 @@ mod tests {
337345
make_test_split("s2", (1000, 2000), 2),
338346
];
339347
let output = make_output(300, 12000);
340-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
348+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
341349

342350
assert_eq!(result.num_merge_ops, 3); // max(2,2,2) + 1
343351
}
344352

345353
#[test]
346354
fn test_empty_inputs_error() {
347355
let output = make_output(0, 0);
348-
let result = merge_parquet_split_metadata(&[], &output);
356+
let result = merge_parquet_split_metadata(&[], &output, false);
349357
assert!(result.is_err());
350358
assert!(
351359
result
@@ -362,7 +370,7 @@ mod tests {
362370
s1.kind = ParquetSplitKind::Sketches;
363371

364372
let output = make_output(200, 9000);
365-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
373+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
366374
assert!(result.is_err());
367375
}
368376

@@ -373,7 +381,7 @@ mod tests {
373381
s1.index_uid = "other-index:00000000000000000000000002".to_string();
374382

375383
let output = make_output(200, 9000);
376-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
384+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
377385
assert!(result.is_err());
378386
}
379387

@@ -384,7 +392,7 @@ mod tests {
384392
s1.partition_id = 99;
385393

386394
let output = make_output(200, 9000);
387-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
395+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
388396
assert!(result.is_err());
389397
}
390398

@@ -395,7 +403,7 @@ mod tests {
395403
s1.sort_fields = "different|schema/V2".to_string();
396404

397405
let output = make_output(200, 9000);
398-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
406+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
399407
assert!(result.is_err());
400408
}
401409

@@ -406,7 +414,7 @@ mod tests {
406414
s1.window = Some(2000..5600);
407415

408416
let output = make_output(200, 9000);
409-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
417+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
410418
assert!(result.is_err());
411419
}
412420

@@ -417,7 +425,7 @@ mod tests {
417425
s1.rg_partition_prefix_len = 1;
418426

419427
let output = make_output(200, 9000);
420-
let result = merge_parquet_split_metadata(&[s0, s1], &output);
428+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false);
421429
let err = result.expect_err("merge must reject mismatched prefix lengths");
422430
let msg = err.to_string();
423431
assert!(
@@ -442,7 +450,7 @@ mod tests {
442450
// num_row_groups = 2 + writer reports demoted prefix_len = 0
443451
// (the legacy writer's choice for a row-count-driven multi-RG).
444452
let output = make_output_full_with_prefix(200, 9000, 2, 0, (1000, 2000), &["cpu.usage"]);
445-
let result = merge_parquet_split_metadata(&[s0, s1], &output).unwrap();
453+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
446454
assert_eq!(result.rg_partition_prefix_len, 0);
447455
}
448456

@@ -457,10 +465,42 @@ mod tests {
457465
s1.rg_partition_prefix_len = 3;
458466

459467
let output = make_output_full_with_prefix(200, 9000, 1, 3, (1000, 2000), &["cpu.usage"]);
460-
let result = merge_parquet_split_metadata(&[s0, s1], &output).unwrap();
468+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
461469
assert_eq!(result.rg_partition_prefix_len, 3);
462470
}
463471

472+
#[test]
473+
fn test_mixed_prefix_ok_skips_input_equality_check() {
474+
// Promotion mode: inputs come from different prefix buckets
475+
// (e.g. one prefix_len=0 legacy + one prefix_len=2 aligned).
476+
// With `mixed_prefix_ok = true` the aggregator must accept this
477+
// and take the output prefix from the writer's stamp.
478+
let mut legacy = make_test_split("s0", (1000, 2000), 0);
479+
legacy.rg_partition_prefix_len = 0;
480+
let mut aligned = make_test_split("s1", (1000, 2000), 0);
481+
aligned.rg_partition_prefix_len = 2;
482+
483+
// Writer stamps prefix_len = 2 on the multi-RG output (streaming
484+
// engine output that successfully promoted the legacy input).
485+
let output = make_output_full_with_prefix(300, 12000, 3, 2, (1000, 2000), &["cpu.usage"]);
486+
487+
let result =
488+
merge_parquet_split_metadata(&[legacy.clone(), aligned.clone()], &output, true)
489+
.expect("mixed-prefix inputs must be accepted in promotion mode");
490+
assert_eq!(
491+
result.rg_partition_prefix_len, 2,
492+
"output prefix matches the writer's stamp (CS-1)",
493+
);
494+
495+
// Same inputs without the mixed_prefix_ok flag must still fail.
496+
let strict = merge_parquet_split_metadata(&[legacy, aligned], &output, false);
497+
let err = strict.expect_err("strict mode must reject mixed-prefix inputs");
498+
assert!(
499+
err.to_string().contains("rg_partition_prefix_len"),
500+
"error should mention the prefix-len mismatch, got: {err}",
501+
);
502+
}
503+
464504
#[test]
465505
fn test_output_prefix_len_preserved_on_multi_rg_streaming_engine() {
466506
// CS-1 regression for F1: the streaming engine produces
@@ -479,7 +519,7 @@ mod tests {
479519
// num_row_groups = 3 (multi-RG) AND writer reports prefix_len = 2
480520
// (the streaming engine's stamp because it verified alignment).
481521
let output = make_output_full_with_prefix(300, 12000, 3, 2, (1000, 2000), &["cpu.usage"]);
482-
let result = merge_parquet_split_metadata(&[s0, s1], &output).unwrap();
522+
let result = merge_parquet_split_metadata(&[s0, s1], &output, false).unwrap();
483523
assert_eq!(
484524
result.rg_partition_prefix_len, 2,
485525
"metastore must mirror the writer's KV (CS-1); multi-RG aligned output keeps its \
@@ -494,7 +534,7 @@ mod tests {
494534
make_test_split("s1", (1000, 2000), 0),
495535
];
496536
let output = make_output(200, 9000);
497-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
537+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
498538

499539
assert_ne!(result.split_id.as_str(), "s0");
500540
assert_ne!(result.split_id.as_str(), "s1");
@@ -510,7 +550,7 @@ mod tests {
510550
output.row_keys_proto = None;
511551
output.zonemap_regexes = HashMap::new();
512552

513-
let result = merge_parquet_split_metadata(&inputs, &output).unwrap();
553+
let result = merge_parquet_split_metadata(&inputs, &output, false).unwrap();
514554

515555
assert!(result.row_keys_proto.is_none());
516556
assert!(result.zonemap_regexes.is_empty());

quickwit/quickwit-parquet-engine/src/merge/mod.rs

Lines changed: 79 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ mod writer;
3131
mod tests;
3232

3333
use std::path::{Path, PathBuf};
34+
use std::sync::Arc;
3435

3536
use anyhow::{Context, Result, bail};
3637
use arrow::array::RecordBatch;
@@ -41,8 +42,9 @@ pub use self::merge_order::MergeRun;
4142
use crate::sort_fields::{equivalent_schemas_for_compaction, parse_sort_fields};
4243
use crate::sorted_series::SORTED_SERIES_COLUMN;
4344
use crate::storage::{
44-
PARQUET_META_NUM_MERGE_OPS, PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_SORT_FIELDS,
45-
PARQUET_META_WINDOW_DURATION, PARQUET_META_WINDOW_START, ParquetWriterConfig,
45+
ColumnPageStream, LegacyInputAdapter, PARQUET_META_NUM_MERGE_OPS,
46+
PARQUET_META_RG_PARTITION_PREFIX_LEN, PARQUET_META_SORT_FIELDS, PARQUET_META_WINDOW_DURATION,
47+
PARQUET_META_WINDOW_START, ParquetWriterConfig, RemoteByteSource, StreamingParquetReader,
4648
};
4749

4850
/// Configuration for a merge operation.
@@ -483,3 +485,78 @@ fn extract_and_validate_input_metadata(paths: &[PathBuf]) -> Result<InputMetadat
483485
rg_partition_prefix_len: consensus_prefix_len.unwrap_or(0),
484486
})
485487
}
488+
489+
/// Execute a [`policy::ParquetMergeOperation`] by opening each input through
490+
/// the appropriate `ColumnPageStream` impl, then feeding the streams
491+
/// to the streaming merge engine.
492+
///
493+
/// Routing per input:
494+
/// - **Regular merge** (`op.target_prefix_len_override == None`): every split is opened directly
495+
/// via [`StreamingParquetReader`]. MP-3 already requires inputs to share
496+
/// `rg_partition_prefix_len`, so the streaming engine sees uniform metadata.
497+
/// - **Promotion merge** (`op.target_prefix_len_override == Some(target)`): splits with
498+
/// `rg_partition_prefix_len < target` are opened through [`LegacyInputAdapter`] with the same
499+
/// target — the adapter re-encodes the file in memory as prefix-aligned and rewrites the
500+
/// `qh.rg_partition_prefix_len` KV. Splits already at `target` are opened directly. The streaming
501+
/// engine then consumes a homogeneous stream advertising `prefix_len = target` on every input.
502+
///
503+
/// `sources` is parallel to `op.splits`: `sources[i]` provides byte-
504+
/// range reads against `op.splits[i].parquet_file`. The caller (e.g.
505+
/// the executor wrapper that lives outside this crate) is responsible
506+
/// for materializing one [`RemoteByteSource`] per split based on its
507+
/// storage backend (S3, local FS, etc.). Splits with names that
508+
/// cannot be opened by the source surface as `LegacyAdapterError::Io`
509+
/// or `ParquetReadError`.
510+
///
511+
/// Returns the merge engine's [`MergeOutputFile`]s. Conversion to
512+
/// `ParquetSplitMetadata` for the metastore is the caller's
513+
/// responsibility — use [`metadata_aggregation::merge_parquet_split_metadata`]
514+
/// with `mixed_prefix_ok = op.target_prefix_len_override.is_some()`.
515+
pub async fn execute_merge_operation(
516+
op: &policy::ParquetMergeOperation,
517+
sources: Vec<Arc<dyn RemoteByteSource>>,
518+
output_dir: &Path,
519+
config: &MergeConfig,
520+
) -> Result<Vec<MergeOutputFile>> {
521+
if sources.len() != op.splits.len() {
522+
bail!(
523+
"execute_merge_operation: sources.len() ({}) != op.splits.len() ({})",
524+
sources.len(),
525+
op.splits.len(),
526+
);
527+
}
528+
529+
let mut streams: Vec<Box<dyn ColumnPageStream>> = Vec::with_capacity(op.splits.len());
530+
for (split, source) in op.splits.iter().zip(sources.into_iter()) {
531+
let path = PathBuf::from(&split.parquet_file);
532+
let stream: Box<dyn ColumnPageStream> = match op.target_prefix_len_override {
533+
Some(target) if split.rg_partition_prefix_len < target => {
534+
// Promote this legacy input. The adapter re-encodes in
535+
// memory and presents itself as a prefix_len = target
536+
// single-RG stream to the merge engine.
537+
let adapter = LegacyInputAdapter::try_open(source, path, target)
538+
.await
539+
.with_context(|| {
540+
format!(
541+
"opening legacy adapter for split {} with target_prefix_len = {target}",
542+
split.split_id,
543+
)
544+
})?;
545+
Box::new(adapter)
546+
}
547+
_ => {
548+
// Direct streaming reader: regular merge, or promotion
549+
// where this input already satisfies the target.
550+
let reader = StreamingParquetReader::try_open(source, path)
551+
.await
552+
.with_context(|| {
553+
format!("opening streaming reader for split {}", split.split_id)
554+
})?;
555+
Box::new(reader)
556+
}
557+
};
558+
streams.push(stream);
559+
}
560+
561+
streaming::streaming_merge_sorted_parquet_files(streams, output_dir, config).await
562+
}

0 commit comments

Comments
 (0)