Add row-oriented byte encoder (vortex-row crate) by joseph-isaacs · Pull Request #7985 · vortex-data/vortex

joseph-isaacs · 2026-05-18T15:59:57Z

Part 25 of 25 in the stacked PR series adding vortex-row. This is the top of the stack.

This PR contains exactly one commit; review just that diff in isolation. The full design + rationale for the entire series is below.

What this commit does

Two more chunk-walking kernels alongside the BitPacked one. Both register via the inventory-based registry.

FoR (Frame of Reference):

Common fused path: FoR around a BitPacked storage with an unsigned reference. Walks the bit-packed chunks via FoR::unchecked_unfor_pack into a stack buffer with the base wrapping-added inline, then encodes rows from that buffer.
Slow path: FoR around a Primitive storage. Walks the canonical buffer once with a per-row wrapping_add and the row encode.

Delta:

Uses the existing chunked decompress_primitive to write into a primitive buffer, then encodes rows from that buffer. Skips the PrimitiveArray wrapping + validity attach.

Adds for_i64_* and delta_i64_* bench triplets.

Stack

#	PR	Title	Branch
1	#7986	vortex-row: crate scaffolding	`claude/row-c01-crate-scaffolding`
2	#7987	vortex-row: add SortField and RowEncodeOptions	`claude/row-c02-sortfield-options`
3	#7988	vortex-row: codec for fixed-width canonical types	`claude/row-c03-codec-fixed-width`
4	#7989	vortex-row: codec for varlen canonical types	`claude/row-c04-codec-varlen`
5	#7990	vortex-row: codec for nested canonical types	`claude/row-c05-codec-nested`
6	#7991	vortex-row: compute_sizes helper and RowSize ScalarFn	`claude/row-c06-rowsize-scalarfn`
7	#7992	vortex-row: RowEncode ScalarFn	`claude/row-c07-rowencode-scalarfn`
8	#7993	vortex-row: convert_columns + tests + bench scaffolding	`claude/row-c08-convert-columns-tests-bench`
9	#7994	Skip ListView validation in row encoder output	`claude/row-c09-skip-listview-validation`
10	#7995	Add validity fast-path helper for the four pattern-matching encoders	`claude/row-c10-validity-fast-path`
11	#7996	Skip zero-init of output buffer	`claude/row-c11-skip-zero-init`
12	#7997	Auto-vectorize pure-fixed offsets construction	`claude/row-c12-vectorize-pure-fixed-offsets`
13	#7998	Auto-vectorize mixed-path offsets construction	`claude/row-c13-vectorize-mixed-offsets`
14	#7999	Rewrite varlen 32-byte block encoder with copy_nonoverlapping	`claude/row-c14-varlen-block-copy-nonoverlapping`
15	#8000	Walk VarBinView rows directly in row encoder hot loop	`claude/row-c15-walk-varbinview-directly`
16	#8001	Add arithmetic-write fast path for fixed-before-varlen columns	`claude/row-c16-arith-write-fast-path`
17	#8002	Specialize Constant for the arithmetic-write fast path	`claude/row-c17-specialize-constant-arith`
18	#8003	RowSizeKernel and RowEncodeKernel dispatch helpers	`claude/row-c18-kernel-dispatch-helpers`
19	#8004	Inventory-based registry for downstream encoding kernels	`claude/row-c19-inventory-registry`
20	#8005	Constant row-encode kernel	`claude/row-c20-constant-kernel`
21	#8006	Dict row-encode kernel	`claude/row-c21-dict-kernel`
22	#8007	Patched row-encode kernel	`claude/row-c22-patched-kernel`
23	#8008	RunEnd row-encode kernel (vortex-runend)	`claude/row-c23-runend-kernel`
24	#8009	BitPacked row-encode kernel (vortex-fastlanes)	`claude/row-c24-bitpacked-kernel`
25	#7985	FoR and Delta row-encode kernels (vortex-fastlanes)	`claude/row-pr3-kernels`

Base of this PR: #8009 (claude/row-c24-bitpacked-kernel)
This is the top of the stack.

Combined context

This is the top-of-stack PR for the 25-commit vortex-row series. Each previous PR contains exactly one commit; the full diff (all 25 commits) is the union of the stack.

The overall change introduces a row-encoded representation for sorting/joining columnar Vortex data: N input columns are encoded into a single byte-comparable ListView<u8> per row, with per-column sort options (ascending/descending, nulls-first/last). The series builds the crate bottom-up:

PRs 1-8 (cumulative ref: claude/row-pr1-base at vortex-row: convert_columns + tests + bench scaffolding #7993's tip): crate scaffolding, options struct, the byte-codec for fixed-width / varlen / nested canonical dtypes, the RowSize and RowEncode scalar functions, and the convert_columns user-facing entry point with tests and bench scaffolding.
PRs 9-17 (cumulative ref: claude/row-pr2-perf at Specialize Constant for the arithmetic-write fast path #8002's tip): performance work on the codec hot path — skipping ListView validation and zero-init, validity fast-path, auto-vectorized offset construction, copy_nonoverlapping-based varlen block encoder, direct VarBinView walking, and the arithmetic-write fast path with Constant specialization.
PRs 18-25 (cumulative ref: claude/row-pr3-kernels, this PR): per-encoding kernels — dispatch helpers, the inventory-based registry for downstream encodings, and Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta kernels that skip canonicalization.

Add an empty `vortex-row` crate with a minimal `initialize` stub so the following commits can layer in the row-encoder, codec, scalar functions, and per-encoding kernels without touching the workspace skeleton each time. The crate is wired into the workspace members list and workspace dependency table; `public-api.lock` is generated against the stub. Signed-off-by: Claude <noreply@anthropic.com>

Introduce the per-column sort-field options and the variadic-function options struct used by the upcoming RowSize / RowEncode scalar functions. `RowEncodeOptions::fields` uses a `SmallVec<[SortField; 4]>` so typical 1-4 column keys avoid a heap allocation. Includes a compact serialize / deserialize helper used later by the scalar-function metadata round-trip. Signed-off-by: Claude <noreply@anthropic.com>

Add the byte-encoding kernels for the fixed-width portion of the row encoder: Null, Bool, Primitive (12 PTypes), and Decimal (i8..i128). Each encoder writes a 1-byte sentinel followed by the value's row-comparable bytes (sign-flipped big-endian for signed ints, sign-aware mask for floats, etc.). The size pass is a constant `width-per-row` add for these types; the encode pass walks rows and writes into the shared output buffer at `offsets[i] + cursors[i]`. `row_width_for_dtype` classifies the column based purely on its DType. Scalar-level encoders (`encode_scalar_primitive` / `encode_scalar_bool` / `encode_scalar_null` / `encode_scalar` / `encoded_size_for_scalar`) are included for the same fixed-width subset; varlen and nested canonical variants bail with a clear "not yet supported" error and land in follow-up commits. The implementation is deliberately the simplest correct version: bounds-checked array indexing, no `copy_nonoverlapping`, no validity fast-path helper. Subsequent PRs evolve this toward the optimized form. Signed-off-by: Claude <noreply@anthropic.com>

Extend the codec to handle Utf8/Binary via VarBinView arrays. Each value encodes as a 1-byte sentinel followed by 32-byte chunks: every full chunk has a 0xFF continuation marker; the final partial chunk pads with zeros and writes the partial length (1..=32) as its trailing byte. `encode_varlen_value` uses the simple byte-at-a-time XOR loop here; a faster `copy_nonoverlapping` + stamped continuation version replaces it in PR 2. `encode_varbinview` uses `arr.with_iterator(...)` for both the nullable and non-nullable branches; a direct view walk for the no-nulls branch lands in PR 2 too. `row_width_for_dtype` now returns `Variable` for Utf8/Binary; the size pass and encode dispatchers route through `add_size_varbinview` / `encode_varbinview` correspondingly. The scalar encoder gains `encode_scalar_varlen` and the matching Utf8/Binary arms. Signed-off-by: Claude <noreply@anthropic.com>

Extend the codec to handle Struct, FixedSizeList, and Extension canonical variants. Each nested row encodes as `outer_sentinel | child bytes...`; for null rows the child bytes are zero-filled after the recursive encoders run so two null rows compare equal regardless of which non-null values would have been written by the children. `row_width_for_dtype` recurses through Struct fields and FSL elements to return `Fixed(w)` when every leaf is fixed; otherwise `Variable`. Extension delegates to its storage dtype. List remains `Variable` and ListView still bails (the row encoder's output is itself a ListView, so nested ListView isn't a near-term use case). Variant and Union bail explicitly. Signed-off-by: Claude <noreply@anthropic.com>

Add the size-pass machinery used by both RowSize and the upcoming RowEncode pipeline. `compute_sizes` walks the N input columns once, classifying each via `row_width_for_dtype` and accumulating fixed-width-prefix sums in `fixed_per_row` while pushing per-row sums of variable-length columns into a lazily allocated `var_lengths` vec. The classification result (`ColKind` + `SizePassResult`) is private to the crate; RowEncode consumes it in a later commit to choose between the arithmetic and cursor encode paths. `RowSize` returns a `Struct { fixed: U32, var: U32 }` so callers can read the per-row width without realizing the constant `fixed` slot as a per-row buffer (it's a `ConstantArray`); the `var` slot is a `ConstantArray(0)` when no varlen column is present. `dispatch_size` is the fallback-only path for PR 1 (canonicalize, then codec::field_size). The `RowSizeKernel` trait exists but is unused; per- encoding fast paths and the inventory registry arrive in PR 3. `initialize()` does NOT register RowSize yet - that lands once RowEncode is in place, so the session-registered pair appears together. Signed-off-by: Claude <noreply@anthropic.com>

Add the RowEncode variadic scalar function: encode N input columns into a single ListView<u8> in a five-phase pipeline. Phase 1: size pass via `compute_sizes`. Phase 2: allocate a zero-initialized output buffer sized to fit every row's encoded bytes; bail if the total exceeds u32::MAX. Phase 3: build per-row `listview_offsets`: i * fixed_per_row for the pure-fixed case, or i * fixed_per_row + exclusive cumsum of varlen lengths otherwise. Uses the simple `Vec::push` + `checked_add` loop. Phase 4: walk columns left-to-right and call `dispatch_encode` for every column (cursor path for all). Each call writes its per-row bytes at `offsets[i] + cursors[i]` and advances the cursor. Phase 5: build the ListView<u8> via the validating `try_new` constructor. `dispatch_encode` is the canonicalize-then-`codec::field_encode` fallback; in-crate kernel arms and the inventory registry land in PR 3. The `RowEncodeKernel` trait is defined but unused. PR 2 will iterate on this pipeline (skip zero-init, skip ListView validation, auto- vectorize the offsets loop, etc.). Signed-off-by: Claude <noreply@anthropic.com>

Wire the RowSize/RowEncode scalar functions to the user-facing API: - `convert_columns` accepts a slice of input arrays and per-column SortFields, constructs `RowEncodeOptions` + `VecExecutionArgs`, and returns the encoded `ListViewArray<u8>`. - `compute_row_sizes` returns just the per-row sizes (the `Struct { fixed: u32, var: u32 }` output of `RowSize`). - `initialize()` now registers `RowSize` and `RowEncode` on the given session so they are reachable via the expression layer. Tests cover sort-order round-trips for bool, primitive (i64 asc/desc, u32, f64), utf8, multi-column, nulls_first/last, struct sort-order, the single-buffer invariant of the ListView output, and the structural shape of `RowSize`. Tests that exercise per-encoding fast paths (`constant_path_matches_canonical`, `dict_path_matches_canonical`) land together with their respective kernels in PR 3. The bench file uses divan + mimalloc and reports throughput in GB/s of encoded output bytes for primitive_i64, utf8, and struct_mixed. Each has an `arrow_row` baseline and a `vortex` measurement. Per-encoding fast-path scenarios (constant/dict/patched/bitpacked/for/delta) gain their triplets in PR 3. Baseline measurements at this commit (sample-count=10): primitive_i64_vortex ~1.97 GB/s (vs arrow-row 4.12 GB/s) utf8_vortex ~0.87 GB/s (vs arrow-row 1.56 GB/s) struct_mixed_vortex ~0.95 GB/s (vs arrow-row 1.19 GB/s) PR 2 closes most of the gap by replacing the validating `ListViewArray::try_new` with `new_unchecked`, skipping the buffer zero-init, auto-vectorizing the offsets and varlen-block paths, etc. Signed-off-by: Claude <noreply@anthropic.com>

The encoder constructs the ListView's elements/offsets/sizes itself and maintains every invariant by construction: monotone offsets, each slice's `offsets[i] + sizes[i] <= total`, pairwise-disjoint slices. `ListViewArray::try_new` re-walks every row to validate those properties, which doubles as a memory pass over the just-built offsets/sizes arrays. Switch to `unsafe { ListViewArray::new_unchecked(...) }` with an inline SAFETY comment justifying each invariant. primitive_i64_vortex throughput improves from ~1.80 GB/s to ~4.7 GB/s on isolated runs (the validate walk dominates for small per-row payloads; larger varlen rows show smaller % improvements). Signed-off-by: Claude <noreply@anthropic.com>

…h it Most production columns are non-nullable or `AllValid`, in which case the per-row `mask.value(i)` branch is dead weight. Introduce a `ValidityKind { AllValid, Mask(...) }` helper resolved exactly once per column, and pattern-match on it in the four encoders that loop over rows: `encode_primitive_typed`, `encode_bool`, `encode_varbinview`, `add_size_varbinview`. For NonNullable / AllValid columns this skips the mask materialization entirely, and the inner loop has no validity branch. For nullable columns the materialized mask is held once instead of re-resolved per row. Yields ~10% across canonical paths on isolated runs; combines with the later auto-vectorization commit because removing the per-row branch makes the inner loop a candidate for the compiler's vectorizer. Signed-off-by: Claude <noreply@anthropic.com>

`BufferMut::with_capacity(total_len).push_n(0u8, total_len)` issues a memset of the entire output range, only to have every byte overwritten by the encoders. The encoders cover every byte by construction: - Fixed-width non-null slots: sentinel + value bytes. - Fixed-width null slots: sentinel + explicit per-byte zero-fill loop. - Varlen blocks: full blocks are written by `encode_varlen_value`; the partial-block tail is zero-padded by that same function. - Struct/FSL null bodies: zero-filled after the child encoders run. Switch to `unsafe { out_buf.set_len(total_len) }` with a SAFETY comment recording the invariant. Reclaims a `total_len`-byte memset per call; for varlen-heavy inputs (multiple MB of output) this saves real time. dict_utf8 (varlen heavy) throughput: ~3.74 GB/s → ~4.55 GB/s. Signed-off-by: Claude <noreply@anthropic.com>

The pure-fixed branch built `listview_offsets` via `Vec::push` + `checked_mul`, which forces the compiler to emit a per-iteration overflow branch and a `push`-style length-update sequence. Both inhibit the autovectorizer. We already validated `total` (= `nrows * fixed_per_row`) fits in u32 before reaching Phase 3, so each individual `i * fixed_per_row` also fits. Replace the loop with a raw `ptr.add(i).write(...)` write through the reserved capacity and a final `set_len(nrows)`. LLVM lowers the inner write to a SIMD store on x86 (verified via cargo asm in earlier iterations). primitive_i64_vortex throughput: ~4.96 GB/s → ~7.74 GB/s on isolated runs. The mixed branch gets the same treatment in the next commit. Signed-off-by: Claude <noreply@anthropic.com>

Apply the same `Vec::push` → raw-pointer-write transformation to the mixed (fixed-plus-varlen) branch of Phase 3. We already validated the total fits in u32 upstream, so `wrapping_mul` / `wrapping_add` here are sound. Mixed paths within the bench noise; this commit keeps the pure-fixed and mixed branches structurally identical so reviewers see the same shape regardless of whether varlen is present. Signed-off-by: Claude <noreply@anthropic.com>

The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional writes per block on each path, even for the ascending (no-XOR) case where the body is exactly a `memcpy(32) + stamp(1)`. Rewrite `encode_varlen_value` with two distinct fast paths: - Ascending: `copy_nonoverlapping(src, dst, 32)` + a single 0xFF stamp. The compiler folds the loop into a SIMD memcpy. - Descending: a `xor_copy_block` helper that XOR-copies 32 bytes via four u64 reads/writes; LLVM lowers it to SIMD on x86. The partial-block tail uses `write_bytes` for the zero-padding instead of a per-byte loop. utf8 throughput: ~0.92 GB/s → ~1.39 GB/s. struct_mixed: +35%. Signed-off-by: Claude <noreply@anthropic.com>

`arr.with_iterator(...)` constructs an `Option<&[u8]>` per row through a trait-object dispatch and a branch-and-merge that hides the inline-vs-buffer view from the compiler. On the AllValid path we don't need the Option (no nulls) and we want the compiler to see the inline-vs-buffer branch directly so it can keep the inline arm in registers. Walk `arr.views()` directly and resolve each view via `is_inlined() → as_inlined().value()` vs `as_view() → buffers[idx][offset..len]`. Cache data-buffer slices once before the loop (SmallVec for ≤4 buffers, the common case). Nullable path is unchanged because the Option<&[u8]> shape is already what we want when nulls are possible. utf8 throughput: ~1.49 GB/s → ~1.84 GB/s. Signed-off-by: Claude <noreply@anthropic.com>

ColKind::Fixed { before_varlen: true, .. } columns have a constant within-row write offset (sum of preceding fixed-column widths plus i * fixed_per_row plus var_prefix[i] when varlen columns are present). For these we don't need a per-row cursor; the position is pure arithmetic. Adds dispatch_encode_fixed_arith + field_encode_fixed_arithmetic and routes the relevant ColKind arm of execute_row_encode's phase 4 through them. Fixed-after-varlen columns and varlen columns continue through the existing cursor path. primitive_i64 vortex 3.0 -> 6+ GB/s. Signed-off-by: Claude <noreply@anthropic.com>

When a ConstantArray feeds the fixed-before-varlen arithmetic path, the encoded scalar bytes are the same for every row. Hoist them into 1-2 register-sized loads outside the loop and emit direct write_unaligned stores per row. Specialized for encoded lengths 2 (bool/i8), 5 (i32), 9 (i64), 17 (i128). Other lengths fall back to copy_nonoverlapping. The var_prefix case (Constant after a varlen column) takes the same shape but computes per-row positions arithmetically rather than via a running cursor. constant_i64_vortex_without_kernel 2.47 -> ~6 GB/s (PR3 commit 3.3 adds the same specialization to the Constant kernel itself). Signed-off-by: Claude <noreply@anthropic.com>

Wire per-encoding fast-path traits into `dispatch_size` and `dispatch_encode`. Both helpers now try the in-crate downcast arms (Constant, Dict, Patched) before falling back to canonicalization. This commit adds stub impls returning `Ok(None)` so the existing behavior is preserved bit-for-bit; subsequent commits replace each stub with its real impl. Keeping the wiring change separate from the algorithm work makes the kernel impl commits trivially reviewable in isolation (they only touch one file each). The kernel module is `mod kernels` (crate-private) so the impls satisfy the orphan rule (trait defined in `vortex-row`, types from `vortex-array`) without leaking the impls into the crate's public surface. Signed-off-by: Claude <noreply@anthropic.com>

Encodings that live outside `vortex-array` (e.g. RunEnd, BitPacked, FoR, Delta) can't be downcast from inside the variadic dispatch loops - vortex-array doesn't know about them, and reversing the dependency would create a cycle. Add a `RowEncodeRegistration` that downstream crates submit via the inventory crate. `lookup(&array_id)` lazily builds an `ArrayId → (size, encode)` HashMap on first call, behind a `OnceLock` so the build is single-threaded and the lookups are wait-free thereafter. Wire the lookup into `dispatch_size` / `dispatch_encode` after the in-crate downcast attempts: in-crate kernels take precedence (constant- time downcast), then downstream registrations (HashMap lookup), then the canonicalization fallback. Signed-off-by: Claude <noreply@anthropic.com>

Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `ConstantArray` with real implementations that skip canonicalization. The size pass adds the (constant) per-row scalar size to every entry of the shared `sizes` slice. The encode pass encodes the scalar bytes once into a small heap buffer, then `copy_nonoverlapping`s those bytes into each row's slot. Per-row work is one `copy_nonoverlapping(N)` plus one cursor add, where `N` is typically 9 (i64), 5 (i32), or 17 (i128). Add a `constant_i64_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization) and a `constant_path_matches_canonical` test that round-trips bytes both ways and asserts they're identical. Signed-off-by: Claude <noreply@anthropic.com>

Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `Dict` with real implementations that skip canonicalization. Strategy: encode each unique value once into a small per-value buffer, then materialize the per-row contribution by indexing into the buffer via the codes array. Per-row cost becomes one `copy_from_slice` of the value's encoded bytes rather than re-encoding from scratch. Amortizes the encode work over the dictionary's cardinality instead of the row count. When values.len() > codes.len() the kernel declines (the canonical path is at least as fast because each value would be touched ≤ 1 time). `add_codes_sizes::<T>` has a u8 fast-path that reads the codes as a raw `&[u8]` slice to elide TryInto overhead. Includes `dict_utf8_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization) and a `dict_path_matches_canonical` round-trip test. Signed-off-by: Claude <noreply@anthropic.com>

Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `Patched` with real implementations. Size pass: per-row size matches the inner array exactly because patches share the inner dtype. Just delegate to `dispatch_size` on the inner array. Encode pass: delegate to `dispatch_encode` on the inner array, then walk the patch indices and overwrite each patched row's value bytes in place. Patched arrays in our hot paths are Primitive-typed (BitPacked with patches, etc.), so the kernel checks `DType::Primitive` upfront and declines for anything else. Pre-cursor snapshot is captured before the inner encoder advances `cursors`, so the overlay knows each row's slot start position. Adds `patched_i32_*` bench triplet. Patched-specific tests live next to the kernel in `kernels/patched.rs::tests` (round-trip vs canonical, both single-chunk and multi-chunk). Signed-off-by: Claude <noreply@anthropic.com>

Add a row-encode kernel for `RunEnd` arrays via the inventory-based registry: the encoding lives in `vortex-runend` which depends on `vortex-array` (not the other way around), so a direct downcast inside `dispatch_size` / `dispatch_encode` would create a cycle. The kernel is functionally analogous to the Dict kernel: encode each unique run-value once into a small per-value buffer, then broadcast the value's encoded bytes across each row in its run. The per-unique-value cost is amortized over the number of runs rather than the row count. `walk_runs` translates the run-end array's `(prev_end, curr_end)` windows into `(start_logical, stop_logical)` row ranges accounting for the array's slice offset and length. When ends.len() > len (very sparse runs, or pathological inputs) the kernel declines so canonicalization stays the dominant path. Includes a round-trip test in `compute/row_encode.rs` checking that the RunEnd path matches the canonical path bit-for-bit. Signed-off-by: Claude <noreply@anthropic.com>

Add a row-encode kernel for BitPacked arrays. The kernel walks the packed storage in 1024-element fastlanes chunks via `BitUnpackedChunks::full_chunks`, unpacks each chunk into a stack-local buffer, and writes the row-encoded bytes for that chunk in one pass. Patches (when present) are applied per-chunk to the stack buffer so a patched cell encodes its corrected value rather than the bit-packed placeholder. The shared `row_encode_common` module factors out the per-chunk encode primitive (`encode_primitive_chunk`) and a small `PrimRowEncode` trait — the same shape FoR and Delta will use in the next commit so those kernels can share the chunk-walk machinery. Kernel is registered via the `inventory`-based registry, since `vortex-fastlanes` depends on `vortex-array`. Includes a `bitpacked_i32_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization). Signed-off-by: Claude <noreply@anthropic.com>

Two more chunk-walking kernels alongside the BitPacked one. Both register via the inventory-based registry. FoR (Frame of Reference): - Common fused path: FoR around a BitPacked storage with an unsigned reference. Walks the bit-packed chunks via `FoR::unchecked_unfor_pack` into a stack buffer with the base wrapping-added inline, then encodes rows from that buffer. - Slow path: FoR around a Primitive storage. Walks the canonical buffer once with a per-row wrapping_add and the row encode. Delta: - Use the existing chunked `decompress_primitive` to write into a primitive buffer, then encode rows from that buffer. Skips the PrimitiveArray wrapping + validity attach. Adds `for_i64_*` and `delta_i64_*` bench triplets. Signed-off-by: Claude <noreply@anthropic.com>

codspeed-hq · 2026-05-18T16:07:04Z

Merging this PR will improve performance by 18%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 4 improved benchmarks
✅ 1217 untouched benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	197.9 µs	162 µs	+22.19%
⚡	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	358.4 µs	323.5 µs	+10.78%
⚡	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	211.2 µs	175.8 µs	+20.11%
⚡	Simulation	`chunked_varbinview_opt_canonical_into[(1000, 10)]`	224.8 µs	188.6 µs	+19.23%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/row-pr3-kernels (7021233) with develop (faf7e42)}

Two unrelated CI failures on PR #7985: 1. Check generated source files: vortex-row/public-api.lock was stale - field_encode_fixed_arithmetic became pub in the arithmetic-write commit but the lock wasn't regenerated. 2. Rust publish dry-run: vortex-row's dev-dep on vortex-fastlanes was inherited from the workspace with a version specifier. Since vortex-fastlanes itself depends on vortex-row (for the inventory kernel registration), cargo publish couldn't resolve the version on crates.io. Drop the workspace inheritance and use a path-only dev-dep for vortex-fastlanes - the bench file is the only consumer and cargo strips path-only dev-deps from the published manifest. Signed-off-by: Claude <noreply@anthropic.com>

claude added 25 commits May 17, 2026 22:00

joseph-isaacs changed the title ~~Add row-oriented byte encoder (vortex-row crate)~~ FoR and Delta row-encode kernels (vortex-fastlanes) May 18, 2026

joseph-isaacs changed the base branch from develop to claude/row-c24-bitpacked-kernel May 18, 2026 16:06

joseph-isaacs changed the title ~~FoR and Delta row-encode kernels (vortex-fastlanes)~~ Add row-oriented byte encoder (vortex-row crate) May 18, 2026

joseph-isaacs changed the base branch from claude/row-c24-bitpacked-kernel to develop May 18, 2026 16:09

joseph-isaacs added the changelog/feature A new feature label May 18, 2026 — with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add row-oriented byte encoder (vortex-row crate)#7985

Add row-oriented byte encoder (vortex-row crate)#7985
joseph-isaacs wants to merge 26 commits into
developfrom
claude/row-pr3-kernels

joseph-isaacs commented May 18, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this commit does

Stack

Combined context

Uh oh!

codspeed-hq Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 18%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joseph-isaacs commented May 18, 2026 •

edited

Loading

codspeed-hq Bot commented May 18, 2026 •

edited

Loading