Skip to content

Add row-oriented byte encoder (vortex-row crate)#7985

Open
joseph-isaacs wants to merge 26 commits into
developfrom
claude/row-pr3-kernels
Open

Add row-oriented byte encoder (vortex-row crate)#7985
joseph-isaacs wants to merge 26 commits into
developfrom
claude/row-pr3-kernels

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented May 18, 2026

Part 25 of 25 in the stacked PR series adding vortex-row. This is the top of the stack.

This PR contains exactly one commit; review just that diff in isolation. The full design + rationale for the entire series is below.

What this commit does

Two more chunk-walking kernels alongside the BitPacked one. Both register via the inventory-based registry.

FoR (Frame of Reference):

  • Common fused path: FoR around a BitPacked storage with an unsigned reference. Walks the bit-packed chunks via FoR::unchecked_unfor_pack into a stack buffer with the base wrapping-added inline, then encodes rows from that buffer.
  • Slow path: FoR around a Primitive storage. Walks the canonical buffer once with a per-row wrapping_add and the row encode.

Delta:

  • Uses the existing chunked decompress_primitive to write into a primitive buffer, then encodes rows from that buffer. Skips the PrimitiveArray wrapping + validity attach.

Adds for_i64_* and delta_i64_* bench triplets.

Stack

# PR Title Branch
1 #7986 vortex-row: crate scaffolding claude/row-c01-crate-scaffolding
2 #7987 vortex-row: add SortField and RowEncodeOptions claude/row-c02-sortfield-options
3 #7988 vortex-row: codec for fixed-width canonical types claude/row-c03-codec-fixed-width
4 #7989 vortex-row: codec for varlen canonical types claude/row-c04-codec-varlen
5 #7990 vortex-row: codec for nested canonical types claude/row-c05-codec-nested
6 #7991 vortex-row: compute_sizes helper and RowSize ScalarFn claude/row-c06-rowsize-scalarfn
7 #7992 vortex-row: RowEncode ScalarFn claude/row-c07-rowencode-scalarfn
8 #7993 vortex-row: convert_columns + tests + bench scaffolding claude/row-c08-convert-columns-tests-bench
9 #7994 Skip ListView validation in row encoder output claude/row-c09-skip-listview-validation
10 #7995 Add validity fast-path helper for the four pattern-matching encoders claude/row-c10-validity-fast-path
11 #7996 Skip zero-init of output buffer claude/row-c11-skip-zero-init
12 #7997 Auto-vectorize pure-fixed offsets construction claude/row-c12-vectorize-pure-fixed-offsets
13 #7998 Auto-vectorize mixed-path offsets construction claude/row-c13-vectorize-mixed-offsets
14 #7999 Rewrite varlen 32-byte block encoder with copy_nonoverlapping claude/row-c14-varlen-block-copy-nonoverlapping
15 #8000 Walk VarBinView rows directly in row encoder hot loop claude/row-c15-walk-varbinview-directly
16 #8001 Add arithmetic-write fast path for fixed-before-varlen columns claude/row-c16-arith-write-fast-path
17 #8002 Specialize Constant for the arithmetic-write fast path claude/row-c17-specialize-constant-arith
18 #8003 RowSizeKernel and RowEncodeKernel dispatch helpers claude/row-c18-kernel-dispatch-helpers
19 #8004 Inventory-based registry for downstream encoding kernels claude/row-c19-inventory-registry
20 #8005 Constant row-encode kernel claude/row-c20-constant-kernel
21 #8006 Dict row-encode kernel claude/row-c21-dict-kernel
22 #8007 Patched row-encode kernel claude/row-c22-patched-kernel
23 #8008 RunEnd row-encode kernel (vortex-runend) claude/row-c23-runend-kernel
24 #8009 BitPacked row-encode kernel (vortex-fastlanes) claude/row-c24-bitpacked-kernel
25 #7985 FoR and Delta row-encode kernels (vortex-fastlanes) claude/row-pr3-kernels

Base of this PR: #8009 (claude/row-c24-bitpacked-kernel)
This is the top of the stack.

Combined context

This is the top-of-stack PR for the 25-commit vortex-row series. Each previous PR contains exactly one commit; the full diff (all 25 commits) is the union of the stack.

The overall change introduces a row-encoded representation for sorting/joining columnar Vortex data: N input columns are encoded into a single byte-comparable ListView<u8> per row, with per-column sort options (ascending/descending, nulls-first/last). The series builds the crate bottom-up:

  • PRs 1-8 (cumulative ref: claude/row-pr1-base at vortex-row: convert_columns + tests + bench scaffolding #7993's tip): crate scaffolding, options struct, the byte-codec for fixed-width / varlen / nested canonical dtypes, the RowSize and RowEncode scalar functions, and the convert_columns user-facing entry point with tests and bench scaffolding.
  • PRs 9-17 (cumulative ref: claude/row-pr2-perf at Specialize Constant for the arithmetic-write fast path #8002's tip): performance work on the codec hot path — skipping ListView validation and zero-init, validity fast-path, auto-vectorized offset construction, copy_nonoverlapping-based varlen block encoder, direct VarBinView walking, and the arithmetic-write fast path with Constant specialization.
  • PRs 18-25 (cumulative ref: claude/row-pr3-kernels, this PR): per-encoding kernels — dispatch helpers, the inventory-based registry for downstream encodings, and Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta kernels that skip canonicalization.

claude added 25 commits May 17, 2026 22:00
Add an empty `vortex-row` crate with a minimal `initialize` stub so the
following commits can layer in the row-encoder, codec, scalar functions,
and per-encoding kernels without touching the workspace skeleton each
time. The crate is wired into the workspace members list and workspace
dependency table; `public-api.lock` is generated against the stub.

Signed-off-by: Claude <noreply@anthropic.com>
Introduce the per-column sort-field options and the variadic-function
options struct used by the upcoming RowSize / RowEncode scalar functions.

`RowEncodeOptions::fields` uses a `SmallVec<[SortField; 4]>` so typical
1-4 column keys avoid a heap allocation. Includes a compact serialize /
deserialize helper used later by the scalar-function metadata round-trip.

Signed-off-by: Claude <noreply@anthropic.com>
Add the byte-encoding kernels for the fixed-width portion of the row
encoder: Null, Bool, Primitive (12 PTypes), and Decimal (i8..i128). Each
encoder writes a 1-byte sentinel followed by the value's row-comparable
bytes (sign-flipped big-endian for signed ints, sign-aware mask for
floats, etc.).

The size pass is a constant `width-per-row` add for these types; the
encode pass walks rows and writes into the shared output buffer at
`offsets[i] + cursors[i]`. `row_width_for_dtype` classifies the column
based purely on its DType.

Scalar-level encoders (`encode_scalar_primitive` / `encode_scalar_bool`
/ `encode_scalar_null` / `encode_scalar` / `encoded_size_for_scalar`)
are included for the same fixed-width subset; varlen and nested
canonical variants bail with a clear "not yet supported" error and
land in follow-up commits.

The implementation is deliberately the simplest correct version:
bounds-checked array indexing, no `copy_nonoverlapping`, no validity
fast-path helper. Subsequent PRs evolve this toward the optimized form.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Utf8/Binary via VarBinView arrays. Each value
encodes as a 1-byte sentinel followed by 32-byte chunks: every full
chunk has a 0xFF continuation marker; the final partial chunk pads with
zeros and writes the partial length (1..=32) as its trailing byte.

`encode_varlen_value` uses the simple byte-at-a-time XOR loop here; a
faster `copy_nonoverlapping` + stamped continuation version replaces it
in PR 2. `encode_varbinview` uses `arr.with_iterator(...)` for both the
nullable and non-nullable branches; a direct view walk for the no-nulls
branch lands in PR 2 too.

`row_width_for_dtype` now returns `Variable` for Utf8/Binary; the size
pass and encode dispatchers route through `add_size_varbinview` /
`encode_varbinview` correspondingly. The scalar encoder gains
`encode_scalar_varlen` and the matching Utf8/Binary arms.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Struct, FixedSizeList, and Extension
canonical variants. Each nested row encodes as `outer_sentinel | child
bytes...`; for null rows the child bytes are zero-filled after the
recursive encoders run so two null rows compare equal regardless of
which non-null values would have been written by the children.

`row_width_for_dtype` recurses through Struct fields and FSL elements
to return `Fixed(w)` when every leaf is fixed; otherwise `Variable`.
Extension delegates to its storage dtype. List remains `Variable` and
ListView still bails (the row encoder's output is itself a ListView, so
nested ListView isn't a near-term use case). Variant and Union bail
explicitly.

Signed-off-by: Claude <noreply@anthropic.com>
Add the size-pass machinery used by both RowSize and the upcoming
RowEncode pipeline. `compute_sizes` walks the N input columns once,
classifying each via `row_width_for_dtype` and accumulating
fixed-width-prefix sums in `fixed_per_row` while pushing per-row sums
of variable-length columns into a lazily allocated `var_lengths` vec.

The classification result (`ColKind` + `SizePassResult`) is private to
the crate; RowEncode consumes it in a later commit to choose between
the arithmetic and cursor encode paths.

`RowSize` returns a `Struct { fixed: U32, var: U32 }` so callers can
read the per-row width without realizing the constant `fixed` slot as
a per-row buffer (it's a `ConstantArray`); the `var` slot is a
`ConstantArray(0)` when no varlen column is present.

`dispatch_size` is the fallback-only path for PR 1 (canonicalize, then
codec::field_size). The `RowSizeKernel` trait exists but is unused; per-
encoding fast paths and the inventory registry arrive in PR 3.

`initialize()` does NOT register RowSize yet - that lands once
RowEncode is in place, so the session-registered pair appears together.

Signed-off-by: Claude <noreply@anthropic.com>
Add the RowEncode variadic scalar function: encode N input columns into
a single ListView<u8> in a five-phase pipeline.

  Phase 1: size pass via `compute_sizes`.
  Phase 2: allocate a zero-initialized output buffer sized to fit every
           row's encoded bytes; bail if the total exceeds u32::MAX.
  Phase 3: build per-row `listview_offsets`: i * fixed_per_row for the
           pure-fixed case, or i * fixed_per_row + exclusive cumsum of
           varlen lengths otherwise. Uses the simple `Vec::push` +
           `checked_add` loop.
  Phase 4: walk columns left-to-right and call `dispatch_encode` for
           every column (cursor path for all). Each call writes its
           per-row bytes at `offsets[i] + cursors[i]` and advances the
           cursor.
  Phase 5: build the ListView<u8> via the validating `try_new`
           constructor.

`dispatch_encode` is the canonicalize-then-`codec::field_encode`
fallback; in-crate kernel arms and the inventory registry land in PR 3.
The `RowEncodeKernel` trait is defined but unused. PR 2 will iterate
on this pipeline (skip zero-init, skip ListView validation, auto-
vectorize the offsets loop, etc.).

Signed-off-by: Claude <noreply@anthropic.com>
Wire the RowSize/RowEncode scalar functions to the user-facing API:

- `convert_columns` accepts a slice of input arrays and per-column
  SortFields, constructs `RowEncodeOptions` + `VecExecutionArgs`, and
  returns the encoded `ListViewArray<u8>`.
- `compute_row_sizes` returns just the per-row sizes (the `Struct
  { fixed: u32, var: u32 }` output of `RowSize`).
- `initialize()` now registers `RowSize` and `RowEncode` on the given
  session so they are reachable via the expression layer.

Tests cover sort-order round-trips for bool, primitive (i64 asc/desc,
u32, f64), utf8, multi-column, nulls_first/last, struct sort-order, the
single-buffer invariant of the ListView output, and the structural
shape of `RowSize`. Tests that exercise per-encoding fast paths
(`constant_path_matches_canonical`, `dict_path_matches_canonical`) land
together with their respective kernels in PR 3.

The bench file uses divan + mimalloc and reports throughput in GB/s of
encoded output bytes for primitive_i64, utf8, and struct_mixed. Each
has an `arrow_row` baseline and a `vortex` measurement. Per-encoding
fast-path scenarios (constant/dict/patched/bitpacked/for/delta) gain
their triplets in PR 3.

Baseline measurements at this commit (sample-count=10):
  primitive_i64_vortex  ~1.97 GB/s  (vs arrow-row 4.12 GB/s)
  utf8_vortex           ~0.87 GB/s  (vs arrow-row 1.56 GB/s)
  struct_mixed_vortex   ~0.95 GB/s  (vs arrow-row 1.19 GB/s)

PR 2 closes most of the gap by replacing the validating
`ListViewArray::try_new` with `new_unchecked`, skipping the buffer
zero-init, auto-vectorizing the offsets and varlen-block paths, etc.

Signed-off-by: Claude <noreply@anthropic.com>
The encoder constructs the ListView's elements/offsets/sizes itself and
maintains every invariant by construction: monotone offsets, each
slice's `offsets[i] + sizes[i] <= total`, pairwise-disjoint slices.
`ListViewArray::try_new` re-walks every row to validate those properties,
which doubles as a memory pass over the just-built offsets/sizes arrays.

Switch to `unsafe { ListViewArray::new_unchecked(...) }` with an inline
SAFETY comment justifying each invariant.

primitive_i64_vortex throughput improves from ~1.80 GB/s to ~4.7 GB/s
on isolated runs (the validate walk dominates for small per-row payloads;
larger varlen rows show smaller % improvements).

Signed-off-by: Claude <noreply@anthropic.com>
…h it

Most production columns are non-nullable or `AllValid`, in which case
the per-row `mask.value(i)` branch is dead weight. Introduce a
`ValidityKind { AllValid, Mask(...) }` helper resolved exactly once per
column, and pattern-match on it in the four encoders that loop over
rows: `encode_primitive_typed`, `encode_bool`, `encode_varbinview`,
`add_size_varbinview`.

For NonNullable / AllValid columns this skips the mask materialization
entirely, and the inner loop has no validity branch. For nullable
columns the materialized mask is held once instead of re-resolved per
row.

Yields ~10% across canonical paths on isolated runs; combines with the
later auto-vectorization commit because removing the per-row branch
makes the inner loop a candidate for the compiler's vectorizer.

Signed-off-by: Claude <noreply@anthropic.com>
`BufferMut::with_capacity(total_len).push_n(0u8, total_len)` issues a
memset of the entire output range, only to have every byte overwritten
by the encoders. The encoders cover every byte by construction:

- Fixed-width non-null slots: sentinel + value bytes.
- Fixed-width null slots: sentinel + explicit per-byte zero-fill loop.
- Varlen blocks: full blocks are written by `encode_varlen_value`; the
  partial-block tail is zero-padded by that same function.
- Struct/FSL null bodies: zero-filled after the child encoders run.

Switch to `unsafe { out_buf.set_len(total_len) }` with a SAFETY comment
recording the invariant. Reclaims a `total_len`-byte memset per call;
for varlen-heavy inputs (multiple MB of output) this saves real time.

dict_utf8 (varlen heavy) throughput: ~3.74 GB/s → ~4.55 GB/s.

Signed-off-by: Claude <noreply@anthropic.com>
The pure-fixed branch built `listview_offsets` via `Vec::push` +
`checked_mul`, which forces the compiler to emit a per-iteration
overflow branch and a `push`-style length-update sequence. Both
inhibit the autovectorizer.

We already validated `total` (= `nrows * fixed_per_row`) fits in u32
before reaching Phase 3, so each individual `i * fixed_per_row` also
fits. Replace the loop with a raw `ptr.add(i).write(...)` write through
the reserved capacity and a final `set_len(nrows)`. LLVM lowers the
inner write to a SIMD store on x86 (verified via cargo asm in earlier
iterations).

primitive_i64_vortex throughput: ~4.96 GB/s → ~7.74 GB/s on isolated
runs. The mixed branch gets the same treatment in the next commit.

Signed-off-by: Claude <noreply@anthropic.com>
Apply the same `Vec::push` → raw-pointer-write transformation to the
mixed (fixed-plus-varlen) branch of Phase 3. We already validated the
total fits in u32 upstream, so `wrapping_mul` / `wrapping_add` here are
sound.

Mixed paths within the bench noise; this commit keeps the pure-fixed
and mixed branches structurally identical so reviewers see the same
shape regardless of whether varlen is present.

Signed-off-by: Claude <noreply@anthropic.com>
The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional
writes per block on each path, even for the ascending (no-XOR) case
where the body is exactly a `memcpy(32) + stamp(1)`.

Rewrite `encode_varlen_value` with two distinct fast paths:
- Ascending: `copy_nonoverlapping(src, dst, 32)` + a single 0xFF stamp.
  The compiler folds the loop into a SIMD memcpy.
- Descending: a `xor_copy_block` helper that XOR-copies 32 bytes via
  four u64 reads/writes; LLVM lowers it to SIMD on x86.

The partial-block tail uses `write_bytes` for the zero-padding instead
of a per-byte loop.

utf8 throughput: ~0.92 GB/s → ~1.39 GB/s.
struct_mixed: +35%.

Signed-off-by: Claude <noreply@anthropic.com>
`arr.with_iterator(...)` constructs an `Option<&[u8]>` per row through a
trait-object dispatch and a branch-and-merge that hides the
inline-vs-buffer view from the compiler. On the AllValid path we don't
need the Option (no nulls) and we want the compiler to see the
inline-vs-buffer branch directly so it can keep the inline arm in
registers.

Walk `arr.views()` directly and resolve each view via `is_inlined() →
as_inlined().value()` vs `as_view() → buffers[idx][offset..len]`. Cache
data-buffer slices once before the loop (SmallVec for ≤4 buffers, the
common case). Nullable path is unchanged because the Option<&[u8]>
shape is already what we want when nulls are possible.

utf8 throughput: ~1.49 GB/s → ~1.84 GB/s.

Signed-off-by: Claude <noreply@anthropic.com>
ColKind::Fixed { before_varlen: true, .. } columns have a constant
within-row write offset (sum of preceding fixed-column widths plus
i * fixed_per_row plus var_prefix[i] when varlen columns are present).
For these we don't need a per-row cursor; the position is pure
arithmetic.

Adds dispatch_encode_fixed_arith + field_encode_fixed_arithmetic and
routes the relevant ColKind arm of execute_row_encode's phase 4
through them. Fixed-after-varlen columns and varlen columns continue
through the existing cursor path.

primitive_i64 vortex 3.0 -> 6+ GB/s.

Signed-off-by: Claude <noreply@anthropic.com>
When a ConstantArray feeds the fixed-before-varlen arithmetic path,
the encoded scalar bytes are the same for every row. Hoist them into
1-2 register-sized loads outside the loop and emit direct
write_unaligned stores per row. Specialized for encoded lengths 2
(bool/i8), 5 (i32), 9 (i64), 17 (i128). Other lengths fall back to
copy_nonoverlapping. The var_prefix case (Constant after a varlen
column) takes the same shape but computes per-row positions
arithmetically rather than via a running cursor.

constant_i64_vortex_without_kernel 2.47 -> ~6 GB/s (PR3 commit 3.3
adds the same specialization to the Constant kernel itself).

Signed-off-by: Claude <noreply@anthropic.com>
Wire per-encoding fast-path traits into `dispatch_size` and
`dispatch_encode`. Both helpers now try the in-crate downcast arms
(Constant, Dict, Patched) before falling back to canonicalization.

This commit adds stub impls returning `Ok(None)` so the existing
behavior is preserved bit-for-bit; subsequent commits replace each
stub with its real impl. Keeping the wiring change separate from the
algorithm work makes the kernel impl commits trivially reviewable in
isolation (they only touch one file each).

The kernel module is `mod kernels` (crate-private) so the impls
satisfy the orphan rule (trait defined in `vortex-row`, types from
`vortex-array`) without leaking the impls into the crate's public
surface.

Signed-off-by: Claude <noreply@anthropic.com>
Encodings that live outside `vortex-array` (e.g. RunEnd, BitPacked, FoR,
Delta) can't be downcast from inside the variadic dispatch loops -
vortex-array doesn't know about them, and reversing the dependency
would create a cycle.

Add a `RowEncodeRegistration` that downstream crates submit via the
inventory crate. `lookup(&array_id)` lazily builds an `ArrayId → (size,
encode)` HashMap on first call, behind a `OnceLock` so the build is
single-threaded and the lookups are wait-free thereafter.

Wire the lookup into `dispatch_size` / `dispatch_encode` after the
in-crate downcast attempts: in-crate kernels take precedence (constant-
time downcast), then downstream registrations (HashMap lookup), then
the canonicalization fallback.

Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for
`ConstantArray` with real implementations that skip canonicalization.

The size pass adds the (constant) per-row scalar size to every entry of
the shared `sizes` slice. The encode pass encodes the scalar bytes once
into a small heap buffer, then `copy_nonoverlapping`s those bytes into
each row's slot. Per-row work is one `copy_nonoverlapping(N)` plus one
cursor add, where `N` is typically 9 (i64), 5 (i32), or 17 (i128).

Add a `constant_i64_*` bench triplet (arrow-row baseline, vortex with
kernel, vortex through canonicalization) and a
`constant_path_matches_canonical` test that round-trips bytes both
ways and asserts they're identical.

Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `Dict`
with real implementations that skip canonicalization.

Strategy: encode each unique value once into a small per-value buffer,
then materialize the per-row contribution by indexing into the buffer
via the codes array. Per-row cost becomes one `copy_from_slice` of the
value's encoded bytes rather than re-encoding from scratch. Amortizes
the encode work over the dictionary's cardinality instead of the row
count.

When values.len() > codes.len() the kernel declines (the canonical path
is at least as fast because each value would be touched ≤ 1 time).

`add_codes_sizes::<T>` has a u8 fast-path that reads the codes as a raw
`&[u8]` slice to elide TryInto overhead.

Includes `dict_utf8_*` bench triplet (arrow-row baseline, vortex with
kernel, vortex through canonicalization) and a
`dict_path_matches_canonical` round-trip test.

Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for
`Patched` with real implementations.

Size pass: per-row size matches the inner array exactly because
patches share the inner dtype. Just delegate to `dispatch_size` on the
inner array.

Encode pass: delegate to `dispatch_encode` on the inner array, then
walk the patch indices and overwrite each patched row's value bytes in
place. Patched arrays in our hot paths are Primitive-typed (BitPacked
with patches, etc.), so the kernel checks `DType::Primitive` upfront
and declines for anything else.

Pre-cursor snapshot is captured before the inner encoder advances
`cursors`, so the overlay knows each row's slot start position.

Adds `patched_i32_*` bench triplet. Patched-specific tests live next
to the kernel in `kernels/patched.rs::tests` (round-trip vs canonical,
both single-chunk and multi-chunk).

Signed-off-by: Claude <noreply@anthropic.com>
Add a row-encode kernel for `RunEnd` arrays via the inventory-based
registry: the encoding lives in `vortex-runend` which depends on
`vortex-array` (not the other way around), so a direct downcast inside
`dispatch_size` / `dispatch_encode` would create a cycle.

The kernel is functionally analogous to the Dict kernel: encode each
unique run-value once into a small per-value buffer, then broadcast the
value's encoded bytes across each row in its run. The per-unique-value
cost is amortized over the number of runs rather than the row count.

`walk_runs` translates the run-end array's `(prev_end, curr_end)`
windows into `(start_logical, stop_logical)` row ranges accounting for
the array's slice offset and length.

When ends.len() > len (very sparse runs, or pathological inputs) the
kernel declines so canonicalization stays the dominant path.

Includes a round-trip test in `compute/row_encode.rs` checking that
the RunEnd path matches the canonical path bit-for-bit.

Signed-off-by: Claude <noreply@anthropic.com>
Add a row-encode kernel for BitPacked arrays. The kernel walks the
packed storage in 1024-element fastlanes chunks via
`BitUnpackedChunks::full_chunks`, unpacks each chunk into a stack-local
buffer, and writes the row-encoded bytes for that chunk in one pass.

Patches (when present) are applied per-chunk to the stack buffer so a
patched cell encodes its corrected value rather than the bit-packed
placeholder.

The shared `row_encode_common` module factors out the per-chunk encode
primitive (`encode_primitive_chunk`) and a small `PrimRowEncode`
trait — the same shape FoR and Delta will use in the next commit so
those kernels can share the chunk-walk machinery.

Kernel is registered via the `inventory`-based registry, since
`vortex-fastlanes` depends on `vortex-array`.

Includes a `bitpacked_i32_*` bench triplet (arrow-row baseline, vortex
with kernel, vortex through canonicalization).

Signed-off-by: Claude <noreply@anthropic.com>
Two more chunk-walking kernels alongside the BitPacked one. Both
register via the inventory-based registry.

FoR (Frame of Reference):
- Common fused path: FoR around a BitPacked storage with an unsigned
  reference. Walks the bit-packed chunks via `FoR::unchecked_unfor_pack`
  into a stack buffer with the base wrapping-added inline, then encodes
  rows from that buffer.
- Slow path: FoR around a Primitive storage. Walks the canonical buffer
  once with a per-row wrapping_add and the row encode.

Delta:
- Use the existing chunked `decompress_primitive` to write into a
  primitive buffer, then encode rows from that buffer. Skips the
  PrimitiveArray wrapping + validity attach.

Adds `for_i64_*` and `delta_i64_*` bench triplets.

Signed-off-by: Claude <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 18, 2026

Merging this PR will improve performance by 18%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 4 improved benchmarks
✅ 1217 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_canonical_into[(1000, 10)] 197.9 µs 162 µs +22.19%
Simulation chunked_varbinview_into_canonical[(100, 100)] 358.4 µs 323.5 µs +10.78%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 211.2 µs 175.8 µs +20.11%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 224.8 µs 188.6 µs +19.23%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/row-pr3-kernels (7021233) with develop (faf7e42)

Open in CodSpeed

@joseph-isaacs joseph-isaacs changed the title FoR and Delta row-encode kernels (vortex-fastlanes) Add row-oriented byte encoder (vortex-row crate) May 18, 2026
@joseph-isaacs joseph-isaacs changed the base branch from claude/row-c24-bitpacked-kernel to develop May 18, 2026 16:09
Two unrelated CI failures on PR #7985:

1. Check generated source files: vortex-row/public-api.lock was stale
   - field_encode_fixed_arithmetic became pub in the arithmetic-write
   commit but the lock wasn't regenerated.

2. Rust publish dry-run: vortex-row's dev-dep on vortex-fastlanes was
   inherited from the workspace with a version specifier. Since
   vortex-fastlanes itself depends on vortex-row (for the inventory
   kernel registration), cargo publish couldn't resolve the version
   on crates.io. Drop the workspace inheritance and use a path-only
   dev-dep for vortex-fastlanes - the bench file is the only consumer
   and cargo strips path-only dev-deps from the published manifest.

Signed-off-by: Claude <noreply@anthropic.com>
@joseph-isaacs joseph-isaacs added the changelog/feature A new feature label May 18, 2026 — with Claude
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants