Add row-oriented byte encoder (vortex-row crate)#7985
Open
joseph-isaacs wants to merge 26 commits into
Open
Conversation
Add an empty `vortex-row` crate with a minimal `initialize` stub so the following commits can layer in the row-encoder, codec, scalar functions, and per-encoding kernels without touching the workspace skeleton each time. The crate is wired into the workspace members list and workspace dependency table; `public-api.lock` is generated against the stub. Signed-off-by: Claude <noreply@anthropic.com>
Introduce the per-column sort-field options and the variadic-function options struct used by the upcoming RowSize / RowEncode scalar functions. `RowEncodeOptions::fields` uses a `SmallVec<[SortField; 4]>` so typical 1-4 column keys avoid a heap allocation. Includes a compact serialize / deserialize helper used later by the scalar-function metadata round-trip. Signed-off-by: Claude <noreply@anthropic.com>
Add the byte-encoding kernels for the fixed-width portion of the row encoder: Null, Bool, Primitive (12 PTypes), and Decimal (i8..i128). Each encoder writes a 1-byte sentinel followed by the value's row-comparable bytes (sign-flipped big-endian for signed ints, sign-aware mask for floats, etc.). The size pass is a constant `width-per-row` add for these types; the encode pass walks rows and writes into the shared output buffer at `offsets[i] + cursors[i]`. `row_width_for_dtype` classifies the column based purely on its DType. Scalar-level encoders (`encode_scalar_primitive` / `encode_scalar_bool` / `encode_scalar_null` / `encode_scalar` / `encoded_size_for_scalar`) are included for the same fixed-width subset; varlen and nested canonical variants bail with a clear "not yet supported" error and land in follow-up commits. The implementation is deliberately the simplest correct version: bounds-checked array indexing, no `copy_nonoverlapping`, no validity fast-path helper. Subsequent PRs evolve this toward the optimized form. Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Utf8/Binary via VarBinView arrays. Each value encodes as a 1-byte sentinel followed by 32-byte chunks: every full chunk has a 0xFF continuation marker; the final partial chunk pads with zeros and writes the partial length (1..=32) as its trailing byte. `encode_varlen_value` uses the simple byte-at-a-time XOR loop here; a faster `copy_nonoverlapping` + stamped continuation version replaces it in PR 2. `encode_varbinview` uses `arr.with_iterator(...)` for both the nullable and non-nullable branches; a direct view walk for the no-nulls branch lands in PR 2 too. `row_width_for_dtype` now returns `Variable` for Utf8/Binary; the size pass and encode dispatchers route through `add_size_varbinview` / `encode_varbinview` correspondingly. The scalar encoder gains `encode_scalar_varlen` and the matching Utf8/Binary arms. Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Struct, FixedSizeList, and Extension canonical variants. Each nested row encodes as `outer_sentinel | child bytes...`; for null rows the child bytes are zero-filled after the recursive encoders run so two null rows compare equal regardless of which non-null values would have been written by the children. `row_width_for_dtype` recurses through Struct fields and FSL elements to return `Fixed(w)` when every leaf is fixed; otherwise `Variable`. Extension delegates to its storage dtype. List remains `Variable` and ListView still bails (the row encoder's output is itself a ListView, so nested ListView isn't a near-term use case). Variant and Union bail explicitly. Signed-off-by: Claude <noreply@anthropic.com>
Add the size-pass machinery used by both RowSize and the upcoming
RowEncode pipeline. `compute_sizes` walks the N input columns once,
classifying each via `row_width_for_dtype` and accumulating
fixed-width-prefix sums in `fixed_per_row` while pushing per-row sums
of variable-length columns into a lazily allocated `var_lengths` vec.
The classification result (`ColKind` + `SizePassResult`) is private to
the crate; RowEncode consumes it in a later commit to choose between
the arithmetic and cursor encode paths.
`RowSize` returns a `Struct { fixed: U32, var: U32 }` so callers can
read the per-row width without realizing the constant `fixed` slot as
a per-row buffer (it's a `ConstantArray`); the `var` slot is a
`ConstantArray(0)` when no varlen column is present.
`dispatch_size` is the fallback-only path for PR 1 (canonicalize, then
codec::field_size). The `RowSizeKernel` trait exists but is unused; per-
encoding fast paths and the inventory registry arrive in PR 3.
`initialize()` does NOT register RowSize yet - that lands once
RowEncode is in place, so the session-registered pair appears together.
Signed-off-by: Claude <noreply@anthropic.com>
Add the RowEncode variadic scalar function: encode N input columns into
a single ListView<u8> in a five-phase pipeline.
Phase 1: size pass via `compute_sizes`.
Phase 2: allocate a zero-initialized output buffer sized to fit every
row's encoded bytes; bail if the total exceeds u32::MAX.
Phase 3: build per-row `listview_offsets`: i * fixed_per_row for the
pure-fixed case, or i * fixed_per_row + exclusive cumsum of
varlen lengths otherwise. Uses the simple `Vec::push` +
`checked_add` loop.
Phase 4: walk columns left-to-right and call `dispatch_encode` for
every column (cursor path for all). Each call writes its
per-row bytes at `offsets[i] + cursors[i]` and advances the
cursor.
Phase 5: build the ListView<u8> via the validating `try_new`
constructor.
`dispatch_encode` is the canonicalize-then-`codec::field_encode`
fallback; in-crate kernel arms and the inventory registry land in PR 3.
The `RowEncodeKernel` trait is defined but unused. PR 2 will iterate
on this pipeline (skip zero-init, skip ListView validation, auto-
vectorize the offsets loop, etc.).
Signed-off-by: Claude <noreply@anthropic.com>
Wire the RowSize/RowEncode scalar functions to the user-facing API:
- `convert_columns` accepts a slice of input arrays and per-column
SortFields, constructs `RowEncodeOptions` + `VecExecutionArgs`, and
returns the encoded `ListViewArray<u8>`.
- `compute_row_sizes` returns just the per-row sizes (the `Struct
{ fixed: u32, var: u32 }` output of `RowSize`).
- `initialize()` now registers `RowSize` and `RowEncode` on the given
session so they are reachable via the expression layer.
Tests cover sort-order round-trips for bool, primitive (i64 asc/desc,
u32, f64), utf8, multi-column, nulls_first/last, struct sort-order, the
single-buffer invariant of the ListView output, and the structural
shape of `RowSize`. Tests that exercise per-encoding fast paths
(`constant_path_matches_canonical`, `dict_path_matches_canonical`) land
together with their respective kernels in PR 3.
The bench file uses divan + mimalloc and reports throughput in GB/s of
encoded output bytes for primitive_i64, utf8, and struct_mixed. Each
has an `arrow_row` baseline and a `vortex` measurement. Per-encoding
fast-path scenarios (constant/dict/patched/bitpacked/for/delta) gain
their triplets in PR 3.
Baseline measurements at this commit (sample-count=10):
primitive_i64_vortex ~1.97 GB/s (vs arrow-row 4.12 GB/s)
utf8_vortex ~0.87 GB/s (vs arrow-row 1.56 GB/s)
struct_mixed_vortex ~0.95 GB/s (vs arrow-row 1.19 GB/s)
PR 2 closes most of the gap by replacing the validating
`ListViewArray::try_new` with `new_unchecked`, skipping the buffer
zero-init, auto-vectorizing the offsets and varlen-block paths, etc.
Signed-off-by: Claude <noreply@anthropic.com>
The encoder constructs the ListView's elements/offsets/sizes itself and
maintains every invariant by construction: monotone offsets, each
slice's `offsets[i] + sizes[i] <= total`, pairwise-disjoint slices.
`ListViewArray::try_new` re-walks every row to validate those properties,
which doubles as a memory pass over the just-built offsets/sizes arrays.
Switch to `unsafe { ListViewArray::new_unchecked(...) }` with an inline
SAFETY comment justifying each invariant.
primitive_i64_vortex throughput improves from ~1.80 GB/s to ~4.7 GB/s
on isolated runs (the validate walk dominates for small per-row payloads;
larger varlen rows show smaller % improvements).
Signed-off-by: Claude <noreply@anthropic.com>
…h it
Most production columns are non-nullable or `AllValid`, in which case
the per-row `mask.value(i)` branch is dead weight. Introduce a
`ValidityKind { AllValid, Mask(...) }` helper resolved exactly once per
column, and pattern-match on it in the four encoders that loop over
rows: `encode_primitive_typed`, `encode_bool`, `encode_varbinview`,
`add_size_varbinview`.
For NonNullable / AllValid columns this skips the mask materialization
entirely, and the inner loop has no validity branch. For nullable
columns the materialized mask is held once instead of re-resolved per
row.
Yields ~10% across canonical paths on isolated runs; combines with the
later auto-vectorization commit because removing the per-row branch
makes the inner loop a candidate for the compiler's vectorizer.
Signed-off-by: Claude <noreply@anthropic.com>
`BufferMut::with_capacity(total_len).push_n(0u8, total_len)` issues a
memset of the entire output range, only to have every byte overwritten
by the encoders. The encoders cover every byte by construction:
- Fixed-width non-null slots: sentinel + value bytes.
- Fixed-width null slots: sentinel + explicit per-byte zero-fill loop.
- Varlen blocks: full blocks are written by `encode_varlen_value`; the
partial-block tail is zero-padded by that same function.
- Struct/FSL null bodies: zero-filled after the child encoders run.
Switch to `unsafe { out_buf.set_len(total_len) }` with a SAFETY comment
recording the invariant. Reclaims a `total_len`-byte memset per call;
for varlen-heavy inputs (multiple MB of output) this saves real time.
dict_utf8 (varlen heavy) throughput: ~3.74 GB/s → ~4.55 GB/s.
Signed-off-by: Claude <noreply@anthropic.com>
The pure-fixed branch built `listview_offsets` via `Vec::push` + `checked_mul`, which forces the compiler to emit a per-iteration overflow branch and a `push`-style length-update sequence. Both inhibit the autovectorizer. We already validated `total` (= `nrows * fixed_per_row`) fits in u32 before reaching Phase 3, so each individual `i * fixed_per_row` also fits. Replace the loop with a raw `ptr.add(i).write(...)` write through the reserved capacity and a final `set_len(nrows)`. LLVM lowers the inner write to a SIMD store on x86 (verified via cargo asm in earlier iterations). primitive_i64_vortex throughput: ~4.96 GB/s → ~7.74 GB/s on isolated runs. The mixed branch gets the same treatment in the next commit. Signed-off-by: Claude <noreply@anthropic.com>
Apply the same `Vec::push` → raw-pointer-write transformation to the mixed (fixed-plus-varlen) branch of Phase 3. We already validated the total fits in u32 upstream, so `wrapping_mul` / `wrapping_add` here are sound. Mixed paths within the bench noise; this commit keeps the pure-fixed and mixed branches structurally identical so reviewers see the same shape regardless of whether varlen is present. Signed-off-by: Claude <noreply@anthropic.com>
The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional writes per block on each path, even for the ascending (no-XOR) case where the body is exactly a `memcpy(32) + stamp(1)`. Rewrite `encode_varlen_value` with two distinct fast paths: - Ascending: `copy_nonoverlapping(src, dst, 32)` + a single 0xFF stamp. The compiler folds the loop into a SIMD memcpy. - Descending: a `xor_copy_block` helper that XOR-copies 32 bytes via four u64 reads/writes; LLVM lowers it to SIMD on x86. The partial-block tail uses `write_bytes` for the zero-padding instead of a per-byte loop. utf8 throughput: ~0.92 GB/s → ~1.39 GB/s. struct_mixed: +35%. Signed-off-by: Claude <noreply@anthropic.com>
`arr.with_iterator(...)` constructs an `Option<&[u8]>` per row through a trait-object dispatch and a branch-and-merge that hides the inline-vs-buffer view from the compiler. On the AllValid path we don't need the Option (no nulls) and we want the compiler to see the inline-vs-buffer branch directly so it can keep the inline arm in registers. Walk `arr.views()` directly and resolve each view via `is_inlined() → as_inlined().value()` vs `as_view() → buffers[idx][offset..len]`. Cache data-buffer slices once before the loop (SmallVec for ≤4 buffers, the common case). Nullable path is unchanged because the Option<&[u8]> shape is already what we want when nulls are possible. utf8 throughput: ~1.49 GB/s → ~1.84 GB/s. Signed-off-by: Claude <noreply@anthropic.com>
ColKind::Fixed { before_varlen: true, .. } columns have a constant
within-row write offset (sum of preceding fixed-column widths plus
i * fixed_per_row plus var_prefix[i] when varlen columns are present).
For these we don't need a per-row cursor; the position is pure
arithmetic.
Adds dispatch_encode_fixed_arith + field_encode_fixed_arithmetic and
routes the relevant ColKind arm of execute_row_encode's phase 4
through them. Fixed-after-varlen columns and varlen columns continue
through the existing cursor path.
primitive_i64 vortex 3.0 -> 6+ GB/s.
Signed-off-by: Claude <noreply@anthropic.com>
When a ConstantArray feeds the fixed-before-varlen arithmetic path, the encoded scalar bytes are the same for every row. Hoist them into 1-2 register-sized loads outside the loop and emit direct write_unaligned stores per row. Specialized for encoded lengths 2 (bool/i8), 5 (i32), 9 (i64), 17 (i128). Other lengths fall back to copy_nonoverlapping. The var_prefix case (Constant after a varlen column) takes the same shape but computes per-row positions arithmetically rather than via a running cursor. constant_i64_vortex_without_kernel 2.47 -> ~6 GB/s (PR3 commit 3.3 adds the same specialization to the Constant kernel itself). Signed-off-by: Claude <noreply@anthropic.com>
Wire per-encoding fast-path traits into `dispatch_size` and `dispatch_encode`. Both helpers now try the in-crate downcast arms (Constant, Dict, Patched) before falling back to canonicalization. This commit adds stub impls returning `Ok(None)` so the existing behavior is preserved bit-for-bit; subsequent commits replace each stub with its real impl. Keeping the wiring change separate from the algorithm work makes the kernel impl commits trivially reviewable in isolation (they only touch one file each). The kernel module is `mod kernels` (crate-private) so the impls satisfy the orphan rule (trait defined in `vortex-row`, types from `vortex-array`) without leaking the impls into the crate's public surface. Signed-off-by: Claude <noreply@anthropic.com>
Encodings that live outside `vortex-array` (e.g. RunEnd, BitPacked, FoR, Delta) can't be downcast from inside the variadic dispatch loops - vortex-array doesn't know about them, and reversing the dependency would create a cycle. Add a `RowEncodeRegistration` that downstream crates submit via the inventory crate. `lookup(&array_id)` lazily builds an `ArrayId → (size, encode)` HashMap on first call, behind a `OnceLock` so the build is single-threaded and the lookups are wait-free thereafter. Wire the lookup into `dispatch_size` / `dispatch_encode` after the in-crate downcast attempts: in-crate kernels take precedence (constant- time downcast), then downstream registrations (HashMap lookup), then the canonicalization fallback. Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `ConstantArray` with real implementations that skip canonicalization. The size pass adds the (constant) per-row scalar size to every entry of the shared `sizes` slice. The encode pass encodes the scalar bytes once into a small heap buffer, then `copy_nonoverlapping`s those bytes into each row's slot. Per-row work is one `copy_nonoverlapping(N)` plus one cursor add, where `N` is typically 9 (i64), 5 (i32), or 17 (i128). Add a `constant_i64_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization) and a `constant_path_matches_canonical` test that round-trips bytes both ways and asserts they're identical. Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `Dict` with real implementations that skip canonicalization. Strategy: encode each unique value once into a small per-value buffer, then materialize the per-row contribution by indexing into the buffer via the codes array. Per-row cost becomes one `copy_from_slice` of the value's encoded bytes rather than re-encoding from scratch. Amortizes the encode work over the dictionary's cardinality instead of the row count. When values.len() > codes.len() the kernel declines (the canonical path is at least as fast because each value would be touched ≤ 1 time). `add_codes_sizes::<T>` has a u8 fast-path that reads the codes as a raw `&[u8]` slice to elide TryInto overhead. Includes `dict_utf8_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization) and a `dict_path_matches_canonical` round-trip test. Signed-off-by: Claude <noreply@anthropic.com>
Replace the stub `RowSizeKernel` / `RowEncodeKernel` impls for `Patched` with real implementations. Size pass: per-row size matches the inner array exactly because patches share the inner dtype. Just delegate to `dispatch_size` on the inner array. Encode pass: delegate to `dispatch_encode` on the inner array, then walk the patch indices and overwrite each patched row's value bytes in place. Patched arrays in our hot paths are Primitive-typed (BitPacked with patches, etc.), so the kernel checks `DType::Primitive` upfront and declines for anything else. Pre-cursor snapshot is captured before the inner encoder advances `cursors`, so the overlay knows each row's slot start position. Adds `patched_i32_*` bench triplet. Patched-specific tests live next to the kernel in `kernels/patched.rs::tests` (round-trip vs canonical, both single-chunk and multi-chunk). Signed-off-by: Claude <noreply@anthropic.com>
Add a row-encode kernel for `RunEnd` arrays via the inventory-based registry: the encoding lives in `vortex-runend` which depends on `vortex-array` (not the other way around), so a direct downcast inside `dispatch_size` / `dispatch_encode` would create a cycle. The kernel is functionally analogous to the Dict kernel: encode each unique run-value once into a small per-value buffer, then broadcast the value's encoded bytes across each row in its run. The per-unique-value cost is amortized over the number of runs rather than the row count. `walk_runs` translates the run-end array's `(prev_end, curr_end)` windows into `(start_logical, stop_logical)` row ranges accounting for the array's slice offset and length. When ends.len() > len (very sparse runs, or pathological inputs) the kernel declines so canonicalization stays the dominant path. Includes a round-trip test in `compute/row_encode.rs` checking that the RunEnd path matches the canonical path bit-for-bit. Signed-off-by: Claude <noreply@anthropic.com>
Add a row-encode kernel for BitPacked arrays. The kernel walks the packed storage in 1024-element fastlanes chunks via `BitUnpackedChunks::full_chunks`, unpacks each chunk into a stack-local buffer, and writes the row-encoded bytes for that chunk in one pass. Patches (when present) are applied per-chunk to the stack buffer so a patched cell encodes its corrected value rather than the bit-packed placeholder. The shared `row_encode_common` module factors out the per-chunk encode primitive (`encode_primitive_chunk`) and a small `PrimRowEncode` trait — the same shape FoR and Delta will use in the next commit so those kernels can share the chunk-walk machinery. Kernel is registered via the `inventory`-based registry, since `vortex-fastlanes` depends on `vortex-array`. Includes a `bitpacked_i32_*` bench triplet (arrow-row baseline, vortex with kernel, vortex through canonicalization). Signed-off-by: Claude <noreply@anthropic.com>
Two more chunk-walking kernels alongside the BitPacked one. Both register via the inventory-based registry. FoR (Frame of Reference): - Common fused path: FoR around a BitPacked storage with an unsigned reference. Walks the bit-packed chunks via `FoR::unchecked_unfor_pack` into a stack buffer with the base wrapping-added inline, then encodes rows from that buffer. - Slow path: FoR around a Primitive storage. Walks the canonical buffer once with a per-row wrapping_add and the row encode. Delta: - Use the existing chunked `decompress_primitive` to write into a primitive buffer, then encode rows from that buffer. Skips the PrimitiveArray wrapping + validity attach. Adds `for_i64_*` and `delta_i64_*` bench triplets. Signed-off-by: Claude <noreply@anthropic.com>
This was referenced May 18, 2026
This was referenced May 18, 2026
Closed
Merging this PR will improve performance by 18%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
197.9 µs | 162 µs | +22.19% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
358.4 µs | 323.5 µs | +10.78% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
211.2 µs | 175.8 µs | +20.11% |
| ⚡ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
224.8 µs | 188.6 µs | +19.23% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/row-pr3-kernels (7021233) with develop (faf7e42)
Two unrelated CI failures on PR #7985: 1. Check generated source files: vortex-row/public-api.lock was stale - field_encode_fixed_arithmetic became pub in the arithmetic-write commit but the lock wasn't regenerated. 2. Rust publish dry-run: vortex-row's dev-dep on vortex-fastlanes was inherited from the workspace with a version specifier. Since vortex-fastlanes itself depends on vortex-row (for the inventory kernel registration), cargo publish couldn't resolve the version on crates.io. Drop the workspace inheritance and use a path-only dev-dep for vortex-fastlanes - the bench file is the only consumer and cargo strips path-only dev-deps from the published manifest. Signed-off-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 25 of 25 in the stacked PR series adding
vortex-row. This is the top of the stack.This PR contains exactly one commit; review just that diff in isolation. The full design + rationale for the entire series is below.
What this commit does
Two more chunk-walking kernels alongside the BitPacked one. Both register via the inventory-based registry.
FoR (Frame of Reference):
FoR::unchecked_unfor_packinto a stack buffer with the base wrapping-added inline, then encodes rows from that buffer.Delta:
decompress_primitiveto write into a primitive buffer, then encodes rows from that buffer. Skips the PrimitiveArray wrapping + validity attach.Adds
for_i64_*anddelta_i64_*bench triplets.Stack
claude/row-c01-crate-scaffoldingclaude/row-c02-sortfield-optionsclaude/row-c03-codec-fixed-widthclaude/row-c04-codec-varlenclaude/row-c05-codec-nestedclaude/row-c06-rowsize-scalarfnclaude/row-c07-rowencode-scalarfnclaude/row-c08-convert-columns-tests-benchclaude/row-c09-skip-listview-validationclaude/row-c10-validity-fast-pathclaude/row-c11-skip-zero-initclaude/row-c12-vectorize-pure-fixed-offsetsclaude/row-c13-vectorize-mixed-offsetsclaude/row-c14-varlen-block-copy-nonoverlappingclaude/row-c15-walk-varbinview-directlyclaude/row-c16-arith-write-fast-pathclaude/row-c17-specialize-constant-arithclaude/row-c18-kernel-dispatch-helpersclaude/row-c19-inventory-registryclaude/row-c20-constant-kernelclaude/row-c21-dict-kernelclaude/row-c22-patched-kernelclaude/row-c23-runend-kernelclaude/row-c24-bitpacked-kernelclaude/row-pr3-kernelsBase of this PR: #8009 (
claude/row-c24-bitpacked-kernel)This is the top of the stack.
Combined context
This is the top-of-stack PR for the 25-commit
vortex-rowseries. Each previous PR contains exactly one commit; the full diff (all 25 commits) is the union of the stack.The overall change introduces a row-encoded representation for sorting/joining columnar Vortex data: N input columns are encoded into a single byte-comparable
ListView<u8>per row, with per-column sort options (ascending/descending, nulls-first/last). The series builds the crate bottom-up:claude/row-pr1-baseat vortex-row: convert_columns + tests + bench scaffolding #7993's tip): crate scaffolding, options struct, the byte-codec for fixed-width / varlen / nested canonical dtypes, theRowSizeandRowEncodescalar functions, and theconvert_columnsuser-facing entry point with tests and bench scaffolding.claude/row-pr2-perfat Specialize Constant for the arithmetic-write fast path #8002's tip): performance work on the codec hot path — skipping ListView validation and zero-init, validity fast-path, auto-vectorized offset construction,copy_nonoverlapping-based varlen block encoder, direct VarBinView walking, and the arithmetic-write fast path with Constant specialization.claude/row-pr3-kernels, this PR): per-encoding kernels — dispatch helpers, the inventory-based registry for downstream encodings, and Constant / Dict / Patched / RunEnd / BitPacked / FoR / Delta kernels that skip canonicalization.