feat(table): add LsmWriteSpec for the MemWAL LSM write path#3396
feat(table): add LsmWriteSpec for the MemWAL LSM write path#3396touch-of-grey wants to merge 1 commit into
Conversation
jackye1995
left a comment
There was a problem hiding this comment.
similar to the other PR, should add typescript support
| } | ||
|
|
||
| #[pymethods] | ||
| impl MergeInsertSpec { |
There was a problem hiding this comment.
we should call this LsmWriteSpec, since we should be able to reuse it also for inserts for tables without primary key
| }) | ||
| } | ||
|
|
||
| pub fn set_merge_insert_spec<'a>( |
There was a problem hiding this comment.
we should also add an unset function so that people can revert to the old behavior if necessary
| /// row of each input batch is hashed (to pick the target writer). | ||
| /// The caller becomes responsible for pre-sharding the input. | ||
| /// No-op for the unsharded variant. | ||
| pub fn assume_pre_sharded(&self) -> Self { |
There was a problem hiding this comment.
the information of this API is not really stored, it should be supplied at runtime when running merge insert, not a part of the persisted sharding spec.
|
|
||
| let mut dataset = (*table.dataset.get().await?).clone(); | ||
| dataset | ||
| .initialize_mem_wal_with_shards( |
There was a problem hiding this comment.
examine what information exactly are persisted by the MemWAL index, we should not record other unnecessary information in the spec. Things like tunings and assume pre-sharded are runtime configs that should not be a part of the shard spec, but should be added into the specific function call.
| /// `num_updated_rows`, and `num_deleted_rows` are all zero and this field | ||
| /// holds the total row count written. | ||
| #[serde(default)] | ||
| pub num_rows: u64, |
There was a problem hiding this comment.
the merge result and merge writer related changes should be added separatedly, this PR should only concern setting and clearing the spec.
Add LsmWriteSpec (Bucket / Unsharded) and Table::set_lsm_write_spec / unset_lsm_write_spec to install and clear the spec that selects Lance's MemWAL LSM write path. The actual merge_insert dispatch and writer are a follow-up. Python and TypeScript bindings included. Split out from lancedb#3354; the unenforced primary key half landed in lancedb#3394.
a5ce403 to
3b647ff
Compare
Summary
Adds
LsmWriteSpecandTable::set_lsm_write_spec/unset_lsm_write_spectoinstall and clear the spec that selects Lance's MemWAL LSM-style write path:
LsmWriteSpec::bucket(column, num_buckets)hash-partitions writes by theunenforced primary key column;
LsmWriteSpec::unsharded()uses a singleMemWAL shard.
with_maintained_indexes(...)lists indexes the MemWAL keepsup to date.
set_lsm_write_specpersists the spec in the MemWAL index;unset_lsm_write_specremoves it (dropping the MemWAL index), reverting tothe standard
merge_insertpath.unsetis idempotent.setLsmWriteSpec/unsetLsmWriteSpec).RemoteTablereturnsNotSupported.The actual
merge_insertLSM dispatch and ShardWriter write path are afollow-up PR — this PR only sets and clears the spec.
Context
Split out from #3354; the unenforced primary key half landed in #3394.
Addresses review feedback: renamed from
MergeInsertSpec(the spec is reusablebeyond
merge_insert); runtime-only knobs (writer tuning, per-row validation)are kept out of the persisted spec; and an
unsetwas added.