-
Notifications
You must be signed in to change notification settings - Fork 636
Description
Description
tl;dr: Rust SDK's Connection::create_empty does not support setting embedding columns. It purely uses TableDefinition::new_from_schema(schema) which marks each column as physical.
The Python SDK does some funny things here. It does not use create_empty_table which you would assume when looking at AsyncConnection.create_table(data=None, schema=<Something derive from e.g. pydantic>)
lancedb/python/python/lancedb/db.py
Lines 1154 to 1173 in 0b7b274
| data, schema = sanitize_create_table( | |
| data, schema, metadata, on_bad_vectors, fill_value | |
| ) | |
| validate_schema(schema) | |
| if exist_ok is None: | |
| exist_ok = False | |
| if mode is None: | |
| mode = "create" | |
| if mode == "create" and exist_ok: | |
| mode = "exist_ok" | |
| if data is None: | |
| new_table = await self._inner.create_empty_table( | |
| name, | |
| mode, | |
| schema, | |
| namespace=namespace, | |
| storage_options=storage_options, | |
| ) |
Instead, sanitize_create_table will create an empty record batch and return it as data.
Let's take a look at the following Rust example:
let schema = Arc::new(Schema::new(Fields::from_iter([Arc::new(Field::new(
"name",
DataType::Utf8,
true,
))])));
let ed = EmbeddingDefinition {
source_column: "name".to_owned(),
dest_column: Some("name_embedding".to_owned()),
embedding_name: "something-registered".to_owned(),
};
let table = connection
.create_empty_table("test", schema)
.mode(CreateTableMode::Overwrite)
.add_embedding(ed)
.unwrap();I would expect that Table::add would now create embeddings for my input batch. Instead the created table does not have a name_embedding column.
Having a look at Python, schema is generated, e.g. based on the Pydantic model. The schema passed to Rust's create_table is complete, including the embedding fields in the Arrow schema and metadata for the table definition.
Trying to create a similar Schema in Rust and pass it to create_empty_table does not work because TableDefinition::new_from_schema erases the whole table definition, and I assume that is the reason Python doesn't use this and instead creates an empty batch to make use of CreateTableBuilder::into_request which hands the embedding function definitions correctly to WithEmbeddings.
Nothing in the Rust docs tells me that I am not able to create embedding columns using create_empty_table, and based on the public API design, I think the internal works seem a bit off.
CreateTableBuilder<false>::new and CreateTableBuilder<false>::execute should support adding embeddings.
Link
No response