Skip to content

Rust: Docs discrepancy for creating empy tables between Rust and Python. #2759

@valkum

Description

@valkum

Description

tl;dr: Rust SDK's Connection::create_empty does not support setting embedding columns. It purely uses TableDefinition::new_from_schema(schema) which marks each column as physical.

The Python SDK does some funny things here. It does not use create_empty_table which you would assume when looking at AsyncConnection.create_table(data=None, schema=<Something derive from e.g. pydantic>)

data, schema = sanitize_create_table(
data, schema, metadata, on_bad_vectors, fill_value
)
validate_schema(schema)
if exist_ok is None:
exist_ok = False
if mode is None:
mode = "create"
if mode == "create" and exist_ok:
mode = "exist_ok"
if data is None:
new_table = await self._inner.create_empty_table(
name,
mode,
schema,
namespace=namespace,
storage_options=storage_options,
)

Instead, sanitize_create_table will create an empty record batch and return it as data.

Let's take a look at the following Rust example:

let schema = Arc::new(Schema::new(Fields::from_iter([Arc::new(Field::new(
        "name",
        DataType::Utf8,
        true,
    ))])));
let ed = EmbeddingDefinition {
        source_column: "name".to_owned(),
        dest_column: Some("name_embedding".to_owned()),
        embedding_name: "something-registered".to_owned(),
    };
let table = connection
        .create_empty_table("test", schema)
        .mode(CreateTableMode::Overwrite)
        .add_embedding(ed)
        .unwrap();

I would expect that Table::add would now create embeddings for my input batch. Instead the created table does not have a name_embedding column.

Having a look at Python, schema is generated, e.g. based on the Pydantic model. The schema passed to Rust's create_table is complete, including the embedding fields in the Arrow schema and metadata for the table definition.
Trying to create a similar Schema in Rust and pass it to create_empty_table does not work because TableDefinition::new_from_schema erases the whole table definition, and I assume that is the reason Python doesn't use this and instead creates an empty batch to make use of CreateTableBuilder::into_request which hands the embedding function definitions correctly to WithEmbeddings.

Nothing in the Rust docs tells me that I am not able to create embedding columns using create_empty_table, and based on the public API design, I think the internal works seem a bit off.
CreateTableBuilder<false>::new and CreateTableBuilder<false>::execute should support adding embeddings.

Link

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions