A SQLite extension for generating text embeddings with llama.cpp. A sister project to sqlite-vec and sqlite-rembed. A work-in-progress!
sqlite-lembed uses embeddings models that are in the GGUF format to generate embeddings. These are a bit hard to find or convert, so here's a sample model you can use:
curl -L -o all-MiniLM-L6-v2.e4ce9877.q8_0.gguf https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/all-MiniLM-L6-v2/all-MiniLM-L6-v2.e4ce9877.q8_0.ggufThis is the sentence-transformers/all-MiniLM-L6-v2 model that I converted to the .gguf format, and quantized at Q8_0 (made smaller at the expense of some quality).
To load it into sqlite-lembed, register it with the temp.lembed_models table.
.load ./lembed0
INSERT INTO temp.lembed_models(name, model)
select 'all-MiniLM-L6-v2', lembed_model_from_file('all-MiniLM-L6-v2.e4ce9877.q8_0.gguf');
select lembed(
'all-MiniLM-L6-v2',
'The United States Postal Service is an independent agency...'
);The temp.lembed_models virtual table lets you "register" models with pure INSERT INTO statements. The name field is a unique identifier for a given model, and model is provided as a path to the .gguf model, on disk, with the lembed_model_from_file() function.
sqlite-lembed works well with sqlite-vec, a SQLite extension for vector search. Embeddings generated with lembed() use the same BLOB format for vectors that sqlite-vec uses.
Here's a sample "semantic search" application, made from a sample dataset of news article headlines.
create table articles(
headline text
);
-- Random NPR headlines from 2024-06-04
insert into articles VALUES
('Shohei Ohtani''s ex-interpreter pleads guilty to charges related to gambling and theft'),
('The jury has been selected in Hunter Biden''s gun trial'),
('Larry Allen, a Super Bowl champion and famed Dallas Cowboy, has died at age 52'),
('After saying Charlotte, a lone stingray, was pregnant, aquarium now says she''s sick'),
('An Epoch Times executive is facing money laundering charge');
-- Build a vector table with embeddings of article headlines
create virtual table vec_articles using vec0(
headline_embeddings float[384]
);
insert into vec_articles(rowid, headline_embeddings)
select rowid, lembed('all-MiniLM-L6-v2', headline)
from articles;
Now we have a regular articles table that stores text headlines, and a vec_articles virtual table that stores embeddings of the article headlines, using the all-MiniLM-L6-v2 model.
To perform a "semantic search" on the embeddings, we can query the vec_articles table with an embedding of our query, and join the results back to our articles table to retrieve the original headlines.
param set :query 'firearm courtroom'
with matches as (
select
rowid,
distance
from vec_articles
where headline_embeddings match lembed('all-MiniLM-L6-v2', :query)
order by distance
limit 3
)
select
headline,
distance
from matches
left join articles on articles.rowid = matches.rowid;
/*
+--------------------------------------------------------------+------------------+
| headline | distance |
+--------------------------------------------------------------+------------------+
| Shohei Ohtani's ex-interpreter pleads guilty to charges rela | 1.14812409877777 |
| ted to gambling and theft | |
+--------------------------------------------------------------+------------------+
| The jury has been selected in Hunter Biden's gun trial | 1.18380105495453 |
+--------------------------------------------------------------+------------------+
| An Epoch Times executive is facing money laundering charge | 1.27715671062469 |
+--------------------------------------------------------------+------------------+
*/Notice how "firearm courtroom" doesn't appear in any of these headlines, but it can still figure out that "Hunter Biden's gun trial" is related, and the other two justice-related articles appear on top.
Most embeddings models out there are provided as PyTorch/ONNX models, but sqlite-lembed uses models in the GGUF file format. However, since ggml/GGUF is relatively new, they can be hard to find. You can always convert models yourself, or here's a few pre-converted embedding models already in GGUF format:
| Model Name | Link |
|---|---|
nomic-embed-text-v1.5 |
https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF |
mxbai-embed-large-v1 |
https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 |
- No batch support yet.
llama.cpphas support for batch processing multiple inputs, but I haven't figured that out yet. Add a 👍 to Issue #2 if you want to see this fixed. - Pre-compiled version of
sqlite-lembeddon't use the GPU. This was done to make compiling/distrubution easier, but that means it will likely take a long time to generate embeddings. If you need it to go faster, try compilingsqlite-lembedyourself (docs coming soon).