Persist ParentDocumentRetriever of langchain

Question

I am using ParentDocumentRetriever of langchain. Using mostly the code from their webpage I managed to create an instance of ParentDocumentRetriever using bge_large embeddings, NLTK text splitter and chromadb. I added documents to it, so that I c

embedding_function = HuggingFaceEmbeddings(model_name='BAAI/bge-large-en-v1.5', cache_folder=hf_embed_path)
# This text splitter is used to create the child documents
child_splitter =  NLTKTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)

I added documents to it, so that I can query using the small chunks to match but to return the full document: matching_docs = retriever.get_relevant_documents(query_text) Chromadb collection 'full_documents' was stored in /chroma_db_child. I can read the collection and query it. I get back the chunks, which is what is expected:

vector_db = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

matching_doc = vector_db.max_marginal_relevance_search('whatever', 3)
len(matching_doc)
>>3

One thing I can't figure out is how to persist the whole structure. This code uses store = InMemoryStore(), which means that once I stopped execution, it goes away.

Is there a way, perhaps using something else instead of InMemoryStore(), to create ParentDocumentRetriever and persist both full documents and the chunks, so that I can restore them later without having to go through retriever.add_documents(docs, ids=None) step?

I'd like to add to @Corinna K.'s answer and point to a remote persistent solution that might be preferred depending on the use case stackoverflow.com/a/77865835/10680282 — guibs35
– guibs35, Commented Jan 11, 2024 at 20:30
@DaMako, how can you connect to this Chroma and LocalStore file with Chromadb persistence client ? I have generator code that generates the vector store and Local Store My LLM Python code is light weight it needs to connect with this Chrome db and query — Ameya
– Ameya, Commented Mar 13, 2024 at 12:19

Corinna K. · Accepted Answer · 2023-10-31 17:31:26Z

7

I had the same problem and found the solution here: https://github.com/langchain-ai/langchain/issues/9345

You need to use the create_kv_docstore() function like this:

from langchain.storage._lc_store import create_kv_docstore

fs = LocalFileStore("./store_location")
store = create_kv_docstore(fs)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="./db")
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(documents, ids=None)

You will end up with 2 folders: the chroma db "db" with the child chunks and the "data" folder with the parents documents.

I think there is also a possibility of saving the documents in a Redis db or Azure blobstorage (https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container) but I am not sure.

answered Oct 31, 2023 at 17:31

Corinna K.

862 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Maxl Gemeinderat Over a year ago

thanks, this helped! Do you know how I can adjust the vectorstore if I don't want to use Chroma but Faiss?

Corinna K. Over a year ago

@MaxlGemeinderat just change to vectorstore = FAISS()

Maxl Gemeinderat Over a year ago

Coming back to this: How can you load the ParentDocRetriever (e.g. in a different notebook) if you save it like this? So how to laod the 2 folders and set it up as retriever?

PAVEL REDKIN · Accepted Answer · 2024-10-31 14:54:35Z

0

The guide in LangChain - Parent-Document Retriever Deepdive with Custom PgVector Store (https://www.youtube.com/watch?v=wxRQe3hhFwU) describes a custom class based on BaseStorage that may also solve the problem with persistent docstore using pgVector instead of file storage

Alternatively, simply specify: byte_store=store in your ParentDocumentRetriever instead of docstore=store

Then you will be able to use store = LocalFileStore("path_to_cache") directly without extra imports

edited Oct 31, 2024 at 14:54

answered Oct 31, 2024 at 12:32

PAVEL REDKIN

113 bronze badges

Collectives™ on Stack Overflow

Persist ParentDocumentRetriever of langchain

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related