7

I am using ParentDocumentRetriever of langchain. Using mostly the code from their webpage I managed to create an instance of ParentDocumentRetriever using bge_large embeddings, NLTK text splitter and chromadb. I added documents to it, so that I c

embedding_function = HuggingFaceEmbeddings(model_name='BAAI/bge-large-en-v1.5', cache_folder=hf_embed_path)
# This text splitter is used to create the child documents
child_splitter =  NLTKTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)

I added documents to it, so that I can query using the small chunks to match but to return the full document: matching_docs = retriever.get_relevant_documents(query_text) Chromadb collection 'full_documents' was stored in /chroma_db_child. I can read the collection and query it. I get back the chunks, which is what is expected:

vector_db = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

matching_doc = vector_db.max_marginal_relevance_search('whatever', 3)
len(matching_doc)
>>3

One thing I can't figure out is how to persist the whole structure. This code uses store = InMemoryStore(), which means that once I stopped execution, it goes away.

Is there a way, perhaps using something else instead of InMemoryStore(), to create ParentDocumentRetriever and persist both full documents and the chunks, so that I can restore them later without having to go through retriever.add_documents(docs, ids=None) step?

2
  • I'd like to add to @Corinna K.'s answer and point to a remote persistent solution that might be preferred depending on the use case stackoverflow.com/a/77865835/10680282 Commented Jan 11, 2024 at 20:30
  • @DaMako, how can you connect to this Chroma and LocalStore file with Chromadb persistence client ? I have generator code that generates the vector store and Local Store My LLM Python code is light weight it needs to connect with this Chrome db and query Commented Mar 13, 2024 at 12:19

2 Answers 2

7

I had the same problem and found the solution here: https://github.com/langchain-ai/langchain/issues/9345

You need to use the create_kv_docstore() function like this:

from langchain.storage._lc_store import create_kv_docstore

fs = LocalFileStore("./store_location")
store = create_kv_docstore(fs)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="./db")
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(documents, ids=None)

You will end up with 2 folders: the chroma db "db" with the child chunks and the "data" folder with the parents documents.

I think there is also a possibility of saving the documents in a Redis db or Azure blobstorage (https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container) but I am not sure.

Sign up to request clarification or add additional context in comments.

3 Comments

thanks, this helped! Do you know how I can adjust the vectorstore if I don't want to use Chroma but Faiss?
@MaxlGemeinderat just change to vectorstore = FAISS()
Coming back to this: How can you load the ParentDocRetriever (e.g. in a different notebook) if you save it like this? So how to laod the 2 folders and set it up as retriever?
0

The guide in LangChain - Parent-Document Retriever Deepdive with Custom PgVector Store (https://www.youtube.com/watch?v=wxRQe3hhFwU) describes a custom class based on BaseStorage that may also solve the problem with persistent docstore using pgVector instead of file storage

Alternatively, simply specify: byte_store=store in your ParentDocumentRetriever instead of docstore=store

Then you will be able to use store = LocalFileStore("path_to_cache") directly without extra imports

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.