MultiVector Retriever
It can often be beneficial to store multiple vectors per document. There
are multiple use cases where this is beneficial. LangChain has a base
MultiVectorRetriever
which makes querying this type of setup easy. A
lot of the complexity lies in how to create the multiple vectors per
document. This notebook covers some of the common ways to create those
vectors and use the MultiVectorRetriever
.
The methods to create multiple vectors per document include:
- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.
Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryByteStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
loaders = [
TextLoader("../../paul_graham_essay.txt"),
TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)
Smaller chunksβ
Often times it can be useful to retrieve larger chunks of information,
but embed smaller chunks. This allows for embeddings to capture the
semantic meaning as closely as possible, but for as much context as
possible to be passed downstream. Note that this is what the
ParentDocumentRetriever
does. Here we show what is going on under the
hood.
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]
Document(page_content='Tonight, Iβd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerβan Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '3f826cfe-78bd-468d-adb8-f5c2719255df', 'source': '../../state_of_the_union.txt'})
# Retriever returns larger chunks
len(retriever.get_relevant_documents("justice breyer")[0].page_content)
9875
The default search type the retriever performs on the vector database is
a similarity search. LangChain Vector Stores also support searching via
Max Marginal
Relevance
so if you want this instead you can just set the search_type
property
as follows:
from langchain.retrievers.multi_vector import SearchType
retriever.search_type = SearchType.mmr
len(retriever.get_relevant_documents("justice breyer")[0].page_content)
9875
Summaryβ
Oftentimes a summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those.
import uuid
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
chain = (
{"doc": lambda x: x.page_content}
| ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
| ChatOpenAI(max_retries=0)
| StrOutputParser()
)
summaries = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
# doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs[0]
Document(page_content="The document is a speech given by the President of the United States, highlighting various issues and priorities. The President discusses the nomination of Judge Ketanji Brown Jackson for the Supreme Court and emphasizes the importance of securing the border and fixing the immigration system. The President also mentions the need to protect women's rights, support LGBTQ+ Americans, pass the Equality Act, and sign bipartisan bills into law. Additionally, the President addresses the opioid epidemic, mental health, support for veterans, and the fight against cancer. The speech concludes with a message of unity and optimism for the future of the United States.", metadata={'doc_id': '1f0bb74d-4878-43ae-9a5d-4c63fb308ca1'})
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
9194
Hypothetical Queriesβ
An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded
functions = [
{
"name": "hypothetical_questions",
"description": "Generate hypothetical questions",
"parameters": {
"type": "object",
"properties": {
"questions": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["questions"],
},
}
]
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
chain = (
{"doc": lambda x: x.page_content}
# Only asking for 3 hypothetical questions, but this could be adjusted
| ChatPromptTemplate.from_template(
"Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
)
| ChatOpenAI(max_retries=0, model="gpt-4").bind(
functions=functions, function_call={"name": "hypothetical_questions"}
)
| JsonKeyOutputFunctionsParser(key_name="questions")
)
chain.invoke(docs[0])
["What was the author's initial career choice before deciding to switch to AI?",
'Why did the author become disillusioned with AI during his first year of grad school?',
'What realization did the author have when visiting the Carnegie Institute?']
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
question_docs.extend(
[Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
)
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs
[Document(page_content='Who is the nominee for the United States Supreme Court, and what is their background?', metadata={'doc_id': 'd4a82bd9-9001-4bd7-bff1-d8ba2dca9692'}),
Document(page_content='Why did Robert Morris suggest the narrator to quit Y Combinator?', metadata={'doc_id': 'aba9b00d-860b-4b93-8e80-87dc08fa461d'}),
Document(page_content='What events led to the narrator deciding to hand over Y Combinator to someone else?', metadata={'doc_id': 'aba9b00d-860b-4b93-8e80-87dc08fa461d'}),
Document(page_content="How does the Bipartisan Infrastructure Law aim to improve America's infrastructure?", metadata={'doc_id': '822c2ba8-0abe-4f28-a72e-7eb8f477cc3d'})]
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
9194