MongoDB Atlas
MongoDB Atlas is a document database that can be used as a vector databse.
In the walkthrough, weโll demo the SelfQueryRetriever
with a
MongoDB Atlas
vector store.
Creating a MongoDB Atlas vectorstoreโ
First weโll want to create a MongoDB Atlas VectorStore and seed it with some data. Weโve created a small demo set of documents that contain summaries of movies.
NOTE: The self-query retriever requires you to have lark
installed
(pip install lark
). We also need the pymongo
package.
#!pip install lark pymongo
We want to use OpenAIEmbeddings
so we have to get the OpenAI API Key.
import os
OPENAI_API_KEY = "Use your OpenAI key"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient
CONNECTION_STRING = "Use your MongoDB Atlas connection string"
DB_NAME = "Name of your MongoDB Atlas database"
COLLECTION_NAME = "Name of your collection in the database"
INDEX_NAME = "Name of a search index defined on the collection"
MongoClient = MongoClient(CONNECTION_STRING)
collection = MongoClient[DB_NAME][COLLECTION_NAME]
embeddings = OpenAIEmbeddings()
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 7.7, "genre": "action"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "genre": "thriller", "rating": 8.2},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "rating": 8.3, "genre": "drama"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={"year": 1979, "rating": 9.9, "genre": "science fiction"},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "genre": "thriller", "rating": 9.0},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated", "rating": 9.3},
),
]
vectorstore = MongoDBAtlasVectorSearch.from_documents(
docs,
embeddings,
collection=collection,
index_name=INDEX_NAME,
)
Now, letโs create a vector search index on your cluster. In the below
example, embedding
is the name of the field that contains the
embedding vector. Please refer to the
documentation
to get more details on how to define an Atlas Vector Search index. You
can name the index {COLLECTION_NAME}
and create the index on the
namespace {DB_NAME}.{COLLECTION_NAME}
. Finally, write the following
definition in the JSON editor on MongoDB Atlas:
{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
},
"genre": {
"type": "token"
},
"ratings": {
"type": "number"
},
"year": {
"type": "number"
}
}
}
}
Creating our self-querying retrieverโ
Now we can instantiate our retriever. To do this weโll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
Testing it outโ
And now we can try actually using our retriever!
# This example only specifies a relevant query
retriever.get_relevant_documents("What are some movies about dinosaurs")
# This example specifies a filter
retriever.get_relevant_documents("What are some highly rated movies (above 9)?")
# This example only specifies a query and a filter
retriever.get_relevant_documents(
"I want to watch a movie about toys rated higher than 9"
)
# This example specifies a composite filter
retriever.get_relevant_documents(
"What's a highly rated (above or equal 9) thriller film?"
)
# This example specifies a query and composite filter
retriever.get_relevant_documents(
"What's a movie after 1990 but before 2005 that's all about dinosaurs, \
and preferably has a lot of action"
)
Filter kโ
We can also use the self query retriever to specify k
: the number of
documents to fetch.
We can do this by passing enable_limit=True
to the constructor.
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
verbose=True,
enable_limit=True,
)
# This example only specifies a relevant query
retriever.get_relevant_documents("What are two movies about dinosaurs?")