Skip to main content

MongoDB Atlas

MongoDB Atlas is a document database that can be used as a vector databse.

In the walkthrough, weโ€™ll demo the SelfQueryRetriever with a MongoDB Atlas vector store.

Creating a MongoDB Atlas vectorstoreโ€‹

First weโ€™ll want to create a MongoDB Atlas VectorStore and seed it with some data. Weโ€™ve created a small demo set of documents that contain summaries of movies.

NOTE: The self-query retriever requires you to have lark installed (pip install lark). We also need the pymongo package.

#!pip install lark pymongo

We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.

import os

OPENAI_API_KEY = "Use your OpenAI key"

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import MongoDBAtlasVectorSearch
from pymongo import MongoClient

CONNECTION_STRING = "Use your MongoDB Atlas connection string"
DB_NAME = "Name of your MongoDB Atlas database"
COLLECTION_NAME = "Name of your collection in the database"
INDEX_NAME = "Name of a search index defined on the collection"

MongoClient = MongoClient(CONNECTION_STRING)
collection = MongoClient[DB_NAME][COLLECTION_NAME]

embeddings = OpenAIEmbeddings()
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 7.7, "genre": "action"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "genre": "thriller", "rating": 8.2},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "rating": 8.3, "genre": "drama"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={"year": 1979, "rating": 9.9, "genre": "science fiction"},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "genre": "thriller", "rating": 9.0},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated", "rating": 9.3},
),
]

vectorstore = MongoDBAtlasVectorSearch.from_documents(
docs,
embeddings,
collection=collection,
index_name=INDEX_NAME,
)

Now, letโ€™s create a vector search index on your cluster. In the below example, embedding is the name of the field that contains the embedding vector. Please refer to the documentation to get more details on how to define an Atlas Vector Search index. You can name the index {COLLECTION_NAME} and create the index on the namespace {DB_NAME}.{COLLECTION_NAME}. Finally, write the following definition in the JSON editor on MongoDB Atlas:

{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
},
"genre": {
"type": "token"
},
"ratings": {
"type": "number"
},
"year": {
"type": "number"
}
}
}
}

Creating our self-querying retrieverโ€‹

Now we can instantiate our retriever. To do this weโ€™ll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
document_content_description = "Brief summary of a movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

Testing it outโ€‹

And now we can try actually using our retriever!

# This example only specifies a relevant query
retriever.get_relevant_documents("What are some movies about dinosaurs")
# This example specifies a filter
retriever.get_relevant_documents("What are some highly rated movies (above 9)?")
# This example only specifies a query and a filter
retriever.get_relevant_documents(
"I want to watch a movie about toys rated higher than 9"
)
# This example specifies a composite filter
retriever.get_relevant_documents(
"What's a highly rated (above or equal 9) thriller film?"
)
# This example specifies a query and composite filter
retriever.get_relevant_documents(
"What's a movie after 1990 but before 2005 that's all about dinosaurs, \
and preferably has a lot of action"
)

Filter kโ€‹

We can also use the self query retriever to specify k: the number of documents to fetch.

We can do this by passing enable_limit=True to the constructor.

retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
verbose=True,
enable_limit=True,
)
# This example only specifies a relevant query
retriever.get_relevant_documents("What are two movies about dinosaurs?")