Machine Learning Spot

LangChain Chromadb: 3 Easy Steps to Make a Vector Database

Langchain Chromadb Thumbnail/featureimage

LangChain Chromadb: LangChain, is a framework used to chain together various LLMs and other functionalities to build apps that use Large Language Models that deal with data that is in numerical form, forming the form of vectors long lists of numbers that have some meaning of features associated with them, which is used to do a semantic search but where to save them here Vector stores like chromaDB come to help us.

We have already covered LangChain FAISS, then what is needed for LangChain Chroimadb? Well, in FAISS, we used the Python feature of pickles to save the vector embeddings to our local machine, but in Chromadb, when we use it it creates a SQL file along with a long list of metadata.

We are using the word Langchain Chromadb, as Chromadb is designed in such a way that it integrates easily with LangChain, and in this article, we will use Langchain, OpenAI to generate embeddings, and ChromaDB to save and load embeddings.

After completing this article, you can also design your RAG using Chromadb We have already designed a RAG using FAISS. You will have to replace only a few lines to build it using Chromadb, as everything remains the same, but you will get more features from Chromadb, among which is organized as a dedicated database instead of a pickle file.

So without any delay, let’s get started!

Installing Each Required Package

use the below line of code to install each package that will be required to import modules we are going to use in this blog on LangChain Chromadb

pip install langchain -U langchain-community langchain-openai tiktoken chromadb

Importing the Required Modules

Every module that is required by us here to complete this blog on LangChain Chromadb is gathered here. You can read the major intention of importing them in the comments that are given by code.

from langchain_openai import OpenAI # This is to interact with OpenAI
from langchain_community.document_loaders import TextLoader # To load text file

from langchain_openai import OpenAIEmbeddings #To convert our text to embedings
from langchain_text_splitters import RecursiveCharacterTextSplitter # To split text to reduce token size
import os #For OpenAI API
from langchain_community.vectorstores import Chroma

Establishing a Connection with OpenAI

To establish a connection with OpenAI, we need an API of OpenAI, As we are going to use OpenAI to generate embeddings, we have chosen OpenAI, but any other platform, like huggingface transformers or any other platform, can also be used to generate embeddings.

os.environ["OPENAI_API_KEY"] = "My API" #replace My API with your API

After completing all of the above we are now ready create our Vectorstore let's read our Langchain chromadb post more to see how it's done.

1. Loading a Document

Just like in the post on Langchain FAISS, here in the post on LangChain Chromadb, where we are discussing the creation of a vector store using Chromadb, we need a file with some text that we will convert into embeddings, so we are using the same file. You can download it from here.

 loader = TextLoader("file path") #Replace file path with your own file’s path having name
 of file.
documents = loader.load() #Uses load method of class text loader to load the entire document

If you are using Colab, just drag and drop the file into your Colab notebook for learning purposes. You can also use our file elon.txt if you are using it, and then just replace the file path with elon.txt in the above code.

Remember, you are not limited to a text file, any other document having text can also be used, just like we used a PDF file in our other blog on LangChain RAG You can read about other document loaders here for this post on Langchain Chromadb. Let’s stick to a text file.

2. Embeddings Creation and Text Splitting

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings() # Embeddings object creation will be used to createe embeddings

In the text, the splitter chunk size is 1000, i.e., each chunk of our text will have 1000 characters. This is important as each model has a limit on tokens.

To make sure information stays next to each other, we are using a recursive text splitter that ensures that relevant chunks stay next to each other, and you can adjust the overlap according to your desired output. If we increase it to 20, then up to 20 characters can be found in both adjacent chunks; we have kept it at 100.

Here is your task first: After completing this post on Langchain Chromadb for your learning, try changing chunk size and chunk overlap and see what happens to the output.

3. LangChain Chromadb: The Final Step, VectorStore Creation

In Chromadb we don’t need to save it separately The beauty of Chromadb is that it creates a proper vector database in SQL with useful metadata files and we can give a proper path to save as a persistent directory It is called a persistent directory as it can be saved wherever we want to save it, so you can give a path where you want it to save the embeddings. Copy the below code and provide the path of your choice to save.

db = Chroma.from_documents(docs, embeddings, persist_directory="my_embeddings/15oct1997")

You can see in the picture below how it saved the embeddings that can be used later in the form of SQL files in Chromadb. It is also easy to update the information saved in them without manually opening them and saving them like the old-school way, but for that, you will have to wait for our next blog post.

let’s end our post on Langchain Chromadb by retrieving some relevant information

query = "who is elon musk"
docs = db.similarity_search(query)
print(docs[0].page_content)

Output:

Born: June 28, 1971, Pretoria, South Africa (age 52)
Founder: PayPal SpaceX Zip2
Recent News
Mar. 8, 2024, 6:21 PM ET (AP)
OpenAI has 'full confidence' in CEO Sam Altman after investigation, reinstates him to board
Mar. 6, 2024, 7:24 PM ET (AP)
Texas approves land-swapping deal with SpaceX as company hopes to expand rocket-launch operations
Top Questions
When was Elon Musk born?
Where did Elon Musk go to school?
What did Elon Musk accomplish?
Elon Musk (born June 28, 1971, Pretoria, South Africa) South African-born American entrepreneur who cofounded the electronic-payment firm PayPal and formed SpaceX, maker of launch vehicles and spacecraft. He was also one of the first significant investors in...

If we send this retrieved information to an LLM, then we will develop a RAG to get a proper, well-structured answer and learn how to do it by reading our blog on LangChain RAG.

Conclusion

Congratulations! By completing this post on LangChain Chromadb, you now know what chromadb is and how to create and use it. Stay in touch with our future posts by subscribing to our newsletter, where we are going to inform you about tutorials like this on Langchain Chromadb, and many others.
I look forward to your feedback. Happy learning!

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development