Machine Learning Spot

LangChain FAISS: 3 Easy Steps to Make Personal Vector Store

LangChain FAISS build your own vector store

LangChain FAISS (Facebook AI Similarity Search): A simple solution for you to make your vector store for free.
You can make your text reader, your research tool, and your own article or report writer using these two in combination, the langchain and FAISS so for our ease, let’s call them LangChain FAISS

After completing this blog, you will know what FAISS is, how easily you can do similarity searches using it, how to talk to your document with extreme ease, and most especially, how to make your vector store. It is as easy as pie!

Setting Up the LangChain FAISS

To use LangChain FAISS, you will need to install the langchain-community package, the FAISS package, and the langchain-openai package. You can install these with:

pip install langchain -U langchain-community faiss-cpu langchain-openai tiktoken

In the above line of code, we are installing LangChain, as this library is the plot of our entire story. This is the platform form for which we are going to use the framework, LangChain Community. This is where we are going to import FAISS and faiss-cpu to do similarity searches, retrieve, save, and load. FAISS is the protagonist, and LangChain is the plot of our story. With LangChain OpenAI, we are going to communicate with the OpenAI API.

Importing Everything Required

So that nothing is missed and you don’t encounter any errors, I have combined everything you need to import for this blog post on LangChain FAISS for vector store creation.


from langchain_openai import OpenAI # This is to interact with OpenAI
from langchain_community.document_loaders import TextLoader # To load text file
from langchain_community.vectorstores import FAISS 
from langchain_openai import OpenAIEmbeddings #To convert our text to embedings
from langchain_text_splitters import RecursiveCharacterTextSplitter # To split text to reduce token size
import os #For OpenAI API

1. Environment Setting for OpenAI API

We need the API of OpenAI to use their text-to-embeddings convertor and also their LLM when needed but for this blog, embedding is the most important thing, as this is what is stored in a vector store. Vector stores store the vector embeddings.

Get your OpenAI API key by going here and then using the below code

os.environ["OPENAI_API_KEY"] = " My API " #replace My API with your API

2. The Show Begins Creation of Embeddings

loader = TextLoader("file path") # replace file path with your own file’s path with its name
documents = loader.load() #uses load method of class text loader to load the entire document

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings() # embeddings object creation
db = FAISS.from_documents(docs, embeddings) # data base creation 

Since in this LangChain FAISS showdown, we are making a vectorstore and we need some text embeddings to store it.

This line of code, “db = FAISS.from_documents(docs, embeddings)” creates vector embeddings using the split texts stored in docs.

In the text splitter, we have used a recursive text splitter so that related information stays next to each other. The chunk is kept at 1000 since LLMs have a limit of tokens that varies in every LLM, and to avoid overlapping in the characters of the chunk, we kept it at 0.

3. The Final stage: Storing of Vectors

To save FAISS indexes, we just make pickle files that can later be used and save them on our local machine, so this LangChain FAISS solution is free of charge. You can save as many vector embeddings as you want, depending on your memory.

 db.save_local("faiss_index") # Replace faiss_index with name of your choice

Use the above line of code to save it to your machine, and the below code to load it from your machine.

new_db = FAISS.load_local("faiss_index", embeddings) # It is loaded in new_db

Okay, so we saved it and loaded it but is it enough? Yeah, we have created a vector store, but what benefit are we going to get without a similarity search? Let’s try that out too!

LangChain FAISS: How Similarity Search is done

It is not your job to do a similarity search; you need to create a string with your query and then pass it to the similarity search object, as shown below.

query = "When was Elon Musk born"
docs = new_db.similarity_search(query)

Our text embeddings contained information on Elon Musk, with only the initial portion copied from this source. When printed, this was the output we obtained.

print(docs[0].page_content)

Output:

Mar. 6, 2024, 7:24 PM ET (AP)
Texas approves land-swapping deal with SpaceX as company hopes to expand rocket-launch operations
Top Questions
When was Elon Musk born?
Where did Elon Musk go to school?
What did Elon Musk accomplish?
Elon Musk (born June 28, 1971, Pretoria, South Africa) South African-born American entrepreneur who cofounded the electronic-payment firm......

It retrieved as much relevant information as it could to see the document You can download it from here.

You can also perform a similarity search with scores, which you can later use for building logic or estimation purposes. This is how it is done.

docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]

So we just added some underscores and two more words with similarity search, i.e., with_score, and this was the output that we got.

(Document(page_content='Mar. 6, 2024, 7:24 PM ET (AP)\nTexas approves land-swapping deal with SpaceX as company hopes to expand rocket-launch operations\nTop Questions\nWhen was Elon Musk born?\nWhere did Elon Musk go to school?\nWhat did Elon Musk accomplish?\nElon Musk (born June 28, 1971, Pretoria, South Africa) South African-born American entrepreneur who cofounded the electronic-payment firm PayPal and formed SpaceX, maker of launch vehicles and spacecraft. He was also one of the first significant investors in, as well as chief executive officer of, the electric car manufacturer Tesla. In addition, Musk acquired Twitter (later X) in 2022.', metadata={'source': 'elon.txt'}), 0.3366713)

The above score is 0.3366713. Since lower scores indicate higher relevance, other chunks with higher scores were likely less relevant and therefore not fetched.

Conclusion:

Way to go! By completing this blog on LangChain FAISS, you now know how to create your own personal vector store and how to use it to read more such blogs. Subscribe to our newsletter, and don’t forget to read other posts on AI.

For suggestions and feedback, contact us.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development