Machine Learning Spot

LangChain Pinecone: Pinecone VectorStore Made Easy (7 Simple Steps)

LangChain Pinecone Featured Image

LangChain Pinecone: It won’t be wrong if we say that among vector stores, the most famous one is Pinecone. Pinecone is a closed-source cloud-based vector store used to store vector embeddings I am writing this blog to help you learn how to use Pinecone using LangChain.

By completing this small blog, you will learn

  1. How to use multiple PDFs
  2. How to split the Text into chunks
  3. How to get embeddings using OpenAI
  4. How to create indexes in Pinecone programmatically and also manually
  5. How to store embeddings in the Pinecone Vector cloud
  6. How to retrieve information
  7. How to use Pinecone and create a Pinecone index
  8. How to use Pinecone with LangChain
  9. Difference between Pinecone Serverless and Pod-Based Services
  10. How to use Pinecone on free starter account

Note: This blog on LangChain Pinecone is part of our LangChain learning series, in which I am striving to make you perfect in LangChain. The above list is just a sneak peek at what you will learn in this blog on LangChain Pinecone.

Installing All Required Packages

Before writing code for this blog, let us install each package we require.

 pip  install langchain_pinecone langchain-openai tiktoken langchain pypdf pinecone-client --quiet

The –quiet command here is to suppress the output. This is beneficial when you don’t want to see every installation information.

Importing Every Required Module

from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader
from pinecone import Pinecone, PodSpec # for index and vectorstore creation
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAI
from langchain.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
import pinecone

import os

To ensure you don’t get an error, I have written each module that we are going to use in this blog in a single code block so that your LangChain Pinecone experience remains smooth.

1. Loading Multiple Documents

In document loaders, to load multiple PDF files, the PyPDF directory loader is used, and if you want to use a single PDF, then you can use the PyPDF loader. In this blog on Langchain Pinecone, we have used a directory loader since we are going to upload two PDFs, so we will use a PDF directory loader.

Before that, let’s make a directory either manually or by giving commands, as I am giving in my Google Colab notebook.

!mkdir papers

Since our PDFs are research papers here, I have named the folder ‘papers’.

Now let’s download the papers from my Google Drive links.

!gdown 1P094bOfyK7xmSJAhO6JV6duegLYOIyvg -O papers/yolov7paper.pdf
!gdown 1WV1MSW8CVm4keaV05pi9SGGac6RqA_kZ -O papers/deepseekcoder.pdf

If you are a beginner, I recommend you use these links too. Later, you can replace these links with your own. The gdown command is used to download from the links. In the end, add the path where you want it to go I have given papers here.

Now that we have downloaded our pdf files, let us load our documents so that we can begin our process of transforming them into embeddings that we can save into our Pinecone Vectorstore. Use the below code to load them.

loader = PyPDFDirectoryLoader("papers")
data = loader.load()

The loader instance above loads the data from pdf into the variable “data.” This data is in raw form that we need to convert into embeddings Let’s see what our data looks like.

print(data)

Output:

Document(page_content='YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object\ndetectors\nChien-Yao Wang1, Alexey Bochkovskiy, and Hong-Yuan Mark Liao1\n1Institute of Information Science, Academia Sinica, Taiwan\[email protected], [email protected], and [email protected]\nAbstract\nYOLOv7 surpasses all known object detectors in both\nspeed and accuracy in the range from 5 FPS to 160 FPS\nand has the highest accuracy 56.8% AP among all known\nreal-time object detectors with 30 FPS .......

The output contained combined data from both of our research papers in text form. To view what it looked like, you can click on this link.

But to convert them into embeddings, we need them in chunks, so let us convert them into chunks now.

2. LangChain Pinecone: Splitting Text into Smaller Chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_chunks = text_splitter.split_documents(data)
print(text_chunks)

Output:

[Document(page_content='DeepSeek-Coder: When the Large Language Model Meets\nProgramming - The Rise of Code Intelligence\nDaya Guo*1, Qihao Zhu∗1,2, Dejian Yang1, Zhenda Xie1, Kai Dong1, Wentao Zhang1\nGuanting Chen1, Xiao Bi1, Y. Wu1, Y.K. Li1, Fuli Luo1, Yingfei Xiong2, Wenfeng Liang1\n1DeepSeek-AI\n2Key Lab of HCST (PKU), MOE; SCS, Peking University\n{zhuqh, guodaya}@deepseek.com\nhttps://github.com/deepseek-ai/DeepSeek-Coder\nAbstract\nThe rapid development of large language models has revolutionized code intelligence in', metadata={'source': 'papers/deepseekcoder.pdf', 'page': 0}), Document(page_content='software development. However, the predominance......

As you can see above, the text is now broken into smaller chunks. To view the complete output, click here. You can observe how it looked before splitting it and how it looks now.

Let’s see in how many chunks of our textual data got divided.

len(text_chunks)

Output:

310

So our textual data is divided into 310 chunks of 100 characters with an overlap of 20 characters, as given by us. Now it will be easier for the model to accept our data so let’s use the OpenAI model to convert our text into embeddings but before that, we need to set up an OpenAI environment Let’s do it.

3. LangChain Pinecone: Setting up OpenAI Environment to Get Embeddings

To set up an OpenAI environment, first of all, you need an OpenAI API key. To get it, just go to this link, login, and get it. You will get free credits of 5 USD if you have never made an ID before. These credits are enough to learn how to build projects using OpenAI embeddings.

After getting an API key, use the below code to set up an OpenAI environment. Replace “My API” with your own API.

os.environ['OPENAI_API_KEY'] = "My API"

Now we will make an instance of OpenAI embeddings.

embeddings = OpenAIEmbeddings()

This instance will be used to convert our chunked textual data into embeddings, but where will we save it? On our PC? No! We are going to use Pinecone vectorstore Now the question arises: how are we going to make a vector store? Well, first, go to Pinecone’s website and sign up.

After signing up, let’s see how to create a Pinecone Vectorstore that we will use to save our OpenAI embeddings using Langchain. So, what are we waiting for? Let’s move to the next level in our LangChain Pinecone blog!

4. Setting up Pinecone VectorStore

To set up a Pinecone VectorStore, there are several ways. One way is to use the GUI and set up an index just as shown in the picture below.

Just give a name and assign specifications to your Pinecone Vectorstore and Voila! Your Pinecone VectorStore is ready to be used!

Since we are programmers, we need to know how to create our Pinecone VectorStore programmatically, so let’s explore the second option, which is, of course, to create a Pinecone VectorStore index programmatically.

5. Creating an Index of Pinecone VectorStore (programmatically)

By programmatically creating a Pinecone VectorStore index, we mean creating a Pinecone VectorStore since a vector store is recognised by an index, which is why its name is called a vector index name. In reality, as soon as we create a Pinecone VectorStore index, we are creating a PineCone VectorStore.

To create an index, you need to go to the website of Pinecone and get an API key for Pinecone, just like we got an API key for OpenAI. Now we are getting it for Pinecone to create our Pinecone VectorStore.

LangChain Pinecone: Pinecone VectorStore API key

As soon as you create your Pinecone Account ID by signing up on Pinecone, an API key is automatically generated, as shown in the above picture. You can find it in the API keys section. Just copy the API key and paste it into the below code.

pc = pinecone.Pinecone(api_key="My Pinecone API") # Replace My Pinecone API with your own Pinecone APi key.

Now, after making an instance of your Pinecone API key, let’s create an index of the Pinecone VectorStore, another milestone for the motif of our blog on Langchain Pinecone!

pc.create_index(
    name='exampleindex',
    dimension=1536,
    metric='cosine',
    spec=pinecone.PodSpec(environment="gcp-starter")
)

In name, we assign the name of the index that we want to give to our index. I have assigned “exampleindex” as a name.

Dimensions tell how many dimensions you want each vector to have in vector embeddings; these dimensions play a crucial role in saving a semantic meaning using which a model can find and compare words according to the user’s requirements. To learn how vector embeddings work, read my blog on vector embeddings.

Since OpenAI embeddings always have 1536 dimensions, no matter what the word or sentence is. That is why I have assigned 1536 as a dimension. If, in the future, you decide to use any other model, then the dimensions will have to be different according to your model.

In metric too, there are multiple options available, like Euclidean, cosine, etc., but for OpenAI, we are using cosine as we want to use cosine similarity search while retrieving the relevant information.

In terms of specs and environment, Pinecone VectorStore offers a pod-based service, in which we buy a pod to create indexes for our vector store and a serverless-based service, where we don’t need to buy a fixed pod; it keeps adjusting itself as we create indexes and upload our embeddings.

To make my blog beneficial to everyone, I am using a free plan, so in spec, we have to use the podspec of Pinecone, having a free environment that goes by the name “gcp-starter.”

If you ever use serverless, only this line of spec will have to be different; everything else will remain the same. There are more environments available in the paid feature of Pinecone VectorStore for both serverless and pod-based services.

#example of spec if we were using serverless 
spec=ServerlessSpec(
        cloud='aws', 
        region='us-west-2')

While buying a plan, pinecone serverless might be cheaper as compared to pods since we don’t have to pay for space that we aren’t using; both options are still available.

Anyway, by running the above code, your new vector store with the name ‘exampleindex’ must have been created, and it will also be visible in your account, as you can see in the below picture.

Pinecone LangChain: Index Created in our Pinecone VectorStore

To see what our Pinecone VectorStore has right now, let’s use the below code.

index = pc.Index("exampleindex")
index.describe_index_stats()

Output

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

The stats of our Pinecone VectorStore show it is empty since our blog is on Langchain Pinecone, so let’s start filling it with our data using LangChain.

To use Langchain to fill our Pinecone VectorStore, we first need to set up an environment for Pinecone, just like we did for the OpenAI API.

6. LangChain Pinecone: Setting up Environment for Pinecone

Bring your API key again. The same API key that we used above to create an index and replace it with “My API” again as you did before.

To access Pinecone, we need an API key, but to access a particular index, we need to tell exactly which index we want to access and what environment it uses. Along with your API key, bring your index name and environment name back again and write them in the below code.

In our case, we were using the “gcp-starter” environment, and the index name was “exampleindex,” so I have written them below.

os.environ['PINECONE_API_KEY'] = "My API"
os.environ['PINECONE_ENVIRONMENT'] = "gcp-starter"
os.environ['PINECONE_iNDEX_NAME'] = "exampleindex"

Hurray! Our environment is set; now it is time to fill our index with our data.

#Filling our vectorstore with our data while converting them into vector embeddings

docsearch = PineconeVectorStore.from_documents(text_chunks, embeddings, index_name="exampleindex")

This one line of langchain will fetch all of your chunked data, then, using the OpenAI embeddings’ instance, convert it into vector embeddings and fill your Pinecone VectorStore.

Now let’s see the stats of our VectorStore again.

index.describe_index_stats()

Output:

{'dimension': 1536, 'index_fullness': 0.0031, 'namespaces': {'': {'vector_count': 310}}, 'total_vector_count': 310}

So as you can see in the statistics, our VectorStore now has some data in it. You can also see how it looked on the web site before vector embeddings were saved, and after they were saved, watch the below video to see how it looked in the GUI on the website when my vector embeddings were saved.

Now that it’s time to retrieve information, let’s query our data and fetch relevant information that is saved in docsearch.

7. Retrieving Relevant Information

query = "what is deep seek coder"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

Output:

Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art
performance among open-source code models across multiple benchmarks but also surpasses
existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models
are under a permissive license that allows for both research and unrestricted commercial use.
Figure 1|The Performance of DeepSeek-Coder

The fetched information is related to our second document research paper by the Deepseek coder. Try it for querying our other document.

You can send the retrieved information to a LLM, which will answer your query in a structured way, i.e., you will be talking to your documents, but that’s beyond the scope of this blog on LangChain Pinecone. To learn how you can do that, go through our LangChain RAG blog.

Conclusion:

Congratulations! Now you have successfully created your vector store from scratch in Pinecone and I hope my blog has helped you gain valuable information related to Pinecone along with many other things like dealing with multiple PDFs, text splitting, etc.

Just Like this blog on LangChain Pinecone to make Pinecone VectorStore, I have also written blogs on LangChain FAISS and LangChain Chroma that I recommend you to check out and also read other related blogs on LangChain.

Lastly, don’t forget to enlighten me with your valuable feedback our team values your feedback.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development