Machine Learning Spot

How to Store Embeddings Using LangChain Weaviate

LangChain Weaviate Feature Image

LangChain Weaviate is another easy-to-use vector store that you will learn about in this blog. In this blog, I will make it easy for you to learn how to use LangCahin Weaviate and to create and store embeddings using Hugging Face. From API creation to code, you learn everything on this blog.

Let’s quickly start setting up the vector store.

Steps to get API

Step 1: Once you have signed up, you will be redirected to dash board, where you need to click on create cluster

Step 2: Give your vector store a name and click on Create. Here, the vector store is called Sandbox. I gave “my-sandbox”; replace it with the name of your choice.

Step 3: Click on the drop-down symbol on the extreme right of your LangChain Weaviate VectorStore’s title. In the below picture, the drop-down symbol is outside the red box on the extreme right.

Step 4: After clicking on drop-down icon, you have to click on API keys

and your API key will appear, as shown below.

I shared these in the form of steps to make it easier for you, as there are multiple options and a person may get confused. Along with the API key for storing embeddings, you will also need the VectorStore URL to use it in your LangChain Weaviate program. You can find it in the picture below, marked in red.

LangChain Weaviate: Installing The Required Packages

Lets install each Package that is used in this blog of Langchain Weaviate. The core packages that you will always require for any sort of program that deals with Weaviate and Langchain are the first two, i.e., Weaviate-client, and Langchain.

%pip install weaviate-client
%pip install langchain
%pip install unstructured
%pip install "unstructured[pdf]"
%pip install sentence-transformers

Importing The Modules

Like always writing each module used in this blog post in a single block so that possibility of errors at your end are reduced.

from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

import weaviate
from langchain.vectorstores import Weaviate

Working of Code

I am going to use a PDF as a source of data The unstructured package and PyPDFDirectoryLoader will be used. We use the directory loader when there are multiple PDFs in a directory, but it can also work for a single file. You can also use only pypdf for a single file.

This data will then be split into chunks using a text splitter, which will then be converted into embeddings and stored in the Weaviate Vectore store. In LangChain Weaviate, we also need to create a schema that you can copy and paste from this blog and save your time.

Setting Up The Data

Lets start by assigning the keys to the variables keys of Weaviate the most essential and Hugging Face since we are going to use Hugging Face.

WEAVIATE_API_KEY = "Your Weaviate API key"
WEAVIATE_CLUSTER = "Your Weaviate URL"
huggingface_key = "Hugging Face API Key"

We already saw in the begining of blog how to get the URL and also the API Key of Weaviate to get the API Key of Hugging Face make an ID on Hugging Face, go to settings, and then go to access tokens. Confirm your email; if it is confirmed, then generate a new token, i.e., an API key, by clicking on the new token.

Now let’s make a directory called “Data” and store our PDF’s text data to process. I am going to use Metamorphosis by Kafka.

         #Making Directory to store PDF and Fetching it's Textual Data
            
!mkdir Data #creating a directory Data


# Before tunning below code save your PDF in the directory Data

from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader

loader = PyPDFDirectoryLoader("/content/Data") # replace /content/Data with your file path

data = loader.load() #This variable data will hold textual data of PDF

         #Breaking Textual data into chunks


from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
 

docs = text_splitter.split_documents(data) # Splits text in variable data and stores in the var docs

      #To Have a sneak peak printing Length and contents 

print(len(docs))

print(docs)

LangChain Weaviate: Setting Up the Client

The client works as a messenger that establishes the connection between your VectorStore and code. Without making a client you won’t be able to communicate with your VectorStore.

#Creating a Weaviate Client 

import weaviate
from langchain.vectorstores import Weaviate

#Connect to weaviate Cluster
auth_config = weaviate.auth.AuthApiKey(api_key = WEAVIATE_API_KEY)
WEAVIATE_URL = WEAVIATE_CLUSTER

#using above API and URL creating a client

client = weaviate.Client(
    url = WEAVIATE_URL,
    additional_headers = {"X-HuggingFace-Api-Key": huggingface_key},
    auth_client_secret = auth_config,
    startup_period = 10
)

    

Use client.is_ready() to check if your client is ready or not.

Writing Schema For LangChain Weaviate Based VectorStore

Scehma is a structure that helps us define the structure using which data will be stored and queried from the VectorStore. In Pinecone it happened on its own at the backend.

In the context of LangChain Weaviate, a schema is defined using a JSON-like structure that includes:

  1. Classes: These are the main categories or types of objects that you want to store in the database. Each class has a name and a description. Here it is, Chatbot.
  2. Properties: These are the attributes or characteristics of the objects that belong to a class. Each property has a name, data type, and description. The description here is “Documents for Chatbot.”
  3. Vectorizer: This is the module that is used to convert the data into vectors. Here we used text2vec-huggingface vectorizer that uses the Hugging Face Transformers library to convert text data into vectors.
  4. ModuleConfig: This is a configuration for the vectorizer, which includes the model name and the type of data. that the vectorizer, here we used “sentence-transformers/all-MiniLM-L6-v2” and the type of data as “text”.


#Writing the Schema

# define input structure
client.schema.delete_all()
client.schema.get()
schema = {
    "classes": [
        {
            "class": "Chatbot",
            "description": "Documents for chatbot",
            "vectorizer": "text2vec-huggingface",  # Use the Hugging Face vectorizer
            "moduleConfig": {
                "text2vec-huggingface": {
                    "model": "sentence-transformers/all-MiniLM-L6-v2",  # model name
                    "type": "text"  # type of data
                }
            },
            "properties": [
                {
                    "dataType": ["text"],
                    "description": "The content of the paragraph",
                    "moduleConfig": {
                        "text2vec-huggingface": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                    "name": "content",
                },
            ],
        },
    ]
}

client.schema.create(schema)
vectorstore = Weaviate(client, "Chatbot", "content", attributes=["source"])

Finally, Storing In VectorStore

While storing into a VectorStore, we have to make sure the metadata goes with the text too in this last section of the blog on Langchain Weaviate we are going to do the same

# Create a list of tuples where each tuple contains the page content and metadata of a document
text_meta_pair = [(doc.page_content, doc.metadata) for doc in docs]

# Transpose the list of tuples into two separate lists: texts and meta
texts, meta = list(zip(*text_meta_pair))

# Add the text data and its corresponding metadata to the vector store
vectorstore.add_texts(texts, meta)

Voila! Now this is how the LangChain Weaviate Vector Store will look with our 146 chunks that we had also checked here programmatically.

LangChain Weaviate End Result
Figures might vary based on the chunks you divided it into and your document

Conclusion:

Hurray! You have learned how to use LangChain Weaviate to store vector embeddings. If you have any suggestions or feedback feel free to reach out. You can read more such blogs on LangChain here.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development