Machine Learning Spot

LangChain PDF Loader: How to Master PDFs in LangChain (2024)

LangChain PDF Loader Feature image

Reading this blog on the LangChain PDF loader is essential for anyone working in generative AI, as PDFs are commonly used documents

This blog on LangChain PDF loader will tell you how to deal with PDFs, whether a complete directory, a single PDF, or multiple PDFs, how you can load them, how to split them, and further how to split the text inside.

So, let’s begin our journey.

Installing the Essentials for the Blog

Only three packages are required to be installed for this blog. Since we are dealing with PDF, the PyPDF module is the best and easiest.

%pip install langchain -U 
%pip install langchain-community
%pip install pypdf

How to Load a Single PDF

To see how to load a PDF, we need a PDF file. I am using a novel by Kafka. You can download it from here if you want to try the same.

from langchain_community.document_loaders import PyPDFLoader # To load pdf file
loader = PyPDFLoader("/Metamorphosis.pdf") # Replace it with your file path.

Splitting the PDF

LangChain Document Loader PDF split

Now, let’s split this PDF. This is going to be an easier process, too. To split and load, we use the load_and_split method.

pages = loader.load_and_split()

Now, let’s see how many pages it’s split into using the len method.

len(pages)

Output:

70

You can see all 70 pages in the “pages” variable using the loop below.

for i, page in enumerate(pages):
    print(f"Page {i + 1}:")
    print(page)
    print("\n" + "-"*50 + "\n")

To see the complete output of all 70 pages, go to this link. I have just pasted the output’s first page to keep the LangChain PDF Loader blog easy to read below.

Output:

Page 1:
page_content='Metamorphosis  
 
by Franz Kafka  
 
Translated by David Wyllie  
 
 
 
 
 
 
 
        Prepared and Published by:  
 Ebd 
E-BooksDirectory.com' metadata={'source': '/Metamorphosis.pdf', 'page': 0}

Splitting the Text Inside the PDF

Although not a LangChain PDF Loader, you still need to use the text splitters to break down the text inside the PDF. One of the most commonly used text splitters is the recursive text splitter, which we will use in the code below.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

Using the printing loop again shows that the pages are now divided into smaller chunks than before.

for i, page in enumerate(pages):
    print(f"Page {i + 1}:")
    print(page)
    print("\n" + "-"*50 + "\n")

We can increase the chunk overlap in the above code to keep the context intact. We kept it to 0, although, according to my personal experience, keeping it to at least 20 is a good practice, which helps retain the context as there will be at least 20 characters.

Let’s see the number of chunks our PDF Text is divided into.

len(docs)

Output:

198

To learn more about text splitters, read my blog on text splitters

LangChain PDF Loader: Loading an Entire Directory of PDFs

Sometimes, we want to load multiple PDFs simultaneously by loading an entire directory with multiple PDFs.

So first, we will make a directory with the name my_directory. In that directory, there will be two PDFs. Both are research papers, one by Yolo and the other by DeepSeekCoder

Let’s see how to do that in the code below.

from langchain.document_loaders import PyPDFDirectoryLoader
!mkdir my_directory

We imported the PyPDFDirectoryLoader to load an entire directory of PDFs. Now, we will create a directory, and using the gdown command, we will load PDFs in our directory. This can also be done manually.

!gdown 1P094bOfyK7xmSJAhO6JV6duegLYOIyvg -O my_directory/yolov7paper.pdf
!gdown 1WV1MSW8CVm4keaV05pi9SGGac6RqA_kZ -O my_directory/deepseekcoder.pdf

At last, we will load the entire directory using the load function and save our data inside the variable”data.”

loader = PyPDFDirectoryLoader("my_directory")
data = loader.load()

Printing it through the loop lets you see what’s inside the data. You will notice that data from both PDFs is present.

for i, page in enumerate(data):
    print(f"Page {i + 1}:")
    print(page)
    print("\n" + "-"*50 + "\n")

Use cases of LangChain PDF Loader

The most important use of LangChain PDF Loader is in RAG. You can learn how I developed RAG within 7 simple steps here in my blog on LangChain RAG.

For RAG, we need to provide LLM with some extra information that we have in the form of a document, so next time, if your data is in PDF form, use the above method of LangChain PDF loader.

The other uses might be Text classification, sentiment analysis, chatbot building, document summarizer, and other operations that can be performed on the text of a document.

Conclusion:

Hurray! You have now mastered how to deal with PDFs in LangChain. If you want me to write any other blog or you want to give any suggestions or report an error, then please feel free to reach out.
This blog on LangChain Document Loader was just a part of my LangChain series. If you want to learn more about LangChain, visit the MLguide section.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development