Reading this blog on the LangChain PDF loader is essential for anyone working in generative AI, as PDFs are commonly used documents
This blog on LangChain PDF loader will tell you how to deal with PDFs, whether a complete directory, a single PDF, or multiple PDFs, how you can load them, how to split them, and further how to split the text inside.
So, let’s begin our journey.
Installing the Essentials for the Blog
Only three packages are required to be installed for this blog. Since we are dealing with PDF, the PyPDF module is the best and easiest.
%pip install langchain -U
%pip install langchain-community
%pip install pypdf
How to Load a Single PDF
To see how to load a PDF, we need a PDF file. I am using a novel by Kafka. You can download it from here if you want to try the same.
from langchain_community.document_loaders import PyPDFLoader # To load pdf file
loader = PyPDFLoader("/Metamorphosis.pdf") # Replace it with your file path.
Splitting the PDF
Now, let’s split this PDF. This is going to be an easier process, too. To split and load, we use the load_and_split method.
pages = loader.load_and_split()
Now, let’s see how many pages it’s split into using the len
method.
len(pages)
Output:
70
You can see all 70 pages in the “pages” variable using the loop below.
for i, page in enumerate(pages):
print(f"Page {i + 1}:")
print(page)
print("\n" + "-"*50 + "\n")
To see the complete output of all 70 pages, go to this link. I have just pasted the output’s first page to keep the LangChain PDF Loader blog easy to read below.
Output:
Page 1:
page_content='Metamorphosis
by Franz Kafka
Translated by David Wyllie
Prepared and Published by:
Ebd
E-BooksDirectory.com' metadata={'source': '/Metamorphosis.pdf', 'page': 0}
Splitting the Text Inside the PDF
Although not a LangChain PDF Loader, you still need to use the text splitters to break down the text inside the PDF. One of the most commonly used text splitters is the recursive text splitter, which we will use in the code below.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
Using the printing loop again shows that the pages are now divided into smaller chunks than before.
for i, page in enumerate(pages):
print(f"Page {i + 1}:")
print(page)
print("\n" + "-"*50 + "\n")
We can increase the chunk overlap in the above code to keep the context intact. We kept it to 0, although, according to my personal experience, keeping it to at least 20 is a good practice, which helps retain the context as there will be at least 20 characters.
Let’s see the number of chunks our PDF Text is divided into.
len(docs)
Output:
198
To learn more about text splitters, read my blog on text splitters
LangChain PDF Loader: Loading an Entire Directory of PDFs
Sometimes, we want to load multiple PDFs simultaneously by loading an entire directory with multiple PDFs.
So first, we will make a directory with the name my_directory. In that directory, there will be two PDFs. Both are research papers, one by Yolo and the other by DeepSeekCoder
Let’s see how to do that in the code below.
from langchain.document_loaders import PyPDFDirectoryLoader
!mkdir my_directory
We imported the PyPDFDirectoryLoader to load an entire directory of PDFs. Now, we will create a directory, and using the gdown command, we will load PDFs in our directory. This can also be done manually.
!gdown 1P094bOfyK7xmSJAhO6JV6duegLYOIyvg -O my_directory/yolov7paper.pdf
!gdown 1WV1MSW8CVm4keaV05pi9SGGac6RqA_kZ -O my_directory/deepseekcoder.pdf
At last, we will load the entire directory using the load function and save our data inside the variable”data.”
loader = PyPDFDirectoryLoader("my_directory")
data = loader.load()
Printing it through the loop lets you see what’s inside the data. You will notice that data from both PDFs is present.
for i, page in enumerate(data):
print(f"Page {i + 1}:")
print(page)
print("\n" + "-"*50 + "\n")
Use cases of LangChain PDF Loader
The most important use of LangChain PDF Loader is in RAG. You can learn how I developed RAG within 7 simple steps here in my blog on LangChain RAG.
For RAG, we need to provide LLM with some extra information that we have in the form of a document, so next time, if your data is in PDF form, use the above method of LangChain PDF loader.
The other uses might be Text classification, sentiment analysis, chatbot building, document summarizer, and other operations that can be performed on the text of a document.
Conclusion:
Hurray! You have now mastered how to deal with PDFs in LangChain. If you want me to write any other blog or you want to give any suggestions or report an error, then please feel free to reach out.
This blog on LangChain Document Loader was just a part of my LangChain series. If you want to learn more about LangChain, visit the MLguide section.