Machine Learning Spot

LangChain Document Loader: How To Load Any Document Effortlessly

LangChain Document Loaders Feature Image

LangChain Document Loader is another feature of LangChain that offers flexibility and more power to a developer. With its versatility, ease, and flexibility, a developer can easily enhance any app’s power by performing various operations on the uploaded document using the LangChain Document Loader.

In this blog, you will learn:

  1. What is the LangChain Document Loader?
  2. Various types of LangChain Document Loader
  3. How a LangChain Document Loader works
  4. How to Load YouTube Transcripts as Document
  5. How to Load a simple text file
  6. How to load multiple PDF files
  7. How to load a CSV file.
  8. How to load any document using LangChain Document Loader and much more

What is the LangChain Document Loader?

LangChain Document Loader is a dedicated LangChain module designed to load various documents or fetch information from different sources specified by the user, such as webpages, YouTube videos, directories, or PDF files. It converts this data into a format easily understood by LangChain, organizing it within a document object along with all the metadata in textual form.

Some of the benefits that LangChain Document Loaders provide are:

  1. They help us bypass the problem of the token limit provided by LLM.
  2. They are the crucial players in LangChain RAG systems.
  3. They act as a bridge between raw data and fine-tuning LLMs, helping to organize data in a way that LLMs can understand.
  4. They help in organizing the data.
  5. Many loaders give you granular control over your data; you can either work with the entire document or with a specific section of that document without any extra hustle.

Document Loading: How It Works and How to Load Documents

Step 1: Goal Clarification. Identify your goal and the type of data you need, and then search for a relevant document loader. There are more than 160 document loaders in LangChain, but almost every one of them loads a document in the same way. You can see the list of document loaders here.

Step 2: Know your data source. Where is the data present, and what path does it follow – is it in a local file, a webpage URL, or somewhere else? For local files, specify the exact path on your system (e.g., “C:/Users/username/Documents/my_data.txt”).
If it’s on a webpage, then specify the complete URL to the loader.

Step 3: Let the loader do the work. In this step, you do nothing; it is the job of the document loader to retrieve data from the specified source.

Step 4: The object creation. The loader transforms the retrieved data into a format LangChain can understand—the document object. It contains the data as well as the metadata.

Step 5: Beginning the process. Once the document object is created, it is ready for processing, whether it be data retrieval, breaking texts into chunks, or any other type of processing, you name it.

The steps I shared above is a general outline for every document, so while loading many documents, you may have to install different packages and even get an API. However, the general steps will remain the same as above.

Furthermore, it is not necessary that you always find a direct way to load your entire document the way you want, so you will have to figure out a way to do that yourself.

Let’s see how documents are loaded through code

Installing Essential Packages

Before getting started, there are essential packages that you need to have to run the code. Let’s install them first.

%pip install langchain 
%pip install langchain-community 
%pip install tiktoken
%pip install pypdf
%pip langchain-openai
%pip install firecrawl-py

Loading a Text File and Breaking it into Chunks

The text file used below can be found here. The information was copied from the Britannica encyclopedia.

                    # Importing necessary Modules

from langchain_community.document_loaders import TextLoader # To load text file

from langchain_text_splitters import RecursiveCharacterTextSplitter # To split text to reduce token size

                   # File Loading and object creation

loader = TextLoader("file path") #Replace file path with your own file’s path having name
 of file.
documents = loader.load() #Uses load method of class text loader to load the entire document
                   # Splitting Text (Process on File)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents) #Splitted Text is saved in docs

I used this loader in my many blogs; you can read how I converted it into embeddings and saved them to my local machine using FAISS in 3 easy steps.

Single PDF Loading and Splitting

The PDF used here can be found using this link:

                        # Importing necessary Modules

from langchain_community.document_loaders import PyPDFLoader # To load pdf file
from langchain_text_splitters import RecursiveCharacterTextSplitter # To split text to reduce token size.

                        # File Loading and object creation
loader = PyPDFLoader("/content/Metamorphosis.pdf") # Replace it with your file path.
pages = loader.load_and_split() #It splitted only PDF

                        # Splitting Text (Process on File)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

I used this LangChain Document Loader to implement a LangChain RAG in 7 simple steps, which I used to query my document.

Loading an Entire PDF Directory

I made a directory in Google Colab and downloaded two research papers from my Google Drive.

                      #Working on Directory

!mkdir papers #Making a Directory with name papers

                      #Downloading The Files in Our Directory

!gdown 1P094bOfyK7xmSJAhO6JV6duegLYOIyvg -O papers/yolov7paper.pdf 
#Note:After the letter O I have provided path to my directory with the file name. 
!gdown 1WV1MSW8CVm4keaV05pi9SGGac6RqA_kZ -O papers/deepseekcoder.pdf

                      #Loading Entire Directory having PDF Files
loader = PyPDFDirectoryLoader("papers") # loading The Directory "papers"
data = loader.load() # Object Creation

                 #Text Splitting (Some Process On The Document)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20) 

text_chunks = text_splitter.split_documents(data)

I used the above code snippet on LangChain Document Loader in my blog on how to use LangChain Pinecone in seven 7 simple steps.

Loading Youtube Video Transcript

                      #Loading YouTube Video

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=VCCgdRF0AIA", add_video_info=True)

transcript = loader.load() #Object creation

                    #Text Splitting (Process On The Document)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=30)
docs = text_splitter.split_documents(transcript) #docs contains splitted text.

While it’s not always necessary to split the text, doing so can help us avoid hitting token limits when working with LLMs. I used the LangChain Document Loader to develop a simple YouTube Video Summarizer capable of handling videos of any length.

Loading FireCrawl as a Web Crawler

Langchain Document Loaders provide you flexibility and power to even crawl an entire website, depending on your use. You can find a myriad of document loaders. Let’s see how to crawl a website.

We will use FireCrawl as a LangChain Document Loader. You can use the code below to do it. We have already installed the required packages above to learn more about FireCrawl and LangChain read blog on LangChain FireCrawl.

#Initializing the Document Loader


          
loader = FireCrawlLoader(
    api_key="fc-0df6b90a13704e818ac59267c9ea1747", url="https://machinelearningspot.com", mode="crawl"
)


           #Loading Document then Printing to see contents


crawl_docs = loader.load()

print(crawl_docs)

The above code crawls our entire website, machinelearningspot.com if you want to crawl another website, just replace it with that website, and if you’re going to scrape, then change the mode to scrape.

Loading CSV

To load a CSV we will first create a random CSV with Random numbers then we will load that document if you already have a CSV you can later replace this random CSV with your own CSV file let’s see the process below.

              #Creating Random CSV to Experiment


import csv
import random

# Define the number of rows and columns
num_rows = 10
num_cols = 5

# Generate random data
data = [[random.randint(1, 100) for _ in range(num_cols)] for _ in range(num_rows)]

# Write data to a CSV file
with open('random.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("CSV file 'random.csv' has been generated with random data.")



              #Loading CSV File 

from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='/content/random.csv') # Give your own fiel path
data = loader.load() #Object creation

The next step is using this object for processing, which can be anything. For example, if it was the result of a student or multiple students, you could use it to query the LLM about the result.

I have not done anything special; I wanted to load a CSV, so I just typed LangChain CSV loader in Langchain on Google, and then from the same list that I shared above, a document appeared, and I used it here. You can do the same to load any document; it is super easy to load any document using LangChain Document Loader.

Practice Time, Now You Try! —Importing a JSON

You have to search for how to load a JSON file using LangChain Document Loader. We previously loaded different types of documents, but now I want you to do it yourself.

You have two famous options:

  1. Search on Google or any other search engine.
  2. Use the list to manually find it.

The other options are of your own choice. Now go and start searching and then implementing. Don’t use or look at the below code until you have tried it yourself.

                      #Generating a Random JSON
import json
import random

# Function to generate a random JSON object
def generate_random_json():
    data = {
        "name": random.choice(["Alice", "Bob", "Charlie", "David", "Eve"]),
        "age": random.randint(18, 60),
        "city": random.choice(["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"]),
        "is_student": random.choice([True, False]),
        "scores": [random.randint(50, 100) for _ in range(5)]
    }
    return data

# Generate a random JSON object
random_json = generate_random_json()

# Write the JSON object to a file
with open('random.json', 'w') as f:
    json.dump(random_json, f)
                      
                #Having a Sneak Peak Into Our JSON

from pathlib import Path
from pprint import pprint

file_path='/content/random.json'
data = json.loads(Path(file_path).read_text())

pprint(data) 
#Note:This was not an ordinary print and you are able to sneak peak into JSON using print with pp i.e pprint

               #Loading The JSON Content and making an Object 

!pip install jq #this package is essential to access the JSON content

from langchain_community.document_loaders import JSONLoader #To load JSON

loader = JSONLoader(
    file_path='/content/random.json',
    jq_schema='.scores[]',
    text_content=False) #Loaded the Content From our File

data = loader.load() #Successfully created Object

print(data) #Used simple print to see what we loaded

There are various ways to load JSON contents using LangChain Document Loader. Check the one that fits you the best here.

Conclusion:

Hurray! Now, I hope that after successfully loading the above documents, you will be able to load any document. Just go and try to load the document of your choice which was not present in the blog.

This blog on LangChain Document Loaders was a continuation of my LangChain blog series. For suggestions, feel free to contact us.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development