Machine Learning Spot

How to Use LangChain CSV Loader 101: Effortless Data-Handling

LangChain CSV Loader

LangChain CSV loader makes it easier for us to deal with CSV files. When working with CSV files in Python, especially in data processing and machine learning pipelines, it is vital to have a reliable way to load, parse, and manipulate data. LangChain’s CSVLoader is an excellent tool for this purpose. CSV files are comma-separated delimiter files essential to master, so this blog will teach you everything related to the LangChain CSV loader.
Let’s see how to Load a CSV first!

Installing The Requirements

We’ll begin with the installations required to follow along with the blog and use the LangChain CSV loader.

%pip install langchain
%pip install langchain-community
%pip install tiktoken

Load Your CSV File

To load your CSV, you need to have a CSV first. The CSV we are using is 100 customer data downloaded from here.

We need to import the CSV loader and provide it with the path of our file when initialising it. Then, it’s done. We need to load after initialisation using the .load() method.

from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = (
    "/content/customers-100.csv"
)

loader = CSVLoader(file_path=file_path) #initializing the CSV loader
data = loader.load() 

Our CSV is a large one, so we will print only 4 rows to see how things are going.

for i, row in enumerate(data):
    if i >= 5:  # Print only the first 5 rows
        break
    print(row)

Let’s observe the magic of the LangChain CSV loader by looking at the output.

Output:

page_content='Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: [email protected]
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/' metadata={'source': '/content/customers-100.csv', 'row': 0}
page_content='Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: [email protected]
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/' metadata={'source': '/content/customers-100.csv', 'row': 1}
page_content='Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: [email protected]
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/' metadata={'source': '/content/customers-100.csv', 'row': 2}
page_content='Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
Phone 1: 001-808-617-6467x12895
Phone 2: +1-813-324-8756
Email: [email protected]
Subscription Date: 2020-06-02
Website: http://www.good-lyons.com/' metadata={'source': '/content/customers-100.csv', 'row': 3}
page_content='Index: 5
Customer Id: 053d585Ab6b3159
First Name: Joanna
Last Name: Bender
Company: Martin, Lang and Andrade
City: West Priscilla
Country: Slovakia (Slovak Republic)
Phone 1: 001-234-203-0635x76146
Phone 2: 001-199-446-3860x3486
Email: [email protected]
Subscription Date: 2021-04-17
Website: https://goodwin-ingram.com/' metadata={'source': '/content/customers-100.csv', 'row': 4}

Customising with LangChain CSV loader

We can also customize a CSV using arguments of the LangChain CSV loader. In CSV arguments, we can specify the delimiter, which tells what character the program should consider to separate the values in the data.

The quotechar is used to quote fields that contain special characters, such as the delimiter. The default is a double quote (").

Fieldnames Allows you to provide custom field names for the columns in the CSV file. This is useful if the CSV file does not include a header row or if you want to rename the columns. Later, we can use new names while programming. This does not change information in the original CSV.

loader = CSVLoader(
    file_path=file_path,
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": ["number", "Customers", "Customer Names"],
    },
)

As you will observe in the output generated using next block of code, I have renamed the Index of the CSV file to “number,” and customer ID to “Customers,” and the First name of my CSV to Customer Names.

data = loader.load()
for record in data[:2]:
    print(record)

Changing Your CSV’s Source Identity

In LangChain CSV Loader, The ‘source’ key is used to identify the origin of each document. This is particularly useful when you need to track the source of each piece of data, especially in larger datasets or when you have to integrate with systems that require source attribution.

When we specify a source_column while loading the CSV file using the LangChain CSV loader, the value in that column for each row is assigned to the source key in the metadata of the corresponding document. You can then easily trace the data to its original row in the CSV file.

Now, It’s our turn to change the ‘source’ key using the below code

from langchain.document_loaders import CSVLoader

loader = CSVLoader(file_path=file_path, source_column="First Name")

data = loader.load()

Let’s print the first five rows again, just like before. This way, you can compare it with the previous output.

for i, row in enumerate(data):
    if i >= 5:  # Print only the first 5 rows
        break
    print(row)

In the output below of the LangChain CSV loader, you can observe that the source key in metadata now has the values of the column name we specified. Previously, the source key was the path of the document, i.e ‘/content/customers-100.csv’ in the case of my LangChain CSV loader’s path file.

Output:

page_content='Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: [email protected]
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/' metadata={'source': 'Sheryl', 'row': 0}
page_content='Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: [email protected]
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/' metadata={'source': 'Preston', 'row': 1}
page_content='Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: [email protected]
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/' metadata={'source': 'Roy', 'row': 2}
page_content='Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
Phone 1: 001-808-617-6467x12895
Phone 2: +1-813-324-8756
Email: [email protected]
Subscription Date: 2020-06-02
Website: http://www.good-lyons.com/' metadata={'source': 'Linda', 'row': 3}
page_content='Index: 5
Customer Id: 053d585Ab6b3159
First Name: Joanna
Last Name: Bender
Company: Martin, Lang and Andrade
City: West Priscilla
Country: Slovakia (Slovak Republic)
Phone 1: 001-234-203-0635x76146
Phone 2: 001-199-446-3860x3486
Email: [email protected]
Subscription Date: 2021-04-17
Website: https://goodwin-ingram.com/' metadata={'source': 'Joanna', 'row': 4}

Loading CSV Data From String

LangChain CSV loader even allows us to load string as a CSV file. The tempfile the module is used here. This is useful when we have CSV data in string format

import tempfile #Used to create temporary files
from io import StringIO #Used to handle string data as file-like objects
from langchain.document_loaders import CSVLoader #Used to load the CSV data

Now, we will define the CSV data string that mimics the content of a CSV file. The .strip() method is used to remove any leading or trailing whitespace.


string_data = """
"Fruit", "Color", "Average Weight (grams)"
"Apple", "Red", 182
"Banana", "Yellow", 118
"Cherry", "Red", 8
"Grapefruit", "Pink", 230
""".strip()

Next, we create a temporary file and then use tempfile.NamedTemporaryFile to create a temporary file and write the CSV string data into it.

The line of code NamedTemporaryFile(delete=False, mode="w+") creates a temporary file in write mode that is not deleted automatically.

# Create a temporary file and write the CSV string data into it
with tempfile.NamedTemporaryFile(delete=False, mode="w+") as temp_file:
    temp_file.write(string_data)
    temp_file_path = temp_file.name  # Retrieve the file path

Then temp_file.write(string_data) writes the CSV string data to the temporary file and temp_file.name retrieves the path of the temporary file

Ultimately, we need to load the data using LangChain CSV Loader since it has already become a CSV. You can do any of the above operations that you learned in this blog on this file, too, but first, let’s load this file using the LangChain CSV loader

loader = CSVLoader(file_path=temp_file_path)
data = loader.load()

In the end, let’s print and see the first record

for record in data[:1]:
    print(record)

Output:

page_content='Fruit: Apple
"Color": "Red"
"Average Weight (grams)": 182' metadata={'source': '/tmp/tmp6ymk63u8', 'row': 0}

To extract and format the entire output from LangChain’s CSVLoader result, you can iterate through the data object and access both the page_content and metadata fields of each record

Output:

Row 0:
Fruit: Apple
"Color": "Red"
"Average Weight (grams)": 182
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 0}

Row 1:
Fruit: Banana
"Color": "Yellow"
"Average Weight (grams)": 118
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 1}

Row 2:
Fruit: Cherry
"Color": "Red"
"Average Weight (grams)": 8
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 2}

Row 3:
Fruit: Grapefruit
"Color": "Pink"
"Average Weight (grams)": 230
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 3}

Conclusion:

Congratulations! In this blog you learned the capabilities of LangChain CSV loader. This blog is just a part of our LangChain series which you can explore here furthermore for suggestions , correction or anything else feel free to reach out.

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development