LangChain CSV loader makes it easier for us to deal with CSV files. When working with CSV files in Python, especially in data processing and machine learning pipelines, it is vital to have a reliable way to load, parse, and manipulate data. LangChain’s CSVLoader is an excellent tool for this purpose. CSV files are comma-separated delimiter files essential to master, so this blog will teach you everything related to the LangChain CSV loader.
Let’s see how to Load a CSV first!
Installing The Requirements
We’ll begin with the installations required to follow along with the blog and use the LangChain CSV loader.
%pip install langchain
%pip install langchain-community
%pip install tiktoken
Load Your CSV File
To load your CSV, you need to have a CSV first. The CSV we are using is 100 customer data downloaded from here.
We need to import the CSV loader and provide it with the path of our file when initialising it. Then, it’s done. We need to load after initialisation using the .load() method.
from langchain_community.document_loaders.csv_loader import CSVLoader
file_path = (
"/content/customers-100.csv"
)
loader = CSVLoader(file_path=file_path) #initializing the CSV loader
data = loader.load()
Our CSV is a large one, so we will print only 4 rows to see how things are going.
for i, row in enumerate(data):
if i >= 5: # Print only the first 5 rows
break
print(row)
Let’s observe the magic of the LangChain CSV loader by looking at the output.
Output:
page_content='Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: [email protected]
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/' metadata={'source': '/content/customers-100.csv', 'row': 0}
page_content='Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: [email protected]
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/' metadata={'source': '/content/customers-100.csv', 'row': 1}
page_content='Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: [email protected]
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/' metadata={'source': '/content/customers-100.csv', 'row': 2}
page_content='Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
Phone 1: 001-808-617-6467x12895
Phone 2: +1-813-324-8756
Email: [email protected]
Subscription Date: 2020-06-02
Website: http://www.good-lyons.com/' metadata={'source': '/content/customers-100.csv', 'row': 3}
page_content='Index: 5
Customer Id: 053d585Ab6b3159
First Name: Joanna
Last Name: Bender
Company: Martin, Lang and Andrade
City: West Priscilla
Country: Slovakia (Slovak Republic)
Phone 1: 001-234-203-0635x76146
Phone 2: 001-199-446-3860x3486
Email: [email protected]
Subscription Date: 2021-04-17
Website: https://goodwin-ingram.com/' metadata={'source': '/content/customers-100.csv', 'row': 4}
Customising with LangChain CSV loader
We can also customize a CSV using arguments of the LangChain CSV loader. In CSV arguments, we can specify the delimiter, which tells what character the program should consider to separate the values in the data.
The quotechar is used to quote fields that contain special characters, such as the delimiter. The default is a double quote ("
).
Fieldnames
Allows you to provide custom field names for the columns in the CSV file. This is useful if the CSV file does not include a header row or if you want to rename the columns. Later, we can use new names while programming. This does not change information in the original CSV.
loader = CSVLoader(
file_path=file_path,
csv_args={
"delimiter": ",",
"quotechar": '"',
"fieldnames": ["number", "Customers", "Customer Names"],
},
)
As you will observe in the output generated using next block of code, I have renamed the Index of the CSV file to “number,” and customer ID to “Customers,” and the First name of my CSV to Customer Names.
data = loader.load()
for record in data[:2]:
print(record)
Changing Your CSV’s Source Identity
In LangChain CSV Loader, The ‘source’ key is used to identify the origin of each document. This is particularly useful when you need to track the source of each piece of data, especially in larger datasets or when you have to integrate with systems that require source attribution.
When we specify a source_column
while loading the CSV file using the LangChain CSV loader, the value in that column for each row is assigned to the source key in the metadata of the corresponding document. You can then easily trace the data to its original row in the CSV file.
Now, It’s our turn to change the ‘source’ key using the below code
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file_path, source_column="First Name")
data = loader.load()
Let’s print the first five rows again, just like before. This way, you can compare it with the previous output.
for i, row in enumerate(data):
if i >= 5: # Print only the first 5 rows
break
print(row)
In the output below of the LangChain CSV loader, you can observe that the source key in metadata now has the values of the column name we specified. Previously, the source key was the path of the document, i.e ‘/content/customers-100.csv’ in the case of my LangChain CSV loader’s path file.
Output:
page_content='Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: [email protected]
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/' metadata={'source': 'Sheryl', 'row': 0}
page_content='Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: [email protected]
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/' metadata={'source': 'Preston', 'row': 1}
page_content='Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: [email protected]
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/' metadata={'source': 'Roy', 'row': 2}
page_content='Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
Phone 1: 001-808-617-6467x12895
Phone 2: +1-813-324-8756
Email: [email protected]
Subscription Date: 2020-06-02
Website: http://www.good-lyons.com/' metadata={'source': 'Linda', 'row': 3}
page_content='Index: 5
Customer Id: 053d585Ab6b3159
First Name: Joanna
Last Name: Bender
Company: Martin, Lang and Andrade
City: West Priscilla
Country: Slovakia (Slovak Republic)
Phone 1: 001-234-203-0635x76146
Phone 2: 001-199-446-3860x3486
Email: [email protected]
Subscription Date: 2021-04-17
Website: https://goodwin-ingram.com/' metadata={'source': 'Joanna', 'row': 4}
Loading CSV Data From String
LangChain CSV loader even allows us to load string as a CSV file. The tempfile
the
module is used here. This is useful when we have CSV data in string format
import tempfile #Used to create temporary files
from io import StringIO #Used to handle string data as file-like objects
from langchain.document_loaders import CSVLoader #Used to load the CSV data
Now, we will define the CSV data string that mimics the content of a CSV file. The .strip()
method is used to remove any leading or trailing whitespace.
string_data = """
"Fruit", "Color", "Average Weight (grams)"
"Apple", "Red", 182
"Banana", "Yellow", 118
"Cherry", "Red", 8
"Grapefruit", "Pink", 230
""".strip()
Next, we create a temporary file and then use tempfile.NamedTemporaryFile
to create a temporary file and write the CSV string data into it.
The line of code NamedTemporaryFile(delete=False, mode="w+")
creates a temporary file in write mode that is not deleted automatically.
# Create a temporary file and write the CSV string data into it
with tempfile.NamedTemporaryFile(delete=False, mode="w+") as temp_file:
temp_file.write(string_data)
temp_file_path = temp_file.name # Retrieve the file path
Then temp_file.write(string_data)
writes the CSV string data to the temporary file and temp_file.name
retrieves the path of the temporary file
Ultimately, we need to load the data using LangChain CSV Loader since it has already become a CSV. You can do any of the above operations that you learned in this blog on this file, too, but first, let’s load this file using the LangChain CSV loader
loader = CSVLoader(file_path=temp_file_path)
data = loader.load()
In the end, let’s print and see the first record
for record in data[:1]:
print(record)
Output:
page_content='Fruit: Apple
"Color": "Red"
"Average Weight (grams)": 182' metadata={'source': '/tmp/tmp6ymk63u8', 'row': 0}
To extract and format the entire output from LangChain’s CSVLoader
result, you can iterate through the data
object and access both the page_content
and metadata
fields of each record
Output:
Row 0:
Fruit: Apple
"Color": "Red"
"Average Weight (grams)": 182
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 0}
Row 1:
Fruit: Banana
"Color": "Yellow"
"Average Weight (grams)": 118
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 1}
Row 2:
Fruit: Cherry
"Color": "Red"
"Average Weight (grams)": 8
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 2}
Row 3:
Fruit: Grapefruit
"Color": "Pink"
"Average Weight (grams)": 230
Metadata: {'source': '/tmp/tmp6ymk63u8', 'row': 3}
Conclusion:
Congratulations! In this blog you learned the capabilities of LangChain CSV loader. This blog is just a part of our LangChain series which you can explore here furthermore for suggestions , correction or anything else feel free to reach out.