Machine Learning Spot

LangChain Text Splitters: How to Split Text 3 Examples

LangChain Text Splitters Featured Image

LangChain Text Splitters helps us split the texts. Especially when we want our text to fit into our LLM’s context windows, for example, in RAG and many other applications.

In this blog, we will discuss 3 text splitters that are widely used; later in my next blog, I will discuss the rest.
In this blog, you will learn about

  1. Character Text Splitter
  2. Recursive Character Text Splitter
  3. Token Splitter

Installing Required Packages

This blog doesn’t require many installations, only the LangChain package, LangChain’s text splitter package, and TikToken are needed in this blog.

%pip install langchain
%pip install tiktoken
%pip install -qU langchain-text-splitters

Let’s jump into the realm of text splitters with character text splitters.

Why Use Character Text Splitter In LangChain Text Splitters

This one in LangChain Text Splitter is used when we want to split our text based on character.

A character can be any number of characters. It can be ‘\n\n’ or even ‘\n\n\n\n’, i.e., four newline characters. It can also be a simple space ‘ ‘ or any other character.

When we use a character text splitter, we simply use it to chunk our text, and we use that particular character to divide the text into chunks.

The first and top priority of a character text splitter is to divide the text into chunks based on the character provided by us to split it.

Other arguments or options in the character text splitters, like overlap, number of chunks we want, etc, are not the priority of this text splitter.

So, we don’t use these options because there is a high chance they won’t work. These arguments work excellently in the recursive Character text splitter, discussed later.

Let’s have some text to split and try our first splitter among LangChain text splitters. The text is copied from the Britannica Encyclopedia.

I will trim it here to keep this blog easy on the eyes. To read the complete document, you will have to read this document.

Steps to Use Text Splitter

Generally, we can divide Text splitting into these 3 steps

  1. Loading Document
  2. Initialization of Text Splitter
  3. Splitting Document

Step 1: Loading the Document – Sample Text for All Three LangChain Text Splitters

The sample text I use here will be the same throughout the blog for all three LangChain text splitters. Replace it with yours and move toward your desired text splitter.

text = """Who Invented the Internet?
Written by
Fact-checked by

\n\n\n\n

Internet http://www blue screen. Hompepage blog 2009, history and society, media news television, .........

Did You Know? The History of the Internet. Discover how the Internet grew to connect billions of people worldwide.
How was the Internet invented?
Learn more about the history and development of the Internet.
.........
............
9 American Political Scandals
Shadow of a man holding large knife in his hand inside of some dark, spooky buiding
7 of History's Most Notorious Serial Killers
Home \n\n\n\n
Demystified
Science
Science & Tech
What's the Difference Between a Solstice and an Equinox?
Written by
Fact-checked by
Demystified: ...........

...... happen in March (about March 21) and September (about September 23). These are the days when the Sun is exactly above the Equator, which makes day and night of equal length."""

Alternative use

with open(“…/Internet.txt”) as f:
text = f.read().

Replace .../Internet.txt with the address of your file and the filename containing the text you desire to split.

Step 2: Initialization of Text Splitter and Super Power of Character Text Splitter (Parameters)

After loading your text, the next step is the most crucial: initializing a character text splitter. In character text splitter, there are different arguments like, but only two of them are most important.

  1. separator
  2. is_separator_regex

Let’s initialize it first and then see the details, but before initializing, don’t forget to import the module.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator= "\n\n\n\n",
    chunk_size=150,
    chunk_overlap=0,
   
 is_separator_regex=True,
)

A separator decides the character that will be considered to separate the text, while, as discussed earlier, any character like a single newline character or even four times newline character or a single letter of the alphabet and even a space” ” can be used as a separator.

The parameter is_separator_regex is used to tell whether we are using regex (short for the regular expression) for complex searches, substitutions, and manipulations of strings. It has special characters for it that are used, for example, for 4 newline characters \n{4,} would be used where {4,} is the quantifier that tells the number of \n that is used as a separator in our case it would be \n\n\n\n.

Since we aren’t using regex so, whether we set is_separator_regex true or false doesn’t matter at all. It will just find the character we have specified and split our text.

The other parameters, the chunk_size and chunk_overlap are important parameters but not so effective in the character text splitter, so let’s see the rest of the code of the character text splitter and then move to the recursive character text splitter of LangChain text splitters in which these are parameters are main heroes.

Step 3: Splitting Document Based on Character

Now, use .create method of text_splitter to split based on character.

splitted_texts = text_splitter.create_documents([text])
print(splitted_texts[0])
 

We are printing only the first chunk since the four newline characters first appeared right in the beginning just after “Fact-checked by”, so let’s observe our output.

page_content='Who Invented the Internet?\nWritten by\nFact-checked by'

There is a high chance that you will see a warning if you have set chunk size, but that’s okay since we are using a character splitter, so it will make chunks based on characters first. Since we have written \n\n\n\n twice it should break our text into 3 chunks let’s check it using the len function.

len(splitted_texts)

Output

3

Hurray! It worked. Now, let’s see what Recursive Character Text SPitter do

Recursive Character Text Splitter: Most Useful For Document Splitting (RAG)

I have used this one in many of my previous blogs, like YouTube Summarizer, LangChain RAG, and many others. The speciality of this text splitter is it can retain the context of the chunks, and not only that, the, parameters like chunk_size and chunk overlap are the main parameters here.

Now let’s import the RecursiveCharacterTextSplitter and directly jump to step 2 of initialization.

from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=50,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

No matter how large a chunk you ask it to divide the texts into, it will do so with ease. The chunk_size indicates the size you want each chunk to be. For example, if the chunk size is 50, then each chunk will have 50 characters on priority.

The chunk overlap helps to retain the context by having some common characters between each chunk it tells how many common characters each chunk should have or you can also say chunk_overlap parameter specifies how many characters from the end of one chunk should be included at the start of the next chunk.

Now let’s create the split document and compare some chunks by printing.

splitted_texts = text_splitter.create_documents([text])

# Printing For Comparison

Output:


page_content='Who Invented the Internet?\nWritten by'
page_content='Written by\nFact-checked by'
page_content='Hompepage blog 2009, history and society, media'
page_content='and society, media news television, crowd opinion'
page_content='crowd opinion protest, In the News 2009, breaking'
page_content='News 2009, breaking news'
page_content='© Khlobystov Alexey/Shutterstock.com'
page_content='What most of us think of as the Internet is'
page_content='as the Internet is really just the pretty face of'
page_content='the pretty face of the operation—browser windows,'

As you can see, not every chunk has exactly 50 characters, but every chunk is under the limit of 50 characters. This is because of multiple factors like the newlines, other special characters being considered characters too, and punctuations to avoid cutting off in the middle splitter, which might slightly adjust. The first and last chunks might be shorter if the text is not neatly divided into exact 50-character chunks.

Token Splitter

Token splitters might be more effective than other splitters. If you are not sure how tokenization works, then in tokenization, small words are taken as single words and longer words are broken into multiple words based on different rules and algorithms. We are going to see the easiest version of the token splitter.

As you can see in the picture, long words are broken into smaller words. However, the method of tokenization varies for each tokenizer. This is why you also see other options in the same picture, such as GPT-4, GPT-4o, and GPT (legacy)

The tokenizer used in the text splitter below follows linguistic rules, so it will divide the text into chunks. You won’t see split words, but the text will be split according to your specified chunk size, following rules similar to those of a recursive text splitter.

Simplest Token Splitter – Code That Adheres to Linguistic Rules

The steps here are the same as the text splitters discussed before just import, initialize, and create. We are using the same text as above, and we already know what chunk_size and chunk_overlap do.

Initializing:

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

Split Document Creation

splitted_texts = text_splitter.create_documents([text])

After running the splitted_text variable in the Google Colab cell, I obtained the following output. You can also verify it using the print function. Note that I have trimmed the output for better readability in the blog.

Output:

Document(page_content='Who Invented the Internet?\nWritten by'),
 Document(page_content='\nFact-checked by\n\n\n\n\n\n\n\n'),
 Document(page_content='Internet http://www blue screen. Hompepage'),
 Document(page_content=' blog 2009, history and society, media news television'),
 Document(page_content=', crowd opinion protest, In the News 2009,'),
 Document(page_content=' breaking news\n© Khlobystov Alex'),
 Document(page_content='ey/Shutterstock.com\nWhat most of'),
 Document(page_content=' us think of as the Internet is really just the'),
 Document(page_content=' pretty face of the operation—browser windows, websites'),
 Document(page_content=', URLs, and search bars. But the real'),
 Document(page_content=' Internet, the brain behind the information superhighway'),
 Document(page_content=', is an intricate set of protocols and rules that'),
 Document(page_content=' someone had to develop before we could get to the'),
 Document(page_content=' World Wide Web. Computer scientists Vinton Cerf'),
 Document(page_content=' and Bob Kahn are credited with inventing the Internet')

Conclusion:

Congratulations! You have learned so much about LangChain Text Splitters, which makes me confident that you can implement these in various projects now.

If you have any feedback or suggestions, feel free to reach out.
Also, check my other blogs on LangChain to make yourself a pro!

Liked the Post?  You can share as well

Facebook
Twitter
LinkedIn

More From Machine Learning Spot

Get The Latest AI News and Insights

Directly To Your Inbox
Subscribe

Signup ML Spot Newsletter

What Will You Get?

Bonus

Get A Free Workshop on
AI Development