LangChain web scraping and web crawling are the most fun and loved part of LangChain. We have various ways to do web scraping and web crawling but today we are going to explore the shortest and easiest one, which is still effective and will help you gather data using LangChain web scraping feature in no time.
without delay, lets get started!
FireCrawl: Tool of the day
For LangChain web scraping, we are going to use FireCrawl we will load it with LangChain Document Loader It is really simple to use; you just need to have its API, choose your website to be scraped, change the mode to scrape, and voila! only one more line data is scraped, ready to be used.
Installing Every Required Package and Importing Module
%pip install langchain
%pip langchain-openai
%pip install firecrawl-py
from langchain_community.document_loaders import FireCrawlLoader
Step 1: Getting FireCrawl API
To get the API, you just need to sign up, go to the playground tab, and copy the API key, as shown in below picture
Step2: Road to LangChain Web Scraping
Using the API of FireCrawl to access the webpage to be scraped.
loader = FireCrawlLoader(
api_key="Your API", url="https://machinelearningspot.com/langchain-youtube-summarizer/", mode="scrape"
)
#Replace Your API with your API
Step 3: Loading Information As Document.
scrape_docs = loader.load()
So by following just 3 steps, I got my data using FireCrawl.
Let’s see what’s inside scrape_docs by printing it. You can see my output by opening this file or by reading it yourself. It contains meta data + data for the entire page.
Sneak Peak Into Output:
[Document(page_content='\n\n[![ML Spot Logo](https://machinelearningspot.com/wp-content/uploads/2024/02/ML-Spot-Logo.png)](https://machinelearningspot.com)\n\n[Machine Learning Spot](https://machinelearningspot.com)\n\n---------------------------------------------------------\n\n* [Home](https://machinelearningspot.com/)\n \n* [About Us](https://machinelearningspot.com/about-machine-learning-spot/)\n \n* [AI Tools](https://machinelearningspot.com/ai-tools/)\n \n* [AI News & Insight](https://machinelearningspot.com/ai-news-insight/)\n \n* [ML Guide](https://machinelearningspot.com/machine-learning-guide/)\n \n* [Contact](https://machinelearningspot.com/contact-us/)\n \n\nMenu\n\n* [Home](https://machinelearningspot.com/)\n \n* [About Us](https://machinelearningspot.com/about-machine-learning-spot/)\n \n* [AI Tools](https://machinelearningspot.com/ai-tools/)\n \n* [AI News & Insight](https://machinelearningspot.com/ai-news-insight/)\n \n* [ML Guide](https://machinelearningspot.com/machine-learning-guide/)\n \n* [Contact](https://machinelearningspot.com/contact-us/)\n \n\nHow to Build LangChain YouTube Summarizer (In 4-Steps & Easy Way)\n=================================================================\n\n* [Muhammad Talal Khan Afridi](https://machinelearningspot.com/author/talal/)\n \n* [April 5, 2024](https://machinelearningspot.com/2024/04/05/)\n \n\n![Thumnail of LangChain YouTube Summarizer](https://machinelearningspot.com/wp-content/uploads/2024/04/summarizer.png)\n\nLangChain YouTube Summarizer: In simple and easy steps, you are going to make your own YouTube video summarizer using LangChain. You just need to keep following the instructions that I am going to share here.\n\nWhat’s special about this tutorial is that it works for any duration of video, whether it be 1 hour, 2 hours, or 3 hours**.** It **works for any duration,** and it **generates an AI summary** according to your own given requirement.\n\nPrerequisites of Tutorial\n-------------------------\n\nThis tutorial on Langchain YouTube summarizer **requires no prerequisites at all,** yes! I am going to simplify it so much that even if you don’t have any idea of programming, by following the given instructions, you can make it yourself.......
You can also read the article to compare the information it deals with on how to make a YouTube Video Summarizer.
LangChain Web Crawling
Now it’s time that we learn how to crawl an entire website, including each markdown, and nothing is left.
It is similar to LangChain web scraping; you just need to change the mode.
#Initializing the Document Loader
loader = FireCrawlLoader(
api_key="fc-0df6b90a13704e818ac59267c9ea1747", url="https://machinelearningspot.com", mode="crawl"
)
#Loading Document then Printing to see contents
crawl_docs = loader.load()
print(crawl_docs)
The above code will crawl the entire website since our website is full of the latest articles on AI tools and LangChain tutorials so it will be difficult to share the entire output with you, but you can try and see it yourself by trying it out. A trimmed output can be seen below to give you an idea.
[Document(page_content='\n\n[![ML Spot Logo](https://machinelearningspot.com/wp-content/uploads/2024/02/ML-Spot-Logo.png)](https://machinelearningspot.com)\n\n[Machine Learning Spot](https://machinelearningspot.com)\n\n---------------------------------------------------------\n\n* [Home](https://machinelearningspot.com/)\n \n* [About Us](https://machinelearningspot.com/about-machine-learning-spot/)\n \n* [AI Tools](https://machinelearningspot.com/ai-tools/)\n \n* [AI News & Insight](https://machinelearningspot.com/ai-news-insight/)\n \n* [ML Guide](https://machinelearningspot.com/machine-learning-guide/)\n \n* [Contact](https://machinelearningspot.com/contact-us/)\n \n\nMenu\n\n* [Home](https://machinelearningspot.com/)\n \n* [About Us](https://machinelearningspot.com/about-machine-learning-spot/)\n \n* [AI Tools](https://machinelearningspot.com/ai-tools/)\n \n* [AI News & Insight](https://machinelearningspot.com/ai-news-insight/)\n \n* [ML Guide](https://machinelearningspot.com/machine-learning-guide/)\n \n* [Contact](https://machinelearningspot.com/contact-us/)\n \n\nThe Best 6 AI Project Management Tools for Smooth Workflow\n==========================================================\n\n* [Saman Shoaib](https://machinelearningspot.com/author/samanshoaib/)\n \n* [April 29, 2024](https://machinelearningspot.com/2024/04/29/)\n ..................
Conclusion:
Hurray! Through this article on Langchain Web Scraping, I hope you have learned a lot. For the next step, you can connect it with a RAG and use LLM to talk with your chosen website. If you want to learn about RAG, then read this article on LangChain RAG.