Over the past years ive’been something like this…
…to be able to catch in all open source projects (previous and newly created).
It’s time to gain some leverage with AI
Scrapegraph-ai is an open-source, Python library that revolutionizes web scraping by integrating Large Language Models (LLMs) and graph logic to automate the creation of scraping pipelines.
It uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
Its sophisticated design allows for the extraction of data from websites, documents, and even XML files efficiently, with the philosophy of “You Only Scrape Once”—signifying a move towards less repetitive and more strategic data gathering.
Developers can leverage this library to construct custom scraping tasks using simple prompts, which are then interpreted by AI to form a tailored scraping workflow.
With Scrapegraph-ai, you can automate data extraction and pipeline construction without repeated manual setup.
The Scrapegraph-ai Project
- The Scrapegraph-ai Project is fully open source and you can explore:
With Scrapegraph-ai, no repeated scraping tasks: Define, Deploy, Extract. Scrape Cleverly Repeatedly Automate Pipelines Effectively.
Why Scrapegraph-ai?
Scrapegraph-ai provides a cutting-edge platform for anyone looking to leverage AI capabilities in web scraping without the typical complexities associated with traditional methods.
Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.
- Ease of Use: Initiate complex scraping operations with just a few lines of prompt.
- Efficiency: Reduce the time and effort spent on repetitive data collection tasks.
- AI Integration: Utilize advanced AI techniques to optimize data extraction.
- Customizable Workflows: Tailor scraping pipelines to your specific needs with the aid of LLMs and graph logic.
ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
Name | Description |
---|---|
Scrapegraph Project Documentation | Documentation for the Scrapegraph project |
scrapegraphai PyPI Package | Python package for Scrapegraph-AI available on PyPI |
Scrapegraph Streamlit Demo | Streamlit demo application for Scrapegraph |
Scrapegraph-AI Streamlit App | Live demo of the Scrapegraph-AI project using Streamlit |
Scrapegraph Colab Notebook | Interactive Colab notebook to try out Scrapegraph |
Scrapegraph-AI: Python Web Scraping with AI
Concept
📜 Scrapegraph-ai is a Python library that enhances web scraping with the use of Large Language Models (LLMs) and graph logic. It is designed to automate the construction of scraping pipelines for websites, documents, and XML files.
Core Idea
💡 The philosophy of “You Only Scrape Once” underlines the library’s goal to extract information efficiently, aiming to reduce the need for repetitive scraping tasks.
Key Functionality
- 🎯 Users provide prompts detailing the data they wish to extract.
- 🧠 Utilizes LLMs and graph logic to tailor scraping pipelines based on these prompts.
Implementation
- 🛠️ Central to the library is the SmartScraper class, which uses a directed graph approach integrating common nodes found in web scraping pipelines.
Further Exploration
- 📖 For more detailed information and usage instructions, consulting the official Scrapegraph-ai documentation is recommended.
In Essence
- 🚀 Scrapegraph-ai appears to be a promising tool for those interested in web scraping with Python, especially with an AI-powered approach to automate and simplify pipeline creation.
How to use ScrapeGraph
Scrapegraphai Usecase: https://github.com/VinciGit00/Scrapegraph-ai/tree/main/examples
sudo apt update
sudo apt install python3
sudo apt install python3.10-venv
Create the venv for ScrapeGraph:
python3 -m venv scrapegraphai
source scrapegraphai/bin/activate
pip install scrapegraphai==1.11.3 #https://pypi.org/project/scrapegraphai/
playwright install
Now the package is ready:
ScrapeGraph With Ollama
Let’s do it fully local:
Python Code - ScrapeGraph Ollama ⏬
Dockerfile for ScrapeGraph ⏬
FROM python:3.11
#https://hub.docker.com/_/python
# LABEL org.opencontainers.image.source https://github.com/JAlcocerT/Streamlit-MultiChat
# LABEL maintainer="Jesus Alcocer Tagua"
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
RUN apt-get update && apt-get install -y \
build-essential \
curl \
software-properties-common \
git \
nano \
chromium-driver \
&& rm -rf /var/lib/apt/lists/*
# Install Python packages
RUN pip install scrapegraphai==0.9.0b7 \
nest_asyncio \
playwright
# Run Playwright commands to install browser dependencies and browsers
RUN playwright install-deps \
&& playwright install
EXPOSE 8501
###podman build -t scrapegraph_cont:latest .
#docker build -t scrapegraph_cont:latest .
Deploying ScrapeGraph with Ollama ⏬
version: '3.8'
services:
scrapper:
image: scrapegraph_cont:latest #python:3.11
container_name: scrapegraph_cont
ports:
- "8701:8501"
working_dir: /app
#command: python3 app.py
command: tail -f /dev/null #keep it running
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
ollama-webui:
image: ghcr.io/ollama-webui/ollama-webui:main #https://github.com/open-webui/open-webui
container_name: ollama-webui
ports:
- "3000:8080" # 3000 is the port that you will access in your browser
add-host:
- "host.docker.internal:host-gateway"
volumes:
- ollama-webui_data:/app/backend/data
restart: always
# networks: ["nginx_default"] #optional
# networks: #optional
# nginx_default: #optional
# external: true #optional
volumes:
ollama_data:
ollama-webui_data:
#docker exec -it scrapegraph_cont /bin/bash
#podman exec -it scrapegraph_cont /bin/bash
You can create user and pass for Open Web UI (Ollama web UI) and later on update the compose with
ENABLE_SIGNUP: false
Takeaways
- Web scraping is a powerful technique for extracting data from websites.
- Proxy networks and tools like Bright Data’s Scraping Browser can help avoid IP blocking and captchas.
- Puppeteer is a useful tool for web scraping, but it’s essential to use it with a proxy network.
If you are looking for a tool to get general knowledge about a website, you can use the web-check project - MIT Licensed
- https://github.com/Lissy93/web-check
- Use it at https://web-check.xyz/
- It provides also links to other very interesting tools to explore a domain/website
🕵️♂️ All-in-one OSINT tool for analysing any website
And you can use these projects together with https://github.com/datopian/markdowndb - MIT Licensed
- https://markdowndb.com/ - A rich API to your markdown files in seconds.
Turn markdown files into structured, queryable data with JS. Build markdown-powered docs, blogs, and sites quickly and reliably.
Other F/OSS Tools for Scrapping
The Good’ol BeautifulSoup… ⏬
- EmbedChain
It uses LangChain (which its also capable of scrapping)
-
GPTCrawler
Crawl a site to generate knowledge files to create your own custom GPT from a URL
- Firecrawl - https://github.com/mendableai/firecrawl - aGPL v3.0
Crawl all accessible subpages and give you clean markdown for each. No sitemap required.
Firecrawl allows you to turn entire websites into LLM-ready markdown.
-
Scrape: Turn any url into clean data
-
Crawl: Firecrawl can recursively search through a urls subdomains, and gather the content
-
https://docs.firecrawl.dev/contributing/guide
- Playground - https://www.firecrawl.dev/playground
- https://www.firecrawl.dev/pricing - API!
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
How to use FireCrawl ⏬
pip install firecrawl-py
Use FireCrawl locally… ⏬
Use FireCrawl together with… ⏬
- Scraperr - https://github.com/jaypyles/Scraperr - MIT Licensed
Self-hosted webscraper.
Use Scrapper… ⏬
- Some Other Projects for Scrapping the Web
- https://github.com/DormyMo/SpiderKeeper - admin ui for scrapy/open source scrapinghub
- https://github.com/crawlab-team/crawlab - Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
- https://github.com/Gerapy/Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js - MIT Licensed ❤️
- https://github.com/my8100/scrapydweb - Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
- https://github.com/apify/crawlee - Apache v2 Licensed
- Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers.
- In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
- https://github.com/raznem/parsera - GPLv2
- Lightweight library for scraping web-sites with LLMs
FAQ
What are rotating proxies ⏬
How can I understand & filter JSON ⏬
- Interesting web browsers:
sudo apt install lynx
lynx duckduckgo.com
How to Take ScreenShots of Webs
- https://github.com/sindresorhus/capture-website - MIT Licensed ❤️
Capture screenshots of websites
The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification
Create agents that monitor and act on your behalf. Your agents are standing by!
Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
How to Monitor Broken Links of Websites
How to Use LinkChecker with Docker ⏬
- Use LinkChecker with their GHCR Image
#podman run --rm -it ghcr.io/linkchecker/linkchecker:latest --verbose https://fossengineer.com > linkchecker_output.txt
docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.example.com
F/OSS RAG Frameworks
Name | Description |
---|---|
LangChain Document Loaders | Documentation for document loaders in LangChain |
Mem0 (ex-EmbedChain) | A tool for creating and managing embeddings (no link provided) |
PandasAI | A library for integrating AI with Pandas dataframes (no link provided) |
LlamaIndex Data Connectors | Documentation for data connectors in LlamaIndex |
LlamaIndex Data Connectors | Documentation for data connectors in LlamaIndex |
Danswer-AI | Gen-AI Chat for Teams - MIT Licensed |
But there are more interesting tools…
One of the key features of LangGraph is the addition of cycles to the agent runtime, enabling repetitive loops essential for agent operation. LangGraph also introduces two main agent runtimes: the agent executor and the chat agent executor.
The agent executor is similar to LangChain’s agent executor, but rebuilt in LangGraph. The chat agent executor, on the other hand, handles agent states as a list of messages, perfect for chat-based models.
Name | Description |
---|---|
LangGraph | A framework for building language model-powered applications |
LangGraph-Studio | An IDE for building and debugging agents using LangGraph |
LangGraph-Studio Blog Post | Blog post introducing LangGraph-Studio as the first agent IDE |
LangGraph Studio is a groundbreaking tool that has the potential to transform the development experience of complex agentic applications.