Over the past years ive’been something like this…

Me Getting Up to Date with Open Source

…to be able to catch in all open source projects (previous and newly created).

It’s time to gain some leverage with AI

Scrapegraph-ai is an open-source, Python library that revolutionizes web scraping by integrating Large Language Models (LLMs) and graph logic to automate the creation of scraping pipelines.

It uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).

Its sophisticated design allows for the extraction of data from websites, documents, and even XML files efficiently, with the philosophy of “You Only Scrape Once”—signifying a move towards less repetitive and more strategic data gathering.

Star History Chart

Developers can leverage this library to construct custom scraping tasks using simple prompts, which are then interpreted by AI to form a tailored scraping workflow.

With Scrapegraph-ai, you can automate data extraction and pipeline construction without repeated manual setup.

The Scrapegraph-ai Project

With Scrapegraph-ai, no repeated scraping tasks: Define, Deploy, Extract. Scrape Cleverly Repeatedly Automate Pipelines Effectively.

Why Scrapegraph-ai?

Scrapegraph-ai provides a cutting-edge platform for anyone looking to leverage AI capabilities in web scraping without the typical complexities associated with traditional methods.

Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.

  • Ease of Use: Initiate complex scraping operations with just a few lines of prompt.
  • Efficiency: Reduce the time and effort spent on repetitive data collection tasks.
  • AI Integration: Utilize advanced AI techniques to optimize data extraction.
  • Customizable Workflows: Tailor scraping pipelines to your specific needs with the aid of LLMs and graph logic.

ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.

Project Description
Scrapegraph Project Documentation Documentation for the Scrapegraph project
scrapegraphai PyPI Package Python package for Scrapegraph-AI available on PyPI
Scrapegraph Streamlit Demo Streamlit demo application for Scrapegraph
Scrapegraph-AI Streamlit App Live demo of the Scrapegraph-AI project using Streamlit
Scrapegraph Colab Notebook Interactive Colab notebook to try out Scrapegraph

Scrapegraph-AI: Python Web Scraping with AI

Concept

📜 Scrapegraph-ai is a Python library that enhances web scraping with the use of Large Language Models (LLMs) and graph logic. It is designed to automate the construction of scraping pipelines for websites, documents, and XML files.

Core Idea

💡 The philosophy of “You Only Scrape Once” underlines the library’s goal to extract information efficiently, aiming to reduce the need for repetitive scraping tasks.

Key Functionality

  • 🎯 Users provide prompts detailing the data they wish to extract.
  • 🧠 Utilizes LLMs and graph logic to tailor scraping pipelines based on these prompts.

Implementation

  • 🛠️ Central to the library is the SmartScraper class, which uses a directed graph approach integrating common nodes found in web scraping pipelines.

Further Exploration

  • 📖 For more detailed information and usage instructions, consulting the official Scrapegraph-ai documentation is recommended.

In Essence

  • 🚀 Scrapegraph-ai appears to be a promising tool for those interested in web scraping with Python, especially with an AI-powered approach to automate and simplify pipeline creation.

How to use ScrapeGraph

Scrapegraphai Usecase: https://github.com/VinciGit00/Scrapegraph-ai/tree/main/examples

sudo apt update
sudo apt install python3

sudo apt install python3.10-venv

Create the venv for ScrapeGraph:

python3 -m venv scrapegraphai
source scrapegraphai/bin/activate

pip install scrapegraphai==1.11.3 #https://pypi.org/project/scrapegraphai/

playwright install

Now the package is ready:

ScrapeGraph With Ollama

Let’s do it fully local:

Python Code - ScrapeGraph Ollama ⏬
Dockerfile for ScrapeGraph ⏬
FROM python:3.11  
#https://hub.docker.com/_/python

# LABEL org.opencontainers.image.source https://github.com/JAlcocerT/Streamlit-MultiChat
# LABEL maintainer="Jesus Alcocer Tagua"

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME

COPY . ./


RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    software-properties-common \
    git \
    nano \
    chromium-driver \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages
RUN pip install scrapegraphai==0.9.0b7 \
    nest_asyncio \
    playwright

# Run Playwright commands to install browser dependencies and browsers
RUN playwright install-deps \
    && playwright install

EXPOSE 8501


###podman build -t scrapegraph_cont:latest .
#docker build -t scrapegraph_cont:latest .
Deploying ScrapeGraph with Ollama ⏬
version: '3.8'

services:
  scrapper:
    image: scrapegraph_cont:latest #python:3.11
    container_name: scrapegraph_cont
    ports:
      - "8701:8501"
    working_dir: /app
    #command: python3 app.py
    command: tail -f /dev/null #keep it running

  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main #https://github.com/open-webui/open-webui
    container_name: ollama-webui
    ports:
      - "3000:8080" # 3000 is the port that you will access in your browser
    add-host:
      - "host.docker.internal:host-gateway"
    volumes:
      - ollama-webui_data:/app/backend/data
    restart: always
#     networks: ["nginx_default"] #optional

# networks: #optional
#   nginx_default: #optional
#     external: true #optional       

volumes:
  ollama_data:
  ollama-webui_data:


#docker exec -it scrapegraph_cont /bin/bash
#podman exec -it scrapegraph_cont /bin/bash

You can create user and pass for Open Web UI (Ollama web UI) and later on update the compose with ENABLE_SIGNUP: false


Takeaways

  • Web scraping is a powerful technique for extracting data from websites.
  • Proxy networks and tools like Bright Data’s Scraping Browser can help avoid IP blocking and captchas.
  • Puppeteer is a useful tool for web scraping, but it’s essential to use it with a proxy network.

If you are looking for a tool to get general knowledge about a website, you can use the web-check project - MIT Licensed

🕵️‍♂️ All-in-one OSINT tool for analysing any website

And you can use these projects together with https://github.com/datopian/markdowndb - MIT Licensed

Turn markdown files into structured, queryable data with JS. Build markdown-powered docs, blogs, and sites quickly and reliably.

Other F/OSS Tools for Scrapping

The Good’ol BeautifulSoup… ⏬

Star History Chart

It uses LangChain (which its also capable of scrapping)

Crawl a site to generate knowledge files to create your own custom GPT from a URL

Crawl all accessible subpages and give you clean markdown for each. No sitemap required.

Firecrawl allows you to turn entire websites into LLM-ready markdown.

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

How to use FireCrawl ⏬
pip install firecrawl-py
Use FireCrawl locally… ⏬
Use FireCrawl together with… ⏬

Self-hosted webscraper.

Use Scrapper… ⏬
  • Some Other Projects for Scrapping the Web
  • https://github.com/apify/crawlee - Apache v2 Licensed
    • Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers.
    • In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
  • https://github.com/raznem/parsera - GPLv2
    • Lightweight library for scraping web-sites with LLMs

FAQ

What are rotating proxies ⏬
How can I understand & filter JSON ⏬
sudo apt install lynx
lynx duckduckgo.com

How to Take ScreenShots of Webs

Capture screenshots of websites

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

Create agents that monitor and act on your behalf. Your agents are standing by!

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.

How to Use LinkChecker with Docker ⏬
#podman run --rm -it ghcr.io/linkchecker/linkchecker:latest --verbose https://fossengineer.com > linkchecker_output.txt

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.example.com

F/OSS RAG Frameworks

Name Description
LangChain Document Loaders Documentation for document loaders in LangChain
Mem0 (ex-EmbedChain) A tool for creating and managing embeddings (no link provided)
PandasAI A library for integrating AI with Pandas dataframes (no link provided)
LlamaIndex Data Connectors Documentation for data connectors in LlamaIndex
LlamaIndex Data Connectors Documentation for data connectors in LlamaIndex
Danswer-AI Gen-AI Chat for Teams - MIT Licensed

But there are more interesting tools…

One of the key features of LangGraph is the addition of cycles to the agent runtime, enabling repetitive loops essential for agent operation. LangGraph also introduces two main agent runtimes: the agent executor and the chat agent executor.

The agent executor is similar to LangChain’s agent executor, but rebuilt in LangGraph. The chat agent executor, on the other hand, handles agent states as a list of messages, perfect for chat-based models.

Name Description
LangGraph A framework for building language model-powered applications
LangGraph-Studio An IDE for building and debugging agents using LangGraph
LangGraph-Studio Blog Post Blog post introducing LangGraph-Studio as the first agent IDE

LangGraph Studio is a groundbreaking tool that has the potential to transform the development experience of complex agentic applications.