How to use ScrapeGraph AI - Scraping with Ollama

Over the past years ive’been something like this…

Me Getting Up to Date with Open Source

…to be able to catch in all open source projects (previous and newly created).

It’s time to gain some leverage with AI

Scrapegraph-ai is an open-source, Python library that revolutionizes web scraping by integrating Large Language Models (LLMs) and graph logic to automate the creation of scraping pipelines.

It uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).

Its sophisticated design allows for the extraction of data from websites, documents, and even XML files efficiently, with the philosophy of “You Only Scrape Once”—signifying a move towards less repetitive and more strategic data gathering.

Developers can leverage this library to construct custom scraping tasks using simple prompts, which are then interpreted by AI to form a tailored scraping workflow.

With Scrapegraph-ai, you can automate data extraction and pipeline construction without repeated manual setup.

The Scrapegraph-ai Project

The Scrapegraph-ai Project is fully open source and you can explore:
- The Scrapegraph-ai Official Page
- The Scrapegraph-ai Source Code at GitHub
  - License: MIT ❤️

With Scrapegraph-ai, no repeated scraping tasks: Define, Deploy, Extract. Scrape Cleverly Repeatedly Automate Pipelines Effectively.

Why Scrapegraph-ai?

Scrapegraph-ai provides a cutting-edge platform for anyone looking to leverage AI capabilities in web scraping without the typical complexities associated with traditional methods.

Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.

Ease of Use: Initiate complex scraping operations with just a few lines of prompt.
Efficiency: Reduce the time and effort spent on repetitive data collection tasks.
AI Integration: Utilize advanced AI techniques to optimize data extraction.
Customizable Workflows: Tailor scraping pipelines to your specific needs with the aid of LLMs and graph logic.

ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.

Project	Description
Scrapegraph Project Documentation	Documentation for the Scrapegraph project
scrapegraphai PyPI Package	Python package for Scrapegraph-AI available on PyPI
Scrapegraph Streamlit Demo	Streamlit demo application for Scrapegraph
Scrapegraph-AI Streamlit App	Live demo of the Scrapegraph-AI project using Streamlit
Scrapegraph Colab Notebook	Interactive Colab notebook to try out Scrapegraph

Scrapegraph-AI: Python Web Scraping with AI

Concept

📜 Scrapegraph-ai is a Python library that enhances web scraping with the use of Large Language Models (LLMs) and graph logic. It is designed to automate the construction of scraping pipelines for websites, documents, and XML files.

Core Idea

💡 The philosophy of “You Only Scrape Once” underlines the library’s goal to extract information efficiently, aiming to reduce the need for repetitive scraping tasks.

Key Functionality

🎯 Users provide prompts detailing the data they wish to extract.
🧠 Utilizes LLMs and graph logic to tailor scraping pipelines based on these prompts.

Implementation

🛠️ Central to the library is the SmartScraper class, which uses a directed graph approach integrating common nodes found in web scraping pipelines.

Further Exploration

📖 For more detailed information and usage instructions, consulting the official Scrapegraph-ai documentation is recommended.

In Essence

🚀 Scrapegraph-ai appears to be a promising tool for those interested in web scraping with Python, especially with an AI-powered approach to automate and simplify pipeline creation.

How to use ScrapeGraph

Scrapegraphai Usecase: https://github.com/VinciGit00/Scrapegraph-ai/tree/main/examples

sudo apt update
sudo apt install python3

sudo apt install python3.10-venv

Create the venv for ScrapeGraph:

python3 -m venv scrapegraphai
source scrapegraphai/bin/activate

pip install scrapegraphai==1.11.3 #https://pypi.org/project/scrapegraphai/

playwright install

Now the package is ready:

https://platform.openai.com/api-keys
- https://scrapegraph-ai.readthedocs.io/en/latest/index.html

ScrapeGraph With Ollama

Let’s do it fully local:

Python Code - ScrapeGraph Ollama ⏬

Dockerfile for ScrapeGraph ⏬

FROM python:3.11  
#https://hub.docker.com/_/python

# LABEL org.opencontainers.image.source https://github.com/JAlcocerT/Streamlit-MultiChat
# LABEL maintainer="Jesus Alcocer Tagua"

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME

COPY . ./


RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    software-properties-common \
    git \
    nano \
    chromium-driver \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages
RUN pip install scrapegraphai==0.9.0b7 \
    nest_asyncio \
    playwright

# Run Playwright commands to install browser dependencies and browsers
RUN playwright install-deps \
    && playwright install

EXPOSE 8501


###podman build -t scrapegraph_cont:latest .
#docker build -t scrapegraph_cont:latest .

Deploying ScrapeGraph with Ollama ⏬

version: '3.8'

services:
  scrapper:
    image: scrapegraph_cont:latest #python:3.11
    container_name: scrapegraph_cont
    ports:
      - "8701:8501"
    working_dir: /app
    #command: python3 app.py
    command: tail -f /dev/null #keep it running

  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main #https://github.com/open-webui/open-webui
    container_name: ollama-webui
    ports:
      - "3000:8080" # 3000 is the port that you will access in your browser
    add-host:
      - "host.docker.internal:host-gateway"
    volumes:
      - ollama-webui_data:/app/backend/data
    restart: always
#     networks: ["nginx_default"] #optional

# networks: #optional
#   nginx_default: #optional
#     external: true #optional       

volumes:
  ollama_data:
  ollama-webui_data:


#docker exec -it scrapegraph_cont /bin/bash
#podman exec -it scrapegraph_cont /bin/bash

You can create user and pass for Open Web UI (Ollama web UI) and later on update the compose with ENABLE_SIGNUP: false

Takeaways

Web scraping is a powerful technique for extracting data from websites.
Proxy networks and tools like Bright Data’s Scraping Browser can help avoid IP blocking and captchas.
Puppeteer is a useful tool for web scraping, but it’s essential to use it with a proxy network.

If you are looking for a tool to get general knowledge about a website, you can use the web-check project - MIT Licensed

https://github.com/Lissy93/web-check
- Use it at https://web-check.xyz/
- It provides also links to other very interesting tools to explore a domain/website

🕵️‍♂️ All-in-one OSINT tool for analysing any website

And you can use these projects together with https://github.com/datopian/markdowndb - MIT Licensed

https://markdowndb.com/ - A rich API to your markdown files in seconds.

Turn markdown files into structured, queryable data with JS. Build markdown-powered docs, blogs, and sites quickly and reliably.

Other F/OSS Tools for Scrapping

The Good’ol BeautifulSoup… ⏬

EmbedChain
- https://fossengineer.com/embedchain-ai/#embedchain-for-scrapping-content
- https://docs.embedchain.ai/get-started/quickstart

It uses LangChain (which its also capable of scrapping)

Crawl a site to generate knowledge files to create your own custom GPT from a URL

Firecrawl - https://github.com/mendableai/firecrawl - aGPL v3.0
- https://github.com/mendableai/firecrawl/tree/main/apps/python-sdk
- https://pypi.org/project/firecrawl-py/

Crawl all accessible subpages and give you clean markdown for each. No sitemap required.

Firecrawl allows you to turn entire websites into LLM-ready markdown.

Scrape: Turn any url into clean data
Crawl: Firecrawl can recursively search through a urls subdomains, and gather the content
https://docs.firecrawl.dev/contributing/guide
- Playground - https://www.firecrawl.dev/playground
- https://www.firecrawl.dev/pricing - API!

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

How to use FireCrawl ⏬

pip install firecrawl-py

Use FireCrawl locally… ⏬

https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md

Use FireCrawl together with… ⏬

Scraperr - https://github.com/jaypyles/Scraperr - MIT Licensed
- https://github.com/jaypyles/Scraperr?tab=MIT-1-ov-file#readme

Self-hosted webscraper.

Use Scrapper… ⏬

Some Other Projects for Scrapping the Web
- https://github.com/DormyMo/SpiderKeeper - admin ui for scrapy/open source scrapinghub
- https://github.com/crawlab-team/crawlab - Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
- https://github.com/Gerapy/Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js - MIT Licensed ❤️
- https://github.com/my8100/scrapydweb - Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO 👉
https://github.com/apify/crawlee - Apache v2 Licensed
- Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers.
- In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://github.com/raznem/parsera - GPLv2
- Lightweight library for scraping web-sites with LLMs

FAQ

What are rotating proxies ⏬

How can I understand & filter JSON ⏬

https://github.com/ynqa/jnv

Interesting web browsers:
- https://github.com/spartrekus/links2
- Lynx

sudo apt install lynx
lynx duckduckgo.com

How to Take ScreenShots of Webs

https://github.com/sindresorhus/capture-website - MIT Licensed ❤️

Capture screenshots of websites

https://github.com/dgtlmoon/changedetection.io
- https://github.com/dgtlmoon/changedetection.io?tab=Apache-2.0-1-ov-file#readme ✅

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

Create agents that monitor and act on your behalf. Your agents are standing by!

https://github.com/thp/urlwatch?tab=License-1-ov-file#readme

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.

How to Monitor Broken Links of Websites

How to Use LinkChecker with Docker ⏬

Use LinkChecker with their GHCR Image

#podman run --rm -it ghcr.io/linkchecker/linkchecker:latest --verbose https://fossengineer.com > linkchecker_output.txt

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.example.com

F/OSS RAG Frameworks

Name	Description
LangChain Document Loaders	Documentation for document loaders in LangChain
Mem0 (ex-EmbedChain)	A tool for creating and managing embeddings (no link provided)
PandasAI	A library for integrating AI with Pandas dataframes (no link provided)
LlamaIndex Data Connectors	Documentation for data connectors in LlamaIndex
LlamaIndex Data Connectors	Documentation for data connectors in LlamaIndex
Danswer-AI	Gen-AI Chat for Teams - MIT Licensed

But there are more interesting tools…

One of the key features of LangGraph is the addition of cycles to the agent runtime, enabling repetitive loops essential for agent operation. LangGraph also introduces two main agent runtimes: the agent executor and the chat agent executor.

The agent executor is similar to LangChain’s agent executor, but rebuilt in LangGraph. The chat agent executor, on the other hand, handles agent states as a list of messages, perfect for chat-based models.

Name	Description
LangGraph	A framework for building language model-powered applications
LangGraph-Studio	An IDE for building and debugging agents using LangGraph
LangGraph-Studio Blog Post	Blog post introducing LangGraph-Studio as the first agent IDE

LangGraph Studio is a groundbreaking tool that has the potential to transform the development experience of complex agentic applications.

The Scrapegraph-ai Project#

Why Scrapegraph-ai?#

Scrapegraph-AI: Python Web Scraping with AI#

Concept#

Core Idea#

Key Functionality#

Implementation#

Further Exploration#

In Essence#

How to use ScrapeGraph#

ScrapeGraph With Ollama#

Takeaways#

Other F/OSS Tools for Scrapping#

FAQ#

How to Take ScreenShots of Webs#

How to Monitor Broken Links of Websites#

F/OSS RAG Frameworks#