python 3.11+ MIT beta · API stable

The web-scraping library agents deserve.

Selector-cheap. LLM-resilient. Replay-safe. Self-hosted. A five-tier router that escalates HTTP → browser → stealth → agent only when reality demands.

// install

pip install "scrapo-ai[browser,anthropic,mcp]"

Get started Read the tiers

5 tier router

1× LLM, then cached

∞ replays

HTML markdown + chunks◇ METADATA json-ld · og · microdata◇ JSON pretty-printed◇ LD+JSON structured◇ RSS / Atom feeds◇ PDF text extraction◇ PLAINTEXT verbatim◇

// 01 · the router

Five tiers.
Escalated only when reality demands.

Most pages return on Tier 0. When they don't, Scrapo climbs — not on your manual heuristics, but on real failure signals: status codes, anti-bot fingerprints, missing schema fields, unrendered SPA shells.

before the ladder

API-first · zero-LLM metadata.

Known sites resolve through public APIs first — Wikipedia & Wikimedia skip the tiers (and the CAPTCHAs) entirely, reporting via="api:wikipedia". Then embedded JSON-LD, OpenGraph, and microdata are read with zero LLM cost. Many pages are answered before T0 ever fires.

HTTP httpx · async · redirect-aware

cost

Plain HTTP.

A direct GET with sensible defaults. If the body returns 200 with content, this tier serves it. No browser, no proxy, no LLM.

escalates on 403 429 empty body

SESSIONED cookies · auth · retries

cost

Sessioned HTTP.

Persistent sessions with cookie carry-over, retry budgets, and per-host concurrency limits. Still no browser. Most logged-in pages live here.

escalates on CF challenge Akamai BMP JS gate

HEADLESS playwright · chromium

cost

Headless browser.

Full JS execution for SPA shells. Network interception is captured to the snapshot store so re-extractions are deterministic, never flaky.

escalates on bot fingerprint missing fields

STEALTH stealth + proxy pool

cost

Stealth + rotating identity.

Browser hardening, proxy adapters (Bright Data · Oxylabs · Scrapfly · Zyte), rotating identity with health checks. The escalator's express car.

escalates on interaction needed multi-step flow

AGENT observe · act · extract

cost

Agent driver.

LLM-driven runner for forms, paginations, and conditional flows. The last resort and the most expensive — so Scrapo treats it that way.

terminal tier target reached

// rule of routing

You can cap the ceiling per call (--max-tier 2) or let Scrapo decide. Either way, every escalation is logged with the trigger that caused it.

// 02 · in code

Three lines for a markdown page.
Six for a typed model.

The first call to a new shape pays an LLM tax. Every subsequent call uses the cached selectors. When the page drifts, the cache rebuilds itself.

scrape.py

# A markdown-clean fetch with provenance.
import scrapo

res = await scrapo.scrape("https://news.ycombinator.com/")

print(res.markdown)             # clean markdown
print(res.chunks[0].provenance) # url · selector · byte range · heading trail
print(res.run_id)              # replayable, diffable, archived

# Typed extraction: LLM once, selectors forever.
from pydantic import BaseModel
import scrapo

class Offer(BaseModel):
    title: str
    price: float
    currency: str = "USD"

class Listing(BaseModel):
    page_title: str
    offers: list[Offer] = []

# First call: LLM extracts + caches CSS selectors.
# Every call after: zero LLM cost, selector-based.
res = await scrapo.scrape(url, schema=Listing)

listing: Listing = res.data

# Watch a page — pay only when it actually changes.
import scrapo

w = await scrapo.watch("https://example.com/pricing", schema=Pricing)
change = await w.refresh()        # conditional GET → 304 = free, zero LLM
if change.changed:
    print(change.summary())

# Batch: one shared browser pool, per-URL error isolation.
items = await scrapo.batch_scrape(urls, schema=Product, main_content=True)

# No event loop? The sync API works in scripts and notebooks.
res = scrapo.scrape_sync("https://example.com/")

# Single page — cap the tier ceiling, export markdown + screenshot.
scrapo scrape https://example.com --max-tier 3 --out-md page.md --screenshot

# Crawl with budgets, or just map a site's URLs.
scrapo crawl https://docs.python.org/3/ --max-depth 2 --max-pages 100
scrapo map https://docs.python.org/3/ --out urls.txt

# Batch many URLs straight to JSONL / CSV.
scrapo batch https://a.com/ https://b.com/ --out-jsonl out.jsonl

# Deterministic replay + field-level diff — no network, no LLM.
scrapo replay <run_id>
scrapo diff <run_a> <run_b>

# Watch for changes · serve a local UI · expose an MCP server.
scrapo watch-add https://example.com/pricing --interval 3600
scrapo serve
scrapo mcp

// 03 · what's in the box

Nine load-bearing
decisions.

Deterministic replay

Every fetch is archived. Re-extract from yesterday's HTML. Diff two runs field-by-field. The audit trail is the database.

Selector self-healing

The first run is LLM-driven; selectors are cached afterward. When a page drifts and the cache fails, Scrapo re-derives — quietly, automatically.

SSRF + PII guards

IP-obfuscation detection blocks internal targets. Opt-in robots gate. PII redaction on snapshots. Append-only audit log.

MCP server built-in

One command — scrapo mcp — gives Claude Code, Claude Desktop, and Cursor seven first-class tools: scrape, crawl, map, batch, replay, diff, list_runs.

Model pinning

Strict mode prevents silent LLM drift in production. Pin the model, the prompt, the schema version. Promotions are intentional.

Per-chunk provenance

Every extracted chunk carries its source: URL, selector path, byte range, heading trail. You can always trace a value back to its sentence.

API-first fast path

Known sites resolve through public APIs before the tier ladder. Wikipedia & Wikimedia skip CAPTCHAs entirely and return via="api:..." — faster, cheaper, no browser.

Zero-LLM metadata

JSON-LD, OpenGraph, Twitter tags, and microdata are read straight from the page before any selector cache or LLM call. Structured data costs zero tokens.

Watch & change alerts

Monitor any URL with conditional GETs — 304 Not Modified is free. Self-hosted scheduler fires webhooks with a field-level diff only when content actually changes.

// 04 · why

Most scraping stacks pay LLM tax on every page.
Scrapo pays once.

the old way ×

One bespoke scraper per site. Re-written every layout drift.
LLM call on every page — selectors discovered then thrown away.
Stealth, proxies, and retries woven through application code.
No audit trail. "Did this number change because the page did?"
Schema change ⇒ re-scrape the whole internet.
Agent frameworks bolt scraping on as an afterthought.

with scrapo ●

One library. Five tiers. Auto-escalates on real failure signals.
API-first + embedded metadata clear most pages at zero LLM cost.
LLM extracts once. Selectors cache and self-heal on drift.
Proxy & stealth as pluggable adapters — swap without rewrites.
Every run replayable. Field-level diff between any two snapshots.
MCP server + watch alerts make Scrapo native to the agent runtime.

// adapters

llm Anthropic OpenAI Gemini DeepSeek OpenRouter Ollama mock proxy Bright Data Oxylabs Scrapfly Zyte BYO storage SQLite · WAL S3

// 05 · get started

Install.
Scrape something.
Replay it tomorrow.

$ pip install "scrapo-ai[browser,anthropic,mcp]"

Star on GitHub

The web-scraping library agents deserve.

Five tiers.Escalated only when reality demands.

API-first · zero-LLM metadata.

Plain HTTP.

Sessioned HTTP.

Headless browser.

Stealth + rotating identity.

Agent driver.

Three lines for a markdown page.Six for a typed model.

Nine load-bearingdecisions.

Deterministic replay

Selector self-healing

SSRF + PII guards

MCP server built-in

Model pinning

Per-chunk provenance

API-first fast path

Zero-LLM metadata

Watch & change alerts

Most scraping stacks pay LLM tax on every page.Scrapo pays once.

Install.Scrape something.Replay it tomorrow.

Five tiers.
Escalated only when reality demands.

Three lines for a markdown page.
Six for a typed model.

Nine load-bearing
decisions.

Most scraping stacks pay LLM tax on every page.
Scrapo pays once.

Install.
Scrape something.
Replay it tomorrow.