scrapo alpha
python 3.11+ MIT alpha · API stable

The web-scraping library agents deserve.

Selector-cheap. LLM-resilient. Replay-safe. Self-hosted. A five-tier router that escalates HTTP → browser → stealth → agent only when reality demands.

// install
pip install "scrapo[browser,anthropic,mcp]"
5 tier router
LLM, then cached
replays
HTML markdown + chunks JSON pretty-printed LD+JSON structured RSS / Atom feeds PDF text extraction PLAINTEXT verbatim
// 01 · the router

Five tiers.
Escalated only when reality demands.

Most pages return on Tier 0. When they don't, Scrapo climbs — not on your manual heuristics, but on real failure signals: status codes, anti-bot fingerprints, missing schema fields, unrendered SPA shells.

T0
HTTP httpx · async · redirect-aware
cost

Plain HTTP.

A direct GET with sensible defaults. If the body returns 200 with content, this tier serves it. No browser, no proxy, no LLM.

escalates on 403 429 empty body
T1
SESSIONED cookies · auth · retries
cost

Sessioned HTTP.

Persistent sessions with cookie carry-over, retry budgets, and per-host concurrency limits. Still no browser. Most logged-in pages live here.

escalates on CF challenge Akamai BMP JS gate
T2
HEADLESS playwright · chromium
cost

Headless browser.

Full JS execution for SPA shells. Network interception is captured to the snapshot store so re-extractions are deterministic, never flaky.

escalates on bot fingerprint missing fields
T3
STEALTH stealth + proxy pool
cost

Stealth + rotating identity.

Browser hardening, proxy adapters (Bright Data · Oxylabs · Scrapfly · Zyte), rotating identity with health checks. The escalator's express car.

escalates on interaction needed multi-step flow
T4
AGENT observe · act · extract
cost

Agent driver.

LLM-driven runner for forms, paginations, and conditional flows. The last resort and the most expensive — so Scrapo treats it that way.

terminal tier target reached
// rule of routing

You can cap the ceiling per call (--max-tier 2) or let Scrapo decide. Either way, every escalation is logged with the trigger that caused it.

// 02 · in code

Three lines for a markdown page.
Six for a typed model.

The first call to a new shape pays an LLM tax. Every subsequent call uses the cached selectors. When the page drifts, the cache rebuilds itself.

scrape.py
# A markdown-clean fetch with provenance.
import scrapo

res = await scrapo.scrape("https://news.ycombinator.com/")

print(res.markdown)             # clean markdown
print(res.chunks[0].provenance) # url · selector · byte range · heading trail
print(res.run_id)              # replayable, diffable, archived
# Typed extraction: LLM once, selectors forever.
from pydantic import BaseModel
import scrapo

class Offer(BaseModel):
    title: str
    price: float
    currency: str = "USD"

class Listing(BaseModel):
    page_title: str
    offers: list[Offer] = []

# First call: LLM extracts + caches CSS selectors.
# Every call after: zero LLM cost, selector-based.
res = await scrapo.scrape(url, schema=Listing)

listing: Listing = res.data
# Single page, with screenshot and markdown export.
scrapo scrape https://example.com --out-md page.md --screenshot

# Recursive crawl with depth and page budget.
scrapo crawl https://docs.python.org/3/ --max-depth 2 --max-pages 100

# Deterministic replay from archived HTML — no network.
scrapo replay <run_id>
scrapo diff <run_a> <run_b>

# Expose as an MCP server for Claude Code, Claude Desktop, Cursor.
scrapo mcp
// 03 · what's in the box

Six load-bearing
decisions.

01

Deterministic replay

Every fetch is archived. Re-extract from yesterday's HTML. Diff two runs field-by-field. The audit trail is the database.

02

Selector self-healing

The first run is LLM-driven; selectors are cached afterward. When a page drifts and the cache fails, Scrapo re-derives — quietly, automatically.

03

SSRF + PII guards

IP-obfuscation detection blocks internal targets. Opt-in robots gate. PII redaction on snapshots. Append-only audit log.

04

MCP server built-in

One command — scrapo mcp — and Claude Code, Claude Desktop, and Cursor get scrape, crawl, replay, diff as first-class tools.

05

Model pinning

Strict mode prevents silent LLM drift in production. Pin the model, the prompt, the schema version. Promotions are intentional.

06

Per-chunk provenance

Every extracted chunk carries its source: URL, selector path, byte range, heading trail. You can always trace a value back to its sentence.

// 04 · why

Most scraping stacks pay LLM tax on every page.
Scrapo pays once.

the old way ×
  • One bespoke scraper per site. Re-written every layout drift.
  • LLM call on every page — selectors discovered then thrown away.
  • Stealth, proxies, and retries woven through application code.
  • No audit trail. "Did this number change because the page did?"
  • Schema change ⇒ re-scrape the whole internet.
  • Agent frameworks bolt scraping on as an afterthought.
with scrapo
  • One library. Five tiers. Auto-escalates on real failure signals.
  • LLM extracts once. Selectors cache. Re-derives only on drift.
  • Proxy & stealth as pluggable adapters — swap without rewrites.
  • Every run replayable. Field-level diff between any two snapshots.
  • Re-extract from archived HTML — schema changes are free.
  • MCP server makes Scrapo native to the agent runtime.
// adapters
llm Anthropic OpenAI Gemini mock proxy Bright Data Oxylabs Scrapfly Zyte BYO storage SQLite · WAL S3
// 05 · get started

Install.
Scrape something.
Replay it tomorrow.

$ pip install "scrapo[browser,anthropic,mcp]"
Star on GitHub