Plain HTTP.
A direct GET with sensible defaults. If the body returns 200 with content, this tier serves it. No browser, no proxy, no LLM.
Selector-cheap. LLM-resilient. Replay-safe. Self-hosted. A five-tier router that escalates HTTP → browser → stealth → agent only when reality demands.
pip install "scrapo[browser,anthropic,mcp]"
Most pages return on Tier 0. When they don't, Scrapo climbs — not on your manual heuristics, but on real failure signals: status codes, anti-bot fingerprints, missing schema fields, unrendered SPA shells.
A direct GET with sensible defaults. If the body returns 200 with content, this tier serves it. No browser, no proxy, no LLM.
Persistent sessions with cookie carry-over, retry budgets, and per-host concurrency limits. Still no browser. Most logged-in pages live here.
Full JS execution for SPA shells. Network interception is captured to the snapshot store so re-extractions are deterministic, never flaky.
Browser hardening, proxy adapters (Bright Data · Oxylabs · Scrapfly · Zyte), rotating identity with health checks. The escalator's express car.
LLM-driven runner for forms, paginations, and conditional flows. The last resort and the most expensive — so Scrapo treats it that way.
You can cap the ceiling per call (--max-tier 2) or let Scrapo decide. Either way, every escalation is logged with the trigger that caused it.
The first call to a new shape pays an LLM tax. Every subsequent call uses the cached selectors. When the page drifts, the cache rebuilds itself.
# A markdown-clean fetch with provenance.
import scrapo
res = await scrapo.scrape("https://news.ycombinator.com/")
print(res.markdown) # clean markdown
print(res.chunks[0].provenance) # url · selector · byte range · heading trail
print(res.run_id) # replayable, diffable, archived
# Typed extraction: LLM once, selectors forever.
from pydantic import BaseModel
import scrapo
class Offer(BaseModel):
title: str
price: float
currency: str = "USD"
class Listing(BaseModel):
page_title: str
offers: list[Offer] = []
# First call: LLM extracts + caches CSS selectors.
# Every call after: zero LLM cost, selector-based.
res = await scrapo.scrape(url, schema=Listing)
listing: Listing = res.data
# Single page, with screenshot and markdown export.
scrapo scrape https://example.com --out-md page.md --screenshot
# Recursive crawl with depth and page budget.
scrapo crawl https://docs.python.org/3/ --max-depth 2 --max-pages 100
# Deterministic replay from archived HTML — no network.
scrapo replay <run_id>
scrapo diff <run_a> <run_b>
# Expose as an MCP server for Claude Code, Claude Desktop, Cursor.
scrapo mcp
Every fetch is archived. Re-extract from yesterday's HTML. Diff two runs field-by-field. The audit trail is the database.
The first run is LLM-driven; selectors are cached afterward. When a page drifts and the cache fails, Scrapo re-derives — quietly, automatically.
IP-obfuscation detection blocks internal targets. Opt-in robots gate. PII redaction on snapshots. Append-only audit log.
One command — scrapo mcp — and Claude Code, Claude Desktop, and Cursor get scrape, crawl, replay, diff as first-class tools.
Strict mode prevents silent LLM drift in production. Pin the model, the prompt, the schema version. Promotions are intentional.
Every extracted chunk carries its source: URL, selector path, byte range, heading trail. You can always trace a value back to its sentence.
pip install "scrapo[browser,anthropic,mcp]"