---
name: gottem
description: "Fetch live web pages through the gottem scraping API — one HTTP call that auto-routes across Firecrawl, Spider, ZenRows, Brightdata, Zyte, Apify, Oxylabs, Browserbase, Browserless, and more. Use this skill whenever the user asks to scrape, fetch, download, extract, crawl, or read a public URL — especially when they mention anti-bot pages, dynamic JavaScript content, or want to validate the same page across multiple providers. Also use it for LLM training corpora that need cross-source verification (gottem's /v1/compare runs the URL through several scrapers and confirms by SHA-256 which one is the ground truth). Authenticate with the GOTTEM_KEY environment variable; the dashboard is at https://gottem.dev/dashboard/keys."
---

# gottem

One web-scraping API in front of every major scraper. Hand it a URL,
get back the page plus a small routing receipt — which vendor answered,
which tier the request resolved at, what it cost, how long it took.

Drop this skill into any agent that needs reliable scraping. It
handles vendor selection, auto-escalation through anti-bot tiers, and
content verification for you. No SDK to install — every call is a
single HTTP request with `curl`, `fetch`, `requests`, or whatever the
agent already has.

## When to invoke

Trigger this skill when the user asks to:

- **Scrape / fetch / download** a URL (one-shot or batched).
- **Read a page** that's blocked, paywalled, JS-rendered, captcha-gated,
  or geo-fenced. gottem escalates automatically.
- **Verify content across sources** — useful for LLM training corpora,
  fact-checking, news ingestion, or any pipeline where a single
  scraper getting fooled by an anti-bot challenge would poison the
  dataset.
- **Switch providers** without rewriting code — same JSON regardless
  of which vendor handled the fetch.
- **BYOK** — store a vendor key once, route through it for a flat
  infra fee instead of pay-per-credit.

## Setup

```bash
export GOTTEM_KEY=gtm_••••••••••••••••••••••••
export GOTTEM_BASE_URL=https://api.gottem.dev
```

A key is created from the dashboard at
https://gottem.dev/dashboard/keys. Top up credits any time
(1 credit = $0.0001).

## Core recipes

### 1. Fetch one URL — auto-routed

The default. gottem starts on the lowest-cost viable route and escalates
only on failure.

```bash
curl -sS "$GOTTEM_BASE_URL/scrape" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
```

Returns:

```json
{
  "url": "https://example.com",
  "status": 200,
  "provider": "spider",
  "route": "spider.fetch",
  "tier": 0,
  "cost_credits": 0,
  "elapsed_ms": 412,
  "attempt": 1,
  "content_bytes": 1256,
  "content": "<!doctype html>…"
}
```

### 2. Verify content across providers

Use this when you can't trust a single scraper — anti-bot challenge
pages, soft-failures with half-rendered DOM, partial loads. gottem
fans the URL across the routes you list, SHA-256s every response,
and reports who agrees. The `best` route is picked by good quality →
lowest cost → route id (deterministic).

```bash
curl -sS "$GOTTEM_BASE_URL/v1/compare" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "routes": ["firecrawl.scrape", "zyte.api", "spider.smart"]
  }' | jq '.best, .good_count'
```

Returns a `results[]` array (one entry per route), a `variants[]`
array (distinct content collapsed by SHA), and `best` / `good_count`
/ `total_cost_credits`. Every fetch is real and billed — the merge
only dedupes the payload.

Threshold rule of thumb: keep rows where `good_count >= 3` for
training data; quarantine the rest with full provenance.

### 3. Bring your own vendor key

Already paying Firecrawl, Brightdata, or another vendor directly?
Store the key once, AES-256-GCM at rest. gottem routes through it
and charges only the flat infra fee.

```bash
curl -sS "$GOTTEM_BASE_URL/v1/byok/keys" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"vendor": "firecrawl", "key": "fc_••••••••"}'
```

### 4. Pin a specific provider

When a workflow needs an exact vendor (legal, compliance, parity
testing), pass `force_provider`. Catalog at `GET /routes`.

```bash
curl -sS "$GOTTEM_BASE_URL/scrape" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "force_provider": "brightdata.scrape"
  }'
```

### 5. Crawl a site as NDJSON

When the user wants many pages (a whole site / section), use `/crawl`
instead of looping `/scrape`. Streaming NDJSON, constant server memory.

```bash
curl -sN "$GOTTEM_BASE_URL/crawl" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url":     "https://example.com",
    "limit":   50,
    "depth":   2,
    "formats": ["markdown"],
    "return_links": true
  }'
```

Each `\n`-delimited line is a complete JSON object for one page (or one
in-band error). See the **Crawling — `/crawl`** section below for the
full request / response shape, engines, format pipeline, and billing.

## Modes (for `/scrape`)

- `ladder` (default): lowest-cost-first, escalate on failure.
- `race`: try several providers in parallel, take the fastest valid response.
- `hedge`: primary route plus staggered backups, fallback if primary stalls.

```bash
curl -sS "$GOTTEM_BASE_URL/scrape" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://target.example", "mode": "race"}'
```

## Crawling — `/crawl` (streaming NDJSON)

When the user wants **many pages** (a whole site, a section, a sitemap),
hit `POST /crawl` instead of looping `/scrape`. The response is
**NDJSON streaming** — one JSON object per line, flushed as each page
lands. Server memory stays constant regardless of crawl size.

```bash
curl -sN "$GOTTEM_BASE_URL/crawl" \
  -H "Authorization: Bearer $GOTTEM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "url":     "https://target.example",
        "limit":   50,
        "depth":   2,
        "formats": ["markdown"],
        "return_links": true
      }'
```

Engines (`"engine"` field):

- `"auto"` (default) — Spider's native `/crawl` if the account has the key, else local.
- `"spider_cloud"` — single round-trip to Spider's `/crawl`, JSONL streamed back.
- `"local"` — gottem-side BFS over the scrape ladder. Each URL fetched gets the same provider escalation as a `/scrape` call. Link discovery uses spider's primitives on already-fetched bytes (no double-fetch). Visited / depth / allow / deny / robots all delegated to spider.

Per-page response shape (one NDJSON line):

```json
{
  "url":              "https://target.example/page",
  "depth":            1,
  "status":           200,
  "content":          "# Page …",
  "content_by_format": { "markdown": "# Page …" },
  "links":            ["https://target.example/other"],
  "route_id":         "firecrawl.scrape",
  "tier":             4,
  "cost_milli":       10,
  "credits_charged":  "0.0100",
  "elapsed_ms":       842
}
```

Per-page errors become NDJSON error lines (`{"error":"…","code":"…"}`); the crawl continues.

Format pipeline — when `formats` is non-empty, the cloud runs the same `spider_transformations` pass `/scrape` uses, **per page**, on the returned bytes. Non-Spider routes that return HTML transform into markdown / text / screenshot server-side; routes that already return markdown pass through. HTML / Screenshot are omitted from pages where the source can't produce them.

Dynamic params — `param: { "k": "v", ... }` flows into every per-page request's `{{param:k}}` template slots. Use this for vendor-specific knobs without redeploys.

Cancellation — drop the response body. The server picks up the disconnect, fires the orchestrator's cancel token, and aborts in-flight fetches.

Billing — per page. Each emitted line carries its `credits_charged` so the client can compute running cost mid-stream.

## Error shape

```json
{ "error": "human-readable message", "code": "MACHINE_READABLE_CODE" }
```

Common codes the skill should handle:

- `INVALID_URL` — fix the URL and retry.
- `INSUFFICIENT_CREDITS` — surface to user; top-up at
  https://gottem.dev/dashboard.
- `RATE_LIMIT_EXCEEDED` — back off using the `Retry-After` header
  on the 429.
- `EXHAUSTED` — every route in the band failed; try a higher
  `tier_max`, switch `mode`, or report blocked.

## Patterns the skill should follow

- **Never log the key.** Read `GOTTEM_KEY` from env every call; never
  print it back to the user or include it in error messages.
- **Always pass `Content-Type: application/json`** on `POST`s.
- **Treat the response as the source of truth.** The `provider` and
  `route` fields tell you who actually answered — don't assume.
- **For training pipelines: prefer `/v1/compare` with three or more
  routes**, and quarantine rows where `good_count < 3` instead of
  dropping silently.
- **For interactive UX: prefer `/scrape` with `mode: "race"`** — the
  fastest valid response wins.
- **Pricing:** 1 credit = $0.0001. Successful fetches are billed; the
  `cost_credits` in the response is the truth, not a quote.

## Useful links

- Hosted API: `https://api.gottem.dev`
- Dashboard / keys: `https://gottem.dev/dashboard/keys`
- HTTP reference: `https://gottem.dev/docs`
- Open-source engine: `https://github.com/spider-rs/gottem`
- Machine-readable index: `https://gottem.dev/llms.txt`
