# Advanced Scraping Guide Source: https://docs.firecrawl.dev/advanced-scraping-guide Configure scrape options, browser actions, crawl, map, and the agent endpoint with Firecrawl's full API surface. Reference for every option across Firecrawl's scrape, crawl, map, and agent endpoints. ## Basic scraping To scrape a single page and get clean markdown content, use the `/scrape` endpoint. ```python Python theme={null} # pip install firecrawl-py from firecrawl import Firecrawl firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY") doc = firecrawl.scrape("https://firecrawl.dev") print(doc.markdown) ``` ```js Node theme={null} // npm install @mendable/firecrawl-js import { Firecrawl } from 'firecrawl-js'; const firecrawl = new Firecrawl({ apiKey: 'fc-YOUR-API-KEY' }); const doc = await firecrawl.scrape('https://firecrawl.dev'); console.log(doc.markdown); ``` ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://docs.firecrawl.dev" }' ``` ## Scraping PDFs Firecrawl supports PDFs. Use the `parsers` option (e.g., `parsers: ["pdf"]`) when you want to ensure PDF parsing. You can control the parsing strategy with the `mode` option: * **`auto`** (default) — attempts fast text-based extraction first, then falls back to OCR if needed. * **`fast`** — text-based parsing only (embedded text). Fastest, but skips scanned/image-heavy pages. * **`ocr`** — forces OCR parsing on every page. Use for scanned documents or when `auto` misclassifies a page. `{ type: "pdf" }` and `"pdf"` both default to `mode: "auto"`. ```json theme={null} "parsers": [{ "type": "pdf", "mode": "fast", "maxPages": 50 }] ``` ## Scrape options When using the `/scrape` endpoint, you can customize the request with the following options. ### Formats (`formats`) The `formats` array controls which output types the scraper returns. Default: `["markdown"]`. **String formats**: pass the name directly (e.g. `"markdown"`). | Format | Description | | ---------- | ---------------------------------------------------------------------------- | | `markdown` | Page content converted to clean Markdown. | | `html` | Processed HTML with unnecessary elements removed. | | `rawHtml` | Original HTML exactly as returned by the server. | | `links` | All links found on the page. | | `images` | All images found on the page. | | `summary` | An LLM-generated summary of the page content. | | `branding` | Extracts brand identity (colors, fonts, typography, spacing, UI components). | **Object formats**: pass an object with `type` and additional options. | Format | Options | Description | | ---------------- | ---------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | `json` | `prompt?: string`, `schema?: object` | Extract structured data using an LLM. Provide a JSON schema and/or a natural-language prompt (max 10,000 characters). | | `screenshot` | `fullPage?: boolean`, `quality?: number`, `viewport?: { width, height }` | Capture a screenshot. Max one per request. Viewport max resolution is 7680×4320. Screenshot URLs expire after 24 hours. | | `changeTracking` | `modes?: ("json" \| "git-diff")[]`, `tag?: string`, `schema?: object`, `prompt?: string` | Track changes between scrapes. Requires `"markdown"` to also be in the formats array. | | `attributes` | `selectors: [{ selector: string, attribute: string }]` | Extract specific HTML attributes from elements matching CSS selectors. | ### Content filtering These parameters control which parts of the page appear in the output. When `onlyMainContent` is `true` (the default), boilerplate (nav, footer, etc.) is stripped. `includeTags` and `excludeTags` are applied against the original page DOM, not the post-filtered result, so your selectors should target elements as they appear in the source HTML. Set `onlyMainContent: false` to use the full page as the starting point for tag filtering. | Parameter | Type | Default | Description | | ----------------- | --------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- | | `onlyMainContent` | `boolean` | `true` | Return only the main content. Set `false` for the full page. | | `includeTags` | `array` | — | CSS selectors to include — tags, classes, IDs, or attribute selectors (e.g. `["h1", "p", ".main-content", "[data-testid=\"main\"]"]`). | | `excludeTags` | `array` | — | CSS selectors to exclude — tags, classes, IDs, or attribute selectors (e.g. `["#ad", "#footer", "[role=\"banner\"]"]`). | ### Timing and cache | Parameter | Type | Default | Description | | --------- | -------------- | ----------- | ------------------------------------------------------------------------------------------------------ | | `waitFor` | `integer` (ms) | `0` | Extra wait time before scraping, on top of smart-wait. Use sparingly. | | `maxAge` | `integer` (ms) | `172800000` | Return a cached version if fresher than this value (default is 2 days). Set `0` to always fetch fresh. | | `timeout` | `integer` (ms) | `60000` | Max request duration before aborting (default is 60 seconds). Minimum is 1000 (1 second). | ### PDF parsing | Parameter | Type | Default | Description | | --------- | ------- | --------- | -------------------------------------------------------------------------------- | | `parsers` | `array` | `["pdf"]` | Controls PDF processing. `[]` to skip parsing and return base64 (1 credit flat). | ```json theme={null} { "type": "pdf", "mode": "fast" | "auto" | "ocr", "maxPages": 10 } ``` | Property | Type | Default | Description | | ---------- | --------------------------- | ------------ | ------------------------------------------------------------------------------------- | | `type` | `"pdf"` | *(required)* | Parser type. | | `mode` | `"fast" \| "auto" \| "ocr"` | `"auto"` | `fast`: text-based extraction only. `auto`: fast with OCR fallback. `ocr`: force OCR. | | `maxPages` | `integer` | — | Cap the number of pages to parse. | ### Actions Run browser actions before scraping. This is useful for dynamic content, navigation, or user-gated pages. You can include up to 50 actions per request, and the combined wait time across all `wait` actions and `waitFor` must not exceed 60 seconds. | Action | Parameters | Description | | ------------------- | ------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | | `wait` | `milliseconds?: number`, `selector?: string` | Wait for a fixed duration **or** until an element is visible (provide one, not both). When using `selector`, times out after 30 seconds. | | `click` | `selector: string`, `all?: boolean` | Click an element matching the CSS selector. Set `all: true` to click every match. | | `write` | `text: string` | Type text into the currently focused field. You must focus the element with a `click` action first. | | `press` | `key: string` | Press a keyboard key (e.g. `"Enter"`, `"Tab"`, `"Escape"`). | | `scroll` | `direction?: "up" \| "down"`, `selector?: string` | Scroll the page or a specific element. Direction defaults to `"down"`. | | `screenshot` | `fullPage?: boolean`, `quality?: number`, `viewport?: { width, height }` | Capture a screenshot. Max viewport resolution is 7680×4320. | | `scrape` | *(none)* | Capture the current page HTML at this point in the action sequence. | | `executeJavascript` | `script: string` | Run JavaScript code in the page. Return values are available in the `actions.javascriptReturns` array of the response. | | `pdf` | `format?: string`, `landscape?: boolean`, `scale?: number` | Generate a PDF. Supported formats: `"A0"` through `"A6"`, `"Letter"`, `"Legal"`, `"Tabloid"`, `"Ledger"`. Defaults to `"Letter"`. | ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl(api_key='fc-YOUR-API-KEY') doc = firecrawl.scrape('https://example.com', { 'actions': [ { 'type': 'wait', 'milliseconds': 1000 }, { 'type': 'click', 'selector': '#accept' }, { 'type': 'scroll', 'direction': 'down' }, { 'type': 'click', 'selector': '#q' }, { 'type': 'write', 'text': 'firecrawl' }, { 'type': 'press', 'key': 'Enter' }, { 'type': 'wait', 'milliseconds': 2000 } ], 'formats': ['markdown'] }) print(doc.markdown) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl-js'; const firecrawl = new Firecrawl({ apiKey: 'fc-YOUR-API-KEY' }); const doc = await firecrawl.scrape('https://example.com', { actions: [ { type: 'wait', milliseconds: 1000 }, { type: 'click', selector: '#accept' }, { type: 'scroll', direction: 'down' }, { type: 'click', selector: '#q' }, { type: 'write', text: 'firecrawl' }, { type: 'press', key: 'Enter' }, { type: 'wait', milliseconds: 2000 } ], formats: ['markdown'] }); console.log(doc.markdown); ``` ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://example.com", "actions": [ { "type": "wait", "milliseconds": 1000 }, { "type": "click", "selector": "#accept" }, { "type": "scroll", "direction": "down" }, { "type": "click", "selector": "#q" }, { "type": "write", "text": "firecrawl" }, { "type": "press", "key": "Enter" }, { "type": "wait", "milliseconds": 2000 } ], "formats": ["markdown"] }' ``` #### Action execution notes * **Write** requires a preceding `click` to focus the target element. * **Scroll** accepts an optional `selector` to scroll a specific element instead of the page. * **Wait** accepts either `milliseconds` (fixed delay) or `selector` (wait until visible). * Actions run **sequentially**: each step completes before the next begins. * Actions are **not supported for PDFs**. If the URL resolves to a PDF the request will fail. #### Advanced action examples **Taking a screenshot:** ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://example.com", "actions": [ { "type": "click", "selector": "#load-more" }, { "type": "wait", "milliseconds": 1000 }, { "type": "screenshot", "fullPage": true, "quality": 80 } ] }' ``` **Clicking multiple elements:** ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://example.com", "actions": [ { "type": "click", "selector": ".expand-button", "all": true }, { "type": "wait", "milliseconds": 500 } ], "formats": ["markdown"] }' ``` **Generating a PDF:** ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://example.com", "actions": [ { "type": "pdf", "format": "A4", "landscape": false } ] }' ``` **Executing JavaScript (e.g. extracting embedded page data):** ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://example.com", "actions": [ { "type": "executeJavascript", "script": "document.querySelector(\"#__NEXT_DATA__\").textContent" } ], "formats": ["markdown"] }' ``` The return value of each `executeJavascript` action is captured in the `actions.javascriptReturns` array of the response. ### Full scrape example The following request combines multiple scrape options: ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer fc-YOUR-API-KEY' \ -d '{ "url": "https://docs.firecrawl.dev", "formats": [ "markdown", "links", "html", "rawHtml", { "type": "screenshot", "fullPage": true, "quality": 80 } ], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000, "parsers": ["pdf"] }' ``` This request returns markdown, HTML, raw HTML, links, and a full-page screenshot. It scopes content to `

Raw buffer .digest('hex'); }); ```