Crawler MCP Tool Reference
The Crawler MCP server exposes three tools. All three honor robots.txt by default and refuse redirects to private addresses unless you opt in.
crawl_site: walk an entire sitefetch_page: one-shot page fetchdiscover_links: shallow URL discovery
crawl_site
Section titled “crawl_site”Crawl an entire website (same host) and return enriched page data for each page, including Markdown, SEO metadata, headings, and links.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | yes | - | Start URL. |
max_pages | integer | no | 50 | Hard cap on pages crawled. Server enforces a tier cap on top of this. |
max_depth | integer | no | 10 | BFS depth limit. |
render_js | string | no | "auto" | JavaScript rendering mode: "auto", "always", or "never". |
respect_robots | boolean | no | true | Honor robots.txt Disallow rules. |
concurrency | integer | no | 5 | Parallel request count. |
delay_ms | integer | no | 50 | Inter-request delay in milliseconds. |
user_agent | string | no | auto | Custom User-Agent string. |
check_outgoing_links | boolean | no | true | Validate outgoing links for 404s and other errors. |
allow_private_redirects | boolean | no | false | Allow redirects to localhost, link-local, and RFC1918 addresses. |
extract_content | boolean | no | true | Return Markdown content for each page. Set to false for faster link discovery only. |
include_html | boolean | no | false | Include the raw (post-JS-render) HTML body of each HTML page as an html field. Significantly increases response size on large crawls. |
Returns
Section titled “Returns”{ "pages": [ { "url": "https://example.com/page", "status_code": 200, "content_type": "text/html; charset=utf-8", "title": "Page Title", "meta_description": "A concise summary of this page.", "h1_text": "Main Heading", "h1_count": 1, "word_count": 1200, "lang": "en", "byline": "Author Name", "excerpt": "First paragraph of the article...", "site_name": "Example", "image": "https://example.com/og-card.png", "markdown": "# Main Heading\n\nArticle body in Markdown...", "canonical_url": "https://example.com/page", "meta_robots": "index, follow", "x_robots_tag": "index, follow", "keywords": ["keyword1", "keyword2"], "published_time": "2026-01-15T10:00:00Z", "modified_time": "2026-03-20T14:30:00Z", "rel_next": "https://example.com/page?p=2", "rel_prev": null, "links_found": 42, "internal_links": ["/about", "/contact"], "outgoing_link_errors": [ { "url": "https://example.com/broken", "status_code": 404, "error": "Not Found", "link_type": "a" } ], "redirect_chain": [ { "url": "http://example.com/page", "status_code": 301, "location": "https://example.com/page" } ], "redirect_loop_detected": false, "response_time_ms": 234, "depth": 0, "html": null } ], "page_count": 1, "cap_applied": 50, "cap_max": 50}Example
Section titled “Example”{ "url": "https://example.com", "max_pages": 20, "delay_ms": 500, "check_outgoing_links": false}fetch_page
Section titled “fetch_page”Fetch a single page and return its full enriched data. Use this for one-shot URL lookups when you do not need to walk the site.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | yes | - | Page URL. |
render_js | string | no | "auto" | JavaScript rendering mode: "auto", "always", or "never". |
user_agent | string | no | auto | Custom User-Agent string. |
respect_robots | boolean | no | true | Honor robots.txt Disallow rules. |
allow_private_redirects | boolean | no | false | Allow redirects to localhost, link-local, and RFC1918 addresses. |
extract_content | boolean | no | true | Return Markdown content. Set to false for metadata-only lookup. |
include_html | boolean | no | false | Include the raw HTML body of the response as an html field. |
Returns
Section titled “Returns”A single page object with the same schema as crawl_site page entries.
Example
Section titled “Example”{ "url": "https://example.com/page", "render_js": "never", "extract_content": false}discover_links
Section titled “discover_links”Crawl shallowly and return discovered URLs with optional titles. Use this to understand a site’s structure before reading its content.
Parameters
Section titled “Parameters”| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | yes | - | Start URL. |
max_depth | integer | no | 2 | BFS depth limit. |
max_pages | integer | no | tier cap | Hard cap on pages. |
concurrency | integer | no | 5 | Parallel request count. |
delay_ms | integer | no | 50 | Inter-request delay in milliseconds. |
user_agent | string | no | auto | Custom User-Agent string. |
respect_robots | boolean | no | true | Honor robots.txt Disallow rules. |
allow_private_redirects | boolean | no | false | Allow redirects to localhost, link-local, and RFC1918 addresses. |
extract_content | boolean | no | true | Include page titles in results. Set to false for URL-only discovery. |
Returns
Section titled “Returns”{ "links": [ { "url": "https://example.com/about", "title": "About Us", "depth": 1, "status_code": 200 }, { "url": "https://example.com/blog", "title": "Blog", "depth": 1, "status_code": 200 } ], "count": 2}When extract_content is false, the title field is omitted from each link.
Example
Section titled “Example”{ "url": "https://example.com", "max_depth": 1, "extract_content": true}