Crawler MCP Tool Reference

The Crawler MCP server exposes three tools. All three honor robots.txt by default and refuse redirects to private addresses unless you opt in.

crawl_site: walk an entire site
fetch_page: one-shot page fetch
discover_links: shallow URL discovery

`crawl_site`

Crawl an entire website (same host) and return enriched page data for each page, including Markdown, SEO metadata, headings, and links.

Parameters

Parameter	Type	Required	Default	Description
`url`	string	yes	-	Start URL.
`max_pages`	integer	no	50	Hard cap on pages crawled. Server enforces a tier cap on top of this.
`max_depth`	integer	no	10	BFS depth limit.
`render_js`	string	no	`"auto"`	JavaScript rendering mode: `"auto"`, `"always"`, or `"never"`.
`respect_robots`	boolean	no	`true`	Honor robots.txt Disallow rules.
`concurrency`	integer	no	5	Parallel request count.
`delay_ms`	integer	no	50	Inter-request delay in milliseconds.
`user_agent`	string	no	auto	Custom User-Agent string.
`check_outgoing_links`	boolean	no	`true`	Validate outgoing links for 404s and other errors.
`allow_private_redirects`	boolean	no	`false`	Allow redirects to localhost, link-local, and RFC1918 addresses.
`extract_content`	boolean	no	`true`	Return Markdown content for each page. Set to `false` for faster link discovery only.
`include_html`	boolean	no	`false`	Include the raw (post-JS-render) HTML body of each HTML page as an `html` field. Significantly increases response size on large crawls.

Returns

{
  "pages": [
    {
      "url": "https://example.com/page",
      "status_code": 200,
      "content_type": "text/html; charset=utf-8",
      "title": "Page Title",
      "meta_description": "A concise summary of this page.",
      "h1_text": "Main Heading",
      "h1_count": 1,
      "word_count": 1200,
      "lang": "en",
      "byline": "Author Name",
      "excerpt": "First paragraph of the article...",
      "site_name": "Example",
      "image": "https://example.com/og-card.png",
      "markdown": "# Main Heading\n\nArticle body in Markdown...",
      "canonical_url": "https://example.com/page",
      "meta_robots": "index, follow",
      "x_robots_tag": "index, follow",
      "keywords": ["keyword1", "keyword2"],
      "published_time": "2026-01-15T10:00:00Z",
      "modified_time": "2026-03-20T14:30:00Z",
      "rel_next": "https://example.com/page?p=2",
      "rel_prev": null,
      "links_found": 42,
      "internal_links": ["/about", "/contact"],
      "outgoing_link_errors": [
        {
          "url": "https://example.com/broken",
          "status_code": 404,
          "error": "Not Found",
          "link_type": "a"
        }
      ],
      "redirect_chain": [
        {
          "url": "http://example.com/page",
          "status_code": 301,
          "location": "https://example.com/page"
        }
      ],
      "redirect_loop_detected": false,
      "response_time_ms": 234,
      "depth": 0,
      "html": null
    }
  ],
  "page_count": 1,
  "cap_applied": 50,
  "cap_max": 50
}

Example

{
  "url": "https://example.com",
  "max_pages": 20,
  "delay_ms": 500,
  "check_outgoing_links": false
}

`fetch_page`

Fetch a single page and return its full enriched data. Use this for one-shot URL lookups when you do not need to walk the site.

Parameters

Parameter	Type	Required	Default	Description
`url`	string	yes	-	Page URL.
`render_js`	string	no	`"auto"`	JavaScript rendering mode: `"auto"`, `"always"`, or `"never"`.
`user_agent`	string	no	auto	Custom User-Agent string.
`respect_robots`	boolean	no	`true`	Honor robots.txt Disallow rules.
`allow_private_redirects`	boolean	no	`false`	Allow redirects to localhost, link-local, and RFC1918 addresses.
`extract_content`	boolean	no	`true`	Return Markdown content. Set to `false` for metadata-only lookup.
`include_html`	boolean	no	`false`	Include the raw HTML body of the response as an `html` field.

Returns

A single page object with the same schema as crawl_site page entries.

Example

{
  "url": "https://example.com/page",
  "render_js": "never",
  "extract_content": false
}

`discover_links`

Crawl shallowly and return discovered URLs with optional titles. Use this to understand a site’s structure before reading its content.

Parameters

Parameter	Type	Required	Default	Description
`url`	string	yes	-	Start URL.
`max_depth`	integer	no	2	BFS depth limit.
`max_pages`	integer	no	tier cap	Hard cap on pages.
`concurrency`	integer	no	5	Parallel request count.
`delay_ms`	integer	no	50	Inter-request delay in milliseconds.
`user_agent`	string	no	auto	Custom User-Agent string.
`respect_robots`	boolean	no	`true`	Honor robots.txt Disallow rules.
`allow_private_redirects`	boolean	no	`false`	Allow redirects to localhost, link-local, and RFC1918 addresses.
`extract_content`	boolean	no	`true`	Include page titles in results. Set to `false` for URL-only discovery.

Returns

{
  "links": [
    {
      "url": "https://example.com/about",
      "title": "About Us",
      "depth": 1,
      "status_code": 200
    },
    {
      "url": "https://example.com/blog",
      "title": "Blog",
      "depth": 1,
      "status_code": 200
    }
  ],
  "count": 2
}

When extract_content is false, the title field is omitted from each link.

Example

{
  "url": "https://example.com",
  "max_depth": 1,
  "extract_content": true
}