May 14, 2026

How to Build a RAG Knowledge Base from a Website with MCP

Crawl a website with crawler-mcp and ingest the Markdown into a vector store for semantic search.

Mehmet Kose
2 mins read

Retrieval-Augmented Generation (RAG) systems need clean, structured text. With crawler-mcp, an agent can crawl an entire website, extract Markdown from every page, and feed it into a vector store - all from a single prompt.

This guide shows how to use crawl_site to build a knowledge base.

Step 1: Install crawler-mcp

Run the install script:

curl -fsSL https://install.crawler.sh/install-mcp.sh | sh

This downloads the correct binary for your platform to ~/.crawler/bin/crawler-mcp.

For more detail, see the installation guide.

Step 2: Wire it into your client

Step 3: Crawl the site

Ask the agent to crawl the target site:

Use crawler-sh to crawl https://docs.example.com to depth 3 with max_pages 100. Save the Markdown from each page to a local file.

The agent calls crawl_site, which returns Markdown content for every page. The agent can then write each page to disk.

Step 4: Chunk and ingest into a vector store

Once the files are saved, ask the agent to prepare them for ingestion:

Read the crawled Markdown files, split them into 500-word chunks with 50-word overlap, and write each chunk to a separate file in ./chunks/.

After chunking, ingest into your vector store:

Use the chunks in ./chunks/ to create embeddings and store them in the vector database.

The exact commands depend on your vector store (Pinecone, Weaviate, pgvector, etc.). The agent can generate the ingestion script for your chosen platform.

Step 5: Query the knowledge base

With the data ingested, ask questions:

Search the knowledge base for “authentication setup” and summarise the steps.

The agent queries the vector store and grounds its answer in the crawled content.

Frequently Asked Questions

Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt