What vector stores work best with crawled Markdown?

Any vector store that accepts text chunks works well. Popular choices include Pinecone, Weaviate, pgvector, and ChromaDB. The key is to chunk the Markdown into 500-word pieces with overlap so semantic search can match partial content.

Does crawl_site extract images and metadata too?

Yes. Each page result includes title, meta description, word count, headings, and links found. Images are referenced in the Markdown as image links. You can filter these out during chunking if you only want text.

How many pages should I crawl for a RAG knowledge base?

Start with max_pages 100 to cover the core content. For large documentation sites, 400 pages is usually enough on the free tier; Pro lifts that to 10,000. You can always run a second crawl to fill gaps.

How to Build a RAG Knowledge Base from a Website with MCP

Retrieval-Augmented Generation (RAG) systems need clean, structured text. With crawler-mcp, an agent can crawl an entire website, extract Markdown from every page, and feed it into a vector store - all from a single prompt.

This guide shows how to use crawl_site to build a knowledge base.

Step 1: Install crawler-mcp

Run the install script:

curl -fsSL https://install.crawler.sh/install-mcp.sh | sh

This downloads the correct binary for your platform to ~/.crawler/bin/crawler-mcp.

For more detail, see the installation guide.

Step 2: Wire it into your client

Step 3: Crawl the site

Ask the agent to crawl the target site:

Use crawler-sh to crawl https://docs.example.com to depth 3 with max_pages 100. Save the Markdown from each page to a local file.

The agent calls crawl_site, which returns Markdown content for every page. The agent can then write each page to disk.

Step 4: Chunk and ingest into a vector store

Once the files are saved, ask the agent to prepare them for ingestion:

Read the crawled Markdown files, split them into 500-word chunks with 50-word overlap, and write each chunk to a separate file in ./chunks/.

After chunking, ingest into your vector store:

Use the chunks in ./chunks/ to create embeddings and store them in the vector database.

The exact commands depend on your vector store (Pinecone, Weaviate, pgvector, etc.). The agent can generate the ingestion script for your chosen platform.

Step 5: Query the knowledge base

With the data ingested, ask questions:

Search the knowledge base for “authentication setup” and summarise the steps.

The agent queries the vector store and grounds its answer in the crawled content.