What is Markdown for LLMs in AI Training

Markdown is the preferred text format for feeding content into large language models. Unlike HTML, which wraps every paragraph in tags and nests elements deeply, Markdown uses lightweight punctuation to convey structure. This means fewer tokens consumed, more context preserved, and cleaner output when models generate or summarize text.

Most major LLM providers and tools, including OpenAI, Anthropic, and Google, use Markdown as their default format for system prompts, tool outputs, and training data. When you ask ChatGPT to format a response, it naturally uses Markdown headings, lists, and code blocks because that is the language it was trained on.

Why Markdown beats HTML for LLMs

Aspect	HTML	Markdown
Heading	`<h1>Title</h1>`	`# Title`
Bold	`<strong>text</strong>`	`text`
List	`<ul><li>item</li></ul>`	`- item`
Link	`<a href="url">text</a>`	`[text](url)`
Tokens for same content	2-4x more	Minimal overhead
Readability for humans	Noisy with tags	Clean and scannable
Model familiarity	Trained on both, but Markdown dominates docs	Native to most training corpora

The token difference is significant. A 1,000-word article in HTML might use 2,500 tokens. The same article in Markdown uses roughly 1,200 tokens. For models with a 4,000-token context window, that is the difference between fitting the whole article and having to truncate it.

Structure preservation

Markdown preserves the semantic structure that LLMs need to understand content hierarchy:

Headings (#, ##) indicate topic boundaries and document outline
Lists (-, 1.) show relationships between items
Code blocks (```) separate executable or quoted text from prose
Tables (|) organize tabular data in a readable grid
Blockquotes (>) distinguish quoted material

This structure survives conversion from HTML far better than plain text. Stripping all tags from HTML leaves a wall of text with no paragraph breaks or emphasis. Converting HTML to Markdown preserves the logical organization.

Markdown in training data

LLMs are trained on massive text corpora scraped from the web. Much of that content exists originally as HTML. Data pipelines that convert HTML to Markdown before training produce better models because:

The model learns from cleaner, more structured input
Less token budget is wasted on markup, leaving more room for actual content
Tables, lists, and code examples remain interpretable rather than becoming garbled tag soup
The model becomes better at generating Markdown, which is what users typically want

OpenAI’s training documentation explicitly recommends Markdown for fine-tuning datasets. Anthropic’s Claude works best with Markdown-formatted prompts. Google’s Gemini documentation uses Markdown for examples.

Markdown vs plain text

Plain text strips all structure. No headings, no lists, no bold, no code blocks. For simple paragraphs, this is fine. For technical documentation, API references, or academic papers, plain text destroys the information hierarchy that helps both humans and models understand the material.

Markdown sits in the sweet spot: structured enough to convey meaning, lightweight enough to avoid token waste, and universally understood by modern LLMs.

Common Markdown patterns for LLMs

# Document Title

## Section Heading

This is a paragraph with **bold emphasis** and a [link](https://example.com).

### Code Example

```python
def hello():
    return "world"

List of Items

First item with details
Second item with a sub-point
- Nested item
Third item

Table

Feature	HTML	Markdown
Tokens	High	Low
Structure	Verbose	Minimal

This pattern, when fed to an LLM, gives it clear signals about content hierarchy. The model can summarize by section, extract the code example, or convert the table to natural language.

## How crawler.sh handles Markdown for LLMs

crawler.sh exports entire crawled sites as clean Markdown archives. The extraction engine converts HTML to Markdown while preserving:

- Heading hierarchy (`h1` through `h6` mapped to `#` through `######`)
- Ordered and unordered lists
- Tables (converted to Markdown pipe tables)
- Links with original URLs preserved
- Code blocks with language hints
- Bold and italic emphasis
- Blockquotes and horizontal rules

The output is a ZIP file containing one `.md` file per page, organized by URL path. This archive is ready to be fed directly into:

- RAG pipelines as chunked context
- Fine-tuning datasets as source material
- LLM prompts as structured background knowledge
- Vector databases as semantically searchable documents

Because the Markdown is clean and token-efficient, more content fits into each context window, reducing the number of chunks needed and improving retrieval accuracy.

## Related

- [Context window](/glossary/context-window/)
- [Chunking](/glossary/chunking/)
- [Web scraping for LLMs](/glossary/web-scraping-for-llms/)
- [RAG](/glossary/rag/)