Markdown for LLMs

What is Markdown for LLMs in AI Training

Markdown is the preferred format for LLM context because it preserves structure while using fewer tokens than HTML.

Markdown is the preferred text format for feeding content into large language models. Unlike HTML, which wraps every paragraph in tags and nests elements deeply, Markdown uses lightweight punctuation to convey structure. This means fewer tokens consumed, more context preserved, and cleaner output when models generate or summarize text.

Most major LLM providers and tools, including OpenAI, Anthropic, and Google, use Markdown as their default format for system prompts, tool outputs, and training data. When you ask ChatGPT to format a response, it naturally uses Markdown headings, lists, and code blocks because that is the language it was trained on.

Why Markdown beats HTML for LLMs

AspectHTMLMarkdown
Heading<h1>Title</h1># Title
Bold<strong>text</strong>**text**
List<ul><li>item</li></ul>- item
Link<a href="url">text</a>[text](url)
Tokens for same content2-4x moreMinimal overhead
Readability for humansNoisy with tagsClean and scannable
Model familiarityTrained on both, but Markdown dominates docsNative to most training corpora

The token difference is significant. A 1,000-word article in HTML might use 2,500 tokens. The same article in Markdown uses roughly 1,200 tokens. For models with a 4,000-token context window, that is the difference between fitting the whole article and having to truncate it.

Structure preservation

Markdown preserves the semantic structure that LLMs need to understand content hierarchy:

  • Headings (#, ##) indicate topic boundaries and document outline
  • Lists (-, 1.) show relationships between items
  • Code blocks (```) separate executable or quoted text from prose
  • Tables (|) organize tabular data in a readable grid
  • Blockquotes (>) distinguish quoted material

This structure survives conversion from HTML far better than plain text. Stripping all tags from HTML leaves a wall of text with no paragraph breaks or emphasis. Converting HTML to Markdown preserves the logical organization.

Markdown in training data

LLMs are trained on massive text corpora scraped from the web. Much of that content exists originally as HTML. Data pipelines that convert HTML to Markdown before training produce better models because:

  • The model learns from cleaner, more structured input
  • Less token budget is wasted on markup, leaving more room for actual content
  • Tables, lists, and code examples remain interpretable rather than becoming garbled tag soup
  • The model becomes better at generating Markdown, which is what users typically want

OpenAI’s training documentation explicitly recommends Markdown for fine-tuning datasets. Anthropic’s Claude works best with Markdown-formatted prompts. Google’s Gemini documentation uses Markdown for examples.

Markdown vs plain text

Plain text strips all structure. No headings, no lists, no bold, no code blocks. For simple paragraphs, this is fine. For technical documentation, API references, or academic papers, plain text destroys the information hierarchy that helps both humans and models understand the material.

Markdown sits in the sweet spot: structured enough to convey meaning, lightweight enough to avoid token waste, and universally understood by modern LLMs.

Common Markdown patterns for LLMs

# Document Title
## Section Heading
This is a paragraph with **bold emphasis** and a [link](https://example.com).
### Code Example
```python
def hello():
return "world"

List of Items

  • First item with details
  • Second item with a sub-point
    • Nested item
  • Third item

Table

FeatureHTMLMarkdown
TokensHighLow
StructureVerboseMinimal
This pattern, when fed to an LLM, gives it clear signals about content hierarchy. The model can summarize by section, extract the code example, or convert the table to natural language.
## How crawler.sh handles Markdown for LLMs
crawler.sh exports entire crawled sites as clean Markdown archives. The extraction engine converts HTML to Markdown while preserving:
- Heading hierarchy (`h1` through `h6` mapped to `#` through `######`)
- Ordered and unordered lists
- Tables (converted to Markdown pipe tables)
- Links with original URLs preserved
- Code blocks with language hints
- Bold and italic emphasis
- Blockquotes and horizontal rules
The output is a ZIP file containing one `.md` file per page, organized by URL path. This archive is ready to be fed directly into:
- RAG pipelines as chunked context
- Fine-tuning datasets as source material
- LLM prompts as structured background knowledge
- Vector databases as semantically searchable documents
Because the Markdown is clean and token-efficient, more content fits into each context window, reducing the number of chunks needed and improving retrieval accuracy.
## Related
- [Context window](/glossary/context-window/)
- [Chunking](/glossary/chunking/)
- [Web scraping for LLMs](/glossary/web-scraping-for-llms/)
- [RAG](/glossary/rag/)
Crawler.sh - Free Local AEO & SEO Spider and a Markdown content extractor | Product Hunt