What is Instruction Tuning in LLM Training

Instruction tuning is a training method that teaches language models to follow natural language instructions. Instead of training on raw text sequences, the model sees examples of tasks phrased as commands, questions, or prompts, paired with the correct response. This transforms a base model that predicts the next token into an assistant that understands and executes user requests.

Base models like GPT-3 or Llama can generate coherent text, but they do not naturally follow instructions. Ask a base model “What is the capital of France?” and it might continue with “What is the capital of Germany? What is the capital of Italy?” because its training objective is to continue text, not to answer questions. Instruction tuning fixes this behavior.

How instruction tuning works

The process involves three components:

A base model - Pre-trained on general text corpora and capable of generating language
An instruction dataset - Thousands to millions of examples in the format {instruction, input, output}
Supervised fine-tuning - Training the model to predict the output given the instruction and input

A typical instruction example looks like this:

{
  "instruction": "Summarize the following article in one sentence.",
  "input": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed for every task...",
  "output": "Machine learning allows computers to learn patterns from data automatically."
}

During training, the model sees the instruction and input, then learns to generate the output. The loss function penalizes deviations from the expected response.

Common dataset formats

Different fine-tuning frameworks expect different schemas:

Alpaca format

{
  "instruction": "Rewrite this sentence in the past tense.",
  "input": "She walks to the store.",
  "output": "She walked to the store."
}

ShareGPT format (conversational)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing simply."},
    {"role": "assistant", "content": "Quantum computing uses quantum mechanics principles..."}
  ]
}

OpenAI fine-tuning format

{"messages": [{"role": "system", "content": "You summarize articles."}, {"role": "user", "content": "Summarize: [article text]"}, {"role": "assistant", "content": "[summary]"}]}

Instruction diversity matters

A good instruction dataset covers many task types:

Question answering - Factual, open-domain, and multi-hop reasoning
Summarization - Extractive and abstractive, across domains
Translation - Between languages and between styles
Classification - Sentiment analysis, topic categorization, intent detection
Code generation - Writing, explaining, and debugging code
Creative writing - Stories, poems, marketing copy
Reasoning - Math, logic puzzles, and step-by-step problem solving
Refusal training - Learning to decline harmful or impossible requests

If the dataset is too narrow (only math problems, for example), the model becomes excellent at math but poor at conversation. Diversity ensures general assistant capabilities.

Sources of instruction data

Human annotators - Expensive but high quality. Used for initial instruction datasets and RLHF.
Existing NLP datasets - Converting supervised learning datasets into instruction format. For example, turning a sentiment classification dataset into “Classify the sentiment of this review: [review] -> Positive”
Self-instruct - Using a powerful model (like GPT-4) to generate instruction-output pairs, then filtering for quality. This scales cheaply but may inherit biases from the teacher model.
Real user interactions - Chat logs and support tickets with appropriate privacy filtering

Instruction tuning vs pre-training

Pre-training	Instruction tuning
Objective	Predict next token
Data	Raw text (books, web, code)
Duration	Weeks to months on thousands of GPUs
Cost	Millions of dollars
Result	Base model

Instruction tuning is far cheaper than pre-training. A 70-billion parameter model can be instruction-tuned for a few thousand dollars, making it accessible to organizations that could never afford full pre-training.

How crawler.sh supports instruction tuning

crawler.sh provides the raw material for building domain-specific instruction datasets:

Source content - Crawl authoritative documentation, blogs, and knowledge bases to use as the input field in instruction pairs
Context windows - Clean Markdown output means more content fits in each example, enabling longer passages for summarization or Q&A tasks
Multiple formats - Export crawled content as structured JSON or Markdown that feeds into dataset generation scripts
Topic coverage - Crawl multiple sites in a domain to ensure diverse examples for the instruction set

For example, a team building a legal assistant might:

# Crawl legal documentation and case summaries
crawler crawl https://legal-docs.example.com --max-pages 5000 --output legal-corpus.zip

# Use the extracted text as source material for instruction pairs:
# "instruction": "Summarize this legal brief"
# "input": [extracted brief text]
# "output": [human-written or model-generated summary]

The quality of the source text directly impacts the quality of the resulting instruction dataset. Clean, well-structured Markdown extraction eliminates the noise that would otherwise require manual cleanup.