LLM Distillation Explained

The Weird Thing About 7B Models Right Now

A year ago, running a 7B LLM on your home GPU felt like hiring a forklift to move a couch — technically it works, but the results are hilariously mediocre. Today? A distilled 7B model like Phi-3 or Qwen 2.5 7B outperforms models three times its size from 2024. That’s not a hardware breakthrough. That’s not better silicon. That’s knowledge distillation — the art of teaching a tiny student what a massive teacher knows, minus all the fat.

Here’s the surprising part: distilled models don’t lose nearly as much as you’d expect. Qwen 2.5 32B distills to 7B and keeps about 75–85% of its reasoning ability. Llama 3.2’s 3B model — distilled from the 8B and 70B Llama 3.1 teachers — loses maybe 15–20% on benchmarks most people never benchmark against. The difference in real-world chat? You might not notice.

If you’re running LLMs locally on a 16GB GPU or less, you’re probably already using distilled models and didn’t even know it. Let’s talk about why these tiny things work so well, how they’re made, and when you should actually care about picking one.

What Knowledge Distillation Actually Is

The name sounds fancy, but the concept is embarrassingly simple. You have:

A teacher model: Large (70B), slow, expensive, genuinely smart
A student model: Small (7B, 3B, even 1B), fast, cheap, but dumb without training
A goal: Make the student mimic what the teacher knows

The trick is the student isn’t memorizing facts. It’s learning the patterns of how the teacher thinks. When the teacher sees “what is 2+2?”, it outputs “4” with 99.7% confidence. The student sees that probability distribution and learns not just the answer, but how certain the teacher is. That’s the magic.

Think of it like this: You’re teaching someone to cook pasta. You could give them a physics textbook on starch gelatinization, or you could just show them what done pasta looks like, feels like, tastes like. Distillation is the second one. You’re showing the student the output distribution, not the internal reasoning.

Without distillation, a 7B model trained from scratch is mediocre because it lacks the depth of knowledge baked into those massive parameters. With distillation, that 7B model inherits the teacher’s understanding in compressed form. It’s like copying an expert’s muscle memory instead of learning from first principles.

Hard Labels vs. Soft Labels vs. Features

Distillation comes in flavors, and they’re not all created equal.

Hard-Label Distillation

The naive approach: Train the student to predict the same final answer as the teacher.

Teacher → outputs “Paris” for “capital of France”
Student → learns to output “Paris”

This works, but it’s information-starved. If the teacher assigns 95% confidence to Paris and 4% to other cities, the hard label throws away that 4%. It’s like learning to cook by only being told “done” or “not done,” without any sense of how close you are.

Soft-Label Distillation (KL Divergence Loss)

The standard approach: Train the student to match the teacher’s probability distribution.

Teacher → Paris (0.95), Lyon (0.03), Marseille (0.02)
Student → learns to output similar probabilities

Now the student learns not just the answer, but the teacher’s uncertainty and secondary options. This is why distilled models often feel more nuanced — they inherited the teacher’s hesitation about ambiguous cases. The loss function is usually KL divergence (Kullback-Leibler), which measures how different two probability distributions are.

The math is messier, but the intuition is clean: Minimize the gap between student and teacher probability distributions. Standard temperature scaling (usually T=3 or T=4) softens both distributions to make the learning signal clearer.

Feature Distillation

The advanced approach: Match intermediate layers, not just final output.

The teacher’s middle layers encode rich representations — ways of structuring concepts that the student could steal. By forcing the student’s layers to align with the teacher’s layers, you’re transferring more than just answers. You’re transferring the teacher’s internal compass.

Llama 3.1 70B distillation to 8B uses hidden-state alignment: The student’s attention heads learn to match the teacher’s attention patterns. Qwen’s distillation pipeline uses feature matching on transformer layers.

This is computationally heavier (you need to run teacher inference), but it produces the strongest students. That’s why Phi-3 and recent Qwen distills punch so hard for their size.

Synthetic Data and Fine-Tuning Pipelines

Distillation alone isn’t magic. The real power comes from what you distill on.

Step 1: Synthetic Data Generation

You need a dataset of inputs where you know what the teacher would output. Options:

Existing benchmarks: Run the teacher on GSM8K, MMLU, HumanEval, get its outputs, use those as ground truth
New synthetic data: Use the teacher to generate QA pairs, coding problems, reasoning chains from scratch
Mixture of both: Combine public benchmarks with teacher-generated synthetic data

Example: Qwen 2.5 7B was distilled from 32B using a mix of coding problems, math reasoning, and instruction-following tasks. The teacher generated most of the synthetic examples.

DeepSeek’s distillation (DeepSeek 7B from 671B) was trained on ~1.5M synthetic examples generated by the teacher. The teacher spent a lot of compute once; the student learns that forever.

Step 2: DPO / SFT Fine-Tuning

After distillation, the student gets fine-tuned with:

SFT (Supervised Fine-Tuning): Standard next-token prediction on the synthetic dataset. This is the core distillation step.
DPO (Direct Preference Optimization): Optional second pass where the student learns to rank responses by quality, not just match the teacher’s outputs.

Why DPO? The teacher’s probability distribution might be overconfident or miss edge cases. DPO lets the student say “I’m more like the teacher than that bad response, but not as good as the teacher thinks.” It’s especially useful for instruction-following and safety.

Llama 3.2 1B and 3B include DPO tuning after distillation from the Llama 3.1 8B and 70B teachers. That’s why they punch above their weight despite being tiny.

Real Examples: Who’s Distilling and How Well

Phi-3 Family (Microsoft)

The gold standard. Phi-3-mini (3.8B) distilled from GPT-3.5 + synthetic data beats Mistral 7B on reasoning benchmarks. Phi-3-small (7B) competes with Llama 2 13B.

Method: Feature distillation + synthetic reasoning chains. Microsoft generated millions of code and math problems using GPT-4, had Phi-3 mini learn them.

Benchmark: MMLU 7B → 3.8B loses maybe 3–5 percentage points. HumanEval (coding) stays within 10%.

Qwen 2.5 Series

Qwen 2.5 32B distilled down to 7B and 1.5B. The 7B model hits 75–78% on MMLU (the 32B hits 84%). Real-world, it’s nearly indistinguishable for chat.

Method: Hidden-state alignment on transformer layers + DPO tuning on preference data.

Trick: Qwen uses a non-uniform loss — certain layers matter more for distillation than others. Attention heads get higher weight than value projections.

Llama 3.1 / 3.2 Distill

Meta released Llama 3.2 1B and 3B as distilled versions of Llama 3.1 (8B and 70B teachers). The lightweight text models were produced by pruning larger models then recovering performance via distillation — not separate training runs from scratch.

Benchmark: GSM8K (math): the 70B teacher hits 86%, while the 3B student lands around 58–63%. Real-world chat? Surprisingly close for most everyday tasks.

Why it works: Llama’s architecture is good enough that even compressed, it keeps the reasoning patterns. The strong 8B and 70B baselines provide rich signal to distill from.

DeepSeek-7B Distilled

DeepSeek distilled its 671B model (yes, you read that right) to 7B using ~1.5M synthetic examples. The tiny version is shockingly capable.

Method: Teacher-generated QA, code, and reasoning chains. No public benchmarks used — purely synthetic.

Trade-off: Incredible on the synthetic distribution, slightly weaker on out-of-distribution benchmarks.

The Math: Why Distilled Models Don’t Collapse

This is the part that surprised me. Benchmarks predict maybe 10–20% loss when going from 70B → 8B. But qualitatively, you lose way less.

Three reasons:

1. Benchmark Saturation

MMLU (multiple choice) and HumanEval (coding) are already solved by both models. The teacher hits 90%, the student hits 80%. Looks like 10% loss, but both are “correct on nearly all examples.” A human would struggle to tell them apart on those specific benchmarks.

The real gaps appear on reasoning chains, edge cases, and long-tail questions — exactly the stuff benchmarks don’t measure well.

2. Calibration Inheritance

The student inherits the teacher’s probability calibration. When the student doesn’t know something, it hedges like the teacher does. It doesn’t confidently hallucinate; it says “I’m not sure.” That feels like intelligence because it matches human uncertainty patterns.

A random 7B model hallucinated confidently on bad questions. Distilled 7B says “I don’t have enough information” — which feels smarter, even if it’s not technically more correct.

3. Architecture Efficiency

Modern 7B models (Llama 3.1, Qwen 2.5) use better architectures than 70B models from 3 years ago. More efficient attention, better normalization, smarter activation functions. A distilled 7B with 2025 architecture competes with a non-distilled 13B with 2023 architecture.

That’s not distillation magic; that’s just architectural progress. But it compounds with distillation.

What Distilled Models Lose (and It Matters)

Here’s the honest part: Distilled models don’t magically become as smart as teachers. You lose on:

Long-Context Reasoning

The teacher (70B) can hold 20 facts in context and reason over all of them. The student (7B) starts forgetting fact 15 by the time it gets to fact 20. This isn’t distillation’s fault — it’s a parameter count problem. Fewer parameters = shorter effective context.

Workaround: Use RAG (retrieval-augmented generation) to inject context gradually instead of all at once.

Novel Problem-Solving

The teacher’s size gives it redundancy. Multiple internal pathways to solve a problem. The student is tighter, more efficient. On novel problems the training distribution didn’t cover, the teacher explores; the student gets stuck.

Workaround: Use larger models for novel problems, distilled models for standard queries.

Reasoning Chains Longer Than Training Data

If the teacher was trained on short reasoning chains, the distilled student won’t suddenly generate 10-step proofs it never saw. Knowledge distillation compresses; it doesn’t extrapolate beyond the teacher.

The Hardware Angle: When to Pick Distilled

This is where it matters for home labs.

Distilled 7B on 16GB GPU

Run Qwen 2.5 7B quantized to 4-bit (GGUF Q4_K_M):

VRAM: ~6–7 GB (leaving plenty for context)
Speed: 30–50 tokens/sec on consumer GPU (RTX 4060, 3060 Ti)
Quality: Handles chat, coding, summarization. You won’t miss a 70B on daily tasks.

Compare to a non-distilled random 7B:

Same VRAM, but half the capability
You’d want 13–20B to get similar results
That’s 12–15 GB VRAM, no longer fits in 16GB comfortably

Distilled 3B on 6GB GPU

Phi-3 3.8B (quantized):

VRAM: ~2–3 GB
Speed: 70–100 tokens/sec
Quality: Surprisingly good for summaries, coding, Q&A. Not for deep reasoning.

Non-distilled 3B is a paperweight. You need at least 7B for usable results.

The Quantization Question

Distilled models also quantize better. A distilled 7B → 4-bit loses maybe 2–3% capability. A non-distilled 7B → 4-bit loses 5–10%. The compressed knowledge is denser, so rounding doesn’t hurt as much.

Using Distilled Models in Ollama / llama.cpp

Pull and Run in Ollama

ollama pull qwen2.5:7b
ollama run qwen2.5:7b

# Set parameters at the interactive prompt
# >>> /set parameter temperature 0.7
# >>> /set parameter num_predict 512

Qwen 2.5 7B is a solid default. It’s small, smart, and available in multiple quantizations.

Other distilled options:

ollama pull phi3:latest       # Phi-3-mini, good for coding
ollama pull llama2:7b         # Older baseline
ollama pull neural-chat:7b    # Fine-tuned for chat
ollama pull deepseek-coder:6.7b  # Coding focus, distilled

Custom Modelfile for Qwen Distilled

If you want to tune temperature and context for your use case:

FROM qwen2.5:7b

PARAMETER temperature 0.6
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1

SYSTEM """You are a helpful, friendly AI assistant. Be concise but thorough."""

Build and run:

ollama create my-qwen -f ./Modelfile
ollama run my-qwen

llama.cpp Inference Server

If you’re running inference server mode for an app:

# Download quantized Qwen 2.5 7B (Q4_K_M)
wget https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf

# Run inference server
./llama-server -m Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8000 \
  --n-gpu-layers 33 \
  --ctx-size 4096

# Test it
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain distillation in one sentence:", "max_tokens": 100}'

The --n-gpu-layers 33 offloads layers to GPU (adjust for your hardware). For a 7B model and 8GB VRAM, try 20–28 layers.

Distilled vs. Quantized: The Decision Tree

Should you pick a distilled model or just quantize a bigger one?

Distilled Model If:

You have <= 16GB VRAM
You want the best quality within a size budget (7B slot)
You run a lot of inference (token/sec matters)
You care about both speed and capability

Quantized Bigger Model If:

You have 24GB+ VRAM
You can afford longer inference time
You need maximum capability (long reasoning, novel problems)
You’re willing to trade speed for quality

Middle Ground (Recommended for Home Labs):

Use a distilled 7B for daily chat and coding (fast, fits anywhere)
Keep a quantized 13B around for harder problems (better reasoning, longer context)
Both fit in 16GB VRAM if you’re strategic with quantization

When Distilled Models Aren’t Enough

Be honest about what you need:

Long-form reports or analysis: Use 13B or bigger. Distilled 7B gets muddled after 3–4 paragraphs.
Novel problem-solving: Use 13B+. Distilled models regurgitate patterns, don’t create new ones.
Math or formal reasoning: Use 13B+. Distilled models inherit the teacher’s reasoning but lose the robustness.
Casual chat, summaries, Q&A, coding small functions: Distilled 7B is better than you’d expect. Do it.

The Bottom Line

Knowledge distillation is one of the best things that happened to local AI. A year ago, “running LLMs at home” meant accepting garbage output or buying expensive GPUs. Today, a distilled 7B model on a 16GB GPU is genuinely useful.

The trick is understanding what you’re getting: Not a smaller version of the same thing. A student that learned by watching a teacher, so it thinks in similar patterns but can’t do everything the teacher can.

For home lab tinkerers, the distilled models (Qwen 2.5, Phi-3, Llama 3.2 8B) are the best bang for the buck right now. They’re fast, they’re smart enough for real tasks, and they fit in hardware you probably already have.

If you’ve been running a non-distilled 7B locally and thinking “this is kinda mid,” try Qwen 2.5 7B. You’ll be surprised how much better it is. That’s not magic. That’s knowledge distillation doing exactly what it was designed for.

Your 2 AM self debugging a deployment will appreciate it.