Give Your AI Agent a Cheap Intern

Full example: Grab the subagent definition and delegation script at github.com/KingPin/sumguy-examples/tree/main/llm/cheap-workhorse-model-ai-coding/

You Don’t Need Your Best Model for Everything

It’s 2 AM. Your AI coding agent just burned through $4 in tokens renaming 40 functions to a new convention and converting a YAML config into TOML. You watched it read every file twice, generate a massive diff, re-read the files to verify, then produce a summary of the summary.

That’s like hiring a senior architect to alphabetize your imports. Technically it works. The results are fine. But somewhere in the billing dashboard, a little icon of a burning money pile appeared.

Most of what an AI coding agent actually does in a session isn’t hard. It’s repetitive, mechanical, and boring, exactly the kind of work you’d hand off to an eager-but-cheap intern. The expensive model is burning its expensive context window on function renames, format conversions, boilerplate generation, and file reads when it could be doing the interesting part: understanding what you actually want and reviewing whether the output is right.

I built a solution for this. Then the pricing rug-pulled me. Now I’m going to explain why the pattern is worth rebuilding anyway.

The Pattern: Overseer Scopes, Workhorse Executes

The core idea is simple. You have two roles:

The Overseer (your expensive model, Claude Sonnet, Opus, whatever you’re using as the “main” agent) does three things:

Understands the request and scopes the task precisely
Hands the scoped task to the workhorse with no ambiguity
Reviews the workhorse’s output before accepting it

The Workhorse (a cheap or free model) does one thing: executes the task exactly as specified, then reports back.

The intern analogy is pretty good here. You (the senior dev) hand the intern a tightly-scoped task: “Rename every occurrence of user_id to userId across these five files. Don’t touch anything else. Show me the diff when you’re done.” The intern does it. You check it. Done.

You would never hand the intern the PRD and say “implement this feature.” That’s not what interns are for. The overseer handles all of that, the understanding, the architectural decisions, the final verification. The intern just executes.

What This Actually Saves (Hint: It’s Not Just Price Per Token)

People hear “cheap model” and think about price per million tokens. That’s real, but it’s the smaller win. The bigger one is something most articles don’t explain clearly.

Win 1: Price per token. Cheap models run anywhere from a few times to dozens of times cheaper per token than frontier models, Haiku is several times cheaper than Sonnet, while a free-tier Gemini Flash against a frontier model is effectively free. Gemini Flash, Haiku, and DeepSeek all sit in the sub-cent-per-million-token range for input. If the workhorse reads 20 files and produces a diff, you’ve spent fractions of a cent instead of a few dollars.

Win 2: The overseer’s context stays clean. This is the one people miss.

When you delegate to a subagent, that subagent runs in its own separate context window. All the grunt-work churn, the full file dumps, the retry after the first attempt was slightly off, the intermediate diffs, the tool call outputs, none of that ends up in the overseer’s context. The expensive model’s context stays small and focused on what matters.

In a long agentic session, context pollution is a serious problem. The overseer gets buried under thousands of tokens of boilerplate it generated two tasks ago, and starts making worse decisions because it’s trying to process too much. Delegating to a subagent is essentially a context hygiene strategy. The workhorse does the dirty work in its own room. The overseer only sees the clean summary at the end.

Both wins compound. You pay less per token and you keep the expensive model’s context clean so it reasons better. That’s the real upside.

Which Workhorse? The Three Tiers

There’s no universal right answer, it depends on your budget, your hardware, and how much you care about your code going to a third party.

Tier 1, Local and Free (Ollama + a Small Coding Model)

Run a model locally. Zero marginal cost per token. Your code never leaves your machine.

The leading option right now for coding tasks is Qwen2.5-Coder (the 7B or 14B variant). It punches above its weight on mechanical tasks: reformatting, renaming, boilerplate generation, simple refactors. Not great at architecture or reasoning about complex bugs, which is fine, because the workhorse isn’t supposed to do that.

# Pull and run Qwen2.5-Coder via Ollama
ollama pull qwen2.5-coder:14b
ollama serve

The trade-off: you need hardware (14B needs ~10GB VRAM for GPU inference, or it runs slowly on CPU), and inference is slower than a cloud API. If you’re on a beefy machine or don’t mind waiting a few extra seconds, this is the ideal setup, zero recurring cost, fully private.

Full setup for running Ollama as a workhorse in Docker with the right API shape for tool use is its own topic, the companion guide covers it: Self-Host a Local AI Coding Workhorse.

Tier 2, Free-Tier / Dirt-Cheap Cloud

If you don’t have the hardware for Tier 1, several cloud options are functionally free at moderate usage:

Gemini Flash (Google AI Studio): generous free tier, fast, solid at code tasks
Groq: extremely fast inference (they have hardware optimized for it), free tier with rate limits
DeepSeek API: very cheap paid tier, comparable to Haiku pricing, strong at coding

Honest trade-offs: free tiers are rate-limited, and some of these providers may train on your data (check their terms, Google’s free tier historically has, so read the current TOS; Groq’s usage policy is worth a read too). If you’re sending proprietary code through a free-tier API, understand what you’re agreeing to.

For hobby projects and open-source work? Gemini Flash’s free tier is genuinely useful and hard to argue with at that price point. For work code, pay for the API or go local.

Tier 3, Cheap Cloud in One Ecosystem (Claude Haiku)

This is the laziest option and sometimes the right call. If you’re already in Claude Code using Sonnet or Opus as the overseer, you can make Claude Haiku the workhorse. Same API key, same tool definitions, no second provider to configure.

In a Claude Code subagent definition, it’s literally one line:

---
name: workhorse
model: claude-haiku-4-5
---

Haiku is significantly cheaper than Sonnet, fast, and capable enough for mechanical tasks. It’s not free, but the friction of setup is near zero. If you’re already paying for Claude and just want to start saving tokens today without any infrastructure work, start here.

Wiring It Up in Claude Code

There are two approaches. The subagent approach is cleaner; the shell-out approach is more flexible.

Approach 1: The Subagent Definition (Recommended)

Create a subagent definition in .claude/agents/. The overseer dispatches tasks to it via the Agent tool, the workhorse runs in its own context window, keeping the grunt-work churn isolated.

Here’s a definition based on the rules I landed on after running this for months, the ones that actually worked:

---
name: workhorse
model: claude-haiku-4-5
---

You are the Workhorse Executor. You accept atomic, mechanical coding tasks and execute them silently.

## Rules

- Accept ONLY tasks that are self-contained and clearly specified. Max ~50 lines of change, ideally single-file.
- Execute without clarifying questions. If the task is ambiguous, report back that you need clarification — don't guess.
- If the task is too large or vague, respond: "Task too large. Break it into smaller chunks."
- Do NOT do code review. Do NOT suggest improvements. Execute only.
- On completion, report ONLY: `git diff --stat` output. Not the full diff.
- No architectural decisions. No refactoring beyond what was asked.
- Assume a 10-minute timeout. If it's taking longer, something is wrong with the task scope.

The overseer uses it like this (in a slash command or CLAUDE.md workflow):

1. Identify the scope: which files, what change, what success looks like.
2. Tell the user: "Sending these N files to workhorse: [list]"
3. Dispatch to workhorse via Agent tool with the scoped task.
4. MANDATORY: Read the affected files after workhorse completes.
5. Verify the output matches intent before reporting back.

Step 4 is not optional. The workhorse is fast and cheap, not infallible. The overseer’s review is what makes the pattern safe.

Approach 2: The Shell-Out (Slash Command Style)

This is the shell-out version I originally built, the overseer shells out to an external CLI that calls a cheap model. It’s more flexible because you can point it at any endpoint: a local Ollama server, a cloud API, whatever.

# Pattern: overseer shells out to a cheap model CLI
# What I actually ran was GitHub's `copilot` CLI pointed at a cheap model
copilot --model gpt-5-mini --yolo --silent -p "{{scoped_task_prompt}}"

That specific tool isn’t the point, anything that takes a prompt and returns a result works. The companion guide swaps this for a small local script that hits a self-hosted model, so the workhorse costs nothing and your code never leaves the box.

The Claude Code slash command that wrapped this:

Delegate a mechanical coding task to the cheap workhorse model.

Steps:
1. Identify the scope. Tell the user which files are being sent.
2. Run: copilot --model <cheap-model> --yolo --silent -p "<scoped task>"
3. Read ALL affected files after the command completes.
4. Report back — do NOT skip the file review step.

The mandatory review step in step 3 can’t be skipped. If you let the shell-out result fly without verification, you’ve introduced a blind spot. The workhorse will occasionally do something subtly wrong, especially if the task description had any ambiguity.

The Non-Negotiable Rules

This pattern is genuinely useful, but it has failure modes. Here’s where people get burned:

Scope tightly or don’t scope at all. Cheap models are great at mechanical, well-specified work. They’re bad at ambiguity. The overseer’s entire job in the delegation step is to remove every shred of ambiguity before handing off. If you can’t explain the task in two sentences with clear success criteria, don’t delegate it.

Always review the output. I mentioned this twice already. Here it is a third time. The overseer reads the changed files or the diff. Always. A workhorse that silently breaks something is worse than no workhorse, at least if you’d done it yourself, you’d know what happened.

Know the break-even point. If a task is so small that scoping it, dispatching it, and reviewing the output costs more overseer tokens than just doing it inline, do it inline. Renaming one variable? Don’t spin up a subagent for that. Renaming a variable across 40 files with a consistent pattern? That’s what the intern is for.

Don’t hard-couple to one model. This is the lesson I learned the hard way. My original setup used GPT-5-mini as the workhorse, included free with a coding tool I already paid for. Then that “included” usage got moved behind monthly-expiring prepaid credits, the small models stopped being free, and overnight my free workhorse was metered. The tool stopped being cost-effective and the setup died. The architecture was right. The coupling to a specific model was the fragile part. Keep the worker model as a single config value, one line to swap it out when the pricing changes again. Because it will.

Local = slower. Free tiers = rate limits and possible data training. Name the trade-offs before you pick a tier, not after. Don’t find out that Gemini’s free tier trained on your client’s proprietary code by reading the ToS after you’ve been using it for a month.

When This Is Worth It

This pattern pays off when:

You have long agentic sessions with a lot of mechanical work: bulk refactors, format conversions, boilerplate generation, test stub creation
You’re using an expensive frontier model (Opus, Sonnet, GPT-5) as the overseer and can clearly identify the dumb parts of its workload
You care about context window hygiene: keeping the overseer’s context clean so it reasons well throughout a long session
You’re building a CLI tool or harness on top of Claude Code and want to route tasks programmatically

Skip it when:

The task requires judgment, architectural thinking, or understanding of business logic: that stays with the overseer
The task is so small the overhead isn’t worth it
You’re using an already-cheap model as the overseer (if you’re already on Haiku or Flash, the delta isn’t there)

The Pricing Will Change. The Architecture Won’t.

The specific model names in this article will be wrong in six months. Gemini Flash will get a new version. Haiku pricing will shift. Some new model will come out of nowhere and be 10x cheaper for coding tasks for a quarter until everyone notices and the pricing adjusts.

That’s fine. The architecture is the stable part: overseer scopes and reviews, workhorse executes in isolation, outputs are always verified. Which model plays each role is a one-line config change.

Build the pattern once. Swap the models whenever the economics shift. Your token bill will appreciate it, and so will your 2 AM self when the context window isn’t clogged with 40 iterations of a file rename.

Extending Claude Code with external tools and MCP servers follows a similar composability pattern, see MCP Servers: Extending LLMs Beyond Chat for the bigger picture. If you’re using a local Ollama instance as the workhorse and want to also give Claude Code a local search tool, Claude Code + SearXNG covers the search side.

Give Your AI Agent a Cheap Intern

You Don’t Need Your Best Model for Everything

The Pattern: Overseer Scopes, Workhorse Executes

What This Actually Saves (Hint: It’s Not Just Price Per Token)

Which Workhorse? The Three Tiers

Tier 1, Local and Free (Ollama + a Small Coding Model)

Tier 2, Free-Tier / Dirt-Cheap Cloud

Tier 3, Cheap Cloud in One Ecosystem (Claude Haiku)

Wiring It Up in Claude Code

Approach 1: The Subagent Definition (Recommended)

Approach 2: The Shell-Out (Slash Command Style)

The Non-Negotiable Rules

When This Is Worth It

The Pricing Will Change. The Architecture Won’t.

Responses from around the web

Discussion

Related Posts

Karakeep: Self-Hosted Bookmarks With AI Tagging

Stop Feeding the AI Your Whole Repo

RTK vs snip vs lean-ctx: Token Killers

Claude Code in a Homelab Workflow