Skip to content
Go back

Give Your AI Agent a Cheap Intern

By SumGuy 12 min read
Give Your AI Agent a Cheap Intern

Full example: Grab the subagent definition and delegation script at github.com/KingPin/sumguy-examples/tree/main/llm/cheap-workhorse-model-ai-coding/

You Don’t Need Your Best Model for Everything

It’s 2 AM. Your AI coding agent just burned through $4 in tokens renaming 40 functions to a new convention and converting a YAML config into TOML. You watched it read every file twice, generate a massive diff, re-read the files to verify, then produce a summary of the summary.

That’s like hiring a senior architect to alphabetize your imports. Technically it works. The results are fine. But somewhere in the billing dashboard, a little icon of a burning money pile appeared.

Here’s the thing: most of what an AI coding agent actually does in a session isn’t hard. It’s repetitive, mechanical, and boring — exactly the kind of work you’d hand off to an eager-but-cheap intern. The expensive model is burning its expensive context window on function renames, format conversions, boilerplate generation, and file reads when it could be doing the interesting part: understanding what you actually want and reviewing whether the output is right.

I built a solution for this. Then the pricing rug-pulled me. Now I’m going to explain why the pattern is worth rebuilding anyway.

The Pattern: Overseer Scopes, Workhorse Executes

The core idea is simple. You have two roles:

The Overseer (your expensive model — Claude Sonnet, Opus, whatever you’re using as the “main” agent) does three things:

  1. Understands the request and scopes the task precisely
  2. Hands the scoped task to the workhorse with no ambiguity
  3. Reviews the workhorse’s output before accepting it

The Workhorse (a cheap or free model) does one thing: executes the task exactly as specified, then reports back.

The intern analogy is pretty good here. You (the senior dev) hand the intern a tightly-scoped task: “Rename every occurrence of user_id to userId across these five files. Don’t touch anything else. Show me the diff when you’re done.” The intern does it. You check it. Done.

You would never hand the intern the PRD and say “implement this feature.” That’s not what interns are for. The overseer handles all of that — the understanding, the architectural decisions, the final verification. The intern just executes.

What This Actually Saves (Hint: It’s Not Just Price Per Token)

People hear “cheap model” and think about price per million tokens. That’s real, but it’s the smaller win. The bigger one is something most articles don’t explain clearly.

Win 1: Price per token. Cheap models run anywhere from a few times to dozens of times cheaper per token than frontier models — Haiku is several times cheaper than Sonnet, while a free-tier Gemini Flash against a frontier model is effectively free. Gemini Flash, Haiku, and DeepSeek all sit in the sub-cent-per-million-token range for input. If the workhorse reads 20 files and produces a diff, you’ve spent fractions of a cent instead of a few dollars.

Win 2: The overseer’s context stays clean. This is the one people miss.

When you delegate to a subagent, that subagent runs in its own separate context window. All the grunt-work churn — the full file dumps, the retry after the first attempt was slightly off, the intermediate diffs, the tool call outputs — none of that ends up in the overseer’s context. The expensive model’s context stays small and focused on what matters.

In a long agentic session, context pollution is a serious problem. The overseer gets buried under thousands of tokens of boilerplate it generated two tasks ago, and starts making worse decisions because it’s trying to process too much. Delegating to a subagent is essentially a context hygiene strategy. The workhorse does the dirty work in its own room. The overseer only sees the clean summary at the end.

Both wins compound. You pay less per token and you keep the expensive model’s context clean so it reasons better. That’s the real upside.

Which Workhorse? The Three Tiers

There’s no universal right answer — it depends on your budget, your hardware, and how much you care about your code going to a third party.

Tier 1 — Local and Free (Ollama + a Small Coding Model)

Run a model locally. Zero marginal cost per token. Your code never leaves your machine.

The leading option right now for coding tasks is Qwen2.5-Coder (the 7B or 14B variant). It punches above its weight on mechanical tasks: reformatting, renaming, boilerplate generation, simple refactors. Not great at architecture or reasoning about complex bugs — which is fine, because the workhorse isn’t supposed to do that.

Terminal window
# Pull and run Qwen2.5-Coder via Ollama
ollama pull qwen2.5-coder:14b
ollama serve

The trade-off: you need hardware (14B needs ~10GB VRAM for GPU inference, or it runs slowly on CPU), and inference is slower than a cloud API. If you’re on a beefy machine or don’t mind waiting a few extra seconds, this is the ideal setup — zero recurring cost, fully private.

Full setup for running Ollama as a workhorse in Docker with the right API shape for tool use is its own topic — the companion guide covers it: Self-Host a Local AI Coding Workhorse.

Tier 2 — Free-Tier / Dirt-Cheap Cloud

If you don’t have the hardware for Tier 1, several cloud options are functionally free at moderate usage:

Honest trade-offs: free tiers are rate-limited, and some of these providers may train on your data (check their terms — Google’s free tier historically has, so read the current TOS; Groq’s usage policy is worth a read too). If you’re sending proprietary code through a free-tier API, understand what you’re agreeing to.

For hobby projects and open-source work? Gemini Flash’s free tier is genuinely useful and hard to argue with at that price point. For work code, pay for the API or go local.

Tier 3 — Cheap Cloud in One Ecosystem (Claude Haiku)

This is the laziest option and sometimes the right call. If you’re already in Claude Code using Sonnet or Opus as the overseer, you can make Claude Haiku the workhorse. Same API key, same tool definitions, no second provider to configure.

In a Claude Code subagent definition, it’s literally one line:

.claude/agents/workhorse.md
---
name: workhorse
model: claude-haiku-4-5
---

Haiku is significantly cheaper than Sonnet, fast, and capable enough for mechanical tasks. It’s not free, but the friction of setup is near zero. If you’re already paying for Claude and just want to start saving tokens today without any infrastructure work, start here.

Wiring It Up in Claude Code

There are two approaches. The subagent approach is cleaner; the shell-out approach is more flexible.

Create a subagent definition in .claude/agents/. The overseer dispatches tasks to it via the Agent tool — the workhorse runs in its own context window, keeping the grunt-work churn isolated.

Here’s a definition based on the rules I landed on after running this for months — the ones that actually worked:

.claude/agents/workhorse.md
---
name: workhorse
model: claude-haiku-4-5
---
You are the Workhorse Executor. You accept atomic, mechanical coding tasks and execute them silently.
## Rules
- Accept ONLY tasks that are self-contained and clearly specified. Max ~50 lines of change, ideally single-file.
- Execute without clarifying questions. If the task is ambiguous, report back that you need clarification — don't guess.
- If the task is too large or vague, respond: "Task too large. Break it into smaller chunks."
- Do NOT do code review. Do NOT suggest improvements. Execute only.
- On completion, report ONLY: `git diff --stat` output. Not the full diff.
- No architectural decisions. No refactoring beyond what was asked.
- Assume a 10-minute timeout. If it's taking longer, something is wrong with the task scope.

The overseer uses it like this (in a slash command or CLAUDE.md workflow):

Overseer delegation pattern
1. Identify the scope: which files, what change, what success looks like.
2. Tell the user: "Sending these N files to workhorse: [list]"
3. Dispatch to workhorse via Agent tool with the scoped task.
4. MANDATORY: Read the affected files after workhorse completes.
5. Verify the output matches intent before reporting back.

Step 4 is not optional. The workhorse is fast and cheap, not infallible. The overseer’s review is what makes the pattern safe.

Approach 2: The Shell-Out (Slash Command Style)

This is the shell-out version I originally built — the overseer shells out to an external CLI that calls a cheap model. It’s more flexible because you can point it at any endpoint: a local Ollama server, a cloud API, whatever.

Terminal window
# Pattern: overseer shells out to a cheap model CLI
# What I actually ran was GitHub's `copilot` CLI pointed at a cheap model
copilot --model gpt-5-mini --yolo --silent -p "{{scoped_task_prompt}}"

That specific tool isn’t the point — anything that takes a prompt and returns a result works. The companion guide swaps this for a small local script that hits a self-hosted model, so the workhorse costs nothing and your code never leaves the box.

The Claude Code slash command that wrapped this:

.claude/commands/delegate.md
Delegate a mechanical coding task to the cheap workhorse model.
Steps:
1. Identify the scope. Tell the user which files are being sent.
2. Run: copilot --model <cheap-model> --yolo --silent -p "<scoped task>"
3. Read ALL affected files after the command completes.
4. Report back — do NOT skip the file review step.

The mandatory review step in step 3 can’t be skipped. If you let the shell-out result fly without verification, you’ve introduced a blind spot. The workhorse will occasionally do something subtly wrong, especially if the task description had any ambiguity.

The Non-Negotiable Rules

This pattern is genuinely useful, but it has failure modes. Here’s where people get burned:

Scope tightly or don’t scope at all. Cheap models are great at mechanical, well-specified work. They’re bad at ambiguity. The overseer’s entire job in the delegation step is to remove every shred of ambiguity before handing off. If you can’t explain the task in two sentences with clear success criteria, don’t delegate it.

Always review the output. I mentioned this twice already. Here it is a third time. The overseer reads the changed files or the diff. Always. A workhorse that silently breaks something is worse than no workhorse — at least if you’d done it yourself, you’d know what happened.

Know the break-even point. If a task is so small that scoping it, dispatching it, and reviewing the output costs more overseer tokens than just doing it inline — do it inline. Renaming one variable? Don’t spin up a subagent for that. Renaming a variable across 40 files with a consistent pattern? That’s what the intern is for.

Don’t hard-couple to one model. This is the lesson I learned the hard way. My original setup used GPT-5-mini as the workhorse, included free with a coding tool I already paid for. Then that “included” usage got moved behind monthly-expiring prepaid credits, the small models stopped being free, and overnight my free workhorse was metered. The tool stopped being cost-effective and the setup died. The architecture was right. The coupling to a specific model was the fragile part. Keep the worker model as a single config value — one line to swap it out when the pricing changes again. Because it will.

Local = slower. Free tiers = rate limits and possible data training. Name the trade-offs before you pick a tier, not after. Don’t find out that Gemini’s free tier trained on your client’s proprietary code by reading the ToS after you’ve been using it for a month.

When This Is Worth It

This pattern pays off when:

Skip it when:

The Pricing Will Change. The Architecture Won’t.

The specific model names in this article will be wrong in six months. Gemini Flash will get a new version. Haiku pricing will shift. Some new model will come out of nowhere and be 10x cheaper for coding tasks for a quarter until everyone notices and the pricing adjusts.

That’s fine. The architecture is the stable part: overseer scopes and reviews, workhorse executes in isolation, outputs are always verified. Which model plays each role is a one-line config change.

Build the pattern once. Swap the models whenever the economics shift. Your token bill will appreciate it, and so will your 2 AM self when the context window isn’t clogged with 40 iterations of a file rename.


Extending Claude Code with external tools and MCP servers follows a similar composability pattern — see MCP Servers: Extending LLMs Beyond Chat for the bigger picture. If you’re using a local Ollama instance as the workhorse and want to also give Claude Code a local search tool, Claude Code + SearXNG covers the search side.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
WebAuthn & Passkeys for Sysadmins
Next Post
Borg vs Duplicacy: Dedup Backup Wars

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts