Skip to content
Go back

Self-Host a Local AI Coding Workhorse

By SumGuy 14 min read
Self-Host a Local AI Coding Workhorse

If you read the pattern piece, you know the idea: let a cheap workhorse model handle the mechanical grunt work — boilerplate, refactors, search-and-replace — while an expensive overseer model like Claude scopes and reviews. The whole point is to stop feeding your API bill things that don’t need a frontier model.

That piece covered three workhorse tiers. This is the deep-dive on Tier 1: local and free. We’re going to self-host a small coding model in Docker, expose an OpenAI-compatible endpoint, and wire it up so Claude can hand off grunt work to it — with your code never leaving the machine.

Full example: Clone the working files (compose stacks + delegate script) at github.com/KingPin/sumguy-examples/tree/main/llm/local-workhorse-ollama-docker-claude/


Why Local, Why Now

Here’s the honest pitch. A local workhorse is:

The trade-off is real (we’ll get to that in the reality-check section), but if you’re doing a lot of mechanical coding work and you have a machine with 8–16 GB of VRAM sitting around, this is worth the hour to set up.


Pick Your Backend: Ollama or llama.cpp

There’s already a full backend comparison on this site if you want the deep breakdown. For this specific use case — running a small coding model as a local API endpoint — here’s the short version:

Ollama is the easy choice. Built-in model management (ollama pull, ollama list), a clean REST API, OpenAI-compatible at /v1, and a Docker image that Just Works. The DX is good. If you’ve never run a local model before, start here.

llama.cpp (llama-server) is for the tinkerers. Leaner binary, more control over quantization and thread counts, same OpenAI-compatible API. If you want to run a specific GGUF you downloaded, or you care about squeezing every token/sec out of your hardware, this is your path. More setup friction, more knobs to turn.

My recommendation: Ollama for most people, llama.cpp if you already know what GGUF means and you’re annoyed that I explained it.

Both sections are below. Pick one, skip the other.


Pick a Model: Small but Decent at Code

For mechanical coding tasks, you want something in the Qwen2.5-Coder family. It’s purpose-trained for code, genuinely good at mechanical tasks, and comes in sizes that are honest about what they need.

ModelVRAM (GPU)RAM (CPU-only)Sweet spot
qwen2.5-coder:3b~3 GB~4–5 GBModest hardware, quick tasks
qwen2.5-coder:7b~6–8 GB~10–12 GBBetter quality, most desktop GPUs
qwen2.5-coder:14b~12–16 GB~20+ GBHigh-end workstation

The 7B is the sweet spot for most home lab setups. If you have an older GPU with less VRAM, the 3B is not embarrassing — it’s still solid for refactors and boilerplate.

Other options worth knowing about: deepseek-coder-v2:16b if you have the VRAM, codellama:7b as a fallback, starcoder2:7b for multilingual work. And don’t sleep on general-purpose models — when I tested this setup (more on that below), I pointed it at a 12B Gemma-class model that isn’t code-specialized, and it handled the mechanical tasks fine. A code-tuned model is the safer default, but if you’ve already got a solid general model loaded, try it before downloading another 8 GB. Honestly, though, just start with Qwen2.5-Coder — it benchmarks well on exactly the kind of tasks we’re using it for.

Important framing: this model does not need to understand your architecture. It doesn’t need to reason about system design. It needs to rename all instances of UserManager to UserService across 12 files, or generate a CRUD handler from a schema you paste in. Small is fine for that.


Option A: Ollama via Docker

This is the recommended path. One compose file, pull your model, you’re done.

docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Remove the 'deploy' block entirely if you have no NVIDIA GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:

CPU-only users: Delete the deploy: block entirely. It’ll work, just slower. We’ll talk about what “slower” means in the reality check.

AMD GPU users: Use the ollama/ollama:rocm image tag instead, and make sure rocm is installed on the host.

Start it up:

Terminal window
docker compose up -d
# Pull your model (the ~4-5 GB download happens once, stored in the volume)
docker exec ollama ollama pull qwen2.5-coder:7b
# Verify it loaded
docker exec ollama ollama list

Test the endpoint:

Terminal window
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:7b",
"messages": [{"role": "user", "content": "Write a Python function that reverses a string."}]
}'

If you get a JSON response with a code block in it, you’re set. Ollama exposes a fully OpenAI-compatible API at http://localhost:11434/v1 — which means anything that speaks OpenAI can point at it.


Option B: llama.cpp via Docker

If you want leaner footprint or you’ve got a specific GGUF downloaded, llama-server from the llama.cpp project exposes the same OpenAI-compatible API.

docker-compose.yml
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server
container_name: llama-server
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ./models:/models
command: >
-m /models/qwen2.5-coder-7b-instruct-q4_k_m.gguf
--host 0.0.0.0
--port 8080
--n-gpu-layers 99
--ctx-size 8192

You’ll need to download the GGUF manually and drop it in a ./models/ directory alongside your compose file. The Hugging Face repo for Qwen2.5-Coder has the quantized files — grab the q4_k_m variant, it’s the best quality-to-size ratio for most tasks.

The endpoint is http://localhost:8080/v1 for OpenAI-compatible requests.

--n-gpu-layers 99 tells llama.cpp to offload everything onto the GPU — it clamps to the model’s actual layer count, so a deliberately high number just means “put it all on the card.” Set to 0 for CPU-only, or dial it down if the model doesn’t fit in VRAM (you’ll see out-of-memory errors at load time if it doesn’t).


Wiring It to Claude (The Overseer)

This is the part that actually matters. Claude Code runs Claude models for its subagents — there’s no built-in way to reroute subagent calls to a local model. What you’re doing instead is giving the overseer a tool it can invoke to delegate specific tasks to the local worker.

The simplest version: a delegation script that takes a task description and optional file contents, fires them at the local OpenAI endpoint, and returns the result.

The Delegation Script

delegate.py
#!/usr/bin/env python3
"""
Local workhorse delegate — sends a task to a local Ollama/llama.cpp endpoint.
Usage: python3 delegate.py "your task description" [file1.py file2.py ...]
"""
import sys
import os
from pathlib import Path
from openai import OpenAI
# Point at your local model server
# Change port to 8080 if using llama.cpp
BASE_URL = os.getenv("WORKHORSE_URL", "http://localhost:11434/v1")
MODEL = os.getenv("WORKHORSE_MODEL", "qwen2.5-coder:7b")
client = OpenAI(api_key="ollama", base_url=BASE_URL)
def build_context(file_paths: list[str]) -> str:
parts = []
for path in file_paths:
p = Path(path)
if p.exists():
parts.append(f"### {path}\n```\n{p.read_text()}\n```")
else:
print(f"Warning: {path} not found, skipping", file=sys.stderr)
return "\n\n".join(parts)
def main():
if len(sys.argv) < 2:
print("Usage: delegate.py <task> [file1 file2 ...]", file=sys.stderr)
sys.exit(1)
task = sys.argv[1]
files = sys.argv[2:]
messages = [
{
"role": "system",
"content": (
"You are a precise code assistant. When asked to modify code, "
"output ONLY the complete updated file contents with no explanation. "
"When asked a question, answer concisely."
),
}
]
if files:
context = build_context(files)
messages.append({
"role": "user",
"content": f"{task}\n\n{context}"
})
else:
messages.append({"role": "user", "content": task})
response = client.chat.completions.create(
model=MODEL,
messages=messages,
temperature=0.1, # low temp — we want deterministic edits, not creativity
# If your model has a "thinking"/reasoning mode, turn it OFF for grunt work.
# This was an 18x speedup in testing (38s -> 2s). This particular knob is
# honored by llama.cpp's server; Ollama silently ignores unknown extra_body
# fields (harmless), and the flag name varies by model -- so test it.
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response.choices[0].message.content)
if __name__ == "__main__":
main()

Install the dependency once: pip install openai (it’s the official client, works with any OpenAI-compatible endpoint).

Calling It From Claude Code

Now tell Claude about the tool. Add this to your project’s .claude/commands/ directory:

.claude/commands/delegate.md
Run a mechanical coding task using the local workhorse model.
Usage: /delegate <task description> [file paths]
This invokes the local model (Ollama/llama.cpp) for grunt work: refactoring,
boilerplate generation, conversions, renames. You MUST review the output before
accepting — check the diff, verify it compiles, make sure it didn't break
anything obvious.
Example:
$ARGUMENTS
Steps:
1. Run: python3 /path/to/delegate.py $ARGUMENTS
2. Review the output carefully — read every line the local model changed
3. If acceptable, apply the changes; if not, note what it got wrong and retry with clearer scope

Now in a Claude Code session, you can type /delegate "rename UserManager to UserService in" src/auth/manager.py and Claude will invoke the script, receive the output, and (critically) review it before touching anything.

Or Use an Existing OpenAI-Compatible CLI

If you’d rather not write a script, several tools speak OpenAI-compatible endpoints out of the box:

aider works directly with Ollama:

Terminal window
# Aider with Ollama backend — point at a non-default host via env var
OLLAMA_API_BASE=http://localhost:11434 aider --model ollama/qwen2.5-coder:7b
# Or with llama.cpp
OPENAI_API_BASE=http://localhost:8080/v1 OPENAI_API_KEY=none \
aider --openai-api-key none --model openai/qwen2.5-coder:7b

Any OpenAI-compatible CLI can be pointed at the local endpoint via environment variables:

Terminal window
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama # value doesn't matter, Ollama ignores it

The overseer (Claude) still reviews the output either way. That part is non-negotiable.

The Review Step Is Not Optional

Worth being explicit about this: the local model is fast, free, and dumber than frontier Claude. It will occasionally do something technically correct but subtly wrong — rename the right thing in the wrong context, miss an edge case, or generate boilerplate with a bug you didn’t catch. That’s not a failure, that’s the design. The entire point of the pattern is that Claude reviews the diff before it lands.

Have Claude run a git diff after the delegate script writes its changes. Read it. If it looks right, accept it. If it’s off, scope the task more narrowly and retry. You’re trading some review overhead for free tokens on the mechanical parts.

What Happened When I Actually Wired This Up

I didn’t want to ship this on theory, so I pointed Claude at a real local model — a general-purpose 12B (Gemma-class) running under llama-server on a box on my network — and handed it a grunt task through the exact delegate.py above: add type hints and a one-line docstring to each function in this file.

The output was genuinely clean. Correct type hints (id: int, -> None), a docstring per function, it even added the from typing import Any import on its own — and thanks to the system prompt, it came back as bare code with no markdown fences to strip. A general model that isn’t even code-tuned nailed a mechanical code task, which is the whole point: this work doesn’t need a genius.

But the first run took 38 seconds for a six-line file, and the reported usage was absurd — nearly 1,900 tokens for an 80-token answer. That’s the tell. This is a reasoning model, and it was burning a giant invisible chain-of-thought before writing a single line of output. For grunt work, that’s pure latency tax — you do not want the model “thinking hard” about how to add a type hint.

The fix is one request parameter. On a llama.cpp / Ollama OpenAI-compatible endpoint, disable the thinking phase:

the one line that mattered
# passed via the OpenAI client's extra_body
extra_body={"chat_template_kwargs": {"enable_thinking": False}}

Same task, thinking off: 2.1 seconds, 93 output tokens, output just as clean. That’s an 18x speedup from a single flag. The exact knob varies by model — some want /no_think in the prompt, some a reasoning_effort field — and for this build only enable_thinking: false actually did anything; the others I tried were quietly ignored and stayed slow. So test it.

The lesson: if your local model has a reasoning mode, turn it off for the workhorse. Reasoning belongs to the overseer; the workhorse just types fast. With it off, a local 12B round-trips a small edit in ~2 seconds — fast enough that the break-even math tilts a lot further toward “just delegate it.”


Reality Check: Is This Worth It?

Local is not always the right call. Let’s be honest.

The bad news:

The good news:

The gut-check question: Are you doing 50+ delegated tasks a week? Do you have a GPU with 8+ GB VRAM, or are you okay with CPU latency? Does code privacy matter more than convenience? If yes to two or more: set it up. Otherwise, Haiku at a fraction of a cent per task is the better answer and you should go read the tier 3 section of the pattern article instead.


You’ve Got a Free Intern

Here’s where you are now: you have a local model server running in Docker, an OpenAI-compatible endpoint at localhost:11434/v1 (or :8080/v1), and a delegation script that Claude can invoke via a slash command.

Your overseer (Claude) scopes the task, hands it to the workhorse via the delegate script, reads the diff, and signs off. The workhorse does the boring parts for free. Your code never leaves the machine.

It’s like having an intern who can type really fast and never complains about refactoring work. You still have to review everything they touch, but at least you’re not paying per keystroke.

Go back to the pattern article if you want to see how Tier 1 (local, this article), Tier 2 (free cloud), and Tier 3 (cheap cloud) fit together — and how to decide which one to reach for on any given task.

Your 2 AM refactor just got a lot cheaper to run.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
WebAuthn & Passkeys for Sysadmins

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts