The Leap from Chatbot to Agent is Just Structured Tool Use
Here’s the thing: the jump from “GPT that answers questions” to “AI agent that actually gets stuff done” isn’t some magical leap in model intelligence. It’s a much simpler party trick — structured tool calling. The model stops spitting out prose and starts emitting JSON that says “hey, I need to call the weather API with these parameters” or “run this bash command” or “look this up in the database.”
For a long time, this was closed off to local models. You had to ship your request to OpenAI, Anthropic, or Anthropic’s competitor du jour, wait for them to handle your tools, and hope they didn’t hallucinate the function names. But in 2026, running a capable function-calling model locally is absolutely doable — and honestly, more reliable than you’d expect. Models like Llama 3.3, Qwen 2.5, and Hermes 3 can do this. Ollama makes it smooth. llama.cpp gives you the fine-grained control. And the patterns? They’re not even that weird once you understand what’s actually happening.
This article walks through what function calling actually is, which models do it well, the tooling (Ollama, llama.cpp, grammar-constrained generation), and a real working example: a local agent that queries a weather API and calls a calculator. We’ll also dig into what breaks, why, and how to fix it.
What Actually Happens When a Model “Calls a Function”
Function calling isn’t the model reaching into your filesystem and executing code. What’s happening is this:
- You give the model a list of available tools in a structured format (JSON schema, usually).
- You ask the model a question that requires using one or more of those tools.
- The model, trained to recognize the pattern, emits a structured response instead of freeform text. That response says “use tool X with arguments Y.”
- Your code parses that structured response, calls the actual tool, and feeds the result back to the model.
- The model, now armed with the tool output, answers the original question.
The model isn’t “calling” anything — your orchestration layer is. The model is just predicting what should be called next, and it’s doing so in a format you can parse reliably.
This is why the OpenAI tools format became so influential: it standardizes how you describe tools and how the model responds. But local models have options now, and some are arguably better for constrained generation.
Function Calling Formats: OpenAI, Ollama, and Native Llama 3.1+
OpenAI Tools Format
The OpenAI format is the lingua franca. You define tools like this:
{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["location"] } }}The model sees this, understands the schema, and responds with something like:
{ "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Seattle, WA\", \"unit\": \"fahrenheit\"}" }}Clean. Parseable. Standard across most LLM APIs. Ollama’s API mirrors this almost exactly when you set up tools.
Ollama’s Tool Support
Ollama supports the OpenAI tools format natively (added around late 2024). You pass tools as part of the request, and Ollama handles routing the model’s output through the tool-calling path. The upside: it’s familiar. The downside: Ollama’s implementation is thin — it relies entirely on the model’s training to follow the schema, with no constraint enforcement on the output.
Llama 3.1+ Native Format
Llama 3.1 introduced an alternative: a built-in tool-use format baked into the model’s tokenizer and training. Instead of JSON in a text field, the model emits special tokens that represent tool calls. This is theoretically more robust because it’s enforced at the token level, but in practice, most local inference engines (Ollama, llama.cpp) still convert this back to JSON for easy consumption. You usually don’t notice the difference.
llama.cpp’s Grammar-Constrained Generation (GBNF)
Here’s where it gets interesting: llama.cpp supports GBNF (EBNF-style grammars) to force the model to emit JSON that matches your schema. No hallucinated argument names. No malformed JSON. The model’s sampling is constrained to only tokens that are valid according to your grammar.
./main -m model.gguf -p "What's the weather?" \ -j schema.gbnfThis is powerful for reliability, especially when you’re running a smaller or less-trained model locally.
Models That Actually Do This Well in 2026
Not every model is trained for function calling. You need one where the developer intentionally included tool-use examples in the training data.
Llama 3.3 (70B, 8B) — The gold standard. Excellent function calling, native tool-use tokens, solid reasoning. If you’re running local models for serious work, start here.
Qwen 2.5 (72B, 7B) — Strong tool use, competitive with Llama 3.3, slightly more forgiving with schema variations.
Hermes 3 (405B) — Overkill for most local setups (because it’s 405B), but if you have the VRAM, it’s phenomenally reliable at tool calling.
Mistral Nemo (12B) — Surprisingly capable for its size. Not as robust as Llama 3.3, but gets the job done in tight memory budgets.
Functionary (7B, based on Mistral) — Purpose-built for tool calling. If you’re serious about function-calling agents, this is worth trying.
Avoid: Older models (Llama 2, Mistral 7B base, anything before 2024). They weren’t trained on tool examples and will hallucinate.
The Practical Setup: Ollama + Python
Let’s build something real. A local agent that:
- Takes a question like “What’s the weather in Portland and add 5 to 32?”
- Calls a mock weather API.
- Calls a calculator.
- Synthesizes the answer.
First, start Ollama with a capable model:
ollama pull llama2:70b-chatollama serveThen, Python code to orchestrate the function calling:
import jsonimport reimport requestsfrom ollama import Client
# Initialize Ollama clientclient = Client(host="http://localhost:11434")
# Define tools schema (OpenAI format)tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a specified location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name or city, state" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["location"] } } }, { "type": "function", "function": { "name": "calculator", "description": "Perform basic arithmetic operations", "parameters": { "type": "object", "properties": { "operation": { "type": "string", "enum": ["add", "subtract", "multiply", "divide"], "description": "The arithmetic operation" }, "a": { "type": "number", "description": "First number" }, "b": { "type": "number", "description": "Second number" } }, "required": ["operation", "a", "b"] } } }]
# Fake tool implementationsdef get_weather(location, unit="fahrenheit"): """Mock weather API""" weather_db = { "seattle, wa": {"temp": 58, "condition": "rainy"}, "portland, or": {"temp": 62, "condition": "cloudy"}, "san francisco, ca": {"temp": 72, "condition": "sunny"} } data = weather_db.get(location.lower(), {"temp": 70, "condition": "unknown"}) return f"The weather in {location} is {data['condition']}, {data['temp']}°{unit[0].upper()}."
def calculator(operation, a, b): """Simple calculator""" ops = { "add": a + b, "subtract": a - b, "multiply": a * b, "divide": a / b if b != 0 else None } result = ops.get(operation) if result is None: return f"Error: division by zero" return f"{a} {operation} {b} = {result}"
# Tool execution dispatcherdef execute_tool(tool_name, args): """Call the actual tool based on name and args""" if tool_name == "get_weather": return get_weather(**args) elif tool_name == "calculator": return calculator(**args) return f"Unknown tool: {tool_name}"
# Main agent loopdef run_agent(user_query, max_iterations=5): """Run the agent with function calling""" messages = [ { "role": "user", "content": user_query } ]
print(f"\n[User] {user_query}")
iteration = 0 while iteration < max_iterations: iteration += 1
# Call the model with tools response = client.chat( model="llama2:70b-chat", messages=messages, tools=tools, stream=False )
# Check if model wants to call a tool assistant_message = response["message"]
if not assistant_message.get("tool_calls"): # No tool calls, model gave a direct answer print(f"[Agent] {assistant_message['content']}") return assistant_message["content"]
# Process tool calls tool_results = [] for tool_call in assistant_message["tool_calls"]: tool_name = tool_call["function"]["name"] tool_args = tool_call["function"]["arguments"]
# Parse arguments (handle both string and dict formats) if isinstance(tool_args, str): tool_args = json.loads(tool_args)
print(f"[Tool Call] {tool_name}({tool_args})")
# Execute the tool tool_result = execute_tool(tool_name, tool_args) print(f"[Tool Result] {tool_result}")
tool_results.append({ "tool_call_id": tool_call.get("id", tool_name), "tool_name": tool_name, "content": tool_result })
# Add assistant message and tool results back to conversation messages.append(assistant_message)
for result in tool_results: messages.append({ "role": "tool", "content": result["content"], "tool_call_id": result["tool_call_id"] })
return "Max iterations reached without answer"
# Test itif __name__ == "__main__": result = run_agent("What's the weather in Portland, OR? Then add 5 to that temperature.") print(f"\n[Final Answer] {result}")Run it:
$ python agent.py[User] What's the weather in Portland, OR? Then add 5 to that temperature.[Tool Call] get_weather({'location': 'Portland, OR', 'unit': 'fahrenheit'})[Tool Result] The weather in Portland, OR is cloudy, 62°F.[Tool Call] calculator({'operation': 'add', 'a': 62, 'b': 5})[Tool Result] 62 add 5 = 67[Agent] The weather in Portland, OR is cloudy with a temperature of 62°F. Adding 5 to that would be 67°F.Works. Actually works.
Why Models Mess Up: Hallucinated Args, Infinite Loops, Parallel Calls
Function calling isn’t magic. Models still hallucinate. Here’s what breaks:
Hallucinated argument names. The model sees your schema and invents a parameter that doesn’t exist. You ask for weather with location but the model emits {"city": "Portland"} instead. Worse at the 8B scale, better with Llama 3.3 70B.
Malformed JSON. The model emits close-enough JSON that regular parsers choke on. Missing quotes, stray commas, incomplete structures. Grammar-constrained generation in llama.cpp eliminates this entirely.
Infinite tool loops. The model calls a tool, gets the result, and decides to call the same tool again in a loop, never reaching a conclusion. Usually happens when the system prompt is unclear about when to stop using tools.
Parallel tool calls. Some models (especially newer ones trained on concurrent tool execution like GPT-4 Turbo) want to emit multiple tool calls in one response. Most local models don’t, but if yours does and your orchestration expects one-at-a-time, you’ll hang.
Ignoring tool results. The model calls a tool, you feed it the result, and the model acts like the result doesn’t exist. Often a sign of a weak model or a system prompt that didn’t clearly explain the loop.
Mitigations:
- Use a model explicitly trained for tool use (Llama 3.3, Qwen 2.5, Hermes 3).
- Use grammar constraints (llama.cpp + GBNF) to force valid JSON.
- Write clear system prompts that specify exactly when to stop using tools:
You are a helpful assistant. You have access to the following tools.Call a tool only when necessary. Once you have all the information neededto answer the user's question, provide the final answer directly. Do notcall a tool more than once for the same purpose. Do not call a tool andthen immediately call it again.- Validate arguments before calling tools. If the model emits a parameter that doesn’t exist, reject it politely and ask the model to try again.
- Set iteration limits to prevent infinite loops (like the
max_iterations=5in the code above).
Grammar-Constrained Generation: llama.cpp’s GBNF Superpower
If you’re serious about reliability, use llama.cpp directly with GBNF. You write a grammar that describes the exact JSON structure your tools accept, and the sampler refuses to emit anything that violates it.
Example grammar for a simple tool call:
root : "{" ws "\"name\"" ws ":" ws string ws "," ws "\"args\"" ws ":" ws object ws "}"object : "{" (pair ("," pair)*)? "}"pair : string ws ":" ws valuevalue : string | number | booleanstring : "\"" ([^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]))* "\""number : "-"? (0 | [1-9] [0-9]*) ("." [0-9]+)? ([eE] [-+]? [0-9]+)?boolean: "true" | "false"ws : ([ \t\n] ws)?Then:
./main -m model.gguf \ -p "What's the weather in Portland?" \ -g grammar.gbnfThe model cannot emit invalid JSON. It’s constrained at the token level. This shifts reliability dramatically, especially for smaller models.
Function Calling vs. MCP: Different Layers
There’s confusion here, so let’s clear it up.
Function calling is a protocol: your orchestration layer describes tools as JSON schemas, the model emits structured tool calls, and you execute them. It’s what we just built above.
MCP (Model Context Protocol) is a standardized tool registry and communication spec. Instead of hardcoding tool definitions in your prompt, MCP lets you connect to a standard tool server that advertises what it can do. The model talks to the MCP server, which handles tool management and execution.
Function calling is the lowlevel protocol. MCP standardizes the tool layer on top. You can use function calling without MCP (like we did) and you can use MCP with function calling (MCP servers expose their tools as function schemas that feed into your calling loop).
For local models, you usually don’t need MCP unless you’re building something complex. Function calling directly is simpler and faster.
Putting It Together: A Real Local Agent
Here’s the full flow for something serious:
- Pick a model: Llama 3.3 70B if you have the VRAM, Qwen 2.5 otherwise.
- Define your tools: JSON schemas that are clear and specific.
- Run inference: Ollama for simplicity, llama.cpp for fine-grained control + GBNF.
- Orchestrate: Python with a loop that handles tool calls and feeds results back.
- Add error handling: Validate arguments, set iteration limits, catch hallucinations.
- Test edge cases: Ask the model to do weird stuff and watch where it breaks.
The tooling is there. The models are capable. The patterns are solid. In 2026, running a function-calling agent locally isn’t aspirational — it’s standard practice for anyone serious about AI at the edge. Your laptop can be the orchestrator. Your network can stay closed. And your models can actually do things instead of just chatting about them.
That’s the shift. Build like it.