Tag: llm

All the articles with the tag "llm".

1-Bit LLMs: The Quantization Endgame

2 Apr, 2026

1-bit models store weights as -1, 0, or 1. That sounds insane until you see them run a 100B parameter model on a laptop CPU. Here's what's actually happening.

AMD Lemonade: Local LLM Serving for AMD GPUs

1 Apr, 2026

AMD finally has a fast, open source local LLM server that uses both GPU and NPU. If you've been jealous of Nvidia users, Lemonade is worth your time.

When to Use Structured Output (JSON Mode) in LLMs

1 Apr, 2026

JSON mode forces models to output valid JSON. When it's a lifesaver vs. when it's overkill and makes the model worse.

LLM Temperature and top_p Explained Without the Math

25 Mar, 2026

Temperature and top_p control randomness in LLMs. No probability theory needed. Just practical intuition and how to tune them.

LLM Backends: vLLM vs llama.cpp vs Ollama

8 Mar, 2026

vLLM, llama.cpp, and Ollama all run local LLMs — compare throughput, memory use, GPU support, and which fits your hardware.

RAG Chunking: Why Chunk Size Is Everything

7 Mar, 2026

RAG breaks documents into chunks. But what chunk size? Too small and context is lost. Too large and semantic search fails. Here's how to pick.

System Prompts: The LLM Feature Most People Ignore

25 Feb, 2026

System prompts are your secret weapon. How they work, why they matter more than you think, and 5 patterns that actually change model behavior.

LLM Quantization: Q4_K_M Isn't Always the Best Choice

15 Feb, 2026

Q4_K_M is the default, but it's not magic. When Q3, Q5, or Q6 makes sense. How to benchmark quantization tradeoffs on your hardware.

Running Multiple Ollama Models Without Running Out of RAM

9 Feb, 2026

Ollama can load one model at a time on limited hardware. How to switch between models, use CPU offloading, and manage VRAM intelligently.

Context Window vs Token Limit: Not the Same Thing

3 Feb, 2026

What's the actual difference between context window and token limit? Why one model says 8K and another says 128K. A practical breakdown.

RAG on a Budget: Building a Knowledge Base with Ollama & ChromaDB

18 Jan, 2026

Learn how to build a local RAG system using Ollama and ChromaDB for free. Step-by-step guide with Docker Compose, Python code, chunking strategies, and real-world examples.

n8n + LLM: Building Automations That Actually Think

6 Jan, 2026

Connect n8n to Ollama or any local LLM to build smart automations that classify, summarize, and triage — not just shuffle data around blindly.