LiteLLM & vLLM: One API to Rule All Your Models
LiteLLM proxies every LLM — local or cloud — behind one OpenAI-compatible endpoint. Pair it with vLLM for GPU-backed serving and ditch the SDK sprawl.
All the articles with the tag "llm".
LiteLLM proxies every LLM — local or cloud — behind one OpenAI-compatible endpoint. Pair it with vLLM for GPU-backed serving and ditch the SDK sprawl.
Pixtral, Qwen3-VL, and Gemma 4 compared for local multimodal use in 2026. LLaVA is dead; here's what to run in Ollama for OCR, screenshots, and vision tasks.
Master Ollama with Modelfiles, GPU tuning, API usage, and performance tricks. Stop running 70B models on 8GB VRAM and wondering why everything is slow.
Ollama keeps models in VRAM after every request. Control GPU usage with keep_alive, force-unload via the API, and check memory to stop the reload cycle.
Self-hosting a ChatGPT alternative? Open WebUI owns local Ollama models; LibreChat handles Claude, GPT, Gemini, and more. Setup, RAG, and trade-offs compared.
Write prompts that get useful results — role prompting, few-shot examples, chain-of-thought, and the patterns that work across any LLM.
Text Generation Web UI vs KoboldCpp: setup, model formats, samplers, APIs, and performance compared so you can pick the right local LLM frontend fast.
Two ways to route LLM traffic across providers — OpenRouter as a hosted gateway, LiteLLM as a self-hosted proxy. Which one fits your home lab in 2026?
Local LLMs can call tools, query APIs, and run code if you set them up right. Function calling on Ollama and llama.cpp explained — patterns that actually work.
Gemma 4 vs Qwen3.6: sizes, reasoning, coding benchmarks, and which model you should actually pull for your home lab rig.
AnythingLLM is the closest thing to a real private NotebookLM you can self-host. Workspaces, RAG, agents, document chat — running locally on Ollama in 20 minutes.
Model Context Protocol turns your LLM into a tool-using agent — file access, APIs, your home lab. Build your first MCP server in under 50 lines of Python.