Stop Renting Intelligence You Can Run Yourself
Running a local LLM is genuinely useful. Not “cool party trick” useful — actually useful. Offline inference, no rate limits, no API bill creeping up on you at 2 AM when you left a script running. But there’s one number that determines everything: VRAM.
CPU offloading exists. It’s also approximately as fun as watching paint dry at 3 tokens per second. If you want a responsive model, the weights need to fit in GPU memory. Full stop.
Here’s the actual math for models you probably care about in 2026:
| Model | Quant | VRAM Needed |
|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~5 GB |
| Mistral 24B | Q4_K_M | ~15 GB |
| Llama 3.1 70B | Q4_K_M | ~40 GB |
| Llama 3.1 70B | Q8 | ~75 GB |
| Qwen2.5 72B | Q4_K_M | ~42 GB |
So if you want to run the big boys without selling a kidney, you need either one card with a lot of VRAM or two cards that can pool it. This guide is about how to get there without getting burned on the used market.
The Tiers (Actual Useful Breakdown)
$200–$400: The Starter Slot
RTX 3060 12GB — The Home-Lab Sweet Spot
Honestly, this card is criminally good for the price. The 3060 12GB has more VRAM than the 3060 Ti and 3070. NVIDIA did that to protect their workstation line and ended up accidentally creating the best entry-level inference card in the used market.
- VRAM: 12 GB GDDR6
- Bandwidth: 360 GB/s
- TDP: 170W
- What it runs: Llama 3.1 8B at full Q8, Mistral 7B comfortably, anything under ~11GB fits cleanly
- fp16 throughput: ~12.7 TFLOPS
It won’t run a 13B in Q4 cleanly — you’re about 2-3 GB short. But for 7B-class models it’s excellent, and two of them in a bifurcated x8/x8 slot setup gives you 24GB pooled if your motherboard cooperates.
Price around $200-250 used in 2026. If you see one for under $190, buy it.
Tesla P40 24GB — Cheap VRAM, Expensive Patience
The P40 is a datacenter card from 2016. People buy it because 24GB for $150 sounds like a steal. It is not a steal.
Here’s what you’re actually getting:
- No fp16 hardware support — fp16 operations fall back to fp32, which means inference is roughly half the speed you’d expect from a card with this much memory
- No NVENC — if you’re also running Plex or Jellyfin on this box, you need a separate card or CPU transcode
- Power connector: 8-pin, but it draws 250W and runs hot as a pizza oven
- Cooling: passive heatsink designed for a server chassis with serious airflow — in an open-air desktop case it will throttle aggressively
If you have a 4U server with proper airflow and you only care about VRAM capacity (loading huge models at low throughput), the P40 is workable. For actual interactive inference, you’ll be frustrated within a week. The context-length performance on long prompts is genuinely bad.
$400–$700: The Sweet Zone
RTX 3090 24GB — Still the King for $/VRAM
In 2026 the 3090 is the card I’d actually recommend to most home-lab LLM people. The used price has settled into the $450-550 range and nothing in that bracket touches it for inference.
- VRAM: 24 GB GDDR6X
- Bandwidth: 936 GB/s (this matters enormously for inference)
- TDP: 350W (transient spikes to ~600W — more on this below)
- fp16: ~35.6 TFLOPS
- What it runs: Mistral 24B Q4 fits with room to spare, Llama 3.1 70B Q4 needs a second card or significant CPU offload
The memory bandwidth is what makes it fast. Inference is bandwidth-bound, not compute-bound. A 3090 at 936 GB/s will outperform a 4090 at 1008 GB/s by less than you’d think, while costing 40% less used.
Two 3090s with NVLink gives you 48GB pooled — that’s a Llama 3.1 70B Q4 setup that actually rips.
# Check your 3090 on arrivalnvidia-smi --query-gpu=name,memory.total,memory.free,temperature.gpu,power.draw --format=csv,noheader,nounitsSample output from a healthy card:
NVIDIA GeForce RTX 3090, 24576, 24200, 32, 15If memory.total shows anything less than 24576 MiB, the card has a failed module. Walk away.
AMD Mi50 32GB — ROCm Cliff Edition
The Mi50 shows up on eBay for $300-400 and 32GB sounds amazing. The catch: ROCm support for Mi50 is officially deprecated as of ROCm 6.x. You’re pinned to older ROCm releases, which means you’re fighting software compatibility every time you update anything.
llama.cpp has HIP support and it mostly works on Mi50 with ROCm 5.7, but you’ll spend more time debugging the stack than running models. Unless you enjoy that kind of thing (some of us do, no judgment), stick to NVIDIA.
$700–$1200: Serious Inference
RTX 4090 24GB — Used Prices Finally Moving
The 4090 used market started cracking in late 2025 as supply loosened. You can find them for $900-1100 now instead of the $1400+ they held for way too long.
- VRAM: 24 GB GDDR6X
- Bandwidth: 1008 GB/s
- TDP: 450W
- fp16: ~82.6 TFLOPS
Honest take: for pure LLM inference, the 4090 is about 10-15% faster than a 3090 at the same VRAM. That’s real, but not $400-500 real unless you’re also gaming or doing heavy video work. If inference is 90% of your use case, a 3090 is better value.
Where the 4090 wins: quantized generation throughput on smaller models. If you’re running Llama 3.1 8B and want it fast — 80+ tokens/second — the 4090 is noticeably better.
RTX A5000 24GB — Workstation Calm
The A5000 is the professional variant: blower cooler, certified drivers, runs at 230W instead of 350W. Same 24GB as the 3090 but at significantly lower power draw and with proper ECC support.
- Used price: $750-950
- Blower cooler means it works in a 4U chassis without custom airflow engineering
- No NVENC encoder, but NVDEC works fine
- Slightly less raw bandwidth than 3090 (768 GB/s) but the stability is worth it for 24/7 inference workloads
If your home lab runs in a server chassis and you care about reliability over peak throughput, the A5000 is underrated.
$1500+: The “I Have a Problem” Tier
2x RTX 3090 with NVLink — 48GB Pooled
This is the build for running Llama 3.1 70B at Q4 comfortably. NVLink on 3090s bonds the VRAM — you get 48GB as a single address space, not just split inference across two cards.
Requirements that will bite you:
- NVLink bridge: about $80 used, but cards need to be the same model and close enough in slot spacing
- PCIe slots: you need two x16 slots, or x16/x8 with bifurcation enabled in BIOS. x4 is too slow for NVLink throughput
- PSU: Two 3090s plus a system can hit 900W under load. Get a 1200W or 1600W PSU. No exceptions.
# Verify NVLink is detectednvidia-smi nvlink --status -i 0RTX 6000 Ada 48GB / A6000 48GB — One Card, No Drama
The A6000 (Ampere) and RTX 6000 Ada (Ada Lovelace) both have 48GB on a single card. No NVLink fiddling, no dual-slot PCIe drama.
A6000 Ampere: ~$2000-2500 used. 48GB, 768 GB/s bandwidth, blower, 300W. RTX 6000 Ada: ~$3500-4500 used. 48GB, 960 GB/s bandwidth, still coming down.
If you’re buying new-to-you at this tier, run the VRAM math against your target models. 48GB runs Llama 3.1 70B at Q4 with room for context. Qwen2.5 72B fits. These cards earn their price if 70B-class inference is your primary workload.
The Pain Points Nobody Mentions Until You’re Ankle-Deep
PCIe Lanes and Bifurcation
A GPU in an x8 slot runs at roughly 85-90% of x16 bandwidth for inference. Annoying but livable. An x4 slot is a problem — you’ll see stutters during model loading and reduced throughput on attention-heavy operations.
Before buying a second GPU, check your motherboard’s bifurcation support. Some boards can split a single x16 slot into two x8s for a PCIe riser — some can’t. Check the manual, not the spec sheet.
Power Supply: The 3090’s Dirty Secret
The RTX 3090 is rated at 350W TDP. What that doesn’t tell you is that transient power spikes during heavy compute can hit 580-620W on a single card. An 850W PSU powering a 3090 plus a modern CPU plus NVMe drives is cutting it uncomfortably close.
Rule of thumb: budget 400W for the 3090, 150W for a mid-tier CPU, 100W for everything else. A 1000W PSU is the minimum I’d recommend for a 3090 system. For dual 3090, go 1600W.
# Monitor power draw in real-time during a stress testwatch -n 1 nvidia-smi --query-gpu=power.draw,temperature.gpu,utilization.gpu --format=csv,noheaderBlower vs. Open-Air in a Server Chassis
Open-air (triple-fan) coolers are designed for ATX tower cases with front-to-back airflow. In a 4U rack server, they exhaust hot air sideways into adjacent components. That’s a thermal catastrophe.
If you’re racking this, get a blower card: A5000, A6000, Tesla-series (P40, T4), Quadro RTX 5000/6000. The blower exhausts out the rear bracket — that’s what your rack is designed for.
Consumer cards in a rack are a 2 AM pager event waiting to happen.
Used Market Scams to Know About
Mining cards: Years of continuous 80%+ load accelerate fan bearing wear. A card that benches fine for 10 minutes can fail at hour 3 under sustained inference load. Ask the seller what the card was used for. If they’re evasive, pass.
“Repaired package” listings: This is when someone has reballed the GPU die to fix a solder joint failure. It can last years or fail in weeks. These show up in bulk lots on AliExpress and get resold individually. Signs: unusually cheap price, suspiciously pristine PCB but worn fans, inconsistent thermal paste application.
Memory module failures: A 24GB card with one bad GDDR6X module sometimes shows up as 24GB in Device Manager but throws ECC errors under load. The burn test below catches this.
Test-on-Arrival Checklist
Run these within your return window. No exceptions.
# 1. Basic info — verify specs match what you boughtnvidia-smi -q | grep -E "Product Name|Total|Driver"
# 2. GPU burn test — catches memory errors, thermal throttling, bad solder# Install: https://github.com/wilicc/gpu-burn./gpu_burn 300 # 5 minutes, watch for errors
# 3. Stress + power monitoring combostress-ng --cpu 4 &watch -n 2 'nvidia-smi --query-gpu=name,temperature.gpu,power.draw,clocks.sm,clocks.mem --format=csv,noheader'If you see GPU Burn: BURN TERMINATED with error counts above zero, the card has a memory problem. Return it.
If clock speeds drop below 1500MHz on a 3090 under sustained load, it’s throttling — could be thermal paste dried out, could be a power delivery issue on the board.
Running Your First Model
Once you’ve got a card that passes burn testing, here’s a real invocation that works:
# llama.cpp — Mistral 24B Q4 on a single 3090./llama-cli \ -m mistral-24b-instruct-q4_k_m.gguf \ -n 512 \ --n-gpu-layers 999 \ --ctx-size 4096 \ -p "Explain NVLink in one paragraph."The --n-gpu-layers 999 pushes all layers to GPU. If the model doesn’t fit, llama.cpp will error with the required VRAM — adjust by reducing layers to offload the rest to RAM.
# vLLM — Llama 3.1 8B on a 3060 12GBpython -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3.1-8B-Instruct \ --dtype float16 \ --max-model-len 4096 \ --gpu-memory-utilization 0.90vLLM’s --gpu-memory-utilization 0.90 leaves 10% headroom for KV cache overhead. On a 12GB card with an 8B model, this is tight but functional. Drop to 0.85 if you see OOM errors on long contexts.
The Bottom Line
Here’s the actual recommendation based on what you’re trying to run:
Running 7B-8B models, tight budget: RTX 3060 12GB at $200-250. Best $/VRAM in this bracket, good fp16, runs cool enough for an ATX case.
Running 13B-24B models, balanced budget: RTX 3090 24GB at $450-550. This is the 2026 home-lab inference card. Nothing else at this price has the bandwidth or VRAM to compete.
Running 70B models: Two RTX 3090s with NVLink at $900-1100 total for 48GB pooled, or an A6000 48GB if you want one card and no drama.
Rack deployment: RTX A5000 (24GB, blower, 230W) or A6000 (48GB, blower) — consumer cards are a bad time in a 4U.
Avoid: Tesla P40 unless you have a proper server chassis, infinite patience, and don’t care about throughput. AMD Mi50 unless you enjoy debugging ROCm compatibility on a Saturday afternoon.
The used GPU market rewards people who know what they’re actually buying. Run the VRAM math for your target models first, pick the tier that covers it with ~20% headroom, and test immediately on arrival. Everything else is negotiable.
Your 2 AM inference server thanks you for buying the right card the first time.