Why Stretching a Pixel Doesn’t Just Make It Bigger
You know that feeling when you find a perfect image online but it’s 640×480 and you need it for a poster? Your browser’s got a zoom button. Your photo app’s got a scale tool. Both will make it bigger. Both will make it look like you photographed a blurry potato.
For years, traditional upscaling was like trying to reverse-engineer a painting by squinting at a thumbnail. Bicubic, Lanczos, nearest-neighbor — they’re all fancy ways of guessing what pixels should exist between the ones you’ve got. They work. They’re fast. They’re also, well, blurry.
Then generative AI showed up and said: “What if we trained a neural network to hallucinate realistic pixels instead?” Enter Real-ESRGAN, BSRGAN, and friends — upscalers that use deep learning to make your low-res images actually look like high-res images. Not scaled. Restored.
This isn’t theoretical. If you’re running ComfyUI, generating images with Stable Diffusion, or working with old screenshots, you need to understand what upscalers do and which one fits your job.
How Upscalers Actually Work (No Math Required)
There are three flavors:
Traditional Upscaling (Bicubic/Lanczos)
You’ve been using this your whole life. Your browser does it. Your phone does it. It looks at neighboring pixels, averages them, fills in gaps. Fast. Free computationally. Terrible at inventing detail.
Think of it like photocopying a photocopy. The machine can’t add information that isn’t there. It just stretches what you’ve got.
GAN-Based Upscaling (Real-ESRGAN, BSRGAN)
GAN = Generative Adversarial Network. Two neural networks fight it out: one generates fake high-res pixels, the other tries to catch the fakes. After thousands of rounds, the generator learns to hallucinate realistic detail.
Real-ESRGAN trains on millions of real images, learns patterns in texture, noise, and edges, and applies that learned knowledge to fill in the gaps. A 512×512 image becomes 2048×2048 (4x upscale) by genuinely reconstructing what details should be there.
The catch? It doesn’t always guess right. If your image has weird artifacts, Real-ESRGAN might enhance those artifacts too. Garbage in, slightly less blurry garbage out.
Diffusion-Based Upscaling (Stable Diffusion x4, Upscayl with ESRGAN backbone)
Newer kid on the block. Diffusion models (the same tech behind Stable Diffusion image generation) can also upscale by iteratively refining a low-res image into a high-res one. Slower, but sometimes better at preserving original intent over hallucinating detail.
Real talk: For most cases, GAN-based Real-ESRGAN is the sweet spot. Diffusion is overkill unless you’re willing to wait 30+ seconds per image.
The Real-ESRGAN Model Family
Real-ESRGAN isn’t one model. It’s a whole lineup. Choosing the right one matters.
General-Purpose Models
- RealESRGAN_x4plus_anime_6B — for anime, drawn art, manga. Preserves linework. Use this for fan art.
- RealESRGAN_x2plus — 2x upscale instead of 4x. Faster, less hallucination, good for photos that just need a nudge.
- RealESRGAN_x4plus — vanilla 4x upscale. Works on photos, paintings, screenshots. Jack-of-all-trades.
- RealESRGAN_x4plus_v3 — newer, trained on diverse datasets (photos, art, text), more robust than earlier versions.
Specialized Models
- NMKD_Siax — stylized art, game assets, maintains artistic intent better than generic models.
- SwinIR — transformer-based upscaler, excellent for photos but slower (requires more VRAM).
- 4x-UltraSharp — tuned for sharpness, good for detailed photos where you want edges crisp and clear.
My take: Start with RealESRGAN_x4plus_v3. It’s the reliable sedan of upscalers. Don’t use anime6B unless your input is actually anime. Specialist models are premature optimization.
Beyond the Model: Real-ESRGAN in ComfyUI
Most people run Real-ESRGAN inside ComfyUI, not standalone CLI. Here’s a typical workflow node chain:
Load Image → Real-ESRGAN Upscale (x4) → Face Restoration (CodeFormer) → KSampler (optional refine)The Real-ESRGAN Upscale node takes your image and the model name. Pick your model, set your scale factor (2x or 4x), and hit run.
Node inputs: - image: your input - upscale_model: "RealESRGAN_x4plus_v3.pth" - scale: 4Output: upscaled image, ready for the next step.
VRAM-friendly upscaling: If you’ve got a 4GB GPU and a 2048×2048 input, Real-ESRGAN might OOM. Solution: tile-based upscaling. Many ComfyUI upscaler nodes have a tile_size parameter. Set it to 512 or 1024. The node upscales the image in overlapping tiles, blends them, and you get a seam-free result without the VRAM spike.
tile_size: 512 ← splits into smaller chunksoverlap: 32 ← blend region to hide seamsTakes longer (2-3x slower), but your 4GB card can now handle 4000×4000+ images.
Chaining Upscalers: Low-Res Gen → 2x → 4x
Here’s a trick that sounds dumb but works: generate small, upscale twice.
KSampler (512×512) → Real-ESRGAN x2 (1024×1024) → Real-ESRGAN x4 (4096×4096)Why? Each upscaler pass adds detail at a different scale. The first 2x pass refines micro-detail. The second 4x pass adds macro-structure. Chaining them often looks better than one monster 4x pass.
It’s slower (obviously), but for showcase work, it’s worth the extra compute.
Face Restoration: The Sidekick
Real-ESRGAN is great at textures and edges. Faces? Faces are weird. Human perception is absurdly good at spotting wrong eyes.
CodeFormer and GFPGAN are face-specific restoration models that run after upscaling. They detect faces, restore them separately, and blend them back in.
ComfyUI workflow:
Real-ESRGAN → CodeFormer Face Restore → OutputCodeFormer’s advantage: it handles large upscales (4x+) better and doesn’t over-smooth. GFPGAN is older but still solid.
Use them when: your image has faces and they look mushy after upscaling. Use them liberally on AI-generated portraits (they fix a lot of common artifacts).
Skip them when: the original has no faces or faces look fine already. Face restoration can over-process and introduce plastic-surgery vibes.
Upscaling for Video: A Teaser
Video upscaling is Real-ESRGAN’s evil twin. Each frame needs upscaling, but your video is 30 fps × 90 seconds = 2700 frames. Running each through ComfyUI takes hours.
Tools like Topaz Video Enhance AI batch-process this. Real-ESRGAN can run on video with frame interpolation (filling in new frames between originals), but that’s a whole rabbit hole.
Quick version: upscale a key frame or two in ComfyUI, use those as reference, then use real-time video upscalers like DAIN or Rife for interpolation on the full video. Practical for archival footage, speedruns, old gameplay captures.
Real-ESRGAN vs. BSRGAN vs. SwinIR: Which Is Fastest?
| Model | Speed | Quality | VRAM |
|---|---|---|---|
| Real-ESRGAN x4plus_v3 | Fast (~1s for 1024×1024) | 8/10 | 2-3 GB |
| BSRGAN | Slower (~3s) | 9/10 (faces) | 3-4 GB |
| SwinIR | Slowest (~5-10s) | 9/10 (details) | 4-6 GB |
| 4x-UltraSharp | Fast (~1s) | 8.5/10 (sharp) | 2-3 GB |
Real-ESRGAN dominates for speed-to-quality. SwinIR is overkill unless you’re upscaling museum-quality artwork. BSRGAN is great if you’ve got the VRAM and patience.
When 4x Upscale Is Overkill (And When It’s Not)
Use 4x upscale when:
- Original is 256×256 or smaller (screenshot, old photo, thumbnail)
- You need wall-art resolution (2560×1440 minimum)
- You’re printing (or rendering for video)
- You want maximum detail reconstruction
Use 2x upscale when:
- Original is already 800×600+ (minor cleanup)
- You’re upscaling meme text (4x adds too much smoothing)
- VRAM is tight
- Speed matters (2x is 4x faster than 4x)
Use no upscaling when:
- Original is already high-res
- You’re upscaling JPG soup (compression artifacts will be “enhanced”)
- It’s a screenshot of text (Real-ESRGAN doesn’t help text sharpness much)
Free vs. Paid: Topaz Reality Check
Paid option: Topaz Video Enhance AI ($200 one-time)
- GUI. One-click video upscaling. Batching built-in. Face detection.
- Worth it if you’re doing video work. Not worth it for stills (ComfyUI is free).
Free option: ComfyUI + Real-ESRGAN
- No GUI (unless you install ComfyUI Manager). Steeper learning curve.
- Unlimited stills. Slower video batch (but doable with scripting).
- GPU/CPU required (can run on CPU, expect 20-30 seconds per image).
Verdict: If you’re upscaling images for your blog or self-hosting project, Real-ESRGAN + ComfyUI is objectively the right call. Video is the one case where Topaz’s convenience wins.
GPU vs. CPU: Runtime Expectations
GPU (RTX 3060, 12GB VRAM):
- 1024×1024 image, 4x upscale: ~1 second
- 4000×4000 (tiled): ~5 seconds
- Batch of 100: ~2-3 minutes
CPU (Ryzen 7 5700X):
- Same image: ~20 seconds
- Batch of 100: ~30+ minutes
- Not practical for anything but single images
Real-ESRGAN is built for GPU. If you don’t have one, you’ve got a problem.
Picking an Upscaler
Here’s the decision tree:
Q: Is your input anime or drawn art?
- Yes →
RealESRGAN_x4plus_anime_6B - No → next
Q: Is your input already 800×600 or larger?
- Yes →
RealESRGAN_x2plus(faster, less hallucination) - No → next
Q: Do you have a good GPU (6GB+ VRAM)?
- Yes →
RealESRGAN_x4plus_v3(solid middle ground) - No →
RealESRGAN_x2plus(lighter footprint)
Q: Does the result have faces?
- Yes → chain with
CodeFormerafter - No → you’re done
Q: Are you printing or going huge (4K+)?
- Yes → consider
SwinIRfor another pass - No → stop, you’re overthinking it
The Real Talk
Real-ESRGAN isn’t magic. A 320×240 screenshot will never look like a native 1280×960 capture. But it’s the closest thing we’ve got to magic without training a new model from scratch.
Use it on old photos, small game assets, screenshots of text that matter. Chain it for showcase work. Pair it with face restoration for portraits. And for God’s sake, test it on one image before batch-processing 500.
Your 2 AM self, staring at upscaled pixel mush, will appreciate the five minutes you spent testing.