Real-ESRGAN & Upscaling Tools

Why Stretching a Pixel Doesn’t Just Make It Bigger

You know that feeling when you find a perfect image online but it’s 640×480 and you need it for a poster? Your browser’s got a zoom button. Your photo app’s got a scale tool. Both will make it bigger. Both will make it look like you photographed a blurry potato.

For years, traditional upscaling was like trying to reverse-engineer a painting by squinting at a thumbnail. Bicubic, Lanczos, nearest-neighbor, they’re all fancy ways of guessing what pixels should exist between the ones you’ve got. They work. They’re fast. They’re also, well, blurry.

Then generative AI showed up and said: “What if we trained a neural network to hallucinate realistic pixels instead?” Enter Real-ESRGAN, BSRGAN, and friends, upscalers that use deep learning to make your low-res images actually look like high-res images. Not scaled. Restored.

This isn’t theoretical. If you’re running ComfyUI, generating images with Stable Diffusion, or working with old screenshots, you need to understand what upscalers do and which one fits your job.

How Upscalers Actually Work (No Math Required)

There are three flavors:

Traditional Upscaling (Bicubic/Lanczos)

You’ve been using this your whole life. Your browser does it. Your phone does it. It looks at neighboring pixels, averages them, fills in gaps. Fast. Free computationally. Terrible at inventing detail.

Think of it like photocopying a photocopy. The machine can’t add information that isn’t there. It just stretches what you’ve got.

GAN-Based Upscaling (Real-ESRGAN, BSRGAN)

GAN = Generative Adversarial Network. Two neural networks fight it out: one generates fake high-res pixels, the other tries to catch the fakes. After thousands of rounds, the generator learns to hallucinate realistic detail.

Real-ESRGAN trains on millions of real images, learns patterns in texture, noise, and edges, and applies that learned knowledge to fill in the gaps. A 512×512 image becomes 2048×2048 (4x upscale) by genuinely reconstructing what details should be there.

The catch? It doesn’t always guess right. If your image has weird artifacts, Real-ESRGAN might enhance those artifacts too. Garbage in, slightly less blurry garbage out.

Diffusion-Based Upscaling (Stable Diffusion x4, Upscayl with ESRGAN backbone)

Newer kid on the block. Diffusion models (the same tech behind Stable Diffusion image generation) can also upscale by iteratively refining a low-res image into a high-res one. Slower, but sometimes better at preserving original intent over hallucinating detail.

Real talk: For most cases, GAN-based Real-ESRGAN is the sweet spot. Diffusion is overkill unless you’re willing to wait 30+ seconds per image.

The Real-ESRGAN Model Family

Real-ESRGAN isn’t one model. It’s a whole lineup. Choosing the right one matters.

General-Purpose Models

RealESRGAN_x4plus_anime_6B: for anime, drawn art, manga. Preserves linework. Use this for fan art.
RealESRGAN_x2plus: 2x upscale instead of 4x. Faster, less hallucination, good for photos that just need a nudge.
RealESRGAN_x4plus: vanilla 4x upscale. Works on photos, paintings, screenshots. Jack-of-all-trades.
realesr-general-x4v3: newer general-purpose model, trained on diverse datasets, more consistent than earlier versions. Also supports a -dn denoise-strength flag so you can dial back over-smoothing.

Specialized Models

NMKD_Siax: stylized art, game assets, maintains artistic intent better than generic models.
SwinIR: transformer-based upscaler, excellent for photos but slower (requires more VRAM).
4x-UltraSharp: tuned for sharpness, good for detailed photos where you want edges crisp and clear.

My take: Start with realesr-general-x4v3. It’s the reliable sedan of upscalers. Don’t use anime6B unless your input is actually anime. Specialist models are premature optimization.

Beyond the Model: Real-ESRGAN in ComfyUI

Most people run Real-ESRGAN inside ComfyUI, not standalone CLI. Here’s a typical workflow node chain:

Load Image → Real-ESRGAN Upscale (x4) → Face Restoration (CodeFormer) → KSampler (optional refine)

The Real-ESRGAN Upscale node takes your image and the model name. Pick your model, set your scale factor (2x or 4x), and hit run.

Node inputs:
  - image: your input
  - upscale_model: "realesr-general-x4v3.pth"
  - scale: 4

Output: upscaled image, ready for the next step.

VRAM-friendly upscaling: If you’ve got a 4GB GPU and a 2048×2048 input, Real-ESRGAN might OOM. Solution: tile-based upscaling. Many ComfyUI upscaler nodes have a tile_size parameter. Set it to 512 or 1024. The node upscales the image in overlapping tiles, blends them, and you get a seam-free result without the VRAM spike.

tile_size: 512  ← splits into smaller chunks
overlap: 32     ← blend region to hide seams

Takes longer (2-3x slower), but your 4GB card can now handle 4000×4000+ images.

Chaining Upscalers: Low-Res Gen → 2x → 4x

Here’s a trick that sounds dumb but works: generate small, upscale twice.

KSampler (512×512) → Real-ESRGAN x2 (1024×1024) → Real-ESRGAN x4 (4096×4096)

Why? Each upscaler pass adds detail at a different scale. The first 2x pass refines micro-detail. The second 4x pass adds macro-structure. Chaining them often looks better than one monster 4x pass.

It’s slower (obviously), but for showcase work, it’s worth the extra compute.

Face Restoration: The Sidekick

Real-ESRGAN is great at textures and edges. Faces? Faces are weird. Human perception is absurdly good at spotting wrong eyes.

CodeFormer and GFPGAN are face-specific restoration models that run after upscaling. They detect faces, restore them separately, and blend them back in.

ComfyUI workflow:

Real-ESRGAN → CodeFormer Face Restore → Output

CodeFormer’s advantage: it handles large upscales (4x+) better and doesn’t over-smooth. GFPGAN is older but still solid.

Use them when: your image has faces and they look mushy after upscaling. Use them liberally on AI-generated portraits (they fix a lot of common artifacts).

Skip them when: the original has no faces or faces look fine already. Face restoration can over-process and introduce plastic-surgery vibes.

Upscaling for Video: A Teaser

Video upscaling is Real-ESRGAN’s evil twin. Each frame needs upscaling, but your video is 30 fps × 90 seconds = 2700 frames. Running each through ComfyUI takes hours.

Tools like Topaz Video Enhance AI batch-process this. Real-ESRGAN can run on video with frame interpolation (filling in new frames between originals), but that’s a whole rabbit hole.

Quick version: upscale a key frame or two in ComfyUI, use those as reference, then use real-time video upscalers like DAIN or Rife for interpolation on the full video. Practical for archival footage, speedruns, old gameplay captures.

Real-ESRGAN vs. BSRGAN vs. SwinIR: Which Is Fastest?

Model	Speed	Quality	VRAM
realesr-general-x4v3	Fast (~1s for 1024×1024)	8/10	2-3 GB
BSRGAN	Slower (~3s)	9/10 (faces)	3-4 GB
SwinIR	Slowest (~5-10s)	9/10 (details)	4-6 GB
4x-UltraSharp	Fast (~1s)	8.5/10 (sharp)	2-3 GB

Real-ESRGAN dominates for speed-to-quality. SwinIR is overkill unless you’re upscaling museum-quality artwork. BSRGAN is great if you’ve got the VRAM and patience.

When 4x Upscale Is Overkill (And When It’s Not)

Use 4x upscale when:

Original is 256×256 or smaller (screenshot, old photo, thumbnail)
You need wall-art resolution (2560×1440 minimum)
You’re printing (or rendering for video)
You want maximum detail reconstruction

Use 2x upscale when:

Original is already 800×600+ (minor cleanup)
You’re upscaling meme text (4x adds too much smoothing)
VRAM is tight
Speed matters (2x is 4x faster than 4x)

Use no upscaling when:

Original is already high-res
You’re upscaling JPG soup (compression artifacts will be “enhanced”)
It’s a screenshot of text (Real-ESRGAN doesn’t help text sharpness much)

Free vs. Paid: Topaz Reality Check

Paid option: Topaz Video Enhance AI ($200 one-time)

GUI. One-click video upscaling. Batching built-in. Face detection.
Worth it if you’re doing video work. Not worth it for stills (ComfyUI is free).

Free option: ComfyUI + Real-ESRGAN

No GUI (unless you install ComfyUI Manager). Steeper learning curve.
Unlimited stills. Slower video batch (but doable with scripting).
GPU/CPU required (can run on CPU, expect 20-30 seconds per image).

Verdict: If you’re upscaling images for your blog or self-hosting project, Real-ESRGAN + ComfyUI is objectively the right call. Video is the one case where Topaz’s convenience wins.

GPU vs. CPU: Runtime Expectations

GPU (RTX 3060, 12GB VRAM):

1024×1024 image, 4x upscale: ~1 second
4000×4000 (tiled): ~5 seconds
Batch of 100: ~2-3 minutes

CPU (Ryzen 7 5700X):

Same image: ~20 seconds
Batch of 100: ~30+ minutes
Not practical for anything but single images

Real-ESRGAN is built for GPU. If you don’t have one, you’ve got a problem.

Picking an Upscaler

Here’s the decision tree:

Q: Is your input anime or drawn art?

Yes → RealESRGAN_x4plus_anime_6B
No → next

Q: Is your input already 800×600 or larger?

Yes → RealESRGAN_x2plus (faster, less hallucination)
No → next

Q: Do you have a good GPU (6GB+ VRAM)?

Yes → realesr-general-x4v3 (solid middle ground)
No → RealESRGAN_x2plus (lighter footprint)

Q: Does the result have faces?

Yes → chain with CodeFormer after
No → you’re done

Q: Are you printing or going huge (4K+)?

Yes → consider SwinIR for another pass
No → stop, you’re overthinking it

The Real Talk

Real-ESRGAN isn’t magic. A 320×240 screenshot will never look like a native 1280×960 capture. But it’s the closest thing we’ve got to magic without training a new model from scratch.

Use it on old photos, small game assets, screenshots of text that matter. Chain it for showcase work. Pair it with face restoration for portraits. And for God’s sake, test it on one image before batch-processing 500.

Your 2 AM self, staring at upscaled pixel mush, will appreciate the five minutes you spent testing.

Why Stretching a Pixel Doesn’t Just Make It Bigger

How Upscalers Actually Work (No Math Required)

Traditional Upscaling (Bicubic/Lanczos)

GAN-Based Upscaling (Real-ESRGAN, BSRGAN)

Diffusion-Based Upscaling (Stable Diffusion x4, Upscayl with ESRGAN backbone)

The Real-ESRGAN Model Family

General-Purpose Models

Specialized Models

Beyond the Model: Real-ESRGAN in ComfyUI

Chaining Upscalers: Low-Res Gen → 2x → 4x

Face Restoration: The Sidekick

Upscaling for Video: A Teaser

Real-ESRGAN vs. BSRGAN vs. SwinIR: Which Is Fastest?

When 4x Upscale Is Overkill (And When It’s Not)

Free vs. Paid: Topaz Reality Check

GPU vs. CPU: Runtime Expectations

Picking an Upscaler

The Real Talk

Responses from around the web

Discussion

Related Posts

Stable Diffusion vs ComfyUI vs Fooocus: AI Image Generation at Home

LM Studio vs Jan vs GPT4All: Desktop LLM Clients

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified

Real-ESRGAN & Upscaling Tools

Why Stretching a Pixel Doesn’t Just Make It Bigger

How Upscalers Actually Work (No Math Required)

Traditional Upscaling (Bicubic/Lanczos)

GAN-Based Upscaling (Real-ESRGAN, BSRGAN)

Diffusion-Based Upscaling (Stable Diffusion x4, Upscayl with ESRGAN backbone)

The Real-ESRGAN Model Family

General-Purpose Models

Specialized Models

Beyond the Model: Real-ESRGAN in ComfyUI

Chaining Upscalers: Low-Res Gen → 2x → 4x

Face Restoration: The Sidekick

Upscaling for Video: A Teaser

Real-ESRGAN vs. BSRGAN vs. SwinIR: Which Is Fastest?

When 4x Upscale Is Overkill (And When It’s Not)

Free vs. Paid: Topaz Reality Check

GPU vs. CPU: Runtime Expectations

Picking an Upscaler

The Real Talk

Related Reading

Responses from around the web

Discussion

Related Posts

Stable Diffusion vs ComfyUI vs Fooocus: AI Image Generation at Home

LM Studio vs Jan vs GPT4All: Desktop LLM Clients

KV Cache Quantization: Free LLM Context, Almost

Mixture of Experts (MoE) for Self-Hosters, Demystified