Skip to content
Go back

ControlNet & LoRA: Advanced Image Control

By SumGuy 13 min read
ControlNet & LoRA: Advanced Image Control

Stable Diffusion Without Guardrails Is Just Slot Machines

Text-to-image is powerful, right up until you need the character to actually look like the reference, or the composition to match your sketch, or the lighting to not look like it was designed by someone’s angry fever dream at 3 AM. Then you’re rerolling the slot machine fifty times with microscopically tweaked prompts, which is honestly the opposite of control.

That’s where ControlNet and LoRA come in. They’re not magic bullet fixes — they’re precision tools that transform image generation from “hope the dice gods smile” to “I know what I’m getting.” And yeah, stacking them together is where things get genuinely fun.

Let’s talk about what they actually do, why they work, and how to wire them into ComfyUI without melting your brain (or your VRAM).

ControlNet: Structural Control

ControlNet is a mechanism that adds spatial conditioning to Stable Diffusion. Instead of just text guiding the generation, you feed in a control map — a preprocessed image that encodes specific structural information — and the model learns to respect it.

Think of it like this: text is your high-level direction (“make a cyberpunk mercenary in tactical gear”), and ControlNet is your architectural blueprint. Text says what, ControlNet says where.

Canny Edge Detection

Canny finds edges in an image — clean, hard lines where pixel intensity changes sharply. Feed a sketch or a reference photo to Canny preprocessing, and you get a monochrome outline. Pipe that into a ControlNet and Stable Diffusion will generate new content that follows those exact edges.

When to use: You have a sketch, a rough layout, or a reference photo and you want the generated image to match the composition exactly. Gesture drawings, wireframes, architectural sketches — all solid Canny targets.

The gotcha: Canny is sensitive. Too low a threshold and you get noise instead of edges. Too high and fine details vanish. Most workflows let you tune the thresholds — start conservative and dial up if the output lacks definition.

Depth Maps (MiDaS & ZoeDepth)

Depth preprocessing estimates how far away each pixel is from the camera. MiDaS is the older, faster option. ZoeDepth (from Intel) is newer, more accurate, slightly slower. Either way, you get a grayscale map where bright = close, dark = far. Feed that to ControlNet and Stable Diffusion will respect the spatial depth cues.

When to use: You have a reference photo and want the 3D layout preserved (camera angle, foreground/background separation, object positioning). Portrait with specific lighting depth, landscape with depth layering, architectural renders where perspective matters.

The gotcha: Depth estimation from 2D photos is inherently guesswork. A room that’s actually a hallway might estimate as a wide space. JPEG compression destroys depth cues. Feed it clean, well-lit source images or expect weird results.

OpenPose (Skeleton Pose)

OpenPose extracts the human (or animal) skeleton from an image — joints, limbs, spine. You get a stick figure essentially. Feed that skeleton to ControlNet and Stable Diffusion will generate a new character in exactly that pose.

When to use: You need a specific pose — someone reaching, dancing, sitting, crouching. A pose reference from a photo, your own gesture drawing, or a stick figure you sketched yourself. Game character in-betweening, action shot composition, full-body character consistency.

The gotcha: OpenPose struggles with foreshortening and arms crossing the body (ambiguous which arm is which). Occlusion breaks it. But for clean, mostly-visible poses it’s rock solid. Also, OpenPose itself is proprietary data — some comfy distributions exclude it for licensing reasons. Verify your install has it.

Scribble / Lineart

Scribble takes any hand-drawn sketch (even rough, messy stuff) and cleans it into crisp line art. Lineart is similar but optimized for illustrations with strong black lines on white. Both feed into ControlNet to make Stable Diffusion respect your drawn composition.

When to use: You sketched something in Krita or Photoshop and want Stable Diffusion to fill in details, colors, and shading while keeping your lines intact. Concept artists who want model assistance without losing the hand-drawn feel. Faster than painting the full image yourself.

The gotcha: Really messy sketches still look messy after preprocessing. You need enough line definition for the ControlNet to latch onto. Also, the model still interprets what you drew based on text prompt — a dog-shaped scribble + “spaceship” might still generate a spaceship because the prompt is strong.

Tile / Tiling

Tile processing splits an image into overlapping patches and processes them. Feed that into ControlNet and you get localized, consistent detail generation. Useful for upscaling without seams, fixing a small region, or ensuring texture continuity.

When to use: Inpainting (fixing a small area), upscaling a low-res image while keeping composition, texture expansion (make a 512×512 seamless tileable texture from a 256×256), consistency-preserving edits.

The gotcha: Tile ControlNet is less forgiving of mismatches between patch boundaries. Also, it can feel overfit to local details — sometimes you lose overall coherence if you’re too aggressive.

Inpaint / Inpaint Model

Inpaint ControlNet is specifically for masked edits. You provide an image, a mask (white = change, black = keep), and a prompt. The model respects the mask while generating new content in the masked region.

When to use: “I like 90% of this image but that person’s hand is a clawed disaster, let me fix just the hand.” Removing objects, replacing faces, fixing AI artifacts, adding details to a specific region.

The gotcha: If your mask edge is too soft, the boundary between kept and generated content looks obvious. Hard mask edges help, but even then there can be color/lighting discontinuity. Keep the masked region reasonably similar in content to the surrounding area or you’ll get jarring seams.


LoRA: Style, Character, and Concept Fine-Tunes

LoRA stands for “Low-Rank Adaptation.” It’s a way to encode stylistic or conceptual knowledge into tiny, mergeable weight updates — usually 10–200 MB instead of the 4 GB base model.

Think of the base Stable Diffusion model as a generalist painter. A LoRA is like handing that painter a style guide: “paint like anime,” “paint like oil on canvas,” “paint my character OC,” “paint in the style of artist X.”

How LoRA Works (Brief Version)

The math: LoRA represents model weight changes as the product of two low-rank matrices (hence “low-rank”). Instead of fine-tuning all 860M parameters of Stable Diffusion (prohibitively expensive), you fine-tune only a tiny subset encoded in two matrices (a few MB).

The practical bit: You apply a LoRA by telling ComfyUI to load it and blend it into the model at inference time. You can stack multiple LoRAs (style + character, aesthetic + subject, etc.) and control the blend strength of each.

Where to Find LoRAs

CivitAI is the primary source — civitai.com is the de facto hub. Search by category (character, aesthetic, model tweaks, style, pose), check ratings, read descriptions, check the preview images, download.

The CivitAI caveats: Not all LoRAs are legal or ethical. NSFW filters exist but aren’t perfect. Some LoRAs are trained on copyrighted art (style clones of specific artists). Some are trained on character likenesses without permission. Some are broken and don’t work. Spend 30 seconds reading the description and preview images before downloading. Bad LoRAs waste time.

Other sources: Hugging Face has community LoRAs. Some are experimental and excellent, some are garbage. Use judgment.

LoRA Rank and What It Means

LoRA rank is typically 4, 8, 16, 32, or 64. Lower rank = smaller file, less capacity, more “directional” influence. Higher rank = larger file, more nuanced control, more risk of overfit or poisoning the prompt.

Rule of thumb: Start with rank 8 or 16. If you’re applying a LoRA at strength 0.7+ and it’s still not “strong enough,” you probably want a higher-rank LoRA. If you apply it at strength 0.3 and it’s already overpowering, you want lower rank or just a different LoRA.

LoRA Weight / Strength

Most workflows let you control LoRA blend strength (0.0 to 1.0, sometimes beyond). 1.0 = full strength, 0.5 = halfway blend with base model, 0.0 = disabled.

Starting values: character LoRAs usually go 0.8–1.0. Style LoRAs often live at 0.5–0.8 (full strength can look artificial). Pose LoRAs vary wildly.


Stacking ControlNet + LoRA: Structure + Style

Here’s where it gets good: you can use ControlNet for spatial control and LoRA for stylistic/conceptual control at the same time.

Example: “I want a cyberpunk soldier in a specific pose, drawn in anime style.”

The model respects both the skeleton and the style guide simultaneously. You get precise pose and consistent aesthetic. It’s the difference between “I hope this looks right” and “I know exactly what I’m getting.”

Stacking Multiple ControlNets

You can even stack multiple ControlNets. Canny + Depth is common (composition + spatial depth). OpenPose + Scribble works if you’re overly ambitious and patient.

The catch: every ControlNet adds memory pressure and inference latency. Two ControlNets are fine. Three starts to get dicey on consumer VRAM. Four requires actual GPU memory and patience.


ComfyUI Node Graph Examples

ComfyUI’s strength is node-based workflows — you wire preprocessing, ControlNet, LoRA loading, and sampling together. Here’s what a practical setup looks like (textual description):

Basic ControlNet workflow:

  1. Load Canny preprocessor node
  2. Feed reference image → Canny
  3. Load base model (Stable Diffusion 1.5 or SDXL)
  4. Load ControlNet model (controlnet-canny for SD, etc.)
  5. Create sampler node (KSampler or DPMSolver)
  6. Connect: canny output + prompt + base model + controlnet → sampler
  7. Decode output latent to image
  8. View result

ControlNet + LoRA stack:

  1. Same as above, but:
    • Load LoRA node
    • Load base model → LoRA loader node at strength 0.7
    • LoRA output → sampler’s model input
  2. Sampler now gets a LoRA-blended model and ControlNet conditioning

Minimal JSON snippet (LoRA loading in ComfyUI):

{
"1": {
"inputs": {
"ckpt_name": "model.safetensors"
},
"class_type": "CheckpointLoaderSimple"
},
"2": {
"inputs": {
"lora_name": "my_lora.safetensors",
"strength_model": 0.7,
"strength_clip": 0.7,
"model": ["1", 0],
"clip": ["1", 1]
},
"class_type": "LoraLoader"
}
}

The strength_model and strength_clip control blend strength (0.0 to 1.0). Tweak these to dial in the LoRA influence.

For ControlNet, add a ControlNetLoader and ControlNetApply node:

{
"controlnet_loader": {
"inputs": {
"control_ckpt_name": "control_canny-fp16.safetensors"
},
"class_type": "ControlNetLoader"
},
"controlnet_apply": {
"inputs": {
"strength": 1.0,
"conditioning": ["conditioning", 0],
"control_net": ["controlnet_loader", 0],
"image": ["load_image", 0]
},
"class_type": "ControlNetApply"
}
}

Real workflows are bigger and more connected, but the principle is the same: load → apply → condition → sample.


SDXL vs. SD1.5: ControlNet and LoRA Differences

SD1.5 (2022):

SDXL (2023):

Practical pick: If you’re running on consumer hardware (3060 Ti, 4070, RTX 4080), start with SD1.5. Faster iteration, lower frustration. If you have the VRAM and want photorealism, SDXL is worth it.

ControlNets and LoRAs are not cross-compatible — a SD1.5 LoRA won’t work on SDXL. Make sure your LoRA matches your base model.


VRAM and Performance Expectations

Memory estimates (rough):

Practical targets:

Optimization tricks: enable memory-efficient attention, use FP16 precision instead of FP32, enable VAE tiling if you run out of memory during decode. ComfyUI handles most of this automatically if you let it.


Common Gotchas and How to Avoid Them

Color shifting: Sometimes the generated image ignores the color palette of your control image. This usually means the prompt is too strong in the opposite direction. Dial the prompt back or boost the ControlNet strength (up to 1.5 sometimes helps).

ControlNet weight tuning: Default strength is 1.0. For Canny and Depth, this is usually right. For Pose and Scribble, 0.7–0.9 often looks better (less rigid). Tile ControlNet needs finesse — experiment from 0.5 upward.

Conflicting LoRAs: Stacking two character LoRAs or two art styles usually ends badly (character morphing, style contamination). Stick to one character LoRA + one or two style/aesthetic LoRAs max.

LoRA not working: Make sure it’s in the right folder (models/loras/ in ComfyUI), verify the filename exactly (including .safetensors suffix), and check that you’re loading the right base model (SD1.5 LoRA on SDXL = nothing happens).

Blurry output: Common when ControlNet strength is too high and overrides the model’s learned detail. Try 0.8–1.0 instead of 1.5. Also, bad reference images → bad control maps. Clean, clear source = better results.


When to Reach for What

The honest truth: ControlNet and LoRA are multipliers on your intent. If your prompt sucks, they’ll make a sharper version of a bad idea. But if you know what you want — a pose, a style, a composition — they’ll let you get there predictably instead of rolling the slot machine fifty times.

That’s control.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Argo Workflows vs Tekton

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts