Simpsonization — Identity-Preserving Face Stylization

Problem

Off-the-shelf "cartoonify" filters tend to fail in two distinct ways: either they erase the subject's identity, or the flat 2D style clashes badly with the photographic 3D scene around it. The goal of this project was to bridge that gap — apply a strongly stylized look without losing identity, and without leaving the cartoon "floating" on top of the photo.

Pipeline

Three components stacked on top of a Stable Diffusion inpainting backbone:

LoRA fine-tuning. A lightweight LoRA (rank 16) was trained on a curated Simpson-style dataset to capture the look — yellow skin, claymation-style volume, characteristic features — without touching the base weights. This avoids catastrophic forgetting and was tractable under a tight compute budget.
9-channel inpainting UNet. The UNet input is concatenated as [noisy latents (4ch) | masked-image latents (4ch) | binary mask (1ch)]. Giving the mask explicitly — rather than letting the model infer it — provides a strong locality signal that the masked region is what to fill, and yields faster convergence.
ControlNet (Canny edges). A Canny-conditioned ControlNet locks in jawline, eyes, and mouth shape. Canny was chosen over Depth or OpenPose because it preserves fine facial structure on stylized faces, where pose estimators tend to fail.
Repaint-style background loop. At each denoising step, known background latents are re-injected outside the mask, ensuring the non-face region stays pixel-identical to the input photo and the composite is seamless.

LoRA fine-tuning step — Fine-tuning. LoRA adapters are injected into the K and V projections of the UNet's cross-attention; only those low-rank matrices are trained, on a curated Simpson-aesthetic image set, under a masked denoising loss.

Dataset

Validation used a "Face Detection in the Wild" set with diverse poses, lighting, and occlusions. Style training used a curated Simpson-aesthetic image set paired with descriptive prompts so the LoRA learned to associate trigger tokens with the target style.

Results

The pipeline produces results that look like the subject as a Simpson, rather than a generic Simpson stamped onto the subject. It is robust across challenging cases — side profiles, expressive faces, dramatic lighting — because the diffusion backbone preserves scene-consistent shading while ControlNet preserves the underlying face.

Sample 1: input · mask · stylized — Selected samples — each row shows input photograph · facial mask · Simpsonized output. Background lighting and identity cues survive the stylization.

Sample 2: input · mask · stylized — Selected samples — each row shows input photograph · facial mask · Simpsonized output. Background lighting and identity cues survive the stylization.

Limitations: extreme profile views can struggle to map "Simpson" features onto a flat human profile; large facial occlusions can confuse the Canny conditioning; eyeglasses are sometimes blended unpredictably.

Takeaways

Practical experience composing LoRA, inpainting, and ControlNet conditioning into a single coherent pipeline
Designed and ablated channel composition for the UNet input — explicit mask > mask-implicit
Engineered the trade-off between style strength and identity preservation through structural conditioning