What is Real-ESRGAN?

Real-ESRGAN (Real-world Enhanced Super-Resolution Generative Adversarial Network) is a practical image and video upscaling model developed by Xintao Wang et al. It extends the original ESRGAN architecture with a focus on handling real-world degradations — noise, compression artifacts, blur, and sensor limitations — that synthetic training data alone cannot capture.

Architecture and Training

Real-ESRGAN uses an RRDB (Residual-in-Residual Dense Block) generator network paired with a U-Net discriminator. What sets it apart from earlier super-resolution models is its high-order degradation pipeline for training data. Instead of simple bicubic downsampling, it applies sequences of blur, resize, noise, and JPEG compression to simulate how images degrade in real-world conditions.

Generator — 23 RRDB blocks that learn to map low-resolution features to high-resolution output
Discriminator — U-Net architecture that provides per-pixel realness feedback
Degradation Model — Second-order degradation process including blur kernels, resize, noise injection, and JPEG compression

Models and Variants

Several model weights are available: RealESRGAN_x4plus for general content at 4x scale, RealESRGAN_x2plus for 2x with less hallucination, realesr-animevideov3 optimized for anime content, and the compact realesr-general-x4v3 for faster inference. Each variant is trained on different data distributions to excel at specific content types.

Performance Characteristics

On an NVIDIA RTX 3070, Real-ESRGAN processes a 480p frame to 4K in approximately 300ms. For video, this means a 10-minute clip at 30fps requires about 90 minutes of processing. VRAM usage scales with resolution: 720p input requires roughly 4GB VRAM, while 1080p input needs 6-8GB for 4x upscaling.

Limitations

Real-ESRGAN can over-sharpen flat areas, creating unnatural texture where none existed. Faces are a particular challenge — the model may hallucinate incorrect facial features, which is why GFPGAN is typically used as a second pass for face regions. Temporal consistency across video frames also requires additional post-processing.

Real-ESRGAN in Clareon

Clareon integrates Real-ESRGAN as its primary upscaling engine, with automatic model selection based on content type detection. It combines Real-ESRGAN output with GFPGAN face restoration and applies temporal smoothing to prevent frame-to-frame flickering. Users can also train custom models on their own footage for specialized domains like surveillance, medical imaging, or archival restoration.

Try Clareon