Top 5 Super Resolution Datasets (LR-HR Pairs) Every Computer Vision Researcher Must Know
A Complete Guide to Low-Resolution and High-Resolution Paired Datasets, Degradation Types, Metrics, Models, and Implementation — With Final Year Project Angles for Researchers, PhD, M.Tech, and Final Year Students

Introduction
What Is Image Super Resolution
Image Super Resolution (SR) is the computational task of reconstructing a high-resolution image from one or more low-resolution observations. The goal is to recover fine spatial detail — sharp edges, clear textures, readable text, precise structural boundaries — that was lost during the downsampling, compression, blurring, or degradation process that produced the low-resolution input. Super resolution is fundamentally an ill-posed inverse problem: infinitely many high-resolution images could plausibly have produced any given low-resolution observation, and the model must learn which high-resolution reconstruction is most probable given the visual content and the degradation process.
Deep learning has transformed super resolution from a classical signal processing problem into one of the most active areas of computer vision research. Where traditional methods like bicubic interpolation simply resample pixels using fixed mathematical kernels, modern deep neural networks — from the original SRCNN (2014) through residual networks, attention mechanisms, generative adversarial networks, vision transformers, and diffusion models — learn to hallucinate fine image detail from statistical patterns observed across millions of training image pairs. The quality improvements achieved by these methods over classical interpolation are dramatic and continuing to improve with each generation of architecture.
The applications of super resolution are both numerous and practically important. In satellite and remote sensing imagery, super resolution enables analysts to extract more precise geographic information from imagery collected by sensors with physical resolution limits. In medical imaging, it enhances MRI, CT, and microscopy images to reveal diagnostic detail that would otherwise require more expensive or more invasive imaging procedures. In surveillance and security, it recovers readable faces and license plates from low-resolution video frames. In consumer photography, it powers the computational zoom features of modern smartphones. In scientific imaging, it enables researchers to study fine structural details in biological, materials, and astronomical imagery that their imaging equipment cannot directly resolve.
Why LR-HR Pairs Matter More Than the Algorithm
Every supervised super resolution model requires paired training data — low-resolution images paired with their corresponding high-resolution ground truth originals. The quality, diversity, scale factor coverage, and degradation realism of this paired data fundamentally determines what the model can learn and how well it generalizes to real-world inputs. A model trained on bicubic-downsampled synthetic LR images will learn to invert bicubic downsampling specifically — and will perform poorly when deployed on real camera images, scanned photographs, or compressed video frames where the actual degradation is complex and not cleanly invertible.
The choice of dataset also determines which evaluation benchmarks are relevant, which baseline models are appropriate for comparison, and what scientific claims can validly be made from the results. For researchers and students entering the super resolution field, understanding the available datasets — their origins, structures, strengths, limitations, and appropriate use cases — is as important as understanding the architectures of the models that learn from them.
How This Article Is Structured
This article covers five of the most important and widely used super resolution datasets for LR-HR pair training and evaluation. Each dataset is covered in depth across twelve subsections — from origin and statistics through structure, degradation methodology, recommended models, download instructions, evaluation metrics, and a concrete final year project angle. Following the dataset coverage, the article provides a complete evaluation metrics guide, a side-by-side comparison table, guidance on choosing the right dataset, a review of SR models benchmarked on these datasets, practical data preparation guidance, a research gap radar, implementation roadmap, tools and frameworks reference, common mistakes guide, and conclusion with actionable next steps.
What Makes a Great Super Resolution Dataset
Not all image datasets are suitable for super resolution research, and not all super resolution datasets are suitable for all SR tasks. Before examining the five datasets in detail, it is worth establishing what distinguishes a great LR-HR paired dataset from an adequate one.
Scale Factor Coverage
Super resolution models are typically trained and evaluated at specific scale factors — ×2, ×3, and ×4 being the most common, with ×8 appearing in some remote sensing and medical imaging contexts. A great dataset provides LR-HR pairs at multiple scale factors, enabling evaluation of model performance across a range of upsampling challenges. Datasets that only provide a single scale factor limit the scope of experiments and comparisons that can be conducted.
Degradation Realism
The way LR images are generated from HR originals — the degradation model — is one of the most consequential design choices in an SR dataset. Simple bicubic downsampling produces LR images with specific frequency characteristics that models can learn to invert reliably, but real-world LR images are produced by far more complex processes involving optical blur, sensor noise, compression artifacts, and atmospheric effects. A dataset whose degradation model matches the target deployment scenario produces models that generalize to real images. A dataset with unrealistic degradation produces models that work only on synthetic benchmarks.
Image Diversity and Domain Coverage
A great SR dataset covers a wide range of visual content — diverse scene types, textures, structural patterns, lighting conditions, and object categories — to ensure that trained models develop robust priors rather than overfitting to a narrow visual distribution. Domain-specific datasets like remote sensing or face images serve specialized applications but should not be used as the sole evaluation benchmark for general-purpose SR claims.
Resolution Quality
The high-resolution ground truth images in an LR-HR paired dataset must themselves be of sufficient quality — high resolution, no visible compression artifacts, no blurring or noise — to serve as meaningful reconstruction targets. If the HR images have quality limitations, the model is supervised toward a degraded target, and evaluation against that target overstates real-world quality.
License and Accessibility
A dataset that requires institutional access, commercial licensing, or complex data use agreements creates barriers to reproducibility and limits adoption in the research community. The best SR datasets are openly licensed, stably hosted, and citeable with permanent DOIs — allowing any researcher anywhere to download, use, and reproduce results without administrative overhead.
Dataset 1: UC Merced Super Resolution Dataset (LR-HR Pairs — Blur and Downsample)
Overview and Origin
The UC Merced Super Resolution Dataset (LR-HR Pairs — Blur and Downsample) is a purpose-built super resolution training and evaluation dataset published in April 2026 on Zenodo by Kannan Wisen (Subash Kannan) under the Creative Commons Attribution 4.0 International license. The dataset is derived from the original UC Merced Land Use dataset — one of the most widely used remote sensing image collections in computer vision research — and extends it with systematically generated paired low-resolution images specifically designed for deep learning super resolution model training and benchmarking.
The UC Merced Land Use dataset was originally assembled by Yang and Newsam (2010) from aerial imagery of the United States Geological Survey (USGS) National Map Urban Area Imagery collection. It contains 2,100 images across 21 land-use scene categories — aircraft, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, tennis court, and agricultural — each at 256×256 pixel resolution with a spatial resolution of 0.3 meters per pixel. This dataset has been extensively used in remote sensing scene classification and super resolution research for over a decade.
The Zenodo release packages the UC Merced imagery with pre-generated LR-HR paired data using a blur-and-downsample degradation pipeline — representing a more challenging and realistic degradation scenario than simple bicubic downsampling alone. This makes the dataset particularly valuable for researchers developing SR models that must perform well under real satellite imaging conditions where optical blur is present alongside sensor-level downsampling.
Official Download: https://zenodo.org/records/19712660
DOI: 10.5281/zenodo.19712660
Published: April 23, 2026 | Version v1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Author: Kannan Wisen (Subash Kannan)
File: Dataset_UCMerced_LandUse_Blur_Downsample_SR.zip (108.4 MB)
Dataset Statistics
Source dataset: UC Merced Land Use dataset (Yang and Newsam, 2010)
Total images: 2,100 HR images across 21 land-use scene categories
Images per category: 100 images per class
HR image resolution: 256×256 pixels at 0.3 m/pixel spatial resolution
Scale factor: ×4 (LR images are 64×64 pixels)
Degradation type: Blur (Gaussian) followed by downsampling
Dataset structure: Training, validation, and testing subsets
File size: 108.4 MB compressed (216.8 MB total data volume)
Format: ZIP archive containing LR/HR image pairs
Programming language: Python
Dataset Structure and File Organization
The dataset is distributed as a single ZIP archive (Dataset_UCMerced_LandUse_Blur_Downsample_SR.zip) that unpacks into a structured directory hierarchy organized for direct use in standard deep learning training pipelines. The dataset is pre-split into training, validation, and testing subsets — a critical feature that ensures reproducible and standardized benchmarking across different research groups and model comparisons.
Each subset contains paired LR and HR image directories with corresponding filenames, enabling straightforward DataLoader construction in PyTorch or TensorFlow. The LR images (64×64 pixels at ×4 scale factor) and HR images (256×256 pixels) share identical filenames within their respective directories, allowing pair-wise loading without any additional file mapping or annotation processing.
The 21-category structure of UC Merced is preserved in the dataset organization, enabling category-stratified evaluation that measures SR model performance on specific land-use types. This is particularly valuable for remote sensing applications where performance on specific scene categories — for example, SR quality on water bodies versus urban structures — is more informative than aggregate metrics across all categories.
Degradation Methodology — Blur and Downsample
The defining characteristic of this dataset compared to simpler LR-HR paired datasets is its blur-and-downsample degradation pipeline. This two-stage process more accurately models real-world satellite and aerial imaging degradation than simple bicubic downsampling alone.
In the first stage, the original HR images are convolved with a Gaussian blur kernel that simulates the point spread function (PSF) of an optical imaging system. The PSF of a real camera lens is approximately Gaussian — light from a point source is spread across a small region of the sensor rather than falling precisely on a single pixel. Blurring the HR image before downsampling ensures that the resulting LR image contains the anti-aliasing blur that real optical systems introduce, rather than the aliasing artifacts that would result from direct downsampling without blur.
In the second stage, the blurred HR images are downsampled by the specified scale factor (×4) to produce LR images at 64×64 pixel resolution. The downsampling removes three-quarters of the pixels in each spatial dimension, producing the low-resolution observation that the SR model must learn to invert.
This blur-then-downsample degradation pipeline is consistent with the physical model used in multi-frame super resolution research — where multiple slightly offset LR frames are combined to reconstruct a single HR image — and is substantially more challenging for SR models than clean bicubic downsampling. A model trained on this dataset learns to simultaneously deblur and upsample, which is a more practically relevant skill than pure upsampling.
Scale Factor and Spatial Resolution
The dataset provides LR-HR pairs at a ×4 scale factor — one of the standard and most challenging super resolution benchmarks. At ×4, each LR pixel must contribute to 16 HR pixels, requiring the model to hallucinate significant high-frequency detail that is absent in the low-resolution input. This is substantially harder than ×2 (where 1 LR pixel contributes to 4 HR pixels) and represents the boundary between upsampling tasks that are practically useful (×4 resolution enhancement makes a meaningful difference to remote sensing analysis) and tasks that are speculative (×8 and above require increasingly uncertain hallucination).
In spatial resolution terms, the ×4 scale factor means that each LR pixel covers 1.2 meters on the ground (4 × 0.3 m/pixel), while each HR pixel covers 0.3 meters. At LR resolution, features like individual vehicles, small buildings, and road markings that are visible in the HR image become unresolvable blobs. The SR model must reconstruct these fine structures from the coarse LR representation — a task that requires the model to develop strong priors about remote sensing scene structure.
Recommended Super Resolution Models
The dataset documentation specifically recommends three SR models as training and evaluation targets:
SRGAN (Super Resolution Generative Adversarial Network): The first deep learning model to demonstrate photorealistic ×4 super resolution using perceptual loss and adversarial training. SRGAN introduced the use of a discriminator network that trains the generator to produce outputs indistinguishable from real HR images, enabling the synthesis of fine texture detail that pixel-level loss functions alone cannot produce. Foundational model for GAN-based SR research.
ESRGAN (Enhanced Super Resolution GAN): An improved version of SRGAN using Residual-in-Residual Dense Blocks (RRDB) instead of residual blocks, removing batch normalization, and using a relativistic discriminator. ESRGAN achieves substantially better perceptual quality than SRGAN on complex textures and structural detail. The RRDB backbone (RRDBNet) is one of the most widely used SR architectures even in non-GAN settings. Winner of the PIRM2018 SR challenge.
SwinIR (Image Restoration Using Swin Transformer): A vision transformer-based image restoration model that uses Swin Transformer blocks to capture long-range dependencies across image patches. SwinIR achieves state-of-the-art results on multiple SR benchmarks by combining the local attention efficiency of shifted window attention with the global context modeling of transformer architectures. Particularly effective on structured scenes with long-range spatial dependencies — making it well-suited for remote sensing SR where scene layout follows predictable spatial patterns.
Strengths and Limitations
Strengths:
Pre-generated LR-HR pairs with a physically motivated blur-and-downsample degradation pipeline — more realistic than simple bicubic downsampling
Pre-configured train/validation/test splits enabling standardized benchmarking
21 diverse land-use scene categories providing good coverage of remote sensing content types
Open access on Zenodo with a stable DOI and CC BY 4.0 license — freely usable in research and publications with citation
Compact size (108.4 MB) makes it accessible in resource-constrained research environments
Specifically designed for deep learning SR models with Python-based toolchain
Links to Image Processing Projects For Final Year and Deep Learning Projects resources
Limitations:
Single scale factor (×4) — does not provide ×2 or ×3 pairs for multi-scale evaluation
Relatively small at 2,100 HR images — insufficient for training very large SR models from scratch without data augmentation or pre-training transfer
Single domain (remote sensing / satellite imagery) limits generalization claims beyond the remote sensing application domain
Fixed degradation type (blur + downsample) — does not cover noise, JPEG compression, or real-world mixed degradation scenarios
256×256 pixel HR resolution is relatively low — not suitable as a high-resolution benchmark for models targeting 4K or higher output quality
How to Download and Use
The dataset is freely available for download from Zenodo at https://zenodo.org/records/19712660. Click the download link for Dataset_UCMerced_LandUse_Blur_Downsample_SR.zip (108.4 MB). No registration or license agreement is required beyond citing the dataset in any resulting publications as required by the CC BY 4.0 license.
After extraction, use the pre-defined train/validation/test splits for all experiments. In PyTorch, implement a Dataset class that reads paired filenames from the split directories, loads LR and HR image pairs using PIL or torchvision.io, and applies standard augmentations (random crop, horizontal flip, rotation) during training. The LR images serve as model inputs and the HR images serve as reconstruction targets for pixel loss, perceptual loss, and adversarial training objectives.
Evaluation Metrics Commonly Used With This Dataset
PSNR — standard pixel-level reconstruction accuracy metric. Typically reported in dB; higher is better. The primary benchmark metric for remote sensing SR.
SSIM — structural similarity index measuring luminance, contrast, and structure consistency between reconstructed and HR reference images. Higher is better.
LPIPS — learned perceptual image patch similarity using deep network features. Better aligned with human perceptual quality than PSNR/SSIM. Lower is better.
Research Papers That Reference This Dataset
The UC Merced Land Use dataset has been used extensively in remote sensing SR research for over a decade. Papers including "Super-Resolution for Remote Sensing Imagery via the Coupling of a Variational Model and Deep Learning" (arXiv:2412.09841), "Wider Channel Attention Network for Remote Sensing Image Super-resolution" (arXiv:1812.05329), "Heterogeneous Mixture of Experts for Remote Sensing Image Super-Resolution" (arXiv:2502.09654), and "Multi-granularity Backprojection Transformer for Remote Sensing Image Super-Resolution" (arXiv:2310.12507) all use UC Merced as their primary evaluation benchmark. The Zenodo release packages this dataset with the specific blur-and-downsample degradation protocol used in the variational SR paper (arXiv:2412.09841), enabling direct reproduction of those results.
Final Year Project Angle
The UC Merced SR dataset is ideal for researchers and students working on remote sensing, satellite image analysis, geospatial AI, or domain-specific super resolution. A compelling final year or MTech project could train and systematically compare SRGAN, ESRGAN, and SwinIR on this dataset, providing a benchmark of GAN-based versus transformer-based SR methods specifically on remote sensing imagery — a comparison not yet thoroughly covered in the literature for this specific degradation type. A PhD research angle could investigate whether SR models trained on the blur-and-downsample degradation generalize to other degradation types, or whether the blur-deconvolution component of the learned mapping transfers across scale factors. An interdisciplinary project combining SR with downstream remote sensing analysis tasks — measuring whether SR preprocessing improves land use classification accuracy, object detection recall, or change detection performance — would be a strong application-focused contribution.
Dataset 2: DIV2K — Diverse 2K Resolution Dataset
Overview and Origin
DIV2K (Diverse 2K) is the most widely used high-resolution super resolution training and evaluation dataset in the deep learning era. Created for the NTIRE (New Trends in Image Restoration and Enhancement) Challenge series — a prestigious annual competition associated with CVPR — DIV2K was released by Timofte et al. in 2017 and has since become the standard benchmark for single-image super resolution research. Its combination of high image quality, visual diversity, and open accessibility has made it the dataset that essentially every modern SR paper reports results on.
The dataset was specifically designed to address a critical limitation of earlier SR benchmarks (Set5, Set14, BSD100) — their small size, low visual diversity, and use of low-resolution or heavily compressed source images. DIV2K provides 1,000 images at genuine 2K resolution (2,040×1,404 pixels on average), spanning a carefully curated collection of natural landscapes, cityscapes, architecture, portraits, wildlife, food, cultural artifacts, and abstract textures.
Official Download: https://data.vision.ee.ethz.ch/cvl/DIV2K/
License: Research and educational use
Citation: Timofte et al., "NTIRE 2017 Challenge on Single Image Super-Resolution," CVPRW 2017
Dataset Statistics
Total images: 1,000 (800 training + 100 validation + 100 test)
HR resolution: ~2,040×1,404 average (2K) — many images exceed 2,000 pixels on the longer axis
Scale factors provided: ×2, ×3, ×4, ×8
Degradation tracks: Bicubic (standard), mild realistic degradation, wild realistic degradation, unknown degradation
Image format: PNG (lossless)
Visual domains: Natural scenes, cityscapes, architecture, portraits, wildlife, food, abstract textures
Challenge versions: NTIRE 2017, 2018, 2019, 2020, 2021 (different degradation tracks per year)
Dataset Structure and File Organization
DIV2K is organized into separate directories for high-resolution images and low-resolution counterparts at each scale factor and degradation track. The HR directory contains the original 2K resolution images. The LR directories contain downsampled versions at each scale factor using each degradation method. All images use consistent naming conventions that match LR and HR pairs by filename. The standard 800/100/100 train/validation/test split is universally adopted across the SR literature.
Degradation Methodology
DIV2K offers multiple degradation tracks serving different research purposes. The bicubic degradation track is the standard benchmark — LR images are generated by bicubic downsampling of HR images, the classic SR problem formulation. The mild and wild realistic degradation tracks (introduced in NTIRE 2018) add combinations of blur, noise, and JPEG compression to simulate more realistic degradation scenarios. The unknown degradation track (NTIRE 2019 onward) withholds the specific degradation applied, requiring models to perform blind SR estimation without degradation knowledge.
Strengths and Limitations
Strengths:
Universal benchmark — virtually every SR paper reports results on DIV2K, enabling direct comparison with published state-of-the-art methods
Genuine 2K resolution source images enable evaluation of fine-grained detail reconstruction at practical output resolutions
Multiple scale factors (×2, ×3, ×4, ×8) and multiple degradation tracks in a single dataset
Lossless PNG format eliminates compression artifacts from evaluation
Exceptional visual diversity across dozens of content categories
Active NTIRE challenge history ensures the dataset remains relevant and well-maintained
Limitations:
Only 800 training images — insufficient for training very large models from scratch without pre-training on larger datasets like Flickr2K or DF2K
Bicubic degradation track does not reflect real-world imaging degradation
No semantic annotations — cannot support semantically guided SR evaluation
Test set ground truth images are not publicly released — limits test set evaluation to challenge submission only
How to Download and Use
DIV2K is available for direct download at https://data.vision.ee.ethz.ch/cvl/DIV2K/. Download the HR training and validation archives plus the LR archives for your target scale factor and degradation track. The dataset is also available via HuggingFace Datasets with a simple one-line loader. BasicSR framework provides native DIV2K dataset handling.
Evaluation Metrics Commonly Used
PSNR — primary DIV2K metric, typically computed on Y-channel of YCbCr color space
SSIM — structural similarity on Y-channel
LPIPS — perceptual quality for GAN-based and diffusion-based SR models
Research Papers That Use This Dataset
DIV2K is the primary benchmark in essentially every landmark SR paper of the past decade: EDSR, RCAN, SAN, HAN, SwinIR, HAT, Real-ESRGAN (fine-tuning), StableSR, SeeSR, and hundreds of others. Any SR paper that does not include DIV2K results faces immediate scrutiny from reviewers.
Final Year Project Angle
DIV2K is the right dataset for any project that needs direct comparability with published state-of-the-art results. A strong M.Tech or PhD project could benchmark a new SR architecture modification on DIV2K across all four scale factors to show consistent improvement. Another angle is a study of training data efficiency — measuring how SR model performance scales with the fraction of DIV2K training data used, identifying the minimum training set size for competitive results. For students interested in perceptual quality, a project comparing pixel loss-trained models vs. GAN-trained models vs. diffusion model-based SR on DIV2K — using both PSNR/SSIM and LPIPS/FID as metrics — demonstrates the fundamental tradeoff between pixel fidelity and perceptual quality.
Dataset 3: Set5 — Classic Super Resolution Benchmark
Overview and Origin
Set5 is a five-image super resolution benchmark dataset that has been used in essentially every SR paper published since its introduction by Bevilacqua et al. in 2012. Despite its tiny size — just five images — Set5 remains one of the most widely reported benchmarks in the SR literature, providing a universally agreed-upon reference point for algorithm comparison. The five images — baby, bird, butterfly, head, and woman — were selected to represent a range of content types including fine textures (butterfly wing patterns), natural structures (bird plumage), smooth gradient regions (baby skin), and portrait content (head, woman).
Set5 was originally proposed alongside a low-complexity SR algorithm, but its value proved durable far beyond that original paper. Its small size makes it computationally inexpensive to evaluate, its content diversity is sufficient to reveal qualitative differences between algorithms, and its universality means that results on Set5 are immediately comparable across essentially the entire SR literature.
Official Download: Available via multiple mirrors — the most accessible source is the BasicSR framework repository at https://github.com/XPixelGroup/BasicSR and the MMagic dataset collection.
License: Research use
Citation: Bevilacqua et al., "Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding," BMVC 2012
Dataset Statistics
Total images: 5 high-resolution images
Images: baby, bird, butterfly, head, woman
Scale factors: ×2, ×3, ×4 (LR images generated by bicubic downsampling)
HR resolution: Variable per image (typically 288×288 to 512×512)
Degradation: Bicubic downsampling (standard)
Format: PNG
Dataset Structure
Set5 is organized into HR and LR directories, with LR subdirectories for each scale factor (X2, X3, X4). Each LR directory contains the bicubic-downsampled versions of all five HR images. The tiny scale of the dataset means evaluation on Set5 completes in seconds on any GPU, making it suitable for rapid development iteration where you need to check model performance after each training epoch without waiting for a full DIV2K evaluation pass.
Degradation Methodology
Set5 exclusively uses bicubic downsampling for LR image generation. This is the simplest and most widely used SR degradation model — multiply the number of pixels in each dimension by 1/scale_factor using bicubic interpolation. While not realistic for real-world SR applications, bicubic downsampling defines the standard benchmark problem that enables cross-paper comparison. The clean, deterministic nature of bicubic degradation means that any two researchers applying it to the same HR image will obtain identical LR images, ensuring perfect reproducibility.
Strengths and Limitations
Strengths:
Universal benchmark — every SR paper reports Set5 results, enabling immediate comparison with the entire literature
Fast evaluation — 5 images complete in seconds, enabling rapid development iteration
Content diversity across the 5 images tests different SR challenges (textures, gradients, portraits)
Available from multiple stable sources as part of standard SR toolkits
Decades of published results as reference points
Limitations:
Only 5 images — statistical variance in PSNR scores is high; small algorithmic differences may appear significant due to overfitting to these specific images
Exclusively bicubic degradation — not suitable for evaluating real-world or blind SR models
Images are small and relatively simple compared to modern high-resolution photography
Over a decade of optimization toward Set5 — some modern architectures may be implicitly tuned for these specific images
How to Download and Use
Set5 is bundled with most SR framework repositories. The most reliable access is through BasicSR at https://github.com/XPixelGroup/BasicSR which provides download scripts for all standard SR benchmarks including Set5, Set14, BSD100, Urban100, and Manga109. It can also be downloaded from the MMagic repository and multiple academic lab pages.
Evaluation Metrics Commonly Used
PSNR (Y-channel) — PSNR computed on the luminance channel of YCbCr space. The standard comparison metric across all Set5 results in the literature.
SSIM (Y-channel) — structural similarity on the luminance channel.
Research Papers That Use This Dataset
Set5 results appear in virtually every SR paper ever published in deep learning — SRCNN, DRRN, EDSR, RCAN, SAN, HAN, SwinIR, HAT, IPT, EDT, and thousands of others. It is the universal comparative reference for PSNR-oriented SR research.
Final Year Project Angle
Set5 should be included in any SR project as a secondary benchmark alongside DIV2K or domain-specific datasets. Its value for final year projects is primarily comparative — reporting Set5 PSNR/SSIM numbers alongside your novel method allows direct comparison with every published baseline without any additional effort from reviewers. A focused project could investigate Set5's limitations as a benchmark — measuring how model rankings on Set5 (5 images) differ from model rankings on DIV2K validation (100 images), providing evidence for or against Set5's continued relevance as a primary evaluation dataset in modern SR research.
Dataset 4: BSD100 — Berkeley Segmentation Dataset
Overview and Origin
BSD100 (Berkeley Segmentation Dataset 100) is a 100-image super resolution benchmark derived from the Berkeley Segmentation Dataset (BSD500), originally assembled for image segmentation research by Martin et al. (2001) at UC Berkeley. The 100-image SR test split was established by Yang et al. (2010) as a more statistically robust alternative to the tiny Set5 and Set14 benchmarks, providing enough images to reliably measure small but genuine performance differences between SR algorithms.
BSD100 images are natural photographs spanning a wide range of content — landscapes, animals, architecture, portraits, and abstract scenes — photographed at relatively modest resolution by modern standards but sufficient for evaluating ×2, ×3, and ×4 SR methods. The dataset became a standard benchmark alongside Set5 and Set14, and remains widely reported in the SR literature for its statistical robustness relative to the smaller classic benchmarks.
Official Download: Available through BasicSR at https://github.com/XPixelGroup/BasicSR and through the original BSD download page.
License: Research use
Citation: Martin et al., "A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics," ICCV 2001
Dataset Statistics
Total test images: 100 natural photographs
Scale factors: ×2, ×3, ×4
HR resolution: Variable — typically 321×481 pixels (landscape) or 481×321 (portrait)
Degradation: Bicubic downsampling
Format: PNG/JPEG
Visual content: Diverse natural photographs including animals, landscapes, architecture, portraits
Dataset Structure
BSD100 is organized identically to other classic SR benchmarks — HR and LR directories with scale factor subdirectories. The 100 images provide a statistically meaningful evaluation set where PSNR differences of 0.1 dB or more can be considered reliable indicators of genuine performance differences rather than measurement noise.
Degradation Methodology
Like Set5 and Set14, BSD100 uses bicubic downsampling exclusively. The statistical robustness of 100 images over Set5's 5 images makes BSD100 the preferred benchmark for detecting small but genuine performance improvements — particularly important when comparing methods that differ by less than 0.5 dB PSNR, where Set5 results have high variance.
Strengths and Limitations
Strengths:
100 images provides statistically robust evaluation — small genuine performance differences are detectable
Widely reported in the SR literature enabling direct comparison
Diverse natural image content spanning multiple scene types
Fast evaluation on GPU — 100 images complete in under a minute for most architectures
Limitations:
Bicubic degradation only — same limitation as Set5
Relatively low source resolution limits applicability as a high-resolution SR benchmark
Derived from a segmentation dataset — not specifically designed for SR challenges
JPEG compression artifacts present in some original images can confound SR evaluation
How to Download and Use
BSD100 is bundled with BasicSR at https://github.com/XPixelGroup/BasicSR alongside Set5, Set14, Urban100, and Manga109. Download scripts are provided in the repository. Evaluation scripts for PSNR/SSIM on all classic benchmarks are also included.
Evaluation Metrics Commonly Used
PSNR (Y-channel) — same protocol as Set5, computed on YCbCr luminance channel
SSIM (Y-channel) — structural similarity on luminance
Research Papers That Use This Dataset
BSD100 results are reported alongside Set5 and Set14 in essentially every CNN-based and transformer-based SR paper — SRCNN, VDSR, EDSR, RCAN, RRDB, SwinIR, HAT, and many others. It is the most statistically reliable of the three classic small-scale SR benchmarks.
Final Year Project Angle
BSD100 is valuable for any SR project that needs to make statistical claims about performance differences. A project comparing a novel SR method against multiple baselines should include BSD100 as the primary statistical benchmark — the 100-image scale enables computation of mean and standard deviation of PSNR across images, supporting statistical significance testing (paired t-test or Wilcoxon signed-rank test) that cannot be meaningfully applied to Set5's 5-image results. Students interested in experimental methodology could conduct a meta-analysis of published SR results on BSD100 vs. DIV2K, examining whether model rankings are consistent across benchmarks or whether some methods are disproportionately optimized for specific benchmark characteristics.
Dataset 5: Urban100 — Urban Building Super Resolution Benchmark
Overview and Origin
Urban100 is a 100-image super resolution benchmark specifically selected for urban architecture and building photography, introduced by Huang et al. (2015) alongside the self-example SR method. Its defining characteristic is a heavy emphasis on images containing strong structural regularity — building facades, windows, grids, patterns, and geometric architectural features that provide challenging tests of structural fidelity in super resolution reconstruction.
Urban100 was designed to specifically challenge SR models on their ability to reconstruct regular, repeating fine structural patterns — the type of content where early deep learning SR methods were weakest. Regular patterns like window grids, brick textures, fence lines, and architectural details require the SR model to maintain strict geometric regularity across the reconstructed image, which is harder than reconstructing irregular natural textures. As a result, Urban100 consistently produces lower PSNR scores than Set5, Set14, or BSD100 for most SR methods — making it the most challenging of the standard SR benchmarks.
Official Download: Available through BasicSR at https://github.com/XPixelGroup/BasicSR.
License: Research use
Citation: Huang et al., "Single Image Super-Resolution From Transformed Self-Exemplars," CVPR 2015
Dataset Statistics
Total images: 100 urban architectural photographs
Scale factors: ×2, ×3, ×4
HR resolution: Variable — high resolution urban photography
Degradation: Bicubic downsampling
Visual content: Urban buildings, architectural facades, windows, grids, geometric patterns, cityscapes
Format: PNG
Dataset Structure
Urban100 follows the same organizational structure as other classic SR benchmarks — HR and LR image directories with scale factor subdirectories. The 100 images are numbered with consistent filenames (img_001 through img_100) that are referenced in the SR literature for qualitative comparisons — specific images like img_004 (regular window grid), img_012 (fence pattern), and img_062 (building with fine structural detail) are commonly used as qualitative examples in published papers.
Degradation Methodology
Urban100 uses bicubic downsampling, consistent with other classic benchmarks. The challenge in Urban100 comes not from the degradation type but from the content — regular structural patterns are severely aliased by bicubic downsampling, creating moiré patterns and ringing artifacts in the LR images that SR models must resolve into clean structural edges in the HR reconstruction.
Strengths and Limitations
Strengths:
The most challenging standard SR benchmark — models that perform well on Urban100 are genuinely handling difficult structural reconstruction tasks
Urban architecture content is practically important for geospatial AI, urban planning, and building analysis applications
100 images provides statistically robust evaluation similar to BSD100
Specifically tests SR quality on the type of content (regular structures, edges, patterns) where SR quality most affects practical visual utility
Widely reported enabling direct comparison with published results
Limitations:
Single domain (urban architecture) — does not evaluate SR quality on natural textures, faces, or other non-architectural content
Bicubic degradation only
Some images may be less representative of modern urban photography
How to Download and Use
Urban100 is bundled with BasicSR at https://github.com/XPixelGroup/BasicSR. Download instructions and evaluation scripts are provided in the repository. Urban100 evaluation is typically included alongside Set5, Set14, and BSD100 in a comprehensive SR evaluation pipeline.
Evaluation Metrics Commonly Used
PSNR (Y-channel) — standard luminance-channel PSNR. Urban100 scores are typically 1–3 dB lower than Set5 or BSD100 for the same method, reflecting the dataset's higher challenge level.
SSIM (Y-channel) — structural similarity is particularly meaningful on Urban100 given its emphasis on structural regularity.
LPIPS — perceptual quality increasingly reported for GAN-based methods on Urban100.
Research Papers That Use This Dataset
Urban100 appears as a standard benchmark in all major SR papers — SwinIR, HAT, EDSR, RCAN, and others consistently report Urban100 results. It is the benchmark where performance differences between SR methods are most clearly visible qualitatively — the structural regularity of urban content makes reconstruction artifacts immediately visible to human observers in a way that irregular natural textures do not.
Final Year Project Angle
Urban100 is the ideal dataset for final year and research projects focused on structural super resolution, geospatial AI, smart city applications, or building analysis. A compelling project could investigate whether SR models trained specifically on urban architecture data (using Urban100 as a training set as well as a benchmark) outperform general SR models on architectural content — testing the domain-specific fine-tuning hypothesis. Another strong angle is a study of SR quality on different architectural feature types — measuring whether SR performance varies significantly between facade grids, roof structures, window patterns, and road markings, and proposing a feature-aware SR approach that handles each category with specialized processing. Combined with the UC Merced dataset (Dataset 1), Urban100 enables a comparison of SR performance on ground-level urban photography versus aerial satellite imagery of the same urban scene types.
Super Resolution Evaluation Metrics Explained
Evaluating super resolution models requires careful attention to metric selection. Different metrics capture different aspects of reconstruction quality, and the choice of metric can significantly affect which models appear superior. A comprehensive SR evaluation should include at minimum a distortion metric (PSNR or SSIM) and a perceptual quality metric (LPIPS or FID).
PSNR — Peak Signal-to-Noise Ratio
PSNR is the most widely reported SR metric. It measures the ratio between the maximum possible pixel value and the mean squared error between the reconstructed and ground truth HR images: PSNR = 10 · log₁₀(MAX²/MSE). Reported in dB; higher is better. PSNR values above 40 dB indicate near-perfect reconstruction; values between 30–40 dB indicate high quality; values below 25 dB indicate visible degradation. PSNR is computed on the Y-channel (luminance) of YCbCr color space in most SR papers to focus evaluation on structural detail rather than color accuracy.
PSNR's limitation is well-documented: it measures pixel-level accuracy rather than perceptual quality. A reconstruction with slight global blurring may have higher PSNR than a reconstruction that is locally sharp but spatially misregistered by a few pixels. Human observers generally prefer the sharper reconstruction, but PSNR would rank it lower. This is why GAN-based SR methods — which sacrifice PSNR for perceptual sharpness — require separate evaluation with perceptual metrics.
SSIM — Structural Similarity Index
SSIM measures structural similarity by comparing local luminance, contrast, and structural patterns between reconstructed and reference images: SSIM(x,y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ. Ranges from -1 to 1; higher is better, with 1 indicating perfect structural identity. SSIM is more correlated with human perceptual quality than PSNR because it explicitly models the structure of image content. It is particularly sensitive to edge sharpness and texture regularity, making it a meaningful metric for Urban100 (structural content) and UC Merced (geometric remote sensing patterns).
LPIPS — Learned Perceptual Image Patch Similarity
LPIPS measures perceptual similarity using deep network feature representations rather than pixel values. It computes the L2 distance between VGG or AlexNet feature activations of the reconstructed and reference images at multiple network layers. Lower LPIPS indicates higher perceptual similarity. LPIPS is substantially better correlated with human quality judgments than PSNR or SSIM — particularly for evaluating GAN-based and diffusion-based SR models that produce sharp, textured outputs that score lower than optimal on PSNR but are preferred by human observers. Should be included in any SR evaluation that uses perceptual loss or adversarial training.
DISTS — Deep Image Structure and Texture Similarity
DISTS is a recently proposed perceptual metric that is specifically designed to be invariant to texture resampling — measuring structural similarity without penalizing models for generating different but equally valid texture realizations of the same structure. This property makes DISTS particularly appropriate for evaluating diffusion-based SR models which generate diverse but equally plausible texture completions of a given LR input.
NIQE — No-Reference Image Quality Estimator
NIQE is a blind image quality metric that does not require a ground truth HR reference. It models the statistical properties of natural images using a multivariate Gaussian fitted on natural scene statistics and measures how much a test image deviates from this natural image model. Lower NIQE scores indicate more natural-looking images. NIQE is valuable for evaluating SR methods in scenarios where HR ground truth is not available — for example, SR applied to genuinely low-resolution images where no HR reference exists.
BRISQUE — Blind/Referenceless Image Spatial Quality Evaluator
BRISQUE is another no-reference image quality metric based on natural scene statistics, specifically designed to detect distortions like blur, noise, and compression artifacts. It is complementary to NIQE and useful for evaluating SR outputs for the presence of reconstruction artifacts without requiring ground truth references.
Which Metric to Use for Which Dataset
| Dataset | Primary Metric | Secondary Metric | When to Add LPIPS |
|---|---|---|---|
| UC Merced SR (Dataset 1) | PSNR (Y-channel) | SSIM | For GAN/diffusion SR models |
| DIV2K | PSNR (Y-channel) | SSIM, LPIPS | Always — most papers report both |
| Set5 | PSNR (Y-channel) | SSIM | Optional for GAN comparison |
| BSD100 | PSNR (Y-channel) | SSIM | For statistical robustness analysis |
| Urban100 | PSNR (Y-channel) | SSIM, LPIPS | For structural quality evaluation |
Comparison Table
| Attribute | UC Merced SR | DIV2K | Set5 | BSD100 | Urban100 |
|---|---|---|---|---|---|
| Total HR Images | 2,100 | 1,000 | 5 | 100 | 100 |
| HR Resolution | 256×256 | ~2K (2040×1404) | 288–512 px | 321×481 px | High urban photo |
| Scale Factors | ×4 only | ×2, ×3, ×4, ×8 | ×2, ×3, ×4 | ×2, ×3, ×4 | ×2, ×3, ×4 |
| Degradation | Blur + Downsample | Bicubic + Realistic | Bicubic | Bicubic | Bicubic |
| Domain | Remote Sensing | Diverse natural | Mixed (5 images) | Natural photos | Urban architecture |
| Pre-split | Yes (train/val/test) | Yes (800/100/100) | Test only | Test only | Test only |
| LR Images Included | Yes (pre-generated) | Yes | Yes | Yes | Yes |
| License | CC BY 4.0 | Research use | Research use | Research use | Research use |
| Download Size | 108.4 MB | ~7 GB (HR+LR) | <10 MB | ~30 MB | ~50 MB |
| Primary Use | Training + Eval | Training + Eval | Benchmark eval | Benchmark eval | Benchmark eval |
| Challenge History | New (2026) | NTIRE 2017–2021 | Since 2012 | Since 2010 | Since 2015 |
| FYP Suitability | High (remote sensing) | Very High (general) | High (comparison) | High (statistics) | High (urban AI) |
How to Choose the Right Dataset for Your Project
Choose UC Merced SR Dataset (Dataset 1) if: Your project involves remote sensing, satellite image analysis, geospatial AI, or any domain where aerial imagery is the target application. It is also the best choice if you need a training dataset with pre-generated LR-HR pairs under blur-and-downsample degradation — saving preprocessing time.
Choose DIV2K (Dataset 2) if: You need your results to be directly comparable with published state-of-the-art. DIV2K is the universal training and evaluation benchmark. If your paper does not include DIV2K results, reviewers will ask for them. Use DIV2K for any general-purpose SR architecture development or comparison study.
Choose Set5 (Dataset 3) if: You need a lightweight benchmark for rapid development iteration or want to include universal comparison numbers alongside a larger evaluation. Set5 should rarely be used as a primary benchmark — its 5-image scale produces unreliable statistical comparisons — but is essential as a secondary reporting standard.
Choose BSD100 (Dataset 4) if: You need statistical robustness in your evaluation — comparing methods where performance differences are small (under 0.5 dB PSNR) and you need enough images to make the comparison statistically meaningful. BSD100 is the best choice when statistical rigor matters more than domain specificity.
Choose Urban100 (Dataset 5) if: Your project involves urban scene analysis, smart city AI, architectural imaging, or any application where structural regularity and geometric precision in SR output are the primary quality criteria. Urban100 is the most challenging standard SR benchmark and the best test of structural fidelity.
Common Super Resolution Models Benchmarked on These Datasets
SRCNN (Super Resolution Convolutional Neural Network): The foundational deep learning SR model (Dong et al., 2014). A three-layer CNN establishing the encoder-decoder pattern for SR. Historical baseline showing how far the field has progressed.
EDSR (Enhanced Deep Super Resolution Network): Residual network SR model that removes batch normalization from ResNet and scales residual connections, achieving strong PSNR performance. Winner of NTIRE 2017. GitHub: sanghyun-son/EDSR-PyTorch.
RCAN (Residual Channel Attention Networks): Introduces channel attention into deep SR networks to selectively emphasize informative features. PSNR state-of-the-art at time of publication (2018). Widely used as baseline in subsequent papers.
Real-ESRGAN: GAN-based SR model designed for real-world degradation beyond simple bicubic downsampling. Uses complex degradation simulation during training to handle noise, blur, JPEG compression simultaneously. Excellent for practical deployment. GitHub: xinntao/Real-ESRGAN.
SwinIR (Image Restoration Using Swin Transformer): Transformer-based SR model using Swin Transformer blocks. Achieves PSNR state-of-the-art on all standard benchmarks at publication. The standard transformer baseline for comparison. GitHub: JingyunLiang/SwinIR.
HAT (Hybrid Attention Transformer): Extends SwinIR with overlapping cross-attention and same-layer interactions, achieving further PSNR improvements. Currently among the highest-PSNR transformer SR models. GitHub: XPixelGroup/HAT.
StableSR / SeeSR: Diffusion model-based SR approaches that use Stable Diffusion as a generative prior for realistic texture synthesis. Trade PSNR for perceptual quality — LPIPS significantly better than PSNR-optimized models, representing the frontier of perceptual SR research.
How to Prepare LR-HR Pairs for Training
Degradation Pipeline
For datasets without pre-generated LR images (Set5, BSD100, Urban100), generate LR images programmatically. The standard bicubic downsampling pipeline uses PIL's Image.resize() with BICUBIC resampling or torchvision's transforms.Resize() with InterpolationMode.BICUBIC. For blur-and-downsample degradation consistent with UC Merced SR Dataset, apply scipy.ndimage.gaussian_filter() with sigma=1.0 before resizing. Always generate and save LR images before training — regenerating them on-the-fly introduces variance across experiments if random elements are involved.
Patch Extraction
Most SR models train on image patches rather than full images for memory efficiency and data augmentation. Standard practice extracts 48×48 LR patches (corresponding to 192×192 HR patches at ×4) with stride 24, producing thousands of patches per training image. Larger patches (64×64 LR / 256×256 HR) produce better models but require more GPU memory. Always extract paired patches — the LR patch must correspond exactly to the ×scale_factor upsampled region of the HR patch.
Data Augmentation for SR
Standard SR augmentations include random horizontal flipping, random vertical flipping, and random rotation by 90°/180°/270°. Color jitter and random cropping are generally not used — color transformations can create inconsistency between LR and HR pairs, and spatial crops must be applied identically to both LR and HR images. Mixup and CutMix augmentations are not standard for SR due to the paired nature of the training data.
Train-Val-Test Split Best Practices
For datasets with pre-defined splits (UC Merced SR, DIV2K), always use the official splits. Never include validation or test images in training. For datasets used only as benchmarks (Set5, BSD100, Urban100), there are no training images — these datasets are evaluation-only. For custom LR-HR generation from a source dataset, establish the split before generating LR images and document the split for reproducibility.
Research Gap Radar
Gap 1 — Real-World Remote Sensing SR Datasets: The UC Merced dataset uses synthetically generated LR images. No large-scale publicly available dataset provides real paired LR-HR remote sensing images captured at different altitudes or sensor resolutions of the same geographic areas. Such a dataset would enable realistic blind SR evaluation for satellite imagery applications.
Gap 2 — Medical Imaging SR Benchmarks: Medical SR — enhancing MRI, CT, histology, and microscopy images — has no universal benchmark dataset comparable to DIV2K in the natural image domain. Multiple institution-specific datasets exist but none has achieved the standardization needed for cross-paper comparison.
Gap 3 — Video Super Resolution Standardization: Video SR is a major research direction but lacks the benchmarking infrastructure of single-image SR. REDS and Vimeo-90K are commonly used but do not have the universal adoption that DIV2K has in the image domain. Real-world video SR datasets with authentic degradation (not synthetic bicubic) are particularly scarce.
Gap 4 — Multi-Degradation Paired Benchmarks: Most SR benchmarks use a single degradation type — either bicubic or a specific blur model. Real-world images suffer from combinations of blur, noise, compression, and atmospheric effects simultaneously. A large-scale benchmark with diverse mixed degradations paired with HR ground truth would better represent practical SR deployment scenarios.
Gap 5 — Domain-Specific SR Beyond Remote Sensing and Faces: Specialized SR domains including document imaging, microscopy, infrared imagery, and depth sensor data each have distinct degradation characteristics but lack community-standard benchmark datasets. Each represents an open opportunity for a dataset contribution paper.
Implementation Roadmap
Step 1 — Choose Your Dataset and Task (Week 1): Use the How to Choose section to select your primary and secondary benchmark datasets. Define whether your task is PSNR-oriented reconstruction quality, perceptual quality, or real-world degradation handling.
Step 2 — Download and Verify (Week 1–2): Download your chosen datasets using the official links. Verify LR-HR pair alignment by visualizing upsampled LR alongside HR and confirming spatial correspondence. Measure the actual PSNR of the bicubic upsampled LR as your lower-bound baseline.
Step 3 — Set Up Your Baseline Model (Week 2–3): Install BasicSR or MMagic and run EDSR or SwinIR as your first baseline. This establishes both a performance reference and confirms your evaluation pipeline produces correct PSNR/SSIM numbers before you implement any novel method.
Step 4 — Generate or Verify LR-HR Pairs (Week 3): For datasets without pre-generated pairs, generate LR images using your chosen degradation pipeline and verify by visual inspection and PSNR measurement. Save generated LR images to disk and use fixed random seeds for any stochastic degradation components.
Step 5 — Implement Your Method (Week 4–7): Implement your core contribution — a new architecture module, a novel loss function, a new degradation simulation strategy, or a domain adaptation method. Start with the minimum viable modification to the baseline before adding complexity.
Step 6 — Evaluate Comprehensively (Week 7–9): Report PSNR and SSIM on all datasets. Add LPIPS if your method uses perceptual or adversarial training. Include qualitative visual comparisons — crop a consistent 64×64 region from the same image across all compared methods to enable direct visual comparison.
Step 7 — Ablation and Analysis (Week 9–11): Remove each component of your method and measure the PSNR impact. Analyze where your method improves most — which image types, which scale factors, which content categories benefit most from your contribution.
Step 8 — Write Up and Release (Week 11–12): Document your implementation, dataset preprocessing, and evaluation protocol completely. Release your code with the evaluation scripts so results are reproducible.
Tools and Frameworks
BasicSR: The most comprehensive SR toolbox available — covers SRCNN, ESRGAN, EDSR, SwinIR, HAT, RealESRGAN, and many more with standardized training, evaluation, and dataset handling. Includes download scripts for all standard SR benchmarks. https://github.com/XPixelGroup/BasicSR
MMagic (formerly MMEditing): OpenMMLab's image and video editing framework with comprehensive SR model implementations, standardized benchmark evaluation, and active maintenance. https://github.com/open-mmlab/mmagic
Real-ESRGAN: The reference implementation for practical blind SR with complex real-world degradation simulation. Essential for any project targeting real-world image enhancement rather than synthetic benchmark performance. https://github.com/xinntao/Real-ESRGAN
HuggingFace Diffusers: Access to StableSR, SeeSR, and other diffusion-based SR models. Required for any project investigating diffusion model SR approaches. https://huggingface.co/docs/diffusers
torchvision.transforms: Standard PyTorch image transformation library. Provides bicubic interpolation, random crop, flip, and rotation operations needed for SR data pipeline construction.
LPIPS library: Official Python implementation of the LPIPS perceptual metric. Essential for any SR evaluation that goes beyond PSNR/SSIM. pip install lpips
IQA-PyTorch: Comprehensive image quality assessment library covering PSNR, SSIM, LPIPS, DISTS, NIQE, BRISQUE, and many other metrics in a unified interface. https://github.com/chaofengc/IQA-PyTorch
Common Mistakes When Using SR Datasets
Mistake 1 — Mismatched degradation between training and evaluation: Training on bicubic-degraded LR images and evaluating on blur-and-downsample LR images (or vice versa) produces misleading results. Always ensure training and evaluation use the same degradation protocol, or explicitly document and justify any mismatch.
Mistake 2 — Computing PSNR in RGB instead of Y-channel: The standard SR literature computes PSNR and SSIM on the Y-channel (luminance) of YCbCr color space, not on RGB. Computing on RGB produces different numbers that are not comparable with published results. Always convert to YCbCr and compute on Y-channel only.
Mistake 3 — Not cropping borders before PSNR computation: SR models typically produce lower-quality output near image boundaries due to limited context at the edges. The standard protocol crops a border of size equal to the scale factor (or 4 pixels at ×4) before computing PSNR/SSIM. Omitting this step produces slightly different numbers that are not directly comparable with papers that follow the standard protocol.
Mistake 4 — Using Set5 as the sole benchmark: Set5's 5-image scale is statistically insufficient for reliable performance comparisons. Always include BSD100 or DIV2K as primary benchmarks alongside Set5. Performance differences on Set5 alone should not be used to claim a method is superior to another.
Mistake 5 — Ignoring the perceptual quality vs. distortion tradeoff: GAN-based and diffusion-based SR methods achieve better perceptual quality (LPIPS, FID) but lower PSNR than reconstruction-loss models. Evaluating a GAN-based SR method only with PSNR and concluding it is inferior misrepresents its actual quality. Always evaluate with both distortion metrics (PSNR/SSIM) and perceptual metrics (LPIPS) and report both.
Mistake 6 — Not fixing random seeds for data generation: Any stochastic element in LR image generation — random noise, random blur kernel selection — must use fixed seeds for reproducibility. Different random seeds produce different LR images, different PSNR numbers, and non-reproducible results.
Mistake 7 — Reporting test set results during model development: Use validation set metrics for architecture decisions and hyperparameter tuning. Report final numbers on the test set exactly once. Repeatedly checking test set performance during development inflates reported results through implicit test set overfitting.
Your Next Steps
Select your primary dataset using the How to Choose section — start with UC Merced SR at https://zenodo.org/records/19712660 for remote sensing projects, or DIV2K at https://data.vision.ee.ethz.ch/cvl/DIV2K/ for general SR work
Download BasicSR at github.com/XPixelGroup/BasicSR — it provides EDSR, SwinIR, download scripts for all classic benchmarks, and standardized evaluation metrics in one repository
Run bicubic upsampling on your test set first — this establishes your lower bound and confirms your PSNR evaluation pipeline is correct before you implement any neural network
Establish your evaluation protocol — Y-channel PSNR and SSIM with border cropping — and apply it consistently across all compared methods
Identify one gap from the Research Gap Radar that your project can address — even a partial contribution toward a missing domain-specific SR benchmark is a genuine research contribution
Document your LR generation pipeline with fixed random seeds and version-controlled code from day one
Conclusion
Super resolution is one of the most mature and well-benchmarked tasks in computer vision — yet it remains an active area of research with genuine open problems in real-world degradation handling, domain-specific applications, and perceptual quality optimization. The five datasets covered in this article represent the complete landscape of SR evaluation: from the newly released domain-specific UC Merced SR Dataset with its physically motivated blur-and-downsample degradation, through the universal high-resolution DIV2K training benchmark, to the classic lightweight benchmarks Set5, BSD100, and Urban100 that have defined SR comparison standards for over a decade.
The UC Merced Super Resolution Dataset, available at https://zenodo.org/records/19712660, is a valuable addition to this landscape specifically for remote sensing and satellite image super resolution research. Its pre-generated LR-HR pairs with blur-and-downsample degradation, pre-configured train/validation/test splits, and open CC BY 4.0 license make it immediately usable for deep learning SR training and evaluation without preprocessing overhead. For researchers and students working in geospatial AI, remote sensing analysis, or domain-specific super resolution, this dataset provides a solid foundation for reproducible, benchmarkable work.
For researchers and students who are new to super resolution and want to build a strong conceptual foundation before working with these datasets — understanding the progression from classical interpolation methods to CNN-based approaches, GAN-based perceptual SR, and modern diffusion model-based systems — the Image Inpainting and Super Resolution research guides on Scientias AI Labs provide clear technical coverage of the architectures and training strategies used across all the models benchmarked on these five datasets.
The resolution is yours to enhance.




