Top 5 Image Inpainting Datasets Every Computer Vision Researcher Must Know
A Complete Guide to Datasets, Metrics, Models, and Implementation — With Research Angles for Every Dataset

Introduction
What Is Image Inpainting
Image inpainting is the computational task of reconstructing missing, corrupted, or deliberately removed regions of a digital image in a way that is visually plausible, semantically consistent, and perceptually indistinguishable from the surrounding content. The term originates from the art restoration practice of filling in damaged sections of paintings — and the computational problem carries the same fundamental challenge: the model must understand not just the local pixel neighborhood around a missing region, but the global semantic context of the entire image, and use that understanding to hallucinate content that never existed in the original data.
Modern deep learning-based inpainting systems are capable of remarkable feats. They can remove an entire person from a crowded photograph and fill the background seamlessly. They can restore century-old damaged photographs to near-original quality. They can remove unwanted watermarks, logos, and text from images without any visible trace. They can extend the boundaries of a photograph beyond its original frame. They can replace a sky, remove a building, erase a car, or fill a face — all from a single forward pass through a neural network.
The applications of image inpainting span virtually every domain where images matter. In film and television post-production, inpainting powers wire removal, object erasure, and background replacement. In medical imaging, it enables restoration of corrupted MRI or CT scan regions. In satellite and remote sensing imagery, it fills gaps caused by cloud cover or sensor failure. In e-commerce, it removes product backgrounds and fills in missing inventory images. In forensics and security, it both enables and detects image manipulation. In mobile photography, it powers the "magic eraser" features in consumer smartphones.
Why Datasets Matter More Than Algorithms
A fundamental truth in modern deep learning research is that the quality of a model is bounded by the quality of the data it learns from. This is particularly true for image inpainting, where the model must learn complex statistical relationships between visible and missing image regions across an enormous diversity of visual content, mask shapes, and semantic contexts. The dataset determines what the model can learn, how well it generalizes to unseen content, how it handles edge cases, and how reliably its performance can be compared against competing approaches.
The history of image inpainting research is closely tied to the history of its benchmark datasets. Early work used simple synthetic datasets with rectangular masks on limited image collections. The introduction of large-scale diverse datasets like MS-COCO enabled the training of powerful generative models that could handle complex semantic inpainting. The introduction of irregular mask datasets enabled models to move beyond toy settings to real-world restoration scenarios. The introduction of domain-specific datasets like FFHQ enabled the development of face-specialized inpainting systems of extraordinary quality. Each dataset advance unlocked a corresponding model capability advance.
For final year students and researchers entering the field, choosing the right dataset for your task is not a secondary concern — it is a primary design decision that will determine the scope, relevance, and reproducibility of your work.
How This Article Is Structured
This article covers five of the most important image inpainting datasets available to researchers in 2025–2026, including a newly released large-scale urban dataset specifically designed for modern diffusion model training. Each dataset is covered in depth — its origin, statistics, structure, strengths, limitations, download instructions, associated evaluation metrics, related research papers, and a concrete final year project and researchers angle. Following the dataset coverage, the article provides a comprehensive evaluation metrics guide, a side-by-side comparison table, guidance on choosing the right dataset for your specific project, a review of the inpainting models commonly benchmarked on these datasets, practical data preparation guidance, a research gap radar, an implementation roadmap, tools and frameworks reference, common mistakes to avoid, and a conclusion with actionable next steps.
What Makes a Great Image Inpainting Dataset
Not all image datasets are suitable for inpainting research, and not all inpainting datasets are suitable for all inpainting tasks. Before examining the five datasets in detail, it is worth establishing the criteria that distinguish a great inpainting dataset from an adequate one.
Image Diversity and Scale
A great inpainting dataset contains images that span a wide range of visual categories, scene types, lighting conditions, resolutions, and content complexities. A model trained only on landscape photographs will fail on portraits. A model trained only on indoor scenes will struggle with outdoor environments. Scale matters because inpainting models — particularly modern diffusion-based systems — require tens of thousands to millions of training examples to learn robust priors. Datasets with fewer than 10,000 images are generally insufficient for training state-of-the-art models from scratch, though they may be adequate for fine-tuning.
Mask Type and Complexity
The mask defines the inpainting problem. A dataset with only rectangular masks trains models that fail on irregular real-world damage patterns. A dataset with only small masks trains models that struggle with large missing regions. A great inpainting dataset provides masks that reflect the target use case — irregular free-form masks for general restoration, semantic object masks for object removal, center masks for context understanding evaluation — with sufficient variety in mask size, shape, and placement to prevent the model from learning mask-specific shortcuts.
Ground Truth Quality
Inpainting evaluation requires clean, high-quality original images against which reconstructed outputs can be compared. Datasets with low-resolution originals, heavy compression artifacts, or inconsistent quality produce unreliable evaluation scores. High-resolution, professionally captured or carefully curated images produce evaluation benchmarks that genuinely reflect model capability rather than dataset noise.
Domain Coverage
Domain coverage determines which inpainting applications the dataset can support. A general-purpose dataset like MS-COCO covers diverse object categories but may be weak for specific domains like faces, medical images, or satellite imagery. Domain-specific datasets like FFHQ (faces) or DIV2K (high-resolution natural scenes) enable specialized model development but may not generalize well to other domains. The best research uses a combination of general and domain-specific datasets.
License and Accessibility
A dataset that is not freely accessible cannot serve the research community. The best inpainting datasets are released under open licenses (Creative Commons, MIT, or equivalent) with clear attribution requirements, stable download infrastructure, and version-controlled releases. Datasets hosted on reliable platforms like Zenodo, GitHub, or institutional servers with DOIs are preferable to those distributed via personal Google Drive links or without version tracking.
Dataset 1: ParisStreetView-RandomMasks — Large-Scale Urban Image Inpainting Dataset with Random Irregular Masks
Overview and Origin
ParisStreetView-RandomMasks is a newly released large-scale urban image inpainting dataset published in May 2026 on Zenodo under an open access license. The dataset was created by Wisen IT Solutions and is specifically designed for computer vision researchers, generative AI developers, and deep learning practitioners working on image restoration and scene completion tasks in urban environments.
The dataset builds upon the original Paris StreetView dataset — a collection of urban street-level photographs of Paris that has been used in computer vision research for scene understanding, place recognition, and image generation tasks. The ParisStreetView-RandomMasks release extends this foundation by adding programmatically generated irregular random masks and corresponding corrupted images, transforming the base street-view photography collection into a complete supervised training and evaluation resource specifically designed for modern inpainting research.
The timing of this release is significant. As diffusion-based inpainting models have become the dominant approach — replacing GAN-based methods that dominated the field from 2018 to 2022 — the community has needed large-scale datasets with diverse, realistic irregular masks that challenge these powerful generative models appropriately. The ParisStreetView-RandomMasks dataset addresses this need with a carefully structured collection of 22,601 image triplets, each containing the original image, a synthetically generated irregular mask, and the corresponding corrupted image ready for supervised inpainting training.
Official Download and DOI: https://zenodo.org/records/20233925
DOI: 10.5281/zenodo.20233925
Version: v1.0 — Published May 16, 2026
License: Open Access — comply with original Paris StreetView dataset license
Dataset Statistics
Total images: 22,601 urban street-view photographs
Total size: 12.7 GB
Image domain: Urban street-level photography — Paris, France
Mask type: Irregular free-form random stroke masks
Mask variation: Varying thickness, curvature, and region complexity
Format: images/, masks/, corrupted/ directories with train.txt, val.txt split files and annotations.csv metadata
Split support: Training and validation split pre-configured
Dataset Structure and File Organization
The dataset is organized into a clean, reproducible directory structure that is immediately compatible with standard PyTorch DataLoader implementations:
Dataset/
├── images/ — original clean street-view photographs
├── masks/ — synthetically generated binary irregular masks
├── corrupted/ — masked versions of original images (input to inpainting model)
├── train.txt — list of training sample filenames
├── val.txt — list of validation sample filenames
└── annotations.csv — metadata annotations per sample
Each sample in the dataset is a triplet: the original clean image (ground truth target), the binary mask (1 = missing region, 0 = visible region), and the corrupted image (original with masked region set to zero or noise). This three-component structure directly supports the standard supervised training loop used by all major inpainting models — the corrupted image and mask serve as inputs, and the original image serves as the reconstruction target.
The annotations.csv file provides per-sample metadata that supports dataset analysis and stratified evaluation — including information about mask coverage percentage, image scene type annotations, and sample identifiers that map between the three component directories.
Mask Generation Methodology
The masks in ParisStreetView-RandomMasks were generated using irregular free-form random stroke simulation methods — a mask generation approach that has become the standard for modern inpainting benchmarks since its introduction in the DeepFill v2 paper (Yu et al., 2019). The methodology simulates realistic missing region patterns that occur in practice due to physical damage, occlusion removal, or user-guided erasing.
The simulation process works by generating a series of random walk paths across the image canvas. Each path begins at a random starting point and proceeds through a series of direction changes governed by random angular perturbations bounded by a maximum turning angle. The path is rendered as a filled stroke with a randomly sampled width drawn from a range spanning thin scratches to broad erasure regions. Multiple strokes are generated per image with random starting points, producing masks that vary significantly in total covered area, spatial distribution, and geometric complexity.
The key parameters varied during mask generation include stroke width (controlling whether the mask simulates a fine scratch or a broad erasure region), the number of strokes per image (controlling total masked area), maximum turning angle (controlling stroke curvature from nearly straight to highly irregular), and total mask coverage ratio (controlling the difficulty of the inpainting task from easy small-region fills to challenging large-area reconstructions).
This variability is critical for training robust inpainting models. A model trained only on small-area masks learns to perform local texture blending but fails on large structural reconstructions. A model trained only on large-area masks may learn global semantic generation but lose fine-grained local texture fidelity. The ParisStreetView-RandomMasks dataset's coverage of a wide range of mask complexities — from thin isolated strokes to large interconnected irregular regions — produces models that generalize across real-world inpainting scenarios.
Recommended Inpainting Models
The dataset documentation recommends five inpainting models as primary baselines and training architectures:
LaMa (Large Mask inpainting): A Fourier convolution-based inpainting model specifically designed for large irregular masks. LaMa's receptive field covers the full image, enabling it to use global context for large-area inpainting. State-of-the-art for irregular mask scenarios.
Partial Convolution: An early architectural innovation that masks convolution operations to prevent the model from treating masked and unmasked regions identically during feature extraction. Foundational method in the irregular mask inpainting paradigm.
Stable Diffusion Inpainting: A latent diffusion model fine-tuned specifically for inpainting. Produces high semantic diversity and photorealism for large missing regions. Requires more compute than convolutional methods but produces qualitatively superior results.
EdgeConnect: A two-stage inpainting model that first predicts edge maps for the missing region, then generates image content guided by the predicted edges. Strong structural consistency for scenes with clear geometric features like urban architecture.
Context Encoder: The foundational deep learning inpainting model (Pathak et al., 2016) that established the encoder-decoder paradigm for inpainting. Primarily of historical and educational interest at this point but useful as a lower-bound baseline.
Strengths and Limitations
Strengths:
The three-component triplet structure (original + mask + corrupted) makes it immediately ready for supervised training without any preprocessing
Irregular free-form masks reflect real-world inpainting scenarios better than simple rectangular masks
Urban street-view domain provides rich structural content — buildings, roads, signage, pedestrians — that challenges both local texture and global structural inpainting capabilities
Pre-configured train/val splits with annotations.csv metadata support reproducible benchmarking
Open access on Zenodo with a stable DOI ensures long-term availability and citability
12.7 GB size is manageable for research environments without requiring cluster-scale storage
Limitations:
Single-city, single-domain coverage (Paris street-level photography) limits generalization evaluation across diverse visual domains
The dataset is a single version release (v1.0) without the multi-year community benchmark history that MS-COCO or FFHQ carry
No semantic object masks — only irregular free-form masks — limiting its use for object removal evaluation
No high-resolution variants above standard street-view resolution
How to Download and Use
The dataset is freely available for download from Zenodo at https://zenodo.org/records/20233925. Individual files (annotations.csv, image archives) can be downloaded separately or as a complete archive. The total download size is 12.7 GB.
For PyTorch integration, a standard DataLoader can be constructed by reading the train.txt or val.txt split files to obtain sample identifiers, then loading the corresponding triplets from the images/, masks/, and corrupted/ directories. The annotations.csv file provides additional metadata for stratified sampling or difficulty-controlled evaluation.
Evaluation Metrics Commonly Used With This Dataset
PSNR — measures pixel-level reconstruction accuracy against the original image
SSIM — measures structural similarity between reconstructed and original images
LPIPS — measures perceptual similarity using deep network features; more aligned with human visual judgment than PSNR/SSIM
FID — measures distributional realism of generated content across the test set
Research Papers That Use or Reference This Dataset
As a May 2026 release, ParisStreetView-RandomMasks is a new addition to the inpainting dataset ecosystem. It builds on the original Paris StreetView dataset which has been referenced in scene completion and image generation research for over a decade. The new release is specifically designed to support training and evaluation of modern diffusion-based inpainting models including LaMa, Stable Diffusion Inpainting, and BrushNet variants that dominate the 2025–2026 research landscape.
Final Year Project Angle
ParisStreetView-RandomMasks is an excellent dataset for final year projects in image restoration, scene completion, or generative AI for urban environments. A strong project could train and compare multiple inpainting architectures (LaMa vs. Stable Diffusion Inpainting vs. EdgeConnect) on this dataset, providing a systematic benchmark of modern methods on urban street-level imagery. Another compelling angle is domain adaptation — fine-tuning a model pre-trained on MS-COCO on this urban-specific dataset and measuring whether the domain-specific fine-tuning improves performance on street-scene inpainting tasks. Students interested in data contribution could extend the dataset by generating additional mask types (semantic object masks, text masks) using automated pipelines and contributing them back to the Zenodo record.
Dataset 2: MS-COCO — Microsoft Common Objects in Context
Overview and Origin
MS-COCO (Microsoft Common Objects in Context) is the most widely used benchmark dataset in computer vision research, and by extension one of the most widely used datasets in image inpainting research. Originally developed by Microsoft Research and published in 2014 by Lin et al., COCO was designed as a large-scale dataset for object detection, segmentation, and captioning. Its scale, diversity, and high-quality annotations made it the de facto standard for evaluating virtually every computer vision task that involves real-world image understanding — including image inpainting.
COCO was not designed specifically for inpainting. Its value for inpainting research comes from its enormous scale (over 328,000 images), its extraordinary visual diversity (80 object categories across thousands of real-world scene types), its high-quality segmentation masks (which can be directly repurposed as inpainting masks for semantic object removal), and the fact that essentially every published inpainting model evaluates on it. This universality makes COCO the most important single benchmark dataset for comparing inpainting methods across publications.
Official Download: https://cocodataset.org/#download
License: Creative Commons Attribution 4.0 License (images from Flickr)
Citation: Lin et al., "Microsoft COCO: Common Objects in Context," ECCV 2014
Dataset Statistics
Total images: 328,000+
Total labeled instances: 2.5 million
Object categories: 80
Stuff categories: 91 (background, sky, grass, etc.)
Training set: ~118,000 images
Validation set: ~5,000 images
Test set: ~41,000 images (no public annotations)
Image resolution: Variable — typically 640×480 to 1280×720
Annotation types: Bounding boxes, segmentation masks, keypoints, captions, panoptic labels
Dataset Structure and Annotation Format
COCO data is organized into image directories and JSON annotation files following the COCO API format. The annotation JSON files contain image metadata, object bounding boxes, segmentation polygon coordinates (which can be rendered as binary masks), and caption text. The COCO Python API (pycocotools) provides utilities for loading annotations, rendering masks, and computing standard evaluation metrics.
For inpainting research, the segmentation annotations are the most valuable component. Each annotated instance includes a polygon or run-length encoded (RLE) mask that precisely delineates the object boundary. These segmentation masks can be directly used as inpainting masks — the annotated object region is treated as the missing area to be reconstructed, and the model must fill it in convincingly given the surrounding context.
How COCO Is Adapted for Inpainting Tasks
Several strategies are used to adapt COCO for inpainting research:
Object removal inpainting: A specific object category (person, car, animal) is selected, its segmentation mask is used as the inpainting mask, and the model is trained or evaluated on filling in the removed object's region with plausible background content. This directly simulates real-world object removal applications.
Random mask inpainting: Synthetic irregular masks (generated independently using random stroke simulation) are overlaid on COCO images regardless of semantic content. The model is trained on a combination of COCO's visual diversity and programmatically generated mask diversity. This is the most common approach in general-purpose inpainting benchmarks.
Text-guided inpainting evaluation: COCO's image captions are used as text prompts for text-guided inpainting models (e.g., Stable Diffusion Inpainting, BrushNet, PowerPaint), where the model is asked to fill a masked region with content described by the text prompt. COCO's diverse captions make this a comprehensive text-image alignment benchmark.
COCO-derived inpainting datasets: Multiple purpose-built inpainting datasets are constructed from COCO as their source. COCOGlide uses COCO validation images inpainted with GLIDE. COCO-Inpaint (2025) builds a comprehensive inpainting detection benchmark from COCO with multiple modern inpainting models. SAGI uses COCO as one of three source datasets for its 95,839-image inpainting detection benchmark.
Mask Types Used With COCO
COCO supports the full spectrum of mask types used in inpainting research. Object segmentation masks from COCO annotations provide semantic, object-shaped masks for object removal evaluation. Random irregular masks generated synthetically and overlaid on COCO images provide the irregular mask benchmark used by most general-purpose inpainting papers. Center masks applied to COCO images provide context understanding evaluation. Text-based region masks guided by COCO captions enable text-guided inpainting evaluation.
Strengths and Limitations
Strengths:
Universal benchmark status — virtually every inpainting paper evaluates on COCO, enabling direct comparison across the literature
Exceptional visual diversity across 80 object categories and thousands of scene types
High-quality semantic segmentation annotations directly usable as inpainting masks for object removal tasks
Large scale (118K training images) sufficient for training large diffusion and GAN models
Active maintenance, multiple versions, and a robust ecosystem of tools (pycocotools, HuggingFace datasets integration)
Creative Commons license enabling research and commercial use
Limitations:
Variable and sometimes low image resolution — many COCO images are below 640×480, insufficient for high-resolution inpainting research
Not designed for inpainting — requires additional preprocessing to generate corrupted images and masks
Object category distribution is uneven — some categories (person) are heavily over-represented, which can bias model training and evaluation
Background content in segmented images is sometimes too simple (plain walls, open sky) to provide meaningful inpainting evaluation challenge
How to Download and Use
COCO is available for direct download at https://cocodataset.org/#download. The dataset is split into separate archives for images (train2017, val2017, test2017) and annotations (instances, captions, keypoints). The pycocotools Python library provides the official API for loading and working with COCO annotations. HuggingFace Datasets also hosts COCO with a simple one-line loading interface.
Evaluation Metrics Commonly Used With This Dataset
FID (Fréchet Inception Distance) — standard realism metric for generated content on COCO
PSNR and SSIM — pixel and structural reconstruction quality
LPIPS — perceptual quality using VGG or AlexNet features
CLIP Score — text-image alignment for text-guided inpainting on COCO captions
mIoU — for inpainting detection tasks measuring mask localization accuracy
Research Papers That Use This Dataset
MS-COCO is referenced in virtually every major inpainting paper of the past decade. Key papers using COCO as their primary benchmark include: LaMa (Resolution-robust Large Mask inpainting), MAT (Mask-Aware Transformer), DeepFill v2, BrushNet, PowerPaint, Stable Diffusion Inpainting, DALL-E 2 inpainting evaluation, and the comprehensive COCO-Inpaint detection benchmark (ACM Multimedia 2025). Additionally, the SAGI (Semantically Aligned and Uncertainty Guided AI Image Inpainting) dataset uses COCO as one of three source datasets for its 95,839-image inpainting detection collection.
Final Year Project Angle
MS-COCO is the right dataset for any final year project that needs to be directly comparable with published state-of-the-art results. A project evaluating a novel inpainting architecture modification on COCO produces numbers that reviewers and readers can immediately contextualize against the existing literature. Strong project angles include: semantic object removal and background inpainting using COCO segmentation masks as a real-world object removal pipeline; text-guided inpainting evaluation using COCO captions as prompts for diffusion-based models; or an inpainting detection study building on the COCO-Inpaint benchmark to evaluate whether modern inpainting artifacts are detectable by existing forensics models. COCO's scale also makes it suitable for studying the effect of training data volume on inpainting quality — a systematic study training models on 10%, 25%, 50%, and 100% of COCO training data and measuring quality degradation at each scale.
Dataset 3: Inpaint32K — High-Quality Multi-Method Inpainting Benchmark Dataset
Overview and Origin
Inpaint32K is a purpose-built image inpainting dataset released in 2024, designed specifically to support research in image inpainting detection and localization — the forensic task of identifying which regions of an image have been artificially inpainted and by which method. Unlike MS-COCO and FFHQ which are used primarily for training and evaluating inpainting generation quality, Inpaint32K is constructed to serve as a comprehensive benchmark for inpainting detection research — a growing field driven by concerns about AI-generated image manipulation and digital forensics applications.
The dataset was constructed with careful attention to methodological diversity. Rather than using a single inpainting algorithm to generate all tampered images, Inpaint32K spans four distinct categories of inpainting technology — traditional methods, CNN-based deep learning methods, GAN-based methods, and diffusion model-based methods — with 8,000 images per category. This design reflects the real-world forensic challenge that inpainting detection systems must contend with: the same scene may be manipulated using any of dozens of different inpainting tools, and a detection system trained only on GAN artifacts will fail against diffusion model outputs and vice versa.
Citation: Hao, 2024 — "Inpaint32K: A High-Quality Dataset for Image Inpainting Detection"
Access: Publicly accessible for research purposes — referenced in InpDiffusion (arXiv:2501.02816) and related forensics papers
Dataset Statistics
Total tampered images: 32,000
Images per technique category: 8,000
Inpainting technique categories: 4 (traditional, CNN-based, GAN-based, diffusion model-based)
Tampering types: 3 (replacement, filling, removal)
Design goal: Comprehensive coverage of real-world inpainting manipulation scenarios
Quality standard: High-quality — carefully crafted to reflect real-world manipulation challenges
Dataset Structure
Inpaint32K organizes its 32,000 tampered images by inpainting technique category, enabling controlled experiments that measure detection performance against specific methods. Each image is paired with its original unmanipulated version and a ground-truth binary localization mask indicating exactly which pixels were inpainted. This three-component structure — original, tampered, mask — supports both binary classification (inpainted vs. not inpainted) and pixel-level localization (which specific regions were inpainted) evaluation protocols.
The three tampering type categories (replacement, filling, removal) represent distinct semantic inpainting scenarios. Replacement inpainting changes the content of a region while preserving the overall image structure — for example, changing the text on a sign or the expression on a face. Filling inpainting adds new content to a previously empty or simple region — for example, adding an object to a plain background. Removal inpainting erases an existing object and fills the region with background content — the most common consumer application of inpainting technology.
Four Inpainting Technique Categories Covered
Traditional Methods: Non-learning approaches including patch-based synthesis (exemplar-based inpainting), diffusion-based propagation, and texture synthesis methods. These methods produce characteristic artifacts including visible seams, repeated texture patterns, and structural inconsistencies at region boundaries. Traditional method detection is generally easier for modern forensic models.
CNN-Based Methods: Early deep learning inpainting using convolutional encoder-decoder architectures trained on paired (corrupted, original) image pairs. Key methods in this category include Context Encoder, Partial Convolution, and DeepFill v1. CNN artifacts include blurring at region boundaries, checkerboard artifacts from transposed convolutions, and color inconsistencies.
GAN-Based Methods: Inpainting models using adversarial training to produce photorealistic completions. Key methods include DeepFill v2, EdgeConnect, MAT, and Co-Modulated GAN. GAN artifacts are harder to detect than CNN artifacts — hallucinated textures may be locally plausible but globally inconsistent, and adversarial training explicitly optimizes against discriminator-based detection.
Diffusion Model-Based Methods: The most challenging category for detection. Methods including Stable Diffusion Inpainting, SDXL Inpainting, DALL-E 2 Inpainting, BrushNet, and PowerPaint produce extremely high-quality completions that are often indistinguishable from authentic image regions to human observers. The forensic challenge is that diffusion models generate diverse, semantically coherent content that leaves few systematic artifacts.
Three Tampering Types Explained
The replacement tampering type covers scenarios where existing image content is replaced with different content of the same semantic category — an important real-world manipulation scenario in disinformation, evidence tampering, and document forgery contexts. Replacement manipulation is particularly challenging to detect because the inpainted region may be semantically consistent with its surroundings even though the content has been changed.
The filling tampering type covers scenarios where empty, damaged, or missing regions are filled with new content. This is the classic image restoration application — filling in damaged areas of historical photographs, restoring corrupted image regions, or completing partially occluded objects. Filling manipulations are somewhat easier to detect because the boundary between original and generated content often crosses structural image features.
The removal tampering type covers scenarios where an existing object is erased and the region is filled with background content. This is currently the most popular consumer application of image inpainting — the "remove object" feature in Google Photos, Samsung's Object Eraser, and Apple's Clean Up tool all implement removal inpainting. Removal artifacts can be detected by structural inconsistencies (shadows without objects, reflections without sources) or by statistical distribution differences between generated background and authentic background regions.
Strengths and Limitations
Strengths:
The only large-scale dataset specifically designed for inpainting detection research spanning all four major inpainting technology categories
Balanced design with 8,000 images per technique category prevents bias toward any single method
Three tampering type categories reflect diverse real-world manipulation scenarios
High-quality construction with carefully crafted images reflecting real-world challenges
Pixel-level ground truth masks enable localization evaluation beyond binary detection
Limitations:
Focused exclusively on inpainting detection — not suitable for training generation quality models
No public stable download link confirmed at time of writing — access via research community channels or paper supplementary materials
Newer diffusion models released after the dataset construction may not be represented in the diffusion category
Limited documentation on the specific methods used within each category
How to Download and Use
Inpaint32K is referenced in the InpDiffusion paper (arXiv:2501.02816) and related forensics research as publicly accessible. Contact the authors via the paper correspondence or search for the dataset on GitHub/HuggingFace using the dataset name. The dataset is designed for standard binary classification and segmentation evaluation pipelines — load paired (original, tampered, mask) triplets and evaluate detection models using standard forensic metrics.
Evaluation Metrics Commonly Used With This Dataset
AUC (Area Under ROC Curve) — standard binary classification metric for detection tasks
F1 Score — harmonic mean of precision and recall for tampered region detection
IoU (Intersection over Union) — pixel-level localization accuracy
mAP (mean Average Precision) — detection performance across multiple IoU thresholds
Pixel-level Accuracy — fraction of pixels correctly classified as inpainted or authentic
Research Papers That Use This Dataset
Inpaint32K is used as a primary evaluation benchmark in InpDiffusion (Conditional Diffusion Models for Image Inpainting Localization, arXiv:2501.02816), which proposes a diffusion-based model for detecting and localizing inpainted regions. It is also referenced alongside the DID dataset and AutoSplice dataset in comparative evaluations of inpainting forensics methods. The dataset fills a gap left by earlier forensics datasets like DID (10 methods, 1,000 images each) and NIST16 (584 images, multiple manipulation types) which predate modern GAN and diffusion inpainting methods.
Final Year Project Angle
Inpaint32K is the ideal dataset for final year projects at the intersection of computer vision and digital forensics, AI safety, or media integrity. A compelling project could train an inpainting detection model that generalizes across all four technique categories — testing whether a detector trained on CNN and GAN artifacts can detect diffusion model artifacts without retraining (a critical real-world requirement given the rapid pace of inpainting model development). Another strong angle is a comparative study of detection difficulty across the four categories — quantifying exactly how much harder diffusion model inpainting is to detect than traditional method inpainting, and identifying which visual features most reliably distinguish each category. Students interested in the societal implications of AI could frame a project around the "inpainting arms race" — evaluating whether improvements in inpainting quality inevitably outpace improvements in detection capability.
Dataset 4: FFHQ — Flickr-Faces-HQ
Overview and Origin
Flickr-Faces-HQ (FFHQ) is a high-quality human face image dataset created by NVIDIA Research and released alongside the StyleGAN paper (Karras et al., 2019). Originally designed as a training dataset for generative adversarial network-based face synthesis research, FFHQ rapidly became the standard benchmark dataset for all tasks involving human face image generation, restoration, and inpainting. Its combination of high resolution (1,024×1,024 pixels), large scale (70,000 images), and exceptional visual diversity has made it the definitive face dataset for deep learning research.
FFHQ was assembled by crawling Flickr for photographs licensed under permissive Creative Commons licenses and filtering for images containing detectable human faces. The images underwent automatic alignment and cropping to center and standardize face positions across the dataset. The resulting collection spans an extraordinary range of human diversity — covering multiple ethnicities, age groups from infants to elderly individuals, genders, face shapes, skin tones, and accessories including glasses, hats, masks, and jewelry. Lighting conditions range from professional studio photography to natural outdoor illumination to challenging artificial light sources.
Official Download: https://github.com/NVlabs/ffhq-dataset
License: Creative Commons BY-NC-SA 4.0 (non-commercial research use)
Citation: Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks," CVPR 2019
Dataset Statistics
Total images: 70,000 high-quality PNG face photographs
Resolution: 1,024×1,024 pixels (also available in 128×128 and 256×256 downsampled versions)
Format: PNG (lossless) and TFRECORDS versions available
Face alignment: All faces automatically aligned and cropped to standard position
Visual diversity: Multiple ethnicities, ages, genders, accessories, and lighting conditions
Source: Flickr photographs under Creative Commons licenses
Standard split: 60,000 training / 10,000 validation
Dataset Structure
FFHQ images are organized into subdirectories by index range, with each subdirectory containing 1,000 PNG images. The GitHub repository provides download scripts for both the full 1,024×1,024 resolution dataset and the downsampled variants. Metadata JSON files provide per-image information including the original Flickr source URL, license type, face detection bounding box coordinates, and image quality indicators.
For inpainting research, FFHQ images are used directly as the original ground truth targets. Masks are generated synthetically and applied to the aligned face images to create corrupted inputs. The face alignment ensures that masks applied to a consistent facial grid — for example, an eye region mask or mouth region mask — consistently target the same semantic face regions across all images in the dataset, enabling semantically meaningful face component inpainting experiments.
Why Face Data Needs a Dedicated Inpainting Dataset
Human faces present unique challenges for image inpainting that general-purpose datasets like MS-COCO cannot adequately address. The human visual system is extraordinarily sensitive to face appearance — far more sensitive than to most other object categories. Slight errors in facial geometry, skin texture, eye symmetry, or expression are immediately perceptible to human observers, even when equivalent errors in background textures or objects would pass unnoticed. This high perceptual sensitivity means that face inpainting requires a fundamentally different quality standard than general image inpainting.
Faces also have strong structural priors — the spatial relationship between eyes, nose, mouth, and jaw follows tight statistical constraints learned from years of human visual experience. A model that fails to respect these constraints produces uncanny valley effects that are immediately noticeable. General inpainting models trained on diverse image collections often fail on faces because the face distribution is a small fraction of their training data and the structural constraints are not sufficiently reinforced.
FFHQ provides the scale, resolution, and diversity needed to train face-specific inpainting models that respect these structural constraints and produce reconstructions that pass human perceptual scrutiny. Models trained on FFHQ develop robust face priors — understanding of facial geometry, skin texture variation, lighting interaction with facial features, and the spatial relationships between face components — that generalize reliably across the diverse face appearances seen in real-world applications.
Mask Strategies for Face Inpainting
Face inpainting research uses several distinct mask strategies depending on the application:
Eye region masks are used for eye inpainting and restoration tasks — covering one or both eyes with surrounding regions. Challenging because the model must reconstruct the precise geometry and appearance of a specific person's eyes from the visible face context.
Mouth and lower face masks simulate face mask removal scenarios (relevant post-COVID) and are used for facial expression completion tasks. The model must infer the hidden lower face from the visible upper face and hair context.
Irregular face region masks simulate damage to face photographs — scratches, tears, water damage — and are the primary benchmark for face photo restoration applications. Use the same irregular free-form stroke generation methodology as general inpainting datasets.
Large center masks covering 25%–50% of the face test global face structure understanding — the model must reconstruct a large contiguous face region from only the peripheral face and hair context. Among the hardest face inpainting tasks.
Accessory removal masks cover glasses, hats, or other accessories following their semantic boundaries, simulating accessory removal from face photographs.
Strengths and Limitations
Strengths:
1,024×1,024 resolution is the highest of any standard inpainting benchmark, enabling evaluation of fine-grained detail reconstruction
70,000 images provides sufficient scale for training large face-specialized generative models
Exceptional visual diversity ensures models trained on FFHQ generalize across human appearance variation
Face alignment standardization enables semantically consistent mask application across the dataset
Universal benchmark status in face synthesis and restoration research enables direct comparison with published results
Lossless PNG format preserves full image quality for high-fidelity reconstruction evaluation
Limitations:
Non-commercial license (CC BY-NC-SA 4.0) restricts use in commercial applications
Single domain — human faces only — makes FFHQ unsuitable as a standalone dataset for general inpainting evaluation
Face alignment and cropping means all images have a similar composition, which may make some inpainting challenges artificially easy (the model always knows where the face is) or artificially hard (no background context variety)
Privacy considerations around face datasets are increasingly significant — some research communities are moving toward synthetic face datasets to avoid privacy concerns
How to Download and Use
FFHQ is available for download from the official GitHub repository at https://github.com/NVlabs/ffhq-dataset. The repository provides Python download scripts that fetch images directly from Google Drive or from individual Flickr sources. The full 1,024×1,024 dataset is approximately 89 GB. Lower-resolution variants (128×128: 955 MB, 256×256: 5.4 GB) are available for experiments with limited storage or compute. HuggingFace Datasets also hosts FFHQ with a simple loading interface.
Evaluation Metrics Commonly Used With This Dataset
FID — face-specific FID computed against FFHQ validation set distribution
PSNR — pixel reconstruction accuracy for masked region evaluation
SSIM — structural similarity with emphasis on facial geometry preservation
LPIPS — perceptual similarity using face-aware VGG features
Identity Similarity (IS) — uses a face recognition model to measure whether the reconstructed face preserves the identity of the original — specifically relevant for face restoration applications
Research Papers That Use This Dataset
FFHQ is used as a primary benchmark in virtually every face image synthesis and face inpainting paper. Key inpainting papers using FFHQ include: MAT (Mask-Aware Transformer for Large Hole Image Inpainting), Co-Modulated GAN, LaMa (face experiments), Stable Diffusion face inpainting evaluations, and probabilistic inpainting frameworks including the diverse inference paper referenced in the Frontiers AI 2025 survey. StyleGAN, StyleGAN2, and StyleGAN3 — all trained on FFHQ — serve as generator backbones in multiple inpainting frameworks that leverage StyleGAN's face prior for reconstruction.
Final Year Project Angle
FFHQ is the ideal dataset for final year projects involving face restoration, face editing, identity-preserving inpainting, or privacy-aware face manipulation detection. A compelling project could build a face occlusion removal system — training a model to reconstruct faces partially covered by physical objects (hands, masks, glasses) using FFHQ with semantic occlusion masks. Identity-preserving face inpainting is another strong angle — training a model that reconstructs missing face regions while maintaining the identity of the original person as measured by face recognition metrics. Students in security and privacy could build an inpainting detection system specifically tuned for FFHQ-style face manipulations, evaluating whether face-specific forensic features (eye geometry consistency, skin texture statistics, illumination coherence) outperform general inpainting detection methods on face manipulation detection tasks.
Dataset 5: DIV2K — Diverse 2K Resolution Image Dataset
Overview and Origin
DIV2K (Diverse 2K) is a high-resolution image dataset originally created for the NTIRE (New Trends in Image Restoration and Enhancement) Challenge, a workshop associated with CVPR that focuses on image super-resolution, denoising, and restoration tasks. Published by Timofte et al. in 2017, DIV2K was designed to address a critical limitation of earlier image restoration benchmarks — their use of low-resolution or heavily compressed source images that limited the evaluation of fine-grained texture reconstruction.
The "2K" in DIV2K refers to the dataset's defining characteristic: all images are at 2K resolution (approximately 2,040×1,404 pixels on average, with many images exceeding 2,000 pixels on the longer dimension). This high resolution makes DIV2K uniquely valuable for evaluating inpainting models on tasks that require fine texture detail reconstruction — where the difference between a mediocre and an excellent inpainting result is visible only at full resolution, not in downsampled previews.
DIV2K's visual diversity is exceptional for its size. The 1,000 images span natural landscapes, cityscapes, architectural photography, portraits, wildlife, food photography, abstract textures, and cultural artifacts. The curators deliberately selected images with high visual complexity — rich textures, detailed structures, and diverse color distributions — specifically to challenge image restoration models that tend to produce overly smooth reconstructions when operating on complex fine-grained content.
Official Download: https://data.vision.ee.ethz.ch/cvl/DIV2K/
License: Research and educational use
Citation: Timofte et al., "NTIRE 2017 Challenge on Single Image Super-Resolution," CVPRW 2017
Dataset Statistics
Total images: 1,000 (800 training + 100 validation + 100 test)
Resolution: 2K — approximately 2,040×1,404 average (many images larger)
Visual diversity: Natural scenes, cityscapes, architecture, portraits, wildlife, textures, food, cultural subjects
Format: PNG (lossless, no compression artifacts)
Degradation tracks: Multiple (bicubic downscaling, realistic degradation, blur, noise, compression)
NTIRE Challenge versions: 2017, 2018, 2019, 2020, 2021 with different degradation tracks per year
Dataset Structure
DIV2K is organized into high-resolution (HR) source images and low-resolution (LR) degraded counterparts generated using different degradation tracks for super-resolution research. For inpainting research, the HR images serve as the ground truth targets, and synthetic inpainting masks are applied to generate corrupted inputs. The 800/100/100 train/validation/test split is standard and widely used in the literature.
Multiple degradation variants of the dataset exist: bicubic downscaling (the most common), mild and wild realistic degradation including blur, noise, and JPEG compression, and unknown degradation for blind restoration evaluation. For inpainting specifically, the clean HR images are used as ground truth without applying the super-resolution degradation tracks.
Why High Resolution Matters for Inpainting
The importance of high-resolution data for inpainting research is not immediately obvious but is significant in practice. When an inpainting model is evaluated at 256×256 resolution (the native resolution of many earlier benchmark datasets), even moderate blurring or texture smoothing in the reconstructed region may not be detectable by standard metrics. At 2K resolution, the same quality difference is clearly visible and measurable — fine-grained texture inconsistencies, frequency domain artifacts, and structural detail loss are all exposed at high resolution in ways that low-resolution evaluation cannot reveal.
This matters for practical applications. Real-world inpainting use cases — professional photography restoration, medical image enhancement, satellite imagery completion, product photography editing — require high-resolution outputs. A model that achieves strong PSNR scores on 256×256 COCO images may produce noticeably blurry or artifact-laden results when applied to 2K resolution inputs. DIV2K provides the resolution benchmark needed to evaluate and develop models that genuinely perform well at the resolutions required by practical applications.
The texture richness of DIV2K images amplifies this effect further. A complex natural texture — forest foliage, water reflections, architectural stonework, fabric patterns — requires the model to generate fine-grained stochastic patterns that are statistically consistent with the surrounding texture while being geometrically seamless at region boundaries. This is qualitatively harder than filling in smooth backgrounds or simple object surfaces, and DIV2K's deliberate selection of visually complex images ensures that inpainting models are challenged on this dimension specifically.
Degradation Types and Mask Strategies
For inpainting research, DIV2K images are used with synthetically generated masks rather than the super-resolution degradation tracks. The mask strategies used with DIV2K typically include irregular free-form masks (consistent with the general inpainting benchmark approach), large center masks (to evaluate global context understanding at high resolution), and texture-aware masks that specifically target high-frequency texture regions identified by edge detection or frequency analysis.
Some researchers have combined DIV2K with mixed degradation — applying both inpainting masks and realistic degradations (blur, noise, compression) simultaneously to evaluate models on the combined restoration task of inpainting under degraded conditions. This multi-task restoration scenario is particularly relevant for historical photograph restoration applications where images may be both damaged (requiring inpainting) and degraded (requiring denoising and sharpening).
Strengths and Limitations
Strengths:
The only standard inpainting benchmark at 2K resolution — essential for evaluating high-resolution reconstruction quality
Lossless PNG format eliminates compression artifacts that would confound inpainting quality evaluation
Exceptional texture diversity specifically selected to challenge fine-grained reconstruction models
Well-established benchmark with a decade of NTIRE challenge history and thousands of citing papers
Reasonable dataset size (1,000 images total) makes it computationally tractable for evaluation even without GPU clusters
Limitations:
Only 800 training images — insufficient for training large inpainting models from scratch without augmentation or pre-training on larger datasets
No built-in inpainting masks — must be generated separately, which can introduce variability across papers if mask generation protocols differ
No semantic annotations — cannot be used for object removal or semantically guided inpainting evaluation
The original NTIRE super-resolution focus means many papers use DIV2K only for combined super-resolution + inpainting experiments, limiting the number of pure inpainting baselines available for comparison
How to Download and Use
DIV2K is available for direct download from the ETH Zürich Computer Vision Lab at https://data.vision.ee.ethz.ch/cvl/DIV2K/. The dataset is organized by degradation track — download the high-resolution training and validation image archives for inpainting research. Individual archives are 3–5 GB each. The dataset is also available on HuggingFace Datasets. No registration or license agreement is required for research use.
Evaluation Metrics Commonly Used With This Dataset
PSNR — the primary metric for DIV2K, inherited from its super-resolution benchmark origins. Measured in dB; higher is better.
SSIM — structural similarity measurement particularly sensitive to texture and edge fidelity at high resolution
LPIPS — perceptual quality metric especially important for evaluating fine texture reconstruction at 2K resolution
NIQE (No-Reference Image Quality Estimator) — no-reference quality metric that does not require a ground truth reference; useful for evaluating generative diversity in probabilistic inpainting models
Research Papers That Use This Dataset
DIV2K appears as a benchmark in a wide range of image restoration papers that include inpainting as one of multiple restoration tasks. Key papers using DIV2K for inpainting evaluation include the probabilistic inpainting framework published in Frontiers in AI (2025), which specifically chose DIV2K for its ability to evaluate high-resolution reconstruction fidelity. LaMa and its variants include DIV2K in their multi-dataset evaluation suite. Papers combining super-resolution and inpainting — a growing research direction — use DIV2K as their primary benchmark because it supports both tasks from a single dataset. The NTIRE 2021 and 2022 challenge tracks include inpainting-related restoration tasks evaluated on DIV2K and its derivatives.
Final Year Project Angle
DIV2K is the right dataset for final year projects focused on high-resolution image quality, texture synthesis, or the combined restoration problem. A compelling project could investigate the resolution generalization of modern inpainting models — training on lower-resolution data (MS-COCO at 640×480) and evaluating on DIV2K at 2K resolution, measuring the performance gap and proposing frequency-domain or progressive upsampling strategies to bridge it. Another strong angle is texture-specific inpainting — using DIV2K's rich texture content to evaluate and compare texture synthesis components of different inpainting architectures, measuring whether Fourier-based approaches (LaMa) outperform spatial attention approaches (MAT) on high-frequency texture reconstruction specifically. Students in medical imaging or satellite remote sensing could adapt DIV2K's high-resolution framework to domain-specific high-resolution inpainting tasks, using DIV2K as a pre-training source before fine-tuning on domain data.
Image Inpainting Metrics Explained
Evaluating image inpainting models is not straightforward. A perfect reconstruction is by definition impossible — the model is generating content for regions where no ground truth pixel information exists in the input. The "correct" answer is one of many possible plausible completions, and different metrics capture different aspects of what "good" means in this context. Understanding the strengths and limitations of each metric is essential for interpreting research results and designing your own evaluation protocol.
PSNR — Peak Signal-to-Noise Ratio
PSNR measures the ratio between the maximum possible pixel value and the mean squared error between the reconstructed and original image. It is computed as PSNR = 10 · log₁₀(MAX²/MSE), where MAX is the maximum pixel value (255 for 8-bit images) and MSE is the mean squared error per pixel. Higher PSNR values indicate more accurate pixel-level reconstruction. PSNR is simple, fast to compute, and universally reported, making it the standard for direct cross-paper comparison. Its limitation is that it measures pixel-level accuracy rather than perceptual quality — a reconstruction that is slightly blurred across the entire missing region may have better PSNR than one that is sharp but slightly spatially misregistered, even though human observers would prefer the sharp reconstruction.
SSIM — Structural Similarity Index
SSIM measures the structural similarity between two images by comparing local luminance, contrast, and structure patterns. It is computed as SSIM(x,y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ where l, c, and s measure luminance, contrast, and structural similarity respectively. SSIM ranges from -1 to 1, with 1 indicating perfect structural similarity. SSIM is more aligned with human visual perception than PSNR because it explicitly models the structure of image content rather than treating all pixels as independent measurements. It is particularly sensitive to edge sharpness, texture regularity, and structural geometry — making it a better metric for evaluating inpainting in regions with clear structural content like urban architecture or facial features.
LPIPS — Learned Perceptual Image Patch Similarity
LPIPS measures perceptual similarity between image patches using feature representations extracted from a pretrained deep network (typically VGG or AlexNet). Rather than comparing pixel values directly, LPIPS compares the deep feature activations produced by both the reconstructed and original image at multiple network layers, measuring how similar they are in the learned feature space. LPIPS is substantially better correlated with human perceptual judgments than PSNR or SSIM — particularly for evaluating generative models that produce plausible but not pixel-exact reconstructions. Lower LPIPS values indicate higher perceptual similarity. LPIPS is increasingly the preferred quality metric in modern inpainting papers and should be included in any serious evaluation protocol.
FID — Fréchet Inception Distance
FID measures the statistical distance between the distribution of generated image features and the distribution of real image features using multivariate Gaussian approximations in Inception network feature space. FID = ||μ_r − μ_g||² + Tr(Σ_r + Σ_g − 2(Σ_r Σ_g)^(1/2)), where μ and Σ are the mean and covariance of the real (r) and generated (g) feature distributions. Lower FID indicates that generated images are drawn from a distribution more similar to the real image distribution. FID is a dataset-level metric — it cannot be computed for a single image pair but requires a large set of generated samples. It is the primary metric for evaluating the realism and diversity of generative models at a population level, capturing whether the model's outputs are plausible images drawn from the correct distribution rather than just measuring reconstruction accuracy for specific training examples.
LPIPS — Learned Perceptual Image Patch Similarity (Extended)
For inpainting specifically, LPIPS is often computed only over the masked region rather than the full image — focusing the perceptual quality measurement on the inpainted content rather than the unchanged visible regions. This masked LPIPS variant better captures the quality of the model's generative output in the region that actually matters for evaluation.
NIQE — No-Reference Image Quality Estimator
NIQE is a no-reference (blind) image quality metric that does not require a clean reference image for comparison. It models the statistical properties of natural images using a multivariate Gaussian model fitted on a corpus of natural scene patches, and measures how much a test image deviates from this natural image model. Lower NIQE scores indicate more natural-looking images. NIQE is valuable for evaluating probabilistic inpainting models that generate diverse completions — where multiple plausible outputs exist and there is no single "correct" reconstruction to compare against.
Which Metric to Use for Which Dataset
| Dataset | Primary Metric | Secondary Metrics | Special Metric |
|---|---|---|---|
| ParisStreetView-RandomMasks | LPIPS | PSNR, SSIM | FID (scene distribution) |
| MS-COCO | FID | LPIPS, PSNR | CLIP Score (text-guided) |
| Inpaint32K | AUC (detection) | F1, IoU | Pixel Accuracy |
| FFHQ | FID | LPIPS, SSIM | Identity Similarity |
| DIV2K | PSNR | SSIM, LPIPS | NIQE (no-reference) |
Comparison Table
| Attribute | ParisStreetView-RandomMasks | MS-COCO | Inpaint32K | FFHQ | DIV2K |
|---|---|---|---|---|---|
| Total Images | 22,601 | 328,000+ | 32,000 | 70,000 | 1,000 |
| Resolution | Street-view standard | Variable (typically ≤720p) | Variable | 1,024×1,024 | 2K (~2040×1404) |
| Domain | Urban street scenes | General objects/scenes | General (multi-method) | Human faces | Diverse natural scenes |
| Mask Type | Irregular free-form | Custom/semantic/irregular | Region masks | Custom/semantic | Custom synthetic |
| Primary Use | Generation/training | Generation/benchmark | Detection/forensics | Face generation/restoration | High-res generation |
| Masks Included | Yes (pre-generated) | No (generate separately) | Yes (ground truth) | No (generate separately) | No (generate separately) |
| Corrupted Images | Yes (included) | No | Yes (tampered) | No | No |
| Semantic Annotations | No | Yes (segmentation, captions) | Partial | No | No |
| License | Open (Zenodo) | CC BY 4.0 | Research | CC BY-NC-SA 4.0 | Research/educational |
| Download Size | 12.7 GB | ~25 GB (train+val) | Not specified | ~89 GB (full res) | ~7 GB (HR train+val) |
| FYP Suitability | High | Very High | High (forensics) | High (face tasks) | Moderate–High |
How to Choose the Right Dataset for Your Project
Choose ParisStreetView-RandomMasks if: Your project involves urban scene completion, diffusion model training on domain-specific data, or irregular mask inpainting evaluation. It is the only dataset in this list that comes with pre-generated masks and corrupted images, making it the fastest to set up for supervised training experiments.
Choose MS-COCO if: You need your results to be directly comparable with published state-of-the-art methods, or if your project involves semantic object removal, text-guided inpainting, or diverse scene understanding. COCO is the universal benchmark — results on COCO are interpretable by every reviewer in the field.
Choose Inpaint32K if: Your project is at the intersection of inpainting and forensics, digital security, AI safety, or media integrity. This is the only dataset specifically designed for inpainting detection research across multiple technique categories.
Choose FFHQ if: Your project involves face restoration, face editing, identity-preserving reconstruction, or face manipulation detection. No other dataset provides FFHQ's combination of scale, resolution, and face diversity.
Choose DIV2K if: Your project requires high-resolution evaluation, texture synthesis quality assessment, or involves combined restoration tasks (inpainting + super-resolution or inpainting + denoising). DIV2K is essential when resolution and texture fidelity are the primary evaluation dimensions.
Common Inpainting Models Benchmarked on These Datasets
LaMa (Large Mask inpainting): Uses Fourier convolutions with a global receptive field to handle large irregular masks. State-of-the-art for LaMa-class tasks. Benchmarked on all five datasets. GitHub: saic-mdal/lama.
Stable Diffusion Inpainting: Latent diffusion model fine-tuned for inpainting via masked latent reconstruction. Produces high semantic diversity and photorealism. Primarily benchmarked on MS-COCO and FFHQ. Available on HuggingFace.
MAT (Mask-Aware Transformer): Transformer-based inpainting with mask-aware attention that treats masked and unmasked tokens differently. Strong on FFHQ and MS-COCO. Accepted at CVPR 2022.
EdgeConnect: Two-stage model predicting edges first then generating image content guided by edges. Strong on structured scenes. Benchmarked on MS-COCO and Paris StreetView-style datasets. Well-suited for ParisStreetView-RandomMasks urban structure evaluation.
DeepFill v2 (Free-Form Image Inpainting): Gated convolution model specifically designed for free-form user-guided inpainting with irregular masks. Foundational method for irregular mask inpainting. Benchmarked on all major datasets.
Partial Convolution (PConv): Convolution operation that masks out invalid (missing) pixels during feature computation. Foundational architectural innovation for irregular mask inpainting. Introduced by NVIDIA Research, benchmarked on DIV2K, MS-COCO, and FFHQ variants.
How to Prepare These Datasets for Training
Mask Generation Strategies
For datasets without pre-generated masks (MS-COCO, FFHQ, DIV2K), you must generate masks programmatically. The standard approach uses the irregular free-form mask generation algorithm from DeepFill v2 — a random walk stroke simulation with configurable stroke width, turning angle, and total coverage ratio. The nvidia/partialconv repository provides a reference implementation. For object removal experiments on COCO, use pycocotools to render segmentation masks for specific object categories directly from COCO annotations.
Data Augmentation for Inpainting
Standard augmentations for inpainting training include random horizontal flipping, random cropping to a fixed training resolution (typically 256×256 for initial training, 512×512 for high-resolution fine-tuning), and random rotation in small angular ranges (±5°). Color jitter augmentation is generally avoided as it may reduce the consistency between the corrupted input and the ground truth target. For mask augmentation specifically, randomly varying mask coverage ratio between 10% and 60% during training produces models that generalize across a wide range of inpainting difficulty levels.
Train/Val/Test Split Best Practices
For datasets with pre-defined splits (DIV2K: 800/100/100, COCO: 118K/5K/41K), use the official splits to ensure comparability with published results. For ParisStreetView-RandomMasks, the provided train.txt and val.txt files define the official split. For FFHQ, the standard 60,000/10,000 train/val split is universal. Never include validation or test images in your training set — a common mistake when using HuggingFace Dataset loading without explicitly checking split boundaries.
Handling Class Imbalance in Tampered Datasets
For Inpaint32K specifically, the balanced design (8,000 images per technique category) eliminates class imbalance across techniques. However, if you combine Inpaint32K with authentic image datasets for detection research, the ratio of tampered to authentic images must be carefully managed — typically 1:1 or 1:2 (tampered:authentic) to prevent the model from trivially classifying all images as authentic by default.
Research Gap Radar
Gap 1 — No Multi-Domain Dataset with Pre-Generated Masks: ParisStreetView-RandomMasks provides pre-generated masks but only for urban scenes. MS-COCO provides visual diversity but no masks. There is no large-scale, visually diverse dataset that combines both — requiring researchers to either accept domain-limited pre-generated masks or generate their own masks with non-standardized protocols.
Gap 2 — Missing Medical Imaging Inpainting Benchmarks: Medical image inpainting (MRI artifact removal, CT scan gap filling, retinal image restoration) is a high-value application with no standard benchmark dataset. Medical inpainting research relies on small institution-specific datasets that prevent cross-paper comparison.
Gap 3 — No Standardized Video Inpainting Dataset at Scale: All five datasets reviewed are for static images. Video inpainting — removing objects from video sequences with temporal consistency — lacks a universally adopted large-scale benchmark with standardized evaluation protocols comparable to the role MS-COCO plays for image inpainting.
Gap 4 — Inpainting Detection Datasets Lag Behind Generation Methods: Inpaint32K covers methods up to its 2024 construction date. The most capable inpainting models of 2025–2026 (SDXL Inpainting, Flux.1-Fill, BrushNet v2) may not be represented, meaning detection models trained on Inpaint32K may not generalize to the latest generation tools.
Gap 5 — No High-Resolution Dataset with Semantic Annotations: DIV2K provides high resolution. MS-COCO provides semantic annotations. No dataset provides both — limiting the development of high-resolution semantic inpainting models that could precisely control what content is generated in removed object regions.
Implementation Roadmap
Step 1 — Choose Your Dataset and Task (Week 1): Use the How to Choose section to select your dataset. Define your specific task: generation quality evaluation, object removal, detection/forensics, face restoration, or high-resolution reconstruction.
Step 2 — Download and Inspect (Week 1–2): Download your chosen dataset using the links in this article. Inspect 50–100 random samples visually before writing any training code. Verify that image quality, diversity, and mask characteristics match your expectations.
Step 3 — Set Up Baselines (Week 2–3): Install and run at least one baseline model on your dataset. LaMa (github.com/saic-mdal/lama) is the recommended first baseline for general inpainting. Run evaluation metrics (PSNR, SSIM, LPIPS) on the baseline to establish your performance floor.
Step 4 — Implement Your Method (Week 3–6): Implement the core architectural contribution of your project. Start with the minimum viable change — one new module, one new loss term, or one new data processing strategy — rather than rewriting the entire baseline system.
Step 5 — Evaluate and Compare (Week 6–8): Run full evaluation on your test set. Compute all standard metrics for your dataset (see the Metrics section per dataset). Compare against your baseline and at least one published method using the same evaluation protocol.
Step 6 — Ablation Study (Week 8–9): Remove each component of your method one at a time and measure the performance impact. This is the core evidence that your contributions are genuine rather than coincidental.
Step 7 — Write Up and Release (Week 9–12): Write your report with clear methodology, quantitative results, and qualitative visual examples. Release your code and evaluation scripts for reproducibility.
Tools and Frameworks
PyTorch + torchvision: Core deep learning framework for all inpainting model implementations. Standard starting point for any inpainting project.
BasicSR: A comprehensive image restoration toolbox built on PyTorch, providing implementations of SRGAN, ESRGAN, EDSR, and multiple inpainting-adjacent restoration models with standardized training and evaluation pipelines. github.com/xinntao/BasicSR
MMEditing (now MMagic): OpenMMLab's image and video editing toolbox covering inpainting, super-resolution, video enhancement, and generation with standardized benchmarks. github.com/open-mmlab/mmagic
HuggingFace Diffusers: The standard library for diffusion-based inpainting models. Provides ready-to-use pipelines for Stable Diffusion Inpainting, SDXL Inpainting, and custom diffusion model training. huggingface.co/docs/diffusers
LaMa Cleaner: A production-ready inpainting tool built on LaMa with support for multiple backends including LaMa, Stable Diffusion, and ZITS. Excellent for building inpainting application prototypes. github.com/Sanster/lama-cleaner
pycocotools: Official Python API for MS-COCO annotations — essential for rendering segmentation masks for object removal inpainting experiments on COCO. Available via pip install pycocotools.
LPIPS library: Official Python implementation of the LPIPS perceptual similarity metric. Available via pip install lpips. Essential for any inpainting evaluation pipeline.
Common Mistakes When Using Inpainting Datasets
Mistake 1 — Evaluating on the training set: Always use a held-out test set that was never seen during training. This mistake is surprisingly common when using HuggingFace Dataset loaders without explicitly verifying which split is being used.
Mistake 2 — Using different mask generation protocols across experiments: If you generate masks differently for your method than for the baselines you compare against, your evaluation is invalid. Always use the same mask generation code and random seed for all compared methods. Ideally, save the generated masks to disk and reuse them for all experiments.
Mistake 3 — Computing PSNR/SSIM over the entire image instead of only the masked region: Since the unmasked regions are identical between input and output, full-image PSNR is inflated by the large proportion of unchanged pixels. Always report masked-region PSNR and SSIM for meaningful inpainting-specific quality measurement.
Mistake 4 — Ignoring image normalization consistency: Some inpainting models expect images normalized to [-1, 1], others to [0, 1], and others to ImageNet mean/std. Mixing normalization conventions between your data pipeline and the model's expected input range produces corrupted outputs that appear as model failures but are actually data pipeline bugs.
Mistake 5 — Not validating mask coverage distribution: If your mask generation produces masks that are systematically too small or too large compared to the benchmark protocol, your results will not be comparable with published numbers even if everything else is correct. Always plot and report the distribution of mask coverage percentages for your evaluation set.
Mistake 6 — Selecting the best checkpoint based on test set metrics: Use validation set metrics for checkpoint selection and report final numbers on the test set exactly once. Selecting checkpoints based on test performance is a subtle form of overfitting to the benchmark.
Mistake 7 — Not reporting standard deviations: Inpainting models can have high variance in results across different mask placements and image content. Always report mean and standard deviation of your metrics over the full test set, not just the mean.
Your Next Steps as a Final Year Student or Researcher
Select one dataset from this article that matches your project domain using the How to Choose section
Download the dataset using the official links provided — start with ParisStreetView-RandomMasks at zenodo.org/records/20233925 if you want the fastest setup (masks and corrupted images already included)
Set up LaMa as your first baseline — it is the most reproducible, well-documented, and widely used inpainting baseline available
Establish your evaluation pipeline computing PSNR, SSIM, and LPIPS on masked regions before implementing any novel method
Identify one open problem from the Research Gap Radar that your project can partially address
Design your ablation study before implementing your method — knowing what you will remove helps clarify what is genuinely novel in your approach
Release your code and evaluation scripts — reproducibility is increasingly a standard expectation in the field
Conclusion
Image inpainting is one of the most practically impactful and technically challenging tasks in computer vision. The five datasets reviewed in this article represent the current best resources for training, evaluating, and benchmarking inpainting models across a wide range of use cases — from general-purpose scene completion and object removal to face restoration, high-resolution texture reconstruction, and inpainting forensics detection.
ParisStreetView-RandomMasks, available at zenodo.org/records/20233925, represents the newest addition to this ecosystem — a purpose-built, ready-to-use inpainting dataset with pre-generated irregular masks and corrupted images for urban street-scene research and diffusion model training. MS-COCO remains the universal benchmark dataset whose results are interpretable across the entire research community. Inpaint32K addresses the growing importance of inpainting detection and forensics research in an era of increasingly capable AI image manipulation tools. FFHQ provides the high-resolution, high-diversity face dataset essential for developing and evaluating face-specific inpainting systems. And DIV2K brings the high-resolution evaluation standard needed to assess inpainting quality at the resolutions demanded by practical real-world applications.
Each dataset has a distinct role in the inpainting research ecosystem. The most rigorous research uses multiple datasets — a general benchmark like MS-COCO for universal comparability, a domain-specific dataset like FFHQ or ParisStreetView-RandomMasks for targeted application evaluation, and a quality-specific benchmark like DIV2K for high-resolution assessment. Understanding which dataset to use for which purpose — and why — is one of the most valuable skills you can develop as a computer vision researcher.
The field is moving fast. New inpainting models are published weekly, new benchmark datasets are created to challenge them, and the line between inpainting, generation, and editing is increasingly blurred by unified generative frameworks. The datasets reviewed in this article are your entry points into this landscape — use them well, evaluate rigorously, and contribute to the growing body of reproducible, comparable inpainting research.




