Skip to main content

Command Palette

Search for a command to run...

Top 5 Image Inpainting Datasets Every Computer Vision Researcher Must Know

A Complete Guide to Datasets, Metrics, Models, and Implementation — With Research Angles for Every Dataset

Updated
56 min read
Top 5 Image Inpainting Datasets Every Computer Vision Researcher Must Know
F
AI enthusiast and academic researcher with a focus on deep learning, computer vision, and NLP. I write about IEEE-aligned project ideas, model architectures, and practical AI implementation guides for final year engineering students. Helping students bridge the gap between research papers and real-world code.

Introduction

What Is Image Inpainting

Image inpainting is the computational task of reconstructing missing, corrupted, or deliberately removed regions of a digital image in a way that is visually plausible, semantically consistent, and perceptually indistinguishable from the surrounding content. The term originates from the art restoration practice of filling in damaged sections of paintings — and the computational problem carries the same fundamental challenge: the model must understand not just the local pixel neighborhood around a missing region, but the global semantic context of the entire image, and use that understanding to hallucinate content that never existed in the original data.

Modern deep learning-based inpainting systems are capable of remarkable feats. They can remove an entire person from a crowded photograph and fill the background seamlessly. They can restore century-old damaged photographs to near-original quality. They can remove unwanted watermarks, logos, and text from images without any visible trace. They can extend the boundaries of a photograph beyond its original frame. They can replace a sky, remove a building, erase a car, or fill a face — all from a single forward pass through a neural network.

The applications of image inpainting span virtually every domain where images matter. In film and television post-production, inpainting powers wire removal, object erasure, and background replacement. In medical imaging, it enables restoration of corrupted MRI or CT scan regions. In satellite and remote sensing imagery, it fills gaps caused by cloud cover or sensor failure. In e-commerce, it removes product backgrounds and fills in missing inventory images. In forensics and security, it both enables and detects image manipulation. In mobile photography, it powers the "magic eraser" features in consumer smartphones.

Why Datasets Matter More Than Algorithms

A fundamental truth in modern deep learning research is that the quality of a model is bounded by the quality of the data it learns from. This is particularly true for image inpainting, where the model must learn complex statistical relationships between visible and missing image regions across an enormous diversity of visual content, mask shapes, and semantic contexts. The dataset determines what the model can learn, how well it generalizes to unseen content, how it handles edge cases, and how reliably its performance can be compared against competing approaches.

The history of image inpainting research is closely tied to the history of its benchmark datasets. Early work used simple synthetic datasets with rectangular masks on limited image collections. The introduction of large-scale diverse datasets like MS-COCO enabled the training of powerful generative models that could handle complex semantic inpainting. The introduction of irregular mask datasets enabled models to move beyond toy settings to real-world restoration scenarios. The introduction of domain-specific datasets like FFHQ enabled the development of face-specialized inpainting systems of extraordinary quality. Each dataset advance unlocked a corresponding model capability advance.

For final year students and researchers entering the field, choosing the right dataset for your task is not a secondary concern — it is a primary design decision that will determine the scope, relevance, and reproducibility of your work.

How This Article Is Structured

This article covers five of the most important image inpainting datasets available to researchers in 2025–2026, including a newly released large-scale urban dataset specifically designed for modern diffusion model training. Each dataset is covered in depth — its origin, statistics, structure, strengths, limitations, download instructions, associated evaluation metrics, related research papers, and a concrete final year project and researchers angle. Following the dataset coverage, the article provides a comprehensive evaluation metrics guide, a side-by-side comparison table, guidance on choosing the right dataset for your specific project, a review of the inpainting models commonly benchmarked on these datasets, practical data preparation guidance, a research gap radar, an implementation roadmap, tools and frameworks reference, common mistakes to avoid, and a conclusion with actionable next steps.


What Makes a Great Image Inpainting Dataset

Not all image datasets are suitable for inpainting research, and not all inpainting datasets are suitable for all inpainting tasks. Before examining the five datasets in detail, it is worth establishing the criteria that distinguish a great inpainting dataset from an adequate one.

Image Diversity and Scale

A great inpainting dataset contains images that span a wide range of visual categories, scene types, lighting conditions, resolutions, and content complexities. A model trained only on landscape photographs will fail on portraits. A model trained only on indoor scenes will struggle with outdoor environments. Scale matters because inpainting models — particularly modern diffusion-based systems — require tens of thousands to millions of training examples to learn robust priors. Datasets with fewer than 10,000 images are generally insufficient for training state-of-the-art models from scratch, though they may be adequate for fine-tuning.

Mask Type and Complexity

The mask defines the inpainting problem. A dataset with only rectangular masks trains models that fail on irregular real-world damage patterns. A dataset with only small masks trains models that struggle with large missing regions. A great inpainting dataset provides masks that reflect the target use case — irregular free-form masks for general restoration, semantic object masks for object removal, center masks for context understanding evaluation — with sufficient variety in mask size, shape, and placement to prevent the model from learning mask-specific shortcuts.

Ground Truth Quality

Inpainting evaluation requires clean, high-quality original images against which reconstructed outputs can be compared. Datasets with low-resolution originals, heavy compression artifacts, or inconsistent quality produce unreliable evaluation scores. High-resolution, professionally captured or carefully curated images produce evaluation benchmarks that genuinely reflect model capability rather than dataset noise.

Domain Coverage

Domain coverage determines which inpainting applications the dataset can support. A general-purpose dataset like MS-COCO covers diverse object categories but may be weak for specific domains like faces, medical images, or satellite imagery. Domain-specific datasets like FFHQ (faces) or DIV2K (high-resolution natural scenes) enable specialized model development but may not generalize well to other domains. The best research uses a combination of general and domain-specific datasets.

License and Accessibility

A dataset that is not freely accessible cannot serve the research community. The best inpainting datasets are released under open licenses (Creative Commons, MIT, or equivalent) with clear attribution requirements, stable download infrastructure, and version-controlled releases. Datasets hosted on reliable platforms like Zenodo, GitHub, or institutional servers with DOIs are preferable to those distributed via personal Google Drive links or without version tracking.


Dataset 1: ParisStreetView-RandomMasks — Large-Scale Urban Image Inpainting Dataset with Random Irregular Masks

Overview and Origin

ParisStreetView-RandomMasks is a newly released large-scale urban image inpainting dataset published in May 2026 on Zenodo under an open access license. The dataset was created by Wisen IT Solutions and is specifically designed for computer vision researchers, generative AI developers, and deep learning practitioners working on image restoration and scene completion tasks in urban environments.

The dataset builds upon the original Paris StreetView dataset — a collection of urban street-level photographs of Paris that has been used in computer vision research for scene understanding, place recognition, and image generation tasks. The ParisStreetView-RandomMasks release extends this foundation by adding programmatically generated irregular random masks and corresponding corrupted images, transforming the base street-view photography collection into a complete supervised training and evaluation resource specifically designed for modern inpainting research.

The timing of this release is significant. As diffusion-based inpainting models have become the dominant approach — replacing GAN-based methods that dominated the field from 2018 to 2022 — the community has needed large-scale datasets with diverse, realistic irregular masks that challenge these powerful generative models appropriately. The ParisStreetView-RandomMasks dataset addresses this need with a carefully structured collection of 22,601 image triplets, each containing the original image, a synthetically generated irregular mask, and the corresponding corrupted image ready for supervised inpainting training.

Official Download and DOI: https://zenodo.org/records/20233925
DOI: 10.5281/zenodo.20233925
Version: v1.0 — Published May 16, 2026
License: Open Access — comply with original Paris StreetView dataset license

Dataset Statistics

  • Total images: 22,601 urban street-view photographs

  • Total size: 12.7 GB

  • Image domain: Urban street-level photography — Paris, France

  • Mask type: Irregular free-form random stroke masks

  • Mask variation: Varying thickness, curvature, and region complexity

  • Format: images/, masks/, corrupted/ directories with train.txt, val.txt split files and annotations.csv metadata

  • Split support: Training and validation split pre-configured

Dataset Structure and File Organization

The dataset is organized into a clean, reproducible directory structure that is immediately compatible with standard PyTorch DataLoader implementations:

Dataset/
├── images/ — original clean street-view photographs
├── masks/ — synthetically generated binary irregular masks
├── corrupted/ — masked versions of original images (input to inpainting model)
├── train.txt — list of training sample filenames
├── val.txt — list of validation sample filenames
└── annotations.csv — metadata annotations per sample

Each sample in the dataset is a triplet: the original clean image (ground truth target), the binary mask (1 = missing region, 0 = visible region), and the corrupted image (original with masked region set to zero or noise). This three-component structure directly supports the standard supervised training loop used by all major inpainting models — the corrupted image and mask serve as inputs, and the original image serves as the reconstruction target.

The annotations.csv file provides per-sample metadata that supports dataset analysis and stratified evaluation — including information about mask coverage percentage, image scene type annotations, and sample identifiers that map between the three component directories.

Mask Generation Methodology

The masks in ParisStreetView-RandomMasks were generated using irregular free-form random stroke simulation methods — a mask generation approach that has become the standard for modern inpainting benchmarks since its introduction in the DeepFill v2 paper (Yu et al., 2019). The methodology simulates realistic missing region patterns that occur in practice due to physical damage, occlusion removal, or user-guided erasing.

The simulation process works by generating a series of random walk paths across the image canvas. Each path begins at a random starting point and proceeds through a series of direction changes governed by random angular perturbations bounded by a maximum turning angle. The path is rendered as a filled stroke with a randomly sampled width drawn from a range spanning thin scratches to broad erasure regions. Multiple strokes are generated per image with random starting points, producing masks that vary significantly in total covered area, spatial distribution, and geometric complexity.

The key parameters varied during mask generation include stroke width (controlling whether the mask simulates a fine scratch or a broad erasure region), the number of strokes per image (controlling total masked area), maximum turning angle (controlling stroke curvature from nearly straight to highly irregular), and total mask coverage ratio (controlling the difficulty of the inpainting task from easy small-region fills to challenging large-area reconstructions).

This variability is critical for training robust inpainting models. A model trained only on small-area masks learns to perform local texture blending but fails on large structural reconstructions. A model trained only on large-area masks may learn global semantic generation but lose fine-grained local texture fidelity. The ParisStreetView-RandomMasks dataset's coverage of a wide range of mask complexities — from thin isolated strokes to large interconnected irregular regions — produces models that generalize across real-world inpainting scenarios.

The dataset documentation recommends five inpainting models as primary baselines and training architectures:

  • LaMa (Large Mask inpainting): A Fourier convolution-based inpainting model specifically designed for large irregular masks. LaMa's receptive field covers the full image, enabling it to use global context for large-area inpainting. State-of-the-art for irregular mask scenarios.

  • Partial Convolution: An early architectural innovation that masks convolution operations to prevent the model from treating masked and unmasked regions identically during feature extraction. Foundational method in the irregular mask inpainting paradigm.

  • Stable Diffusion Inpainting: A latent diffusion model fine-tuned specifically for inpainting. Produces high semantic diversity and photorealism for large missing regions. Requires more compute than convolutional methods but produces qualitatively superior results.

  • EdgeConnect: A two-stage inpainting model that first predicts edge maps for the missing region, then generates image content guided by the predicted edges. Strong structural consistency for scenes with clear geometric features like urban architecture.

  • Context Encoder: The foundational deep learning inpainting model (Pathak et al., 2016) that established the encoder-decoder paradigm for inpainting. Primarily of historical and educational interest at this point but useful as a lower-bound baseline.

Strengths and Limitations

Strengths:

  • The three-component triplet structure (original + mask + corrupted) makes it immediately ready for supervised training without any preprocessing

  • Irregular free-form masks reflect real-world inpainting scenarios better than simple rectangular masks

  • Urban street-view domain provides rich structural content — buildings, roads, signage, pedestrians — that challenges both local texture and global structural inpainting capabilities

  • Pre-configured train/val splits with annotations.csv metadata support reproducible benchmarking

  • Open access on Zenodo with a stable DOI ensures long-term availability and citability

  • 12.7 GB size is manageable for research environments without requiring cluster-scale storage

Limitations:

  • Single-city, single-domain coverage (Paris street-level photography) limits generalization evaluation across diverse visual domains

  • The dataset is a single version release (v1.0) without the multi-year community benchmark history that MS-COCO or FFHQ carry

  • No semantic object masks — only irregular free-form masks — limiting its use for object removal evaluation

  • No high-resolution variants above standard street-view resolution

How to Download and Use

The dataset is freely available for download from Zenodo at https://zenodo.org/records/20233925. Individual files (annotations.csv, image archives) can be downloaded separately or as a complete archive. The total download size is 12.7 GB.

For PyTorch integration, a standard DataLoader can be constructed by reading the train.txt or val.txt split files to obtain sample identifiers, then loading the corresponding triplets from the images/, masks/, and corrupted/ directories. The annotations.csv file provides additional metadata for stratified sampling or difficulty-controlled evaluation.

Evaluation Metrics Commonly Used With This Dataset

  • PSNR — measures pixel-level reconstruction accuracy against the original image

  • SSIM — measures structural similarity between reconstructed and original images

  • LPIPS — measures perceptual similarity using deep network features; more aligned with human visual judgment than PSNR/SSIM

  • FID — measures distributional realism of generated content across the test set

Research Papers That Use or Reference This Dataset

As a May 2026 release, ParisStreetView-RandomMasks is a new addition to the inpainting dataset ecosystem. It builds on the original Paris StreetView dataset which has been referenced in scene completion and image generation research for over a decade. The new release is specifically designed to support training and evaluation of modern diffusion-based inpainting models including LaMa, Stable Diffusion Inpainting, and BrushNet variants that dominate the 2025–2026 research landscape.

Final Year Project Angle

ParisStreetView-RandomMasks is an excellent dataset for final year projects in image restoration, scene completion, or generative AI for urban environments. A strong project could train and compare multiple inpainting architectures (LaMa vs. Stable Diffusion Inpainting vs. EdgeConnect) on this dataset, providing a systematic benchmark of modern methods on urban street-level imagery. Another compelling angle is domain adaptation — fine-tuning a model pre-trained on MS-COCO on this urban-specific dataset and measuring whether the domain-specific fine-tuning improves performance on street-scene inpainting tasks. Students interested in data contribution could extend the dataset by generating additional mask types (semantic object masks, text masks) using automated pipelines and contributing them back to the Zenodo record.


Dataset 2: MS-COCO — Microsoft Common Objects in Context

Overview and Origin

MS-COCO (Microsoft Common Objects in Context) is the most widely used benchmark dataset in computer vision research, and by extension one of the most widely used datasets in image inpainting research. Originally developed by Microsoft Research and published in 2014 by Lin et al., COCO was designed as a large-scale dataset for object detection, segmentation, and captioning. Its scale, diversity, and high-quality annotations made it the de facto standard for evaluating virtually every computer vision task that involves real-world image understanding — including image inpainting.

COCO was not designed specifically for inpainting. Its value for inpainting research comes from its enormous scale (over 328,000 images), its extraordinary visual diversity (80 object categories across thousands of real-world scene types), its high-quality segmentation masks (which can be directly repurposed as inpainting masks for semantic object removal), and the fact that essentially every published inpainting model evaluates on it. This universality makes COCO the most important single benchmark dataset for comparing inpainting methods across publications.

Official Download: https://cocodataset.org/#download
License: Creative Commons Attribution 4.0 License (images from Flickr)
Citation: Lin et al., "Microsoft COCO: Common Objects in Context," ECCV 2014

Dataset Statistics

  • Total images: 328,000+

  • Total labeled instances: 2.5 million

  • Object categories: 80

  • Stuff categories: 91 (background, sky, grass, etc.)

  • Training set: ~118,000 images

  • Validation set: ~5,000 images

  • Test set: ~41,000 images (no public annotations)

  • Image resolution: Variable — typically 640×480 to 1280×720

  • Annotation types: Bounding boxes, segmentation masks, keypoints, captions, panoptic labels

Dataset Structure and Annotation Format

COCO data is organized into image directories and JSON annotation files following the COCO API format. The annotation JSON files contain image metadata, object bounding boxes, segmentation polygon coordinates (which can be rendered as binary masks), and caption text. The COCO Python API (pycocotools) provides utilities for loading annotations, rendering masks, and computing standard evaluation metrics.

For inpainting research, the segmentation annotations are the most valuable component. Each annotated instance includes a polygon or run-length encoded (RLE) mask that precisely delineates the object boundary. These segmentation masks can be directly used as inpainting masks — the annotated object region is treated as the missing area to be reconstructed, and the model must fill it in convincingly given the surrounding context.

How COCO Is Adapted for Inpainting Tasks

Several strategies are used to adapt COCO for inpainting research:

Object removal inpainting: A specific object category (person, car, animal) is selected, its segmentation mask is used as the inpainting mask, and the model is trained or evaluated on filling in the removed object's region with plausible background content. This directly simulates real-world object removal applications.

Random mask inpainting: Synthetic irregular masks (generated independently using random stroke simulation) are overlaid on COCO images regardless of semantic content. The model is trained on a combination of COCO's visual diversity and programmatically generated mask diversity. This is the most common approach in general-purpose inpainting benchmarks.

Text-guided inpainting evaluation: COCO's image captions are used as text prompts for text-guided inpainting models (e.g., Stable Diffusion Inpainting, BrushNet, PowerPaint), where the model is asked to fill a masked region with content described by the text prompt. COCO's diverse captions make this a comprehensive text-image alignment benchmark.

COCO-derived inpainting datasets: Multiple purpose-built inpainting datasets are constructed from COCO as their source. COCOGlide uses COCO validation images inpainted with GLIDE. COCO-Inpaint (2025) builds a comprehensive inpainting detection benchmark from COCO with multiple modern inpainting models. SAGI uses COCO as one of three source datasets for its 95,839-image inpainting detection benchmark.

Mask Types Used With COCO

COCO supports the full spectrum of mask types used in inpainting research. Object segmentation masks from COCO annotations provide semantic, object-shaped masks for object removal evaluation. Random irregular masks generated synthetically and overlaid on COCO images provide the irregular mask benchmark used by most general-purpose inpainting papers. Center masks applied to COCO images provide context understanding evaluation. Text-based region masks guided by COCO captions enable text-guided inpainting evaluation.

Strengths and Limitations

Strengths:

  • Universal benchmark status — virtually every inpainting paper evaluates on COCO, enabling direct comparison across the literature

  • Exceptional visual diversity across 80 object categories and thousands of scene types

  • High-quality semantic segmentation annotations directly usable as inpainting masks for object removal tasks

  • Large scale (118K training images) sufficient for training large diffusion and GAN models

  • Active maintenance, multiple versions, and a robust ecosystem of tools (pycocotools, HuggingFace datasets integration)

  • Creative Commons license enabling research and commercial use

Limitations:

  • Variable and sometimes low image resolution — many COCO images are below 640×480, insufficient for high-resolution inpainting research

  • Not designed for inpainting — requires additional preprocessing to generate corrupted images and masks

  • Object category distribution is uneven — some categories (person) are heavily over-represented, which can bias model training and evaluation

  • Background content in segmented images is sometimes too simple (plain walls, open sky) to provide meaningful inpainting evaluation challenge

How to Download and Use

COCO is available for direct download at https://cocodataset.org/#download. The dataset is split into separate archives for images (train2017, val2017, test2017) and annotations (instances, captions, keypoints). The pycocotools Python library provides the official API for loading and working with COCO annotations. HuggingFace Datasets also hosts COCO with a simple one-line loading interface.

Evaluation Metrics Commonly Used With This Dataset

  • FID (Fréchet Inception Distance) — standard realism metric for generated content on COCO

  • PSNR and SSIM — pixel and structural reconstruction quality

  • LPIPS — perceptual quality using VGG or AlexNet features

  • CLIP Score — text-image alignment for text-guided inpainting on COCO captions

  • mIoU — for inpainting detection tasks measuring mask localization accuracy

Research Papers That Use This Dataset

MS-COCO is referenced in virtually every major inpainting paper of the past decade. Key papers using COCO as their primary benchmark include: LaMa (Resolution-robust Large Mask inpainting), MAT (Mask-Aware Transformer), DeepFill v2, BrushNet, PowerPaint, Stable Diffusion Inpainting, DALL-E 2 inpainting evaluation, and the comprehensive COCO-Inpaint detection benchmark (ACM Multimedia 2025). Additionally, the SAGI (Semantically Aligned and Uncertainty Guided AI Image Inpainting) dataset uses COCO as one of three source datasets for its 95,839-image inpainting detection collection.

Final Year Project Angle

MS-COCO is the right dataset for any final year project that needs to be directly comparable with published state-of-the-art results. A project evaluating a novel inpainting architecture modification on COCO produces numbers that reviewers and readers can immediately contextualize against the existing literature. Strong project angles include: semantic object removal and background inpainting using COCO segmentation masks as a real-world object removal pipeline; text-guided inpainting evaluation using COCO captions as prompts for diffusion-based models; or an inpainting detection study building on the COCO-Inpaint benchmark to evaluate whether modern inpainting artifacts are detectable by existing forensics models. COCO's scale also makes it suitable for studying the effect of training data volume on inpainting quality — a systematic study training models on 10%, 25%, 50%, and 100% of COCO training data and measuring quality degradation at each scale.


Dataset 3: Inpaint32K — High-Quality Multi-Method Inpainting Benchmark Dataset

Overview and Origin

Inpaint32K is a purpose-built image inpainting dataset released in 2024, designed specifically to support research in image inpainting detection and localization — the forensic task of identifying which regions of an image have been artificially inpainted and by which method. Unlike MS-COCO and FFHQ which are used primarily for training and evaluating inpainting generation quality, Inpaint32K is constructed to serve as a comprehensive benchmark for inpainting detection research — a growing field driven by concerns about AI-generated image manipulation and digital forensics applications.

The dataset was constructed with careful attention to methodological diversity. Rather than using a single inpainting algorithm to generate all tampered images, Inpaint32K spans four distinct categories of inpainting technology — traditional methods, CNN-based deep learning methods, GAN-based methods, and diffusion model-based methods — with 8,000 images per category. This design reflects the real-world forensic challenge that inpainting detection systems must contend with: the same scene may be manipulated using any of dozens of different inpainting tools, and a detection system trained only on GAN artifacts will fail against diffusion model outputs and vice versa.

Citation: Hao, 2024 — "Inpaint32K: A High-Quality Dataset for Image Inpainting Detection"
Access: Publicly accessible for research purposes — referenced in InpDiffusion (arXiv:2501.02816) and related forensics papers

Dataset Statistics

  • Total tampered images: 32,000

  • Images per technique category: 8,000

  • Inpainting technique categories: 4 (traditional, CNN-based, GAN-based, diffusion model-based)

  • Tampering types: 3 (replacement, filling, removal)

  • Design goal: Comprehensive coverage of real-world inpainting manipulation scenarios

  • Quality standard: High-quality — carefully crafted to reflect real-world manipulation challenges

Dataset Structure

Inpaint32K organizes its 32,000 tampered images by inpainting technique category, enabling controlled experiments that measure detection performance against specific methods. Each image is paired with its original unmanipulated version and a ground-truth binary localization mask indicating exactly which pixels were inpainted. This three-component structure — original, tampered, mask — supports both binary classification (inpainted vs. not inpainted) and pixel-level localization (which specific regions were inpainted) evaluation protocols.

The three tampering type categories (replacement, filling, removal) represent distinct semantic inpainting scenarios. Replacement inpainting changes the content of a region while preserving the overall image structure — for example, changing the text on a sign or the expression on a face. Filling inpainting adds new content to a previously empty or simple region — for example, adding an object to a plain background. Removal inpainting erases an existing object and fills the region with background content — the most common consumer application of inpainting technology.

Four Inpainting Technique Categories Covered

Traditional Methods: Non-learning approaches including patch-based synthesis (exemplar-based inpainting), diffusion-based propagation, and texture synthesis methods. These methods produce characteristic artifacts including visible seams, repeated texture patterns, and structural inconsistencies at region boundaries. Traditional method detection is generally easier for modern forensic models.

CNN-Based Methods: Early deep learning inpainting using convolutional encoder-decoder architectures trained on paired (corrupted, original) image pairs. Key methods in this category include Context Encoder, Partial Convolution, and DeepFill v1. CNN artifacts include blurring at region boundaries, checkerboard artifacts from transposed convolutions, and color inconsistencies.

GAN-Based Methods: Inpainting models using adversarial training to produce photorealistic completions. Key methods include DeepFill v2, EdgeConnect, MAT, and Co-Modulated GAN. GAN artifacts are harder to detect than CNN artifacts — hallucinated textures may be locally plausible but globally inconsistent, and adversarial training explicitly optimizes against discriminator-based detection.

Diffusion Model-Based Methods: The most challenging category for detection. Methods including Stable Diffusion Inpainting, SDXL Inpainting, DALL-E 2 Inpainting, BrushNet, and PowerPaint produce extremely high-quality completions that are often indistinguishable from authentic image regions to human observers. The forensic challenge is that diffusion models generate diverse, semantically coherent content that leaves few systematic artifacts.

Three Tampering Types Explained

The replacement tampering type covers scenarios where existing image content is replaced with different content of the same semantic category — an important real-world manipulation scenario in disinformation, evidence tampering, and document forgery contexts. Replacement manipulation is particularly challenging to detect because the inpainted region may be semantically consistent with its surroundings even though the content has been changed.

The filling tampering type covers scenarios where empty, damaged, or missing regions are filled with new content. This is the classic image restoration application — filling in damaged areas of historical photographs, restoring corrupted image regions, or completing partially occluded objects. Filling manipulations are somewhat easier to detect because the boundary between original and generated content often crosses structural image features.

The removal tampering type covers scenarios where an existing object is erased and the region is filled with background content. This is currently the most popular consumer application of image inpainting — the "remove object" feature in Google Photos, Samsung's Object Eraser, and Apple's Clean Up tool all implement removal inpainting. Removal artifacts can be detected by structural inconsistencies (shadows without objects, reflections without sources) or by statistical distribution differences between generated background and authentic background regions.

Strengths and Limitations

Strengths:

  • The only large-scale dataset specifically designed for inpainting detection research spanning all four major inpainting technology categories

  • Balanced design with 8,000 images per technique category prevents bias toward any single method

  • Three tampering type categories reflect diverse real-world manipulation scenarios

  • High-quality construction with carefully crafted images reflecting real-world challenges

  • Pixel-level ground truth masks enable localization evaluation beyond binary detection

Limitations:

  • Focused exclusively on inpainting detection — not suitable for training generation quality models

  • No public stable download link confirmed at time of writing — access via research community channels or paper supplementary materials

  • Newer diffusion models released after the dataset construction may not be represented in the diffusion category

  • Limited documentation on the specific methods used within each category

How to Download and Use

Inpaint32K is referenced in the InpDiffusion paper (arXiv:2501.02816) and related forensics research as publicly accessible. Contact the authors via the paper correspondence or search for the dataset on GitHub/HuggingFace using the dataset name. The dataset is designed for standard binary classification and segmentation evaluation pipelines — load paired (original, tampered, mask) triplets and evaluate detection models using standard forensic metrics.

Evaluation Metrics Commonly Used With This Dataset

  • AUC (Area Under ROC Curve) — standard binary classification metric for detection tasks

  • F1 Score — harmonic mean of precision and recall for tampered region detection

  • IoU (Intersection over Union) — pixel-level localization accuracy

  • mAP (mean Average Precision) — detection performance across multiple IoU thresholds

  • Pixel-level Accuracy — fraction of pixels correctly classified as inpainted or authentic

Research Papers That Use This Dataset

Inpaint32K is used as a primary evaluation benchmark in InpDiffusion (Conditional Diffusion Models for Image Inpainting Localization, arXiv:2501.02816), which proposes a diffusion-based model for detecting and localizing inpainted regions. It is also referenced alongside the DID dataset and AutoSplice dataset in comparative evaluations of inpainting forensics methods. The dataset fills a gap left by earlier forensics datasets like DID (10 methods, 1,000 images each) and NIST16 (584 images, multiple manipulation types) which predate modern GAN and diffusion inpainting methods.

Final Year Project Angle

Inpaint32K is the ideal dataset for final year projects at the intersection of computer vision and digital forensics, AI safety, or media integrity. A compelling project could train an inpainting detection model that generalizes across all four technique categories — testing whether a detector trained on CNN and GAN artifacts can detect diffusion model artifacts without retraining (a critical real-world requirement given the rapid pace of inpainting model development). Another strong angle is a comparative study of detection difficulty across the four categories — quantifying exactly how much harder diffusion model inpainting is to detect than traditional method inpainting, and identifying which visual features most reliably distinguish each category. Students interested in the societal implications of AI could frame a project around the "inpainting arms race" — evaluating whether improvements in inpainting quality inevitably outpace improvements in detection capability.


Dataset 4: FFHQ — Flickr-Faces-HQ

Overview and Origin

Flickr-Faces-HQ (FFHQ) is a high-quality human face image dataset created by NVIDIA Research and released alongside the StyleGAN paper (Karras et al., 2019). Originally designed as a training dataset for generative adversarial network-based face synthesis research, FFHQ rapidly became the standard benchmark dataset for all tasks involving human face image generation, restoration, and inpainting. Its combination of high resolution (1,024×1,024 pixels), large scale (70,000 images), and exceptional visual diversity has made it the definitive face dataset for deep learning research.

FFHQ was assembled by crawling Flickr for photographs licensed under permissive Creative Commons licenses and filtering for images containing detectable human faces. The images underwent automatic alignment and cropping to center and standardize face positions across the dataset. The resulting collection spans an extraordinary range of human diversity — covering multiple ethnicities, age groups from infants to elderly individuals, genders, face shapes, skin tones, and accessories including glasses, hats, masks, and jewelry. Lighting conditions range from professional studio photography to natural outdoor illumination to challenging artificial light sources.

Official Download: https://github.com/NVlabs/ffhq-dataset
License: Creative Commons BY-NC-SA 4.0 (non-commercial research use)
Citation: Karras et al., "A Style-Based Generator Architecture for Generative Adversarial Networks," CVPR 2019

Dataset Statistics

  • Total images: 70,000 high-quality PNG face photographs

  • Resolution: 1,024×1,024 pixels (also available in 128×128 and 256×256 downsampled versions)

  • Format: PNG (lossless) and TFRECORDS versions available

  • Face alignment: All faces automatically aligned and cropped to standard position

  • Visual diversity: Multiple ethnicities, ages, genders, accessories, and lighting conditions

  • Source: Flickr photographs under Creative Commons licenses

  • Standard split: 60,000 training / 10,000 validation

Dataset Structure

FFHQ images are organized into subdirectories by index range, with each subdirectory containing 1,000 PNG images. The GitHub repository provides download scripts for both the full 1,024×1,024 resolution dataset and the downsampled variants. Metadata JSON files provide per-image information including the original Flickr source URL, license type, face detection bounding box coordinates, and image quality indicators.

For inpainting research, FFHQ images are used directly as the original ground truth targets. Masks are generated synthetically and applied to the aligned face images to create corrupted inputs. The face alignment ensures that masks applied to a consistent facial grid — for example, an eye region mask or mouth region mask — consistently target the same semantic face regions across all images in the dataset, enabling semantically meaningful face component inpainting experiments.

Why Face Data Needs a Dedicated Inpainting Dataset

Human faces present unique challenges for image inpainting that general-purpose datasets like MS-COCO cannot adequately address. The human visual system is extraordinarily sensitive to face appearance — far more sensitive than to most other object categories. Slight errors in facial geometry, skin texture, eye symmetry, or expression are immediately perceptible to human observers, even when equivalent errors in background textures or objects would pass unnoticed. This high perceptual sensitivity means that face inpainting requires a fundamentally different quality standard than general image inpainting.

Faces also have strong structural priors — the spatial relationship between eyes, nose, mouth, and jaw follows tight statistical constraints learned from years of human visual experience. A model that fails to respect these constraints produces uncanny valley effects that are immediately noticeable. General inpainting models trained on diverse image collections often fail on faces because the face distribution is a small fraction of their training data and the structural constraints are not sufficiently reinforced.

FFHQ provides the scale, resolution, and diversity needed to train face-specific inpainting models that respect these structural constraints and produce reconstructions that pass human perceptual scrutiny. Models trained on FFHQ develop robust face priors — understanding of facial geometry, skin texture variation, lighting interaction with facial features, and the spatial relationships between face components — that generalize reliably across the diverse face appearances seen in real-world applications.

Mask Strategies for Face Inpainting

Face inpainting research uses several distinct mask strategies depending on the application:

Eye region masks are used for eye inpainting and restoration tasks — covering one or both eyes with surrounding regions. Challenging because the model must reconstruct the precise geometry and appearance of a specific person's eyes from the visible face context.

Mouth and lower face masks simulate face mask removal scenarios (relevant post-COVID) and are used for facial expression completion tasks. The model must infer the hidden lower face from the visible upper face and hair context.

Irregular face region masks simulate damage to face photographs — scratches, tears, water damage — and are the primary benchmark for face photo restoration applications. Use the same irregular free-form stroke generation methodology as general inpainting datasets.

Large center masks covering 25%–50% of the face test global face structure understanding — the model must reconstruct a large contiguous face region from only the peripheral face and hair context. Among the hardest face inpainting tasks.

Accessory removal masks cover glasses, hats, or other accessories following their semantic boundaries, simulating accessory removal from face photographs.

Strengths and Limitations

Strengths:

  • 1,024×1,024 resolution is the highest of any standard inpainting benchmark, enabling evaluation of fine-grained detail reconstruction

  • 70,000 images provides sufficient scale for training large face-specialized generative models

  • Exceptional visual diversity ensures models trained on FFHQ generalize across human appearance variation

  • Face alignment standardization enables semantically consistent mask application across the dataset

  • Universal benchmark status in face synthesis and restoration research enables direct comparison with published results

  • Lossless PNG format preserves full image quality for high-fidelity reconstruction evaluation

Limitations:

  • Non-commercial license (CC BY-NC-SA 4.0) restricts use in commercial applications

  • Single domain — human faces only — makes FFHQ unsuitable as a standalone dataset for general inpainting evaluation

  • Face alignment and cropping means all images have a similar composition, which may make some inpainting challenges artificially easy (the model always knows where the face is) or artificially hard (no background context variety)

  • Privacy considerations around face datasets are increasingly significant — some research communities are moving toward synthetic face datasets to avoid privacy concerns

How to Download and Use

FFHQ is available for download from the official GitHub repository at https://github.com/NVlabs/ffhq-dataset. The repository provides Python download scripts that fetch images directly from Google Drive or from individual Flickr sources. The full 1,024×1,024 dataset is approximately 89 GB. Lower-resolution variants (128×128: 955 MB, 256×256: 5.4 GB) are available for experiments with limited storage or compute. HuggingFace Datasets also hosts FFHQ with a simple loading interface.

Evaluation Metrics Commonly Used With This Dataset

  • FID — face-specific FID computed against FFHQ validation set distribution

  • PSNR — pixel reconstruction accuracy for masked region evaluation

  • SSIM — structural similarity with emphasis on facial geometry preservation

  • LPIPS — perceptual similarity using face-aware VGG features

  • Identity Similarity (IS) — uses a face recognition model to measure whether the reconstructed face preserves the identity of the original — specifically relevant for face restoration applications

Research Papers That Use This Dataset

FFHQ is used as a primary benchmark in virtually every face image synthesis and face inpainting paper. Key inpainting papers using FFHQ include: MAT (Mask-Aware Transformer for Large Hole Image Inpainting), Co-Modulated GAN, LaMa (face experiments), Stable Diffusion face inpainting evaluations, and probabilistic inpainting frameworks including the diverse inference paper referenced in the Frontiers AI 2025 survey. StyleGAN, StyleGAN2, and StyleGAN3 — all trained on FFHQ — serve as generator backbones in multiple inpainting frameworks that leverage StyleGAN's face prior for reconstruction.

Final Year Project Angle

FFHQ is the ideal dataset for final year projects involving face restoration, face editing, identity-preserving inpainting, or privacy-aware face manipulation detection. A compelling project could build a face occlusion removal system — training a model to reconstruct faces partially covered by physical objects (hands, masks, glasses) using FFHQ with semantic occlusion masks. Identity-preserving face inpainting is another strong angle — training a model that reconstructs missing face regions while maintaining the identity of the original person as measured by face recognition metrics. Students in security and privacy could build an inpainting detection system specifically tuned for FFHQ-style face manipulations, evaluating whether face-specific forensic features (eye geometry consistency, skin texture statistics, illumination coherence) outperform general inpainting detection methods on face manipulation detection tasks.


Dataset 5: DIV2K — Diverse 2K Resolution Image Dataset

Overview and Origin

DIV2K (Diverse 2K) is a high-resolution image dataset originally created for the NTIRE (New Trends in Image Restoration and Enhancement) Challenge, a workshop associated with CVPR that focuses on image super-resolution, denoising, and restoration tasks. Published by Timofte et al. in 2017, DIV2K was designed to address a critical limitation of earlier image restoration benchmarks — their use of low-resolution or heavily compressed source images that limited the evaluation of fine-grained texture reconstruction.

The "2K" in DIV2K refers to the dataset's defining characteristic: all images are at 2K resolution (approximately 2,040×1,404 pixels on average, with many images exceeding 2,000 pixels on the longer dimension). This high resolution makes DIV2K uniquely valuable for evaluating inpainting models on tasks that require fine texture detail reconstruction — where the difference between a mediocre and an excellent inpainting result is visible only at full resolution, not in downsampled previews.

DIV2K's visual diversity is exceptional for its size. The 1,000 images span natural landscapes, cityscapes, architectural photography, portraits, wildlife, food photography, abstract textures, and cultural artifacts. The curators deliberately selected images with high visual complexity — rich textures, detailed structures, and diverse color distributions — specifically to challenge image restoration models that tend to produce overly smooth reconstructions when operating on complex fine-grained content.

Official Download: https://data.vision.ee.ethz.ch/cvl/DIV2K/
License: Research and educational use
Citation: Timofte et al., "NTIRE 2017 Challenge on Single Image Super-Resolution," CVPRW 2017

Dataset Statistics

  • Total images: 1,000 (800 training + 100 validation + 100 test)

  • Resolution: 2K — approximately 2,040×1,404 average (many images larger)

  • Visual diversity: Natural scenes, cityscapes, architecture, portraits, wildlife, textures, food, cultural subjects

  • Format: PNG (lossless, no compression artifacts)

  • Degradation tracks: Multiple (bicubic downscaling, realistic degradation, blur, noise, compression)

  • NTIRE Challenge versions: 2017, 2018, 2019, 2020, 2021 with different degradation tracks per year

Dataset Structure

DIV2K is organized into high-resolution (HR) source images and low-resolution (LR) degraded counterparts generated using different degradation tracks for super-resolution research. For inpainting research, the HR images serve as the ground truth targets, and synthetic inpainting masks are applied to generate corrupted inputs. The 800/100/100 train/validation/test split is standard and widely used in the literature.

Multiple degradation variants of the dataset exist: bicubic downscaling (the most common), mild and wild realistic degradation including blur, noise, and JPEG compression, and unknown degradation for blind restoration evaluation. For inpainting specifically, the clean HR images are used as ground truth without applying the super-resolution degradation tracks.

Why High Resolution Matters for Inpainting

The importance of high-resolution data for inpainting research is not immediately obvious but is significant in practice. When an inpainting model is evaluated at 256×256 resolution (the native resolution of many earlier benchmark datasets), even moderate blurring or texture smoothing in the reconstructed region may not be detectable by standard metrics. At 2K resolution, the same quality difference is clearly visible and measurable — fine-grained texture inconsistencies, frequency domain artifacts, and structural detail loss are all exposed at high resolution in ways that low-resolution evaluation cannot reveal.

This matters for practical applications. Real-world inpainting use cases — professional photography restoration, medical image enhancement, satellite imagery completion, product photography editing — require high-resolution outputs. A model that achieves strong PSNR scores on 256×256 COCO images may produce noticeably blurry or artifact-laden results when applied to 2K resolution inputs. DIV2K provides the resolution benchmark needed to evaluate and develop models that genuinely perform well at the resolutions required by practical applications.

The texture richness of DIV2K images amplifies this effect further. A complex natural texture — forest foliage, water reflections, architectural stonework, fabric patterns — requires the model to generate fine-grained stochastic patterns that are statistically consistent with the surrounding texture while being geometrically seamless at region boundaries. This is qualitatively harder than filling in smooth backgrounds or simple object surfaces, and DIV2K's deliberate selection of visually complex images ensures that inpainting models are challenged on this dimension specifically.

Degradation Types and Mask Strategies

For inpainting research, DIV2K images are used with synthetically generated masks rather than the super-resolution degradation tracks. The mask strategies used with DIV2K typically include irregular free-form masks (consistent with the general inpainting benchmark approach), large center masks (to evaluate global context understanding at high resolution), and texture-aware masks that specifically target high-frequency texture regions identified by edge detection or frequency analysis.

Some researchers have combined DIV2K with mixed degradation — applying both inpainting masks and realistic degradations (blur, noise, compression) simultaneously to evaluate models on the combined restoration task of inpainting under degraded conditions. This multi-task restoration scenario is particularly relevant for historical photograph restoration applications where images may be both damaged (requiring inpainting) and degraded (requiring denoising and sharpening).

Strengths and Limitations

Strengths:

  • The only standard inpainting benchmark at 2K resolution — essential for evaluating high-resolution reconstruction quality

  • Lossless PNG format eliminates compression artifacts that would confound inpainting quality evaluation

  • Exceptional texture diversity specifically selected to challenge fine-grained reconstruction models

  • Well-established benchmark with a decade of NTIRE challenge history and thousands of citing papers

  • Reasonable dataset size (1,000 images total) makes it computationally tractable for evaluation even without GPU clusters

Limitations:

  • Only 800 training images — insufficient for training large inpainting models from scratch without augmentation or pre-training on larger datasets

  • No built-in inpainting masks — must be generated separately, which can introduce variability across papers if mask generation protocols differ

  • No semantic annotations — cannot be used for object removal or semantically guided inpainting evaluation

  • The original NTIRE super-resolution focus means many papers use DIV2K only for combined super-resolution + inpainting experiments, limiting the number of pure inpainting baselines available for comparison

How to Download and Use

DIV2K is available for direct download from the ETH Zürich Computer Vision Lab at https://data.vision.ee.ethz.ch/cvl/DIV2K/. The dataset is organized by degradation track — download the high-resolution training and validation image archives for inpainting research. Individual archives are 3–5 GB each. The dataset is also available on HuggingFace Datasets. No registration or license agreement is required for research use.

Evaluation Metrics Commonly Used With This Dataset

  • PSNR — the primary metric for DIV2K, inherited from its super-resolution benchmark origins. Measured in dB; higher is better.

  • SSIM — structural similarity measurement particularly sensitive to texture and edge fidelity at high resolution

  • LPIPS — perceptual quality metric especially important for evaluating fine texture reconstruction at 2K resolution

  • NIQE (No-Reference Image Quality Estimator) — no-reference quality metric that does not require a ground truth reference; useful for evaluating generative diversity in probabilistic inpainting models

Research Papers That Use This Dataset

DIV2K appears as a benchmark in a wide range of image restoration papers that include inpainting as one of multiple restoration tasks. Key papers using DIV2K for inpainting evaluation include the probabilistic inpainting framework published in Frontiers in AI (2025), which specifically chose DIV2K for its ability to evaluate high-resolution reconstruction fidelity. LaMa and its variants include DIV2K in their multi-dataset evaluation suite. Papers combining super-resolution and inpainting — a growing research direction — use DIV2K as their primary benchmark because it supports both tasks from a single dataset. The NTIRE 2021 and 2022 challenge tracks include inpainting-related restoration tasks evaluated on DIV2K and its derivatives.

Final Year Project Angle

DIV2K is the right dataset for final year projects focused on high-resolution image quality, texture synthesis, or the combined restoration problem. A compelling project could investigate the resolution generalization of modern inpainting models — training on lower-resolution data (MS-COCO at 640×480) and evaluating on DIV2K at 2K resolution, measuring the performance gap and proposing frequency-domain or progressive upsampling strategies to bridge it. Another strong angle is texture-specific inpainting — using DIV2K's rich texture content to evaluate and compare texture synthesis components of different inpainting architectures, measuring whether Fourier-based approaches (LaMa) outperform spatial attention approaches (MAT) on high-frequency texture reconstruction specifically. Students in medical imaging or satellite remote sensing could adapt DIV2K's high-resolution framework to domain-specific high-resolution inpainting tasks, using DIV2K as a pre-training source before fine-tuning on domain data.


Image Inpainting Metrics Explained

Evaluating image inpainting models is not straightforward. A perfect reconstruction is by definition impossible — the model is generating content for regions where no ground truth pixel information exists in the input. The "correct" answer is one of many possible plausible completions, and different metrics capture different aspects of what "good" means in this context. Understanding the strengths and limitations of each metric is essential for interpreting research results and designing your own evaluation protocol.

PSNR — Peak Signal-to-Noise Ratio

PSNR measures the ratio between the maximum possible pixel value and the mean squared error between the reconstructed and original image. It is computed as PSNR = 10 · log₁₀(MAX²/MSE), where MAX is the maximum pixel value (255 for 8-bit images) and MSE is the mean squared error per pixel. Higher PSNR values indicate more accurate pixel-level reconstruction. PSNR is simple, fast to compute, and universally reported, making it the standard for direct cross-paper comparison. Its limitation is that it measures pixel-level accuracy rather than perceptual quality — a reconstruction that is slightly blurred across the entire missing region may have better PSNR than one that is sharp but slightly spatially misregistered, even though human observers would prefer the sharp reconstruction.

SSIM — Structural Similarity Index

SSIM measures the structural similarity between two images by comparing local luminance, contrast, and structure patterns. It is computed as SSIM(x,y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ where l, c, and s measure luminance, contrast, and structural similarity respectively. SSIM ranges from -1 to 1, with 1 indicating perfect structural similarity. SSIM is more aligned with human visual perception than PSNR because it explicitly models the structure of image content rather than treating all pixels as independent measurements. It is particularly sensitive to edge sharpness, texture regularity, and structural geometry — making it a better metric for evaluating inpainting in regions with clear structural content like urban architecture or facial features.

LPIPS — Learned Perceptual Image Patch Similarity

LPIPS measures perceptual similarity between image patches using feature representations extracted from a pretrained deep network (typically VGG or AlexNet). Rather than comparing pixel values directly, LPIPS compares the deep feature activations produced by both the reconstructed and original image at multiple network layers, measuring how similar they are in the learned feature space. LPIPS is substantially better correlated with human perceptual judgments than PSNR or SSIM — particularly for evaluating generative models that produce plausible but not pixel-exact reconstructions. Lower LPIPS values indicate higher perceptual similarity. LPIPS is increasingly the preferred quality metric in modern inpainting papers and should be included in any serious evaluation protocol.

FID — Fréchet Inception Distance

FID measures the statistical distance between the distribution of generated image features and the distribution of real image features using multivariate Gaussian approximations in Inception network feature space. FID = ||μ_r − μ_g||² + Tr(Σ_r + Σ_g − 2(Σ_r Σ_g)^(1/2)), where μ and Σ are the mean and covariance of the real (r) and generated (g) feature distributions. Lower FID indicates that generated images are drawn from a distribution more similar to the real image distribution. FID is a dataset-level metric — it cannot be computed for a single image pair but requires a large set of generated samples. It is the primary metric for evaluating the realism and diversity of generative models at a population level, capturing whether the model's outputs are plausible images drawn from the correct distribution rather than just measuring reconstruction accuracy for specific training examples.

LPIPS — Learned Perceptual Image Patch Similarity (Extended)

For inpainting specifically, LPIPS is often computed only over the masked region rather than the full image — focusing the perceptual quality measurement on the inpainted content rather than the unchanged visible regions. This masked LPIPS variant better captures the quality of the model's generative output in the region that actually matters for evaluation.

NIQE — No-Reference Image Quality Estimator

NIQE is a no-reference (blind) image quality metric that does not require a clean reference image for comparison. It models the statistical properties of natural images using a multivariate Gaussian model fitted on a corpus of natural scene patches, and measures how much a test image deviates from this natural image model. Lower NIQE scores indicate more natural-looking images. NIQE is valuable for evaluating probabilistic inpainting models that generate diverse completions — where multiple plausible outputs exist and there is no single "correct" reconstruction to compare against.

Which Metric to Use for Which Dataset

Dataset Primary Metric Secondary Metrics Special Metric
ParisStreetView-RandomMasks LPIPS PSNR, SSIM FID (scene distribution)
MS-COCO FID LPIPS, PSNR CLIP Score (text-guided)
Inpaint32K AUC (detection) F1, IoU Pixel Accuracy
FFHQ FID LPIPS, SSIM Identity Similarity
DIV2K PSNR SSIM, LPIPS NIQE (no-reference)

Comparison Table

Attribute ParisStreetView-RandomMasks MS-COCO Inpaint32K FFHQ DIV2K
Total Images 22,601 328,000+ 32,000 70,000 1,000
Resolution Street-view standard Variable (typically ≤720p) Variable 1,024×1,024 2K (~2040×1404)
Domain Urban street scenes General objects/scenes General (multi-method) Human faces Diverse natural scenes
Mask Type Irregular free-form Custom/semantic/irregular Region masks Custom/semantic Custom synthetic
Primary Use Generation/training Generation/benchmark Detection/forensics Face generation/restoration High-res generation
Masks Included Yes (pre-generated) No (generate separately) Yes (ground truth) No (generate separately) No (generate separately)
Corrupted Images Yes (included) No Yes (tampered) No No
Semantic Annotations No Yes (segmentation, captions) Partial No No
License Open (Zenodo) CC BY 4.0 Research CC BY-NC-SA 4.0 Research/educational
Download Size 12.7 GB ~25 GB (train+val) Not specified ~89 GB (full res) ~7 GB (HR train+val)
FYP Suitability High Very High High (forensics) High (face tasks) Moderate–High

How to Choose the Right Dataset for Your Project

Choose ParisStreetView-RandomMasks if: Your project involves urban scene completion, diffusion model training on domain-specific data, or irregular mask inpainting evaluation. It is the only dataset in this list that comes with pre-generated masks and corrupted images, making it the fastest to set up for supervised training experiments.

Choose MS-COCO if: You need your results to be directly comparable with published state-of-the-art methods, or if your project involves semantic object removal, text-guided inpainting, or diverse scene understanding. COCO is the universal benchmark — results on COCO are interpretable by every reviewer in the field.

Choose Inpaint32K if: Your project is at the intersection of inpainting and forensics, digital security, AI safety, or media integrity. This is the only dataset specifically designed for inpainting detection research across multiple technique categories.

Choose FFHQ if: Your project involves face restoration, face editing, identity-preserving reconstruction, or face manipulation detection. No other dataset provides FFHQ's combination of scale, resolution, and face diversity.

Choose DIV2K if: Your project requires high-resolution evaluation, texture synthesis quality assessment, or involves combined restoration tasks (inpainting + super-resolution or inpainting + denoising). DIV2K is essential when resolution and texture fidelity are the primary evaluation dimensions.


Common Inpainting Models Benchmarked on These Datasets

LaMa (Large Mask inpainting): Uses Fourier convolutions with a global receptive field to handle large irregular masks. State-of-the-art for LaMa-class tasks. Benchmarked on all five datasets. GitHub: saic-mdal/lama.

Stable Diffusion Inpainting: Latent diffusion model fine-tuned for inpainting via masked latent reconstruction. Produces high semantic diversity and photorealism. Primarily benchmarked on MS-COCO and FFHQ. Available on HuggingFace.

MAT (Mask-Aware Transformer): Transformer-based inpainting with mask-aware attention that treats masked and unmasked tokens differently. Strong on FFHQ and MS-COCO. Accepted at CVPR 2022.

EdgeConnect: Two-stage model predicting edges first then generating image content guided by edges. Strong on structured scenes. Benchmarked on MS-COCO and Paris StreetView-style datasets. Well-suited for ParisStreetView-RandomMasks urban structure evaluation.

DeepFill v2 (Free-Form Image Inpainting): Gated convolution model specifically designed for free-form user-guided inpainting with irregular masks. Foundational method for irregular mask inpainting. Benchmarked on all major datasets.

Partial Convolution (PConv): Convolution operation that masks out invalid (missing) pixels during feature computation. Foundational architectural innovation for irregular mask inpainting. Introduced by NVIDIA Research, benchmarked on DIV2K, MS-COCO, and FFHQ variants.


How to Prepare These Datasets for Training

Mask Generation Strategies

For datasets without pre-generated masks (MS-COCO, FFHQ, DIV2K), you must generate masks programmatically. The standard approach uses the irregular free-form mask generation algorithm from DeepFill v2 — a random walk stroke simulation with configurable stroke width, turning angle, and total coverage ratio. The nvidia/partialconv repository provides a reference implementation. For object removal experiments on COCO, use pycocotools to render segmentation masks for specific object categories directly from COCO annotations.

Data Augmentation for Inpainting

Standard augmentations for inpainting training include random horizontal flipping, random cropping to a fixed training resolution (typically 256×256 for initial training, 512×512 for high-resolution fine-tuning), and random rotation in small angular ranges (±5°). Color jitter augmentation is generally avoided as it may reduce the consistency between the corrupted input and the ground truth target. For mask augmentation specifically, randomly varying mask coverage ratio between 10% and 60% during training produces models that generalize across a wide range of inpainting difficulty levels.

Train/Val/Test Split Best Practices

For datasets with pre-defined splits (DIV2K: 800/100/100, COCO: 118K/5K/41K), use the official splits to ensure comparability with published results. For ParisStreetView-RandomMasks, the provided train.txt and val.txt files define the official split. For FFHQ, the standard 60,000/10,000 train/val split is universal. Never include validation or test images in your training set — a common mistake when using HuggingFace Dataset loading without explicitly checking split boundaries.

Handling Class Imbalance in Tampered Datasets

For Inpaint32K specifically, the balanced design (8,000 images per technique category) eliminates class imbalance across techniques. However, if you combine Inpaint32K with authentic image datasets for detection research, the ratio of tampered to authentic images must be carefully managed — typically 1:1 or 1:2 (tampered:authentic) to prevent the model from trivially classifying all images as authentic by default.


Research Gap Radar

Gap 1 — No Multi-Domain Dataset with Pre-Generated Masks: ParisStreetView-RandomMasks provides pre-generated masks but only for urban scenes. MS-COCO provides visual diversity but no masks. There is no large-scale, visually diverse dataset that combines both — requiring researchers to either accept domain-limited pre-generated masks or generate their own masks with non-standardized protocols.

Gap 2 — Missing Medical Imaging Inpainting Benchmarks: Medical image inpainting (MRI artifact removal, CT scan gap filling, retinal image restoration) is a high-value application with no standard benchmark dataset. Medical inpainting research relies on small institution-specific datasets that prevent cross-paper comparison.

Gap 3 — No Standardized Video Inpainting Dataset at Scale: All five datasets reviewed are for static images. Video inpainting — removing objects from video sequences with temporal consistency — lacks a universally adopted large-scale benchmark with standardized evaluation protocols comparable to the role MS-COCO plays for image inpainting.

Gap 4 — Inpainting Detection Datasets Lag Behind Generation Methods: Inpaint32K covers methods up to its 2024 construction date. The most capable inpainting models of 2025–2026 (SDXL Inpainting, Flux.1-Fill, BrushNet v2) may not be represented, meaning detection models trained on Inpaint32K may not generalize to the latest generation tools.

Gap 5 — No High-Resolution Dataset with Semantic Annotations: DIV2K provides high resolution. MS-COCO provides semantic annotations. No dataset provides both — limiting the development of high-resolution semantic inpainting models that could precisely control what content is generated in removed object regions.


Implementation Roadmap

Step 1 — Choose Your Dataset and Task (Week 1): Use the How to Choose section to select your dataset. Define your specific task: generation quality evaluation, object removal, detection/forensics, face restoration, or high-resolution reconstruction.

Step 2 — Download and Inspect (Week 1–2): Download your chosen dataset using the links in this article. Inspect 50–100 random samples visually before writing any training code. Verify that image quality, diversity, and mask characteristics match your expectations.

Step 3 — Set Up Baselines (Week 2–3): Install and run at least one baseline model on your dataset. LaMa (github.com/saic-mdal/lama) is the recommended first baseline for general inpainting. Run evaluation metrics (PSNR, SSIM, LPIPS) on the baseline to establish your performance floor.

Step 4 — Implement Your Method (Week 3–6): Implement the core architectural contribution of your project. Start with the minimum viable change — one new module, one new loss term, or one new data processing strategy — rather than rewriting the entire baseline system.

Step 5 — Evaluate and Compare (Week 6–8): Run full evaluation on your test set. Compute all standard metrics for your dataset (see the Metrics section per dataset). Compare against your baseline and at least one published method using the same evaluation protocol.

Step 6 — Ablation Study (Week 8–9): Remove each component of your method one at a time and measure the performance impact. This is the core evidence that your contributions are genuine rather than coincidental.

Step 7 — Write Up and Release (Week 9–12): Write your report with clear methodology, quantitative results, and qualitative visual examples. Release your code and evaluation scripts for reproducibility.


Tools and Frameworks

PyTorch + torchvision: Core deep learning framework for all inpainting model implementations. Standard starting point for any inpainting project.

BasicSR: A comprehensive image restoration toolbox built on PyTorch, providing implementations of SRGAN, ESRGAN, EDSR, and multiple inpainting-adjacent restoration models with standardized training and evaluation pipelines. github.com/xinntao/BasicSR

MMEditing (now MMagic): OpenMMLab's image and video editing toolbox covering inpainting, super-resolution, video enhancement, and generation with standardized benchmarks. github.com/open-mmlab/mmagic

HuggingFace Diffusers: The standard library for diffusion-based inpainting models. Provides ready-to-use pipelines for Stable Diffusion Inpainting, SDXL Inpainting, and custom diffusion model training. huggingface.co/docs/diffusers

LaMa Cleaner: A production-ready inpainting tool built on LaMa with support for multiple backends including LaMa, Stable Diffusion, and ZITS. Excellent for building inpainting application prototypes. github.com/Sanster/lama-cleaner

pycocotools: Official Python API for MS-COCO annotations — essential for rendering segmentation masks for object removal inpainting experiments on COCO. Available via pip install pycocotools.

LPIPS library: Official Python implementation of the LPIPS perceptual similarity metric. Available via pip install lpips. Essential for any inpainting evaluation pipeline.


Common Mistakes When Using Inpainting Datasets

Mistake 1 — Evaluating on the training set: Always use a held-out test set that was never seen during training. This mistake is surprisingly common when using HuggingFace Dataset loaders without explicitly verifying which split is being used.

Mistake 2 — Using different mask generation protocols across experiments: If you generate masks differently for your method than for the baselines you compare against, your evaluation is invalid. Always use the same mask generation code and random seed for all compared methods. Ideally, save the generated masks to disk and reuse them for all experiments.

Mistake 3 — Computing PSNR/SSIM over the entire image instead of only the masked region: Since the unmasked regions are identical between input and output, full-image PSNR is inflated by the large proportion of unchanged pixels. Always report masked-region PSNR and SSIM for meaningful inpainting-specific quality measurement.

Mistake 4 — Ignoring image normalization consistency: Some inpainting models expect images normalized to [-1, 1], others to [0, 1], and others to ImageNet mean/std. Mixing normalization conventions between your data pipeline and the model's expected input range produces corrupted outputs that appear as model failures but are actually data pipeline bugs.

Mistake 5 — Not validating mask coverage distribution: If your mask generation produces masks that are systematically too small or too large compared to the benchmark protocol, your results will not be comparable with published numbers even if everything else is correct. Always plot and report the distribution of mask coverage percentages for your evaluation set.

Mistake 6 — Selecting the best checkpoint based on test set metrics: Use validation set metrics for checkpoint selection and report final numbers on the test set exactly once. Selecting checkpoints based on test performance is a subtle form of overfitting to the benchmark.

Mistake 7 — Not reporting standard deviations: Inpainting models can have high variance in results across different mask placements and image content. Always report mean and standard deviation of your metrics over the full test set, not just the mean.


Your Next Steps as a Final Year Student or Researcher

  • Select one dataset from this article that matches your project domain using the How to Choose section

  • Download the dataset using the official links provided — start with ParisStreetView-RandomMasks at zenodo.org/records/20233925 if you want the fastest setup (masks and corrupted images already included)

  • Set up LaMa as your first baseline — it is the most reproducible, well-documented, and widely used inpainting baseline available

  • Establish your evaluation pipeline computing PSNR, SSIM, and LPIPS on masked regions before implementing any novel method

  • Identify one open problem from the Research Gap Radar that your project can partially address

  • Design your ablation study before implementing your method — knowing what you will remove helps clarify what is genuinely novel in your approach

  • Release your code and evaluation scripts — reproducibility is increasingly a standard expectation in the field


Conclusion

Image inpainting is one of the most practically impactful and technically challenging tasks in computer vision. The five datasets reviewed in this article represent the current best resources for training, evaluating, and benchmarking inpainting models across a wide range of use cases — from general-purpose scene completion and object removal to face restoration, high-resolution texture reconstruction, and inpainting forensics detection.

ParisStreetView-RandomMasks, available at zenodo.org/records/20233925, represents the newest addition to this ecosystem — a purpose-built, ready-to-use inpainting dataset with pre-generated irregular masks and corrupted images for urban street-scene research and diffusion model training. MS-COCO remains the universal benchmark dataset whose results are interpretable across the entire research community. Inpaint32K addresses the growing importance of inpainting detection and forensics research in an era of increasingly capable AI image manipulation tools. FFHQ provides the high-resolution, high-diversity face dataset essential for developing and evaluating face-specific inpainting systems. And DIV2K brings the high-resolution evaluation standard needed to assess inpainting quality at the resolutions demanded by practical real-world applications.

Each dataset has a distinct role in the inpainting research ecosystem. The most rigorous research uses multiple datasets — a general benchmark like MS-COCO for universal comparability, a domain-specific dataset like FFHQ or ParisStreetView-RandomMasks for targeted application evaluation, and a quality-specific benchmark like DIV2K for high-resolution assessment. Understanding which dataset to use for which purpose — and why — is one of the most valuable skills you can develop as a computer vision researcher.

The field is moving fast. New inpainting models are published weekly, new benchmark datasets are created to challenge them, and the line between inpainting, generation, and editing is increasingly blurred by unified generative frameworks. The datasets reviewed in this article are your entry points into this landscape — use them well, evaluate rigorously, and contribute to the growing body of reproducible, comparable inpainting research.