Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID

Pipeline overview showing synthetic face augmentation process using DreamBooth and InstantID for improved facial resemblance in AI-generated portraits


Pipeline for generating personalized portraits using synthetic images enhanced through classical and generative augmentations to improve identity resemblance in DreamBooth and InstantID outputs. We compare classical augmentations with generative augmentation using InstantID's synthetic images to enrich training data.

Abstract

Personalizing Stable Diffusion for professional portrait generation from amateur photos faces challenges in maintaining facial resemblance. This paper evaluates the impact of augmentation strategies on two personalization methods: DreamBooth and InstantID.

We compare classical augmentations (flipping, cropping, color adjustments) with generative augmentation using InstantID's synthetic images to enrich training data. Using SDXL and a new FaceDistance metric based on FaceNet, we quantitatively assess facial similarity.

Results show classical augmentations can cause artifacts harming identity retention, while InstantID improves fidelity when balanced with real images to avoid overfitting. A user study with 97 participants confirms high photorealism and preferences for InstantID's polished look versus DreamBooth's identity accuracy.

Our findings inform effective augmentation strategies for personalized text-to-image generation.

Augmentation Strategy Results

We evaluate augmentation strategies on DreamBooth and InstantID personalization methods using the SDXL model and a new FaceDistance metric to measure facial similarity.

Key Findings

  • Classical Augmentations (DreamBooth): Often introduce artifacts harming identity retention. Techniques like random flips and background replacements can slow training or degrade image quality. Gray backgrounds (especially light gray) work best. Resizing images to ~1MP aligns well with SDXL training, while ESR-GAN upscaling introduces artifacts.
  • Generative Augmentations (InstantID): Enhance DreamBooth training by producing diverse, realistic synthetic images that improve facial similarity. Maintaining a balance between real and synthetic images is critical to avoid overfitting. Though effective, the 2-step generation method is computationally costly.
  • InstantID’s Behavior: Rotational/shape augmentations and background replacements degrade similarity. Upscaling with traditional methods works better than neural upscaling. Using multiple reference images significantly improves consistency. Face replacement offers better pose control and faster generation, but requires well-posed reference photos.

Overall Insights

  • DreamBooth excels in facial similarity, while InstantID yields a more professional, "Photoshopped" look favored by some users.
  • FaceDistance effectively ranks facial similarity but has limited sensitivity for fine distinctions and holistic personalization.
  • Datasets with very few images \((\leq 3)\) can result in poor subject representation despite seeming accurate to outsiders.

FaceDistance Metric

To quantify facial similarity in generated images, we employ the FaceDistance metric based on FaceNet embeddings [Schroff et al., 2015]. FaceNet projects facial images into a 128-dimensional hyperspherical embedding space where spatial proximity reflects facial similarity.

Definition: Given batches of generated images \( G = \{G_i\}_{i=1}^m \) and real images \( R = \{R_j\}_{j=1}^n \), the FaceDistance is defined as:

\[ \bigl[\operatorname{FaceDistance}(G, R)\bigr]_i := \frac{1}{n} \sum_{j=1}^n \delta^{[0,2]}_{\cos}\bigl(f(G_i), f(R_j)\bigr), \quad i=1, \dots, m \]

where

  • \( f(\cdot) := \operatorname{FaceNet}\bigl(\operatorname{MTCNN}(\cdot)\bigr) \)
  • \(\delta^{[0,2]}_{\cos}(\mathbf{x}, \mathbf{y}) := \operatorname{clip}_{[0,2]} \left( 1 - \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\| \, \|\mathbf{y}\|} \right)\)

The clipping function \(\operatorname{clip}_{[0,2]}\) is used to avoid numerical issues.


FaceDistance allows to:

  • Rank generated images by similarity (lower distance = better match)
  • Discard the top \(k\%\) of distant embeddings to improve personalization quality (e.g., \(k=15\%\) for datasets with \(n \geq 8\))
  • Identify failure cases such as off-subject images or artifacts

In our paper, we explore these use cases in detail.

FaceDistance metric visualization demonstrating ranking system for qualitative facial similarity assessment using FaceNet embeddings

Visualization of ranking using the FaceDistance metric for qualitative facial similarity assessment.

InstantID Pipelines

We analyze two distinct InstantID pipeline approaches for generating personalized portraits, each offering different trade-offs between facial similarity and compositional control.

2-Step Generation

We collect subject reference images (s1, ..., sn) and a separate image representing the desired pose and composition skpts. These are used as reference images and the keypoints image, respectively. While the resulting output is generally satisfactory, using facial landmarks from one person to generate another reduces facial similarity due to structural differences in the five keypoints (eyes, nose, mouth). We hypothesize this stems from imbalanced conditioning weights. Performance improves when replacing skpts with a previously generated image of the subject, yielding better facial similarity while maintaining compositional control.

InstantID two-step generation pipeline comparison showing keypoints-based versus subject-based reference image approaches for improved facial similarity

Two-Step Generation Pipeline. Initial outputs use a keypoints image (skpts) from another identity, often reducing facial similarity. Replacing skpts with a prior output of the subject improves identity preservation while retaining pose. Using four reference images offers a good trade-off, as demonstrated in the appendix. Despite the ease-of-use in downstream applications, this limitation motivates our face replacement method for greater control.

Face Replacement

Users interact with a simple tool to manipulate (move/rotate/resize) their cropped face on a canvas matching the diffusion model's output dimensions. This approach eliminates the similarity issues caused by using another person's facial landmarks. However, the method performs poorly when none of the reference images show the subject facing the camera (deviations >30°). User satisfaction was higher with this approach compared to 2-step generation, which we attribute to increased interactivity and faster generation times.

User Study Results

Our survey evaluated the professional viability of AI-generated portraits, comparing DreamBooth and InstantID headshot generators with 97 white-collar professionals and students. Participants, aged 18 to 74, were predominantly White (81%) and mostly over 45 (74.2%), with balanced gender representation. They evaluated image quality, facial similarity, and realism.

Key Findings

  • Performance: DreamBooth and InstantID delivered similar quality and detail; DreamBooth showed superior facial similarity.
  • Preferences: Slight preference (\(\Delta=4\%\)) for InstantID’s consistent professional style; DreamBooth favored for realistic identity preservation.
  • AI Detection: Most participants struggled to identify AI portraits; experienced AI users more often detected DreamBooth images.

These findings highlight the strengths and user perceptions of AI portrait generators in professional settings. We thank all participants for their time and contributions.

Flickr-Suits-XL & Flickr-Portraits-XL Datasets

We introduce Flickr-Suits-XL (FSXL) and Flickr-Portraits-XL (FPXL), two high-quality datasets for AI-generated professional portraits.

FSXL includes 1,208 high-resolution images of people in formal attire, sourced from Flickr under permissive licenses. Images were filtered using MTCNN (single face), aligned, and cropped to match SDXL training resolution, with faces centered horizontally and positioned one-third from the top.

FPXL follows the same process using raw "in-the-wild" FFHQ images. We thank the authors of the FFHQ dataset. Both datasets inherit demographic biases.

Citation (BibTeX)

@inproceedings{Ulusan2025SynData4CV,  
  author        = {Ulusan, Koray and Kiefer, Benjamin},
  title         = {{Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID}},
  booktitle     = {Proceedings of the CVPR 2025 Workshop on Synthetic Data for Computer Vision (SynData4CV)},
  year          = {2025},
  month         = {May},
  url           = {https://openreview.net/forum?id=2o0RxrcV23},
  note          = {Accepted to the CVPR 2025 SynData4CV Workshop},
  eprint        = {2505.03557},
  archiveprefix = {arXiv},
  primaryclass  = {cs.CV},
  doi           = {10.48550/arXiv.2505.03557}
}