Beyond FID: Human Perceptual Judgments Reveal Systematic Blind Spots in GAN Face Evaluation

Nierula, BirgitMelnik, AnnaBarthel, FlorianBrama, AileenHilsmann, AnnaEisert, PeterNikulin, Vadim V.Gaebler, MichaelKlotzsche, FelixChen, YonghaoStephani, TilmanBosse, SebastianMusialski, PrzemyslawLim, Isaak2026-04-202026-04-202026978-3-03868-299-82309-5059https://diglib.eg.org/handle/10.2312/egs20261007https://diglib.eg.org/handle/10.2312/egs20261007Generative Adversarial Networks (GANs) can synthesize highly realistic facial images from random noise vectors. The Fréchet Inception Distance (FID) is widely used as a standard metric to automatically evaluate the quality of GAN-generated images. However, it remains unclear to what extent this statistical measure reflects human perceptual judgments, which ultimately define image realism in practical applications. To address this, we conducted a psychophysical study in which participants (n = 20) performed a two-alternative forced-choice task, assessing actual photographs and GAN-generated images as real or fake. We show that while FID provides a reliable global ordering of image quality, it systematically fails for localized semantic artifacts (e.g., eyewear and skin texture) that disproportionately affect human realness judgments. This demonstrates that FID and human perception are not merely noisy versions of the same signal, but that FID has systematic blind spots for localized semantic artifacts that disproportionately drive human realism judgments.CC-BY-4.0Models of computationInteractive computationComputer GraphicsBeyond FID: Human Perceptual Judgments Reveal Systematic Blind Spots in GAN Face Evaluation10.2312/egs.20261007