S3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors

Xingyu Ren1     Jiankang Deng2     Yuhao Cheng1     Wenhan Zhu3
Yichao Yan1     Xiaokang Yang1     Stefanos Zafeiriou2     Chao Ma1†
1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
2Imperial College London    3Xueshen AI
(Corresponding author)
CVPR 2025
Teaser

We introduce S3-Face, a comprehensive high-quality facial reflectance estimation method. Our approach leverages a pre-trained diffusion model as a reflectance prior, robustly generating subsurface scattering (SSS)-compliant facial reflectance, particularly for hemoglobin and melanin maps. S3-Face adeptly handles diverse input subjects and formats, delivering photorealistic rendering results.

Abstract

Recent 3D face reconstruction methods have made remarkable advancements, yet achieving high-quality facial reflectance from monocular input remains challenging. Existing methods rely on the light-stage captured data to learn facial reflectance models. However, limited subject diversity in these datasets poses challenges in achieving good generalization and broad applicability. This motivates us to explore whether the extensive priors captured in recent generative diffusion models (e.g., Stable Diffusion) can enable more generalizable facial reflectance estimation as these models have been pre-trained on large-scale internet image collections containing rich visual patterns. In this paper, we introduce the use of Stable Diffusion as a prior for facial reflectance estimation, achieving robust results with minimal captured data for fine-tuning. We present S3-Face, a comprehensive framework capable of producing SSS-compliant skin reflectance from in-the-wild images. Our method adopts a two-stage training approach: in the first stage, DSN-Net is trained to predict diffuse albedo, specular albedo, and normal maps from in-the-wild images using a novel joint reflectance attention module. In the second stage, HM-Net is trained to generate hemoglobin and melanin maps based on the diffuse albedo predicted in the first stage, yielding SSS-compliant and detailed reflectance maps. Extensive experiments demonstrate that our method achieves strong generalization and produces high-fidelity, SSS-compliant facial reflectance estimation.

Pipeline

Pipeline
Overview of the DSN-Net. Given paired inputs, we first extract corresponding latent codes through a frozen encoder. We then jointly train the reflectance model using two alternating modes: 1) ”Single image-multiple reflections” mode (Left): The switcher selects a reflection type (diffuse, specular, or normal) to concatenate with the color latent. 2) ”Multiple images-single diffuse reflection” mode (Right): The switcher selects color latents under different lighting conditions, each concatenated with the diffuse latent. These combined inputs are then fed into the network for noise prediction training.