S3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors

S³-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors

¹MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

²Imperial College London ³Xueshen AI

(^†Corresponding author)

CVPR 2025

Abstract

Recent 3D face reconstruction methods have made remarkable advancements, yet achieving high-quality facial reflectance from monocular input remains challenging. Existing methods rely on the light-stage captured data to learn facial reflectance models. However, limited subject diversity in these datasets poses challenges in achieving good generalization and broad applicability. This motivates us to explore whether the extensive priors captured in recent generative diffusion models (e.g., Stable Diffusion) can enable more generalizable facial reflectance estimation as these models have been pre-trained on large-scale internet image collections containing rich visual patterns. In this paper, we introduce the use of Stable Diffusion as a prior for facial reflectance estimation, achieving robust results with minimal captured data for fine-tuning. We present S³-Face, a comprehensive framework capable of producing SSS-compliant skin reflectance from in-the-wild images. Our method adopts a two-stage training approach: in the first stage, DSN-Net is trained to predict diffuse albedo, specular albedo, and normal maps from in-the-wild images using a novel joint reflectance attention module. In the second stage, HM-Net is trained to generate hemoglobin and melanin maps based on the diffuse albedo predicted in the first stage, yielding SSS-compliant and detailed reflectance maps. Extensive experiments demonstrate that our method achieves strong generalization and produces high-fidelity, SSS-compliant facial reflectance estimation.

Pipeline

Overview of the DSN-Net. Given paired inputs, we first extract corresponding latent codes through a frozen encoder. We then jointly train the reflectance model using two alternating modes: 1) ”Single image-multiple reflections” mode (Left): The switcher selects a reflection type (diffuse, specular, or normal) to concatenate with the color latent. 2) ”Multiple images-single diffuse reflection” mode (Right): The switcher selects color latents under different lighting conditions, each concatenated with the diffuse latent. These combined inputs are then fed into the network for noise prediction training.