ShellNeRF: Learning a controllable high-resolution model of the Eye and Periocular region

Eurographics 2024 paper 1013

Abstract

Eye gaze and expressions are crucial non-verbal signals in face-to-face communication. Visual effects and telepresence demand significant improvements in personalized tracking, animation, and synthesis of the eye region to achieve true immersion. Morphable face models, in combination with coordinate-based neural volumetric representations, show promise in solving the difficult problem of reconstructing intricate geometry (eyelashes) and synthesizing photorealistic appearance variations (wrinkles and specularities) of eye performances. We propose a novel hybrid representation - ShellNeRF - that builds a discretized volume around a 3DMM face mesh using concentric surfaces to model the deformable `periocular' region. We define a canonical space using the UV layout of the shells that constrains the space of dense correspondence search. Combined with an explicit eyeball mesh for modeling corneal light-transport, our model allows for animatable photorealistic 3D synthesis of the whole eye region. Using multi-view video input, we demonstrate significant improvements over state-of-the-art in expression re-enactment and transfer for high-resolution close-up views of the eye region.

* We recommend using the Chrome or Safari browser to correctly dispay all visuals.


Novel View Synthesis

We can render the same scene from continuous camera views. Unlike other methods, we ensure multiview consistency and do not "hide" wrinkles and shadows beneath the skin surface. Nerface suffers from high instability and often diverges, as seen in the second subject

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Decomposition

We can also decompose each video into albedo, diffuse shading and specular shading.

Albedo

Diffuse

Specular

Albedo

Diffuse

Specular

Regazing

Our method can control eye gaze and synthesize novel view directions. Note how our method is significantly sharper and moves smoother than all other competing methods. Mixture of Volumetric Primitives(MVP) requires a projected texture as an input, which is not given for such manipulations. We therefore use a neutral texture as a placeholder. MVP is unable to regaze properly, instead blending from one eye pose to the next. On the other hand, the lack of overall quality prevents the eyeball in EyeNeRF from being learned correctly for one pose of the first subject, and generally for the second subject.

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Regazing with a Moving Camera

We can do the same with a moving camera.

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Interpolating Expressions

We interpolate between 13 expressions. Related works struggle to render convincing expressions. For our method, the periocular region smoothly adapts and shows detailed, naturally deforming wrinkles and highly detailed reflections on the eyeball. Again, for MVP we use the neutral texture as a placeholder.

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Interpolating Expressions with a Moving Camera

We can do also perform interpolation of expressions while moving the camera and maintain 3D consistency throughout the motion. SotA methods struggle with gaze and expression changes and produce significant floaters for novel camera viewpoints.

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNerf

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNerf

MonoAvatar

3DMM Expressions

Our shell-based formulation enables fine-grained control via a 3DMM parameters. We show slow-motion renderings of a highly complex expression: a closing eyelid. Only our method and MonoAvatar are capable of this.

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNerf

MonoAvatar

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNerf

MonoAvatar

Reenactment

We show results on expressions unseen during training. In these examples, we extract 3DMM coefficients from the Expression Target (left) and apply them to our target subject on the right. Note how our method faithfully applies the desired expressions. Note that our 3DMM fitting method does not enforce temporal smoothness, resulting in jitter. As we crop our image based on the estimated eye pose, this results in jitter in the GT video as well. Although performs comparably in most situations, it is unable to handle certain expressions which it did not directly see, such as the half open eye in the first expression, resulting in strong artifacting.

Expression Target

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Expression Target

Ours

MVP

Nerface + EyeNeRF

3DMM cond. EyeNeRF

MonoAvatar

Ablations

We show results on expressions unseen during training and interpolated expressions with camera orbits for all our ablation settings.

No Warp

No Shading

No Warp or Shading

No Eyelid Regularization

No Split Training

No Geometry Optimization

2 Shell Layers

6 Shell Layers

12 Shell Layers

20mm Outer Shell

6mm Outer Shell

3mm Inner Shell

No Warp

No Shading

No Warp or Shading

No Eyelid Regularization

No Split Training

No Geometry Optimization

2 Shell Layers

6 Shell Layers

12 Shell Layers

20mm Outer Shell

6mm Outer Shell

3mm Inner Shell