FSRT: Facial Scene Representation Transformer

FSRT: Facial Scene Representation Transformer
for Face Reenactment from Factorized Appearance,
Head-pose, and Facial Expression Features

Autonomous Intelligent Systems - Computer Science Institute VI and Center for Robotics, University of Bonn, Germany
Lamarr Institute for Machine Learning and Artificial Intelligence, Germany

CVPR 2024

Abstract

The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.

BibTeX

@inproceedings{rochow2024fsrt, title={{FSRT}: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features}, author={Rochow, Andre and Schwarz, Max and Behnke, Sven}, booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages={7716--7726}, year={2024} }

FSRT: Facial Scene Representation Transformer
for Face Reenactment from Factorized Appearance,
Head-pose, and Facial Expression Features

Abstract

Method Overview

Generalizing to CelebA-HQ

Our model generalizes to source images from the CelebA-HQ dataset (top rows) and driving videos from the official VoxCeleb2 test set (left).

Architecture

BibTeX

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Abstract

Method Overview

Generalizing to CelebA-HQ

Our model generalizes to source images from the CelebA-HQ dataset (top rows) and driving videos from the official VoxCeleb2 test set (left).

Architecture

BibTeX

FSRT: Facial Scene Representation Transformer
for Face Reenactment from Factorized Appearance,
Head-pose, and Facial Expression Features