FSRT: Facial Scene Representation Transformer
for Face Reenactment from Factorized Appearance,
Head-pose, and Facial Expression Features

Autonomous Intelligent Systems - Computer Science Institute VI and Center for Robotics, University of Bonn, Germany
Lamarr Institute for Machine Learning and Artificial Intelligence, Germany
CVPR 2024

Abstract

The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.

Method Overview

Overview of our method with relative motion transfer. The source image(s) are encoded along with keypoints, capturing head pose, and facial expression vectors to a set-latent representation of the source person. The decoder attends this representation for a query pixel, conditioned on keypoints and a facial expression vector extracted from the driving frame. Images and videos from the VoxCeleb test set.

Generalizing to CelebA-HQ



Our model generalizes to source images from the CelebA-HQ dataset (top rows) and driving videos from the official VoxCeleb2 test set (left).

Architecture

Given the driving frame and source images, we extract facial keypoints and latent expression vectors. Extracted source information are used to generate the input representation of the Patch CNN. The encoder infers the set-latent source face representation from the patch embeddings. The decoder is applied for each query pixel individually and is conditioned with the driving keypoints and the latent driving expression vector.

BibTeX

@inproceedings{rochow2024fsrt,
  title={{FSRT}: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features},
  author={Rochow, Andre and Schwarz, Max and Behnke, Sven},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={7716--7726},
  year={2024}
}