Given the visual face information of source videos (upper left) and driving audio (upper right), our method is capable of rendering and generating more realistic, highfidelity, and lip-synchronized videos (lower). See the zoom in patches, our method can see details such as teeth.
Abstract
Talking face generation has a wide range of potential applications in the field of virtual digital humans. However, rendering high-fidelity facial video while ensuring lip synchronization is still a challenge for existing audio-driven talking face generation approaches. To address this issue, we propose HyperLips, a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces. In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio. First,FaceEncoder is used to obtain latent code by extracting features from the visual face information taken from the video source containing the face frame.Then,HyperConv, which weighting parameters are updated by HyperNet with the audio features as input, will modify the latent code to synchronize the lip movement with the audio.Finally,FaceDecoder will decode the modified and synchr nized latent code into visual face content. In the second stage, we obtain higher quality face videos through a high-resolution decoder. To further improve the quality of face generation, we trained a high-resolution decoder, HRDecoder, using face images and detected sketches generated from the first stage as input. Extensive quantitative and qualitative experiments show that our method outperforms stateof-the-art work with more realistic, high-fidelity, and lip synchronization.
Approach
The overview of our framework is shown above.We aim to generate a high-fidelity talking face video with synchronized lip movements by implementing the occluded face in the lower half of the input video frame by frame, given an audio and video sequence. Our proposed method consists of two stages: Base Face Generation and High Fidelity Rendering. In Base Face Generation, we designed a hyper-network that takes audio features as input to control the encoding and decoding of visual information to obtain base face images. In high-fidelity rendering, we trained an HRDecoder network using face data from the network trained in the first stage and corresponding face sketches to enhance the base face.
Visual Comparison
Quantitative Comparison
Table 1 and 2 show the quantitative comparison on the LRS2 and MEAD-Neutral datasets, respectively.The results show that whether it is our HyperLips-Base or our HyperLips-HR, the generated faces are significantly better than other methods in terms of PSNR, SSIM, and LMD indexes. Our HyperLips-HR is significantly better than our HyperLips-Base in terms of PSNR and SSIM, which shows that our HRDecoder has enhanced high-fidelity face rendering. However, there is no significant increase in the LMD index, which shows that HRDecoder does not help improve lip synchronization. For LSE-C and LSE-D, Wav2Lip perform better results and even outperforms those of ground truth.It only proves that their lip-sync results are nearly comparable to the ground truth, not better.Although LSE-C and LSE-D are not best in our methods, we performed better on the LMD indicator, which is another sync metric metric that measures correspondence in the visual domain.
User study
User study.As can be seen, our results stand out from other methods in terms of video quality and lip synchronization.
Citation
@inproceedings{chen2023HyperLips,
title = {HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation},
author = {Yaosen Chen and Yu Yao and Zhiqiang Li and Wei Wang and Yanru Zhang and Han Yang and Xuming Wen},
year = {2023},
booktitle = {arxiv}
}