DiffTalk

The pytorch implementation for our CVPR2023 paper "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation".

[Project] [Paper] [Video Demo]

Requirements

python 3.7.0
pytorch 1.10.0
pytorch-lightning 1.2.5
torchvision 0.11.0
pytorch-lightning==1.2.5

For more details, please refer to the requirements.txt. We conduct the experiments with 8 NVIDIA 3090Ti GPUs.

Put the first stage model to ./models.

Dataset

Please download the HDTF dataset for training and test, and process the dataset as following.

Data Preprocessing:

Set all videos to 25 fps.
Extract the audio signals and facial landmarks.
Put the processed data in ./data/HDTF, and construct the data directory as following.
Constract the data_train.txt and data_test.txt as following.

./data/HDTF:

|——data/HDTF
   |——images
      |——0_0.jpg
      |——0_1.jpg
      |——...
      |——N_M.bin
   |——landmarks
      |——0_0.lmd
      |——0_1.lmd
      |——...
      |——N_M.lms
   |——audio_smooth
      |——0_0.npy
      |——0_1.npy
      |——...
      |——N_M.npy

./data/data_train(test).txt:

0_0
0_1
0_2
...
N_M

N is the total number of classes, and M is the class size.

Training

sh run.sh

Test

sh inference.sh

Weakness

The DiffTalk models talking head generation as an iterative denoising process, which needs more time to synthesize a frame compared with most GAN-based approaches. This is also a common problem of LDM-based works.
The model is trained on the HDTF dataset, and it sometimes fails on some identities from other datasets.
When driving a portrait with more challenging cross-identity audio, the audio-lip synchronization of the synthesized video is slightly inferior to the ones under self-driven setting.
During inference, the network is also sensitive to the mask shape in z_T , where the mask needs to cover the mouth region completely and its shape cannot leak any lip shape information.

Acknowledgement

This code is built upon the publicly available code latent-diffusion. Thanks the authors of latent-diffusion for making their excellent work and codes publicly available.

Citation

Please cite the following paper if you use this repository in your research.

@inproceedings{shen2023difftalk,
   author={Shen, Shuai and Zhao, Wenliang and Meng, Zibin and Li, Wanhua and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
   title={DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation},
   booktitle={CVPR},
   year={2023}
}

anvt / difftalk Goto Github PK

difftalk's Introduction

DiffTalk

Requirements

Dataset

Training

Test

Weakness

Acknowledgement

Citation

difftalk's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent