sahilg06 / emogen Goto Github PK

PyTorch Implementation for Paper "Emotionally Enhanced Talking Face Generation" (ICCVW'23 and ACM-MMW'23)

License: Creative Commons Zero v1.0 Universal

Python 99.82% Shell 0.18%

emotion lip-sync multimodal pytorch talking-face-generation emotion-recognition deepfakes wav2lip audio-driven-talking-face face-reenactment

emogen's Introduction

Emotionally Enhanced Talking Face Generation

Results.mp4

This repository is the official PyTorch implementation of our paper: Emotionally Enhanced Talking Face Generation accepted at ACM MM 2023, McGE Workshop and ICCV 2023, CVEU Workshop. We introduce a multimodal framework to generate lipsynced videos agnostic to any arbitrary identity, language, and emotion. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.

📑 Original Paper	📰 Project Page	🌀 Demo	⚡ Live Testing
Paper	Project Page	Demo Video	Interactive Demo

Note: Currently, our web-interface utilizes CPU for generating results.

Disclaimer

All results from this open-source code or our demo website should only be used for research/academic/personal purposes only.

Prerequisites

ffmpeg: sudo apt-get install ffmpeg
Install necessary packages using pip install -r requirements.txt.
Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.

Preparing CREMA-D for training

Download data

Download the data from this repo.

Convert videos to 25 fps

python convertFPS.py -i <raw_video_folder> -o <folder_to_save_25fps_videos>

Preprocess dataset

python preprocess_crema-d.py --data_root <folder_of_25fps_videos> --preprocessed_root preprocessed_dataset/

Train!

There are three major steps: (i) Train the expert lip-sync discriminator, (ii) Train the emotion discriminator (iii) Train the EmoGen model.

Training the expert discriminator

python color_syncnet_train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints>

Training the emotion discriminator

python emotion_disc_train.py -i preprocessed_dataset/ -o <folder_to_save_checkpoints>

Training the main model

python train.py --data_root preprocessed_dataset/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint> --emotion_disc_path <path_to_emotion_disc_checkpoint>

You can also set additional less commonly used hyper-parameters at the bottom of the hparams.py file.

Note: For simplification in the code, we have used torch.utils.data.random_split in the training scripts to split the CREMA-D dataset into training and testing sets. There is no official train-test split of CREMA-D. Ideally, you should follow this evaluation protocol for splitting.

Inference

Comment these code lines for inference: line1 and line2.

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> --emotion <categorical emotion>

The result is saved (by default) in results/{emotion}.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3, or even a video file, from which the code will automatically extract the audio. Choose categorical emotion from this list: [HAP, SAD, FEA, ANG, DIS, NEU].

Tips for better results:

Experiment with the --pads argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g., --pads 0 20 0 0.
If you see the mouth position dislocated or some weird artifacts such as two mouths, then it can be because of over-smoothing the face detections. Use the --nosmooth argument and give it another try.
Experiment with the --resize_factor argument, to get a lower-resolution video. Why? The models are trained on faces that were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too).

Evaluation

Please check the evaluation/ folder for the instructions.

Future Plans

Train the model on MEAD dataset.
Develop a metric to evaluate the video quality in case of emotion incorporation.
Improve the demo website based on the user study in the paper.

Citation

This repository can only be used for personal/research/non-commercial purposes. Please cite the following paper if you use this repository:

@misc{goyal2023emotionally,
      title={Emotionally Enhanced Talking Face Generation}, 
      author={Sahil Goyal and Shagun Uppal and Sarthak Bhagat and Yi Yu and Yifang Yin and Rajiv Ratn Shah},
      year={2023},
      eprint={2303.11548},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

For license information, see the license.

Acknowledgments

The code structure is inspired by Wav2Lip. We thank the authors for the wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models. Demo website is developed by @ddhroov10 and @SakshatMali.

emogen's People

Contributors

Stargazers

Watchers

emogen's Issues

Model Availability for Inference

Hi Team,
That's a really good research avenue you have there. Is it possible for your team to share any of the pretrained models that can be directly used by us to test and see the inference? It would be quite useful.

Or
Would we have to train the models ourselves?

Thanks. Look forward to your response.

我们创建了一个中文讨论组，有需要的加我微信douzijun1999

1705126444.mp4

Pretrained Model availability

Great work! Could you please share the pre-trained model checkpoint so that we can test how it works?

Pretrained checkpoint

HI,

your work is very interesting. I would like to test the model with my videos and audios, but the link to the demo is not working. Could you share the pretrained checkpoint thus i can run the inference command?

Thatnk you in advance!

How do I get the pretraining model parameters ?

It's an interesting work, especially generating talking face with emotion. I do want to test it using your model weights. Thanks.

lip expert discriminator

Hello, when I was training the lip expert discriminator, the epoch got stuck at 0, but I made progress while training the emotion discriminator. Why is this?

Cannot training color_syncnet_train.py

When I run:
python color_syncnet_train.py --data_root preprocessed_dataset/ --checkpoint_dir folder_to_save_checkpoints/
It only displays and nothing happen
Epoch: 0
0it [00:00, ?it/s]
Could anyone help me?

Interactive Demo Error

The interactive demo is not running now, will you fix this problem?

KeyError: 'state_dicts' when running inference.py

F:\DaChuang\EmoGen>python inference.py --checkpoint_path face_detection/detection/sfd/s3fd.pth --face data/0.mp4 --audio data/1.mp3 --emotion HAP --resize_factor 2

Using cuda for inference.
Reading video frames...
Number of frames available for inference: 1997
Extracting raw audio...

Traceback (most recent call last):
File "inference.py", line 315, in
main()
File "inference.py", line 282, in main
model = load_model(args.checkpoint_path)
File "inference.py", line 184, in load_model
s = checkpoint["state_dict"]
KeyError: 'state_dict'

I prepare data 0.mp4 and 1.mp3, I try to inference on laptop with one 1650Ti. But something went wrong.
Could you please tell me how to resolve this.Thank you very much.

perceptual Loss and sync loss cannot be used at the same time during training

When I train the Wav2Lip model, adding the sync expert loss on top of the perceptual loss causes the training to collapse. I noticed that you also use perceptual loss in your training process, and I'm wondering if you have encountered a similar situation. If so, how did you solve it?

load model error

Thank you for the great work!
The parameter's folder of you model is not provide, should i train the model in dataset before inference?

KeyError: 'state_dict' in s = checkpoint["state_dict"] inference.py

The uploaded model checkpoint "s3fd.pth" has no key called "state_dict". I understood from previous issues this for face_detection model not the whole model so the weights for whole model are available or not yet? I have a .mp4 video and .wav audio as I want to try your model.

color_syncnet_train.py training not working

After executing the command "python color_syncnet_train.py --data_root preprocessed_dataset/ --checkpoint_dir C:\EmoGen\checkpoints\syncnet," my CPU usage stays at 100% for several hours, but it seems that the training has not started yet.
I'm not sure at which stage I'm stuck.

train model

I encountered an issue when running the training code. The code requires the LRS2 training dataset as input, but there are problems when loading the CREMA-D dataset.

color_syncnet_train.py training not running

can you advise on a workaround for this? I have been waiting for color_syncnet_train.py to run for over a few hours numerous times, and nothing is loading or running yet. you mentioned internal changes about tqdm, but what is the potential issue specifically with it?

I could run emotion_disc_train.py training, but not the color_syncnet_train.py training

code entered in cmd line:
python color_syncnet_train.py --data_root ./preprocessed_dataset/ --checkpoint_dir ./checkpoint/

I was running the code on google Colab and I got this error

I was running the code on Colab and I got this error

Using cuda for inference.
Reading video frames...
Number of frames available for inference: 211
/content/EmoGen/audio.py:100: FutureWarning: Pass sr=16000, n_fft=800 as keyword args. From version 0.10 passing these as positional arguments will result in an error
return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,
(80, 2372)
Length of mel chunks: 885
0% 0/4 [00:00<?, ?it/s]
0% 0/14 [00:00<?, ?it/s]
7% 1/14 [00:12<02:41, 12.42s/it]
14% 2/14 [00:13<01:09, 5.82s/it]
21% 3/14 [00:15<00:42, 3.83s/it]
29% 4/14 [00:16<00:27, 2.79s/it]
36% 5/14 [00:18<00:21, 2.44s/it]
43% 6/14 [00:19<00:16, 2.11s/it]
50% 7/14 [00:20<00:12, 1.81s/it]
57% 8/14 [00:21<00:09, 1.62s/it]
64% 9/14 [00:23<00:07, 1.49s/it]
71% 10/14 [00:24<00:05, 1.40s/it]
79% 11/14 [00:25<00:03, 1.33s/it]
86% 12/14 [00:26<00:02, 1.29s/it]
93% 13/14 [00:27<00:01, 1.26s/it]
100% 14/14 [00:30<00:00, 2.18s/it]
Load checkpoint from: /content/EmoGen/checkpoint/pre.pth
0% 0/4 [00:31<?, ?it/s]
Traceback (most recent call last):
File "/content/EmoGen/inference.py", line 298, in
main()
File "/content/EmoGen/inference.py", line 265, in main
model = load_model(args.checkpoint_path)
File "/content/EmoGen/inference.py", line 180, in load_model
s = checkpoint["state_dict"]
KeyError: 'state_dict'

已创建中文的讨论组想加入的请添加微信xaaheng

Error while running inference file

Traceback (most recent call last):
File "/workspace/Wav2Expression/tafreen/EmoGen-master/inference.py", line 298, in
main()
File "/workspace/Wav2Expression/tafreen/EmoGen-master/inference.py", line 265, in main
model = load_model(args.checkpoint_path)
File "/workspace/Wav2Expression/tafreen/EmoGen-master/inference.py", line 180, in load_model
s = checkpoint["state_dict"]
KeyError: 'state_dict'

I have trained till 1500 epochs while testing it is throwing this even though I have given the checkpoints path of main model itself.
Could you please tell me how to resolve this.

Thanks and regards

inference.py problem

I was running the inference.py and I got this error

Traceback (most recent call last):
File "/sharefiles/zhaodan/projects/EmoGen-master/inference.py", line 298, in
main()
File "/sharefiles/zhaodan/projects/EmoGen-master/inference.py", line 233, in main
mel = audio.melspectrogram(wav)
File "/sharefiles/zhaodan/projects/EmoGen-master/audio.py", line 47, in melspectrogram
S = _amp_to_db(_linear_to_mel(np.abs(D))) - hp.ref_level_db
File "/sharefiles/zhaodan/projects/EmoGen-master/audio.py", line 95, in _linear_to_mel
_mel_basis = _build_mel_basis()
File "/sharefiles/zhaodan/projects/EmoGen-master/audio.py", line 100, in _build_mel_basis
return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,
TypeError: mel() takes 0 positional arguments but 2 positional arguments (and 3 keyword-only arguments) were give

Thanks and regards

Pretrained inferrence model.

Hey hatss off to the wonderful research and work. I wanted to ask can you share the pretrained model as the online link https://midas.iiitd.edu.in/emo/ provided is down and we cant test it. If you guys can release the pretrained model it would be great to be used as a benchmark model for my research.

Thanks and regards

Thank you.

sahilg06 / emogen Goto Github PK

emogen's Introduction

Emotionally Enhanced Talking Face Generation

Disclaimer

Prerequisites

Preparing CREMA-D for training

Download data

Convert videos to 25 fps

Preprocess dataset

Train!

Training the expert discriminator

Training the emotion discriminator

Training the main model

Inference

Tips for better results:

Evaluation

Future Plans

Citation

License

Acknowledgments

emogen's People

Contributors

Stargazers

Watchers

Forkers

emogen's Issues

Recommend Projects

Recommend Topics

Recommend Org