hangz-nju-cuhk / talking-face_pc-avs Goto Github PK

View Code? Open in Web Editor NEW

907.0 907.0 172.0 62.39 MB

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

License: Creative Commons Attribution 4.0 International

Python 99.68% Shell 0.32%

talking-face_pc-avs's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou baldrlector peternara houlin justinjohn0306 ruofeidu janfschr dmdomo joskid hajungong007 trendingtechnology lzhbrian bycloudai ycmia martincastellano shishenghuang aigizk immortalsdm peterzhousz dourcer githubcrj guomin cuijianzhu boragocode lwzbuaa beizhai-wangyu ntzzc wendison lisiyao21 psilooo777 cvlinks probuse rosiekk xrosliang tu-curious mc261670164 c1a1o1 burning846 simonggx anthonyyuan wmonica birdflies deepstem helloworldcn 1ucky40nc3 zhanchao019 wintdkyo boboyiyi rainrain1218 gwiths migakol sahilg06 visheshcse blakecheng normantud czy96800 caixiong110 xuanhanyu skysocc dengandong nickabockal raghavadhanya cmmclee zzitaileo assassindesign linghaochan robotpin pragasv owen-fish bbk40221 mathpopo alterzero qfwc258 sarthak42 simon1212-max hertz-pj lincong666 jjandnn imogenqi killsking knowcorp rexryu r20213 ramizf gongchensz gov-ai nirvanalan rslnmtvv mychiux413 yihe1003 sunshine866 gengcauwong peng2017 jojocorleone kee-qin danielcastillac ninjaguru-git nyrize andriineverov ashok-arjun

talking-face_pc-avs's Issues

Distorted Output

Hey @Hangz-nju-cuhk
Thanks for sharing such nice work. I tried replicating the work however got really distorted results as shown below.
They way I created my own dataset was

Cropped image to 256x256
Ran the script scripts/prepare_testing_files.py using the above source image and driving video.
Then copied the generated csv into demo folder
Ran bash experiments/demo_vox.sh

Let me know if I am doing something wrong

concat.mp4

nice work, about clip_len

您好，我想问下，论文中提及的video clip，具体是多长的片段呢，在训练的时候，您是每次迭代读取batch_size个视频帧片段吗？因为我看到您代码里有select_frames函数，相关的参数是clip_len和generate_interval?我对这个没太理解，训练的时候这两个怎么设置参数呢？期待您的回复，谢谢

python scripts/align_68.py --folder_path ./misc/Input/517600055
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.87it/s]
TypeError: expected dtype object, got 'numpy.dtype[float32]'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "scripts/align_68.py", line 110, in
main()
File "scripts/align_68.py", line 106, in main
align_folder(args.folder_path, save_img_path)
File "scripts/align_68.py", line 50, in align_folder
preds = fa.get_landmarks_from_directory(folder_path)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/api.py", line 238, in get_landmarks_from_directory
preds = self.get_landmarks_from_image(image, bounding_boxes)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/api.py", line 154, in get_landmarks_from_image
pts, pts_img = get_preds_fromhm(out, center.numpy(), scale)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/utils.py", line 199, in get_preds_fromhm
preds, preds_orig = _get_preds_fromhm(hm, idx, center, scale)
SystemError: CPUDispatcher(<function _get_preds_fromhm at 0x7f5d8140b050>) returned a result with an error set

Plan for training process

I appreciate your amazing work, thank your team.
when the training script will be released?

关于测试集的测试结果

目前，random选取了测试集的部分数据, 总体来说，就是生成的video<输入图片的清晰度，有些视频生成的比较清晰（例如倒数第二个），大部分有一些糊。现在，我不太确定这是否是正常的测试结果（目前，没有改动GitHub给出的inference里面的任何参数）。期待您的回复，谢谢。

concat.mp4

can you share the details of the augmentation generating non-id space

i notice that you mentioned the paper "Neural Head Reenactment with Latent Pose Descriptors" in an issue about non-id space. i tried the augmentation used in their code, but it seems different from yours, with fewer changes.
this part seems vital considering the non-id space is vital, promising the model to distangle the id feature and pose feature. so i would like to know the augmentation details in your paper. can you share it to us?

训练代码

请问训练代码什么时候可以开源呢？

Question on pose space training

Thanks so much for the great work and codes. When I read the paper and codes, I get confused about the pose space learning part.

As the training strategy said in the paper, firstly pre-train the identity encoder and speech content space and then loaded to the overall framework to train the generator and pose space learning, I could understand the training procedure, however, for learning the pose space I am confused if you use the loss (compute_diff_loss)

Talking-Face_PC-AVS/models/av_model.py

Line 401 in 23585e2

 def compute_diff_loss(self, input_img, pose_feature, pose_feature_audio, G_losses): 

when training the whole generator. If so, it is consistent with the codes to compute the l1 loss on pose differences and pose_feature_audio.

Last but not least, congratulations on the research progress, I think it is a great breakthrough to disentangle the sync information and head pose in the feature representation. Looking forward to your reply!

Preprocess video

Hello and thank you very much for sharing your great work!
I was wondering how to preprocess a pose video.
I get how to preprocess the input image, according to the README, but I am not sure on how to do the same with the video por setting the pose.
Thank you very much in advance!

result does not look like src

source has 700 frames aligned so:

el csv :
/content/Talking-Face_PC-AVS/misc/Input/faf/ 700 /content/Talking-Face_PC-AVS/misc/Pose_Source/517600078 160 /content/Talking-Face_PC-AVS/misc/Audio_Source/00015.mp3 /content/Talking-Face_PC-AVS/misc/Mouth_Source/681600002 363 dummy"

how to improve the result?

avconcat.6.mp4

Discrepancy in SSIM and CPBD metrics

Excellent work! We refer to https://github.com/Po-Hsun-Su/pytorch-ssim and calculate SSIM on generated G_Pose_Driven_/ and Pose_Source_ but find the results much smaller than those in the paper. We also refer to https://github.com/0x64746b/python-cpbd for CPBD metric but fail to reproduce the test results. We haven't found LMD metric yet but can we just use LSE-D proposed in Wav2lip as a replacement of it? Looking forward to your reply!

Exception('None Image') while using custom audio source

Thank you author for your great work.
I have encountered problems while using custom audio source.
I have used cropped image from the audio source and set the mouth source path to the cropped image file.
For example, when I generated the demo.csv, I saw the moth source contains 208 frame. But when I run !bash experiments/demo_vox.sh, the image path will indicate a number beyond 208 and cause Exception('None Image').
However, there was no problem while using the audio sources you provided.
Do you have any idea about this problem?
Many thanks!

About constrast learning

Thanks for sharing your work! Where is the sricpt for constrast learning for image feature and audio feature. I found a class in "models\networks\loss.py": SoftmaxContrastiveLoss, is this the realization of contrast learning?

Why won't you blink

TypeError: exceptions must derive from BaseException

Thanks for your great job.
I run bash experiments/demo_vox.sh and got an error:

Network [ResNeXtEncoder] was created. Total number of parameters: 38.0 million. To see the architecture, do print(network).
./checkpoints/demo/latest_net_G.pth not exists yet!
Traceback (most recent call last):
File "inference.py", line 117, in
main()
File "inference.py", line 97, in main
model = create_model(opt).cuda()
File "/content/Talking-Face_PC-AVS/models/init.py", line 33, in create_model
instance = model(opt)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 24, in init
self.initialize_networks(opt)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 272, in initialize_networks
self.load_network(netG, 'G', opt.which_epoch)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 773, in load_network
raise ('Generator must exist!')
TypeError: exceptions must derive from BaseException.

Thanks

When will you release the model training script?

Thanks for the Great work.I wonder when will you release the model training script.

Error related to "align_68.py" file

Thank you for your excellent work!
After running the "align_68.py" file, I first encountered an error [ZeroDivisionError: division by zero], after fixing it. The code is executed but no file is generated in the p_cropped folder.

python scripts/align_68.py --folder_path scripts/p

0it [00:00, ?it/s]
cropped files saved at scripts/p_cropped

Thank you very much for your guidance!

identity encoder怎么训练的?

你好,你的论文里提到你在Voxceleb2数据集训练了identity encoder,用来提取pose信息,训练中使用了三种数据增强,能详细说下怎么训练的吗?损失函数怎么定义呢

Eyes can not move ?

hi, thanks for your great job ! when I test with my own video, i found that the eyes in the generated video didn't move with the driven video. Is it normal ?

About non-id space training.

Thanks for sharing your work.Could you tell me how you train the non-id speace and what kind of loss did you use.Cause I find that there are seems not loss for non-id space training have been mentioned in the paper. Many thanks.

Training code will be released?

Hi,

Congratulations on this great work and thanks for releasing the codes! The results are mighty impressive! Any plans on releasing the training code too?

Questions about the demo video in project page

In the demo video on the project page, there are generated faces of Obama and Biden，of time stamp 2:26.

It seems that there exists a little identity mismatch problem, which may be a potential improvement direction.
Could you please tell us the source videos you use in this situation or the location in VoxCeleb2(maybe) for better reference?
Thanks in advance and appreciate the great work PC-AVS!

window Install

Thanks for your sharing. Could you explain how to install on window with Anaconda. Thanks you!

RuntimeError: mat1 dim 1 must match mat2 dim 0

Hi Hangz_Zhou and team,

I've been struggling to get the demo experiment to work. When I run the code, I get the following Runtime error:

Network [ModulateGenerator] was created. Total number of parameters: 90.1 million. To see the architecture, do print(network).
Embedding size is 512, encoder SAP.
Network [ResSESyncEncoder] was created. Total number of parameters: 10.4 million. To see the architecture, do print(network).
Network [FanEncoder] was created. Total number of parameters: 14.3 million. To see the architecture, do print(network).
Network [ResNeXtEncoder] was created. Total number of parameters: 38.0 million. To see the architecture, do print(network).
Pretrained network G has fewer layers; The following are not initialized:
['conv1', 'convs', 'style', 'to_rgb1', 'to_rgbs']
model [AvModel] was created
working
dataset [VOXTestDataset] of size 361 was created
  0%|          | 0/181 [00:00<?, ?it/s]C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py:3328: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py:3458: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode)
  0%|          | 0/181 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "C:/Users/Admin/Documents/Github/Talking-Face_PC-AVS/app/inference.py", line 107, in main
    inference_single_audio(opt, path_label, model)
  File "C:/Users/Admin/Documents/Github/Talking-Face_PC-AVS/app/inference.py", line 66, in inference_single_audio
    fake_image_original_pose_a, fake_image_driven_pose_a = model.forward(data_i, mode='inference')
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 94, in forward
    driving_pose_frames)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 484, in inference
    fake_image_ref_pose_a, _ = self.generate_fake(sel_id_feature, ref_merge_feature_a)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 448, in generate_fake
    fake_image, style_rgb = self.netG(style)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 583, in forward
    out = self.conv1(out, latent[:, 0], noise=noise[0])
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 392, in forward
    out, _ = self.conv(input, style)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 295, in forward
    style = self.modulation(style).view(batch, 1, in_channel, 1, 1)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 214, in forward
    input, self.weight * self.scale, bias=self.bias * self.lr_mul
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0
misc/Input/517600055 1 misc/Pose_Source/517600078 160 misc/Audio_Source/681600002.mp3 misc/Mouth_Source/681600002 363 dummy

mat1 dim 1 must match mat2 dim 0

Process finished with exit code 0

The error occurs with these variables, although I'm not sure this is telling you much:

I'm currently running the code with PyTorch 1.8.1 (and Python 3.6) as I haven't managed to get PyTorch 1.3.0 working due to CUDA 10 not supporting my GPU. What would you recommend as a following action? Your help is very appreciated. Keep up the good work!

unconsistent duration time between mp3 and mp4 in LRW

Thanks for your great work. I use your prepared video to test your model, there is no problem. Howerer, when I test the video in LRW, it does not work. And then, I find that the original mp4 duration time is 1.16s, but it change to 1.21s after converting to mp3 by ffmpeg, can you give me some advice?
Thanks in advance.

support for non-human faces

Hi,

First of all, thanks for sharing the implementation of your amazing work. I was wondering if it supports non-human faces?

Thanks

中文的支持

您好,看了你给出的示例视频,对中文也有较好的支持,请问你使用中文进行训练了?还有就是请教一下为什么没有采用mfcc或fbank等音频特征,作者有没有尝试其他音频特征吗,现在的音频特征获得了最好的效果吗

SyntaxError: invalid syntax

Hello with bash experiments/demo_vox.sh I obtained an error:

~/Talking-Face_PC-AVS-main$ bash experiments/demo_vox.sh
Traceback (most recent call last):
File "inference.py", line 4, in
from options.test_options import TestOptions
File "/home/andrea/Talking-Face_PC-AVS-main/options/test_options.py", line 1, in
from .base_options import BaseOptions
File "/home/andrea/Talking-Face_PC-AVS-main/options/base_options.py", line 5, in
from util import util
File "/home/andrea/Talking-Face_PC-AVS-main/util/util.py", line 47
imgs = np.concatenate([imgs, np.zeros((rowPadding, *imgs.shape[1:]), dtype=imgs.dtype)], axis=0)
^
SyntaxError: invalid syntax

Cannot find paper link

Could you please update the "MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement" paper link

How to expand output image?

In the inference.py,
I've have seen it just cropped face image but I want to expand the top part of the image

Running code on own image?

Hello,
I have configured the project as suggested here. I am able to run it on the demo images placed inside ./misc/input/some_id
My question is how can I run the project on my own image.

For example when I am creating a folder inside ./misc/input/ of name 123456 and then placing my own image of size 224x224 inside it with name 000000.jpg (the complete path is ./misc/input/123456/000000.jpg), and then after changing the demo.csv file when I am running the code then I am not getting desired results.

Please help me.

请问你是如何做语音和视频的对比学习的

请问你做对比学习，是一段语音和一段视频做对比学习，还是一帧语音和一帧图片做对比学习

How to calculate the file "demo.csv" parameters （160， 363）？

Hi, thanks for your greate work.
what about the parameters in file "demo.csv" (160, 363)

misc/Input/517600055 1 misc/Pose_Source/517600078 160 misc/Audio_Source/681600002.mp3 misc/Mouth_Source/681600002 363 dummy

Wrong Paper link

The paper link of the corresponding CVPR21 paper is linked to your previous AAAI19 paper DIVS.
@Hangz-nju-cuhk

当我在录我自己的时候，识别的脸部不正确

是需要规定视频的大小或者视频的尺寸吗？我是对这自己的脸拍摄的但最后识别截取的部分都在墙上

Any way to use this on Google Colab?

Google colab is basically a free service that allows you to use tesla K80's and T4's for free.

how to conduct quality assessment

Could you compute SSIM, CPDB and LMD? Could you give me a code reference? Thanks.

Questions for align_68.py

I really appreciate the authors for the great work! The demo looks super well and the models are robustness.

I have two questions for some details in the align_68.py.
(1) I'm just curious what's the purpose of adding the bias here
(2) Why do you calculate the average for the three points here. Wouldn't it be better to store the three points for each frame and calculate the M for each of them?

Thanks for the great work again!

LRW + VoxCeleb

Hi Hang, did you combine the two dataset for training the lip-sync and test separately on each, or did you separate the training dataset - i wonder if voxceleb2 alone can achieve the good performance in terms of the lip-sync.
Thanks.

Talking Face with just audio input

Hi, thank you for your amazing work.

I am just wondering if it's possible to render without a mouth frame and just based on the audio?(similar to what Wav2Lip does)

If so, can you tell me how to do it? Because I've been trying to figure it out if it's possible and keep on running into Exception: None Image error if I put the paths for mouth frames to None and amount of frames to 0 in demo.csv

Is it possible to do lip sync without audio?

I'm just curious whether if its possible to just have mouth_source and input to generate a video, where the audio source can be ignored (be None). Similar to how lips sync works. I tested on mine and it didn't work.

Sorry to bother you again. I tried to figure out by looking through the codes but that didn't help.

关于训练集

您好，请问一下，对于vox2数据集，训练集包含多少个speaker，您是随机选取的吗

When can the training code be released?

Thank you for sharing. When can the training code be released?

About pretrained speech content space

Thank you for the great work.
In the Equation 2, you use F_c^v and F_c^a to calculate the loss function L_c^v2a. However, in the code "av_model.py", when mode == 'sync', you use function **sync(self, augmented, spectrogram)**to train the speech content space.
If i am not wrong, in the function, you use F_n of Non-Identity space and F_c^a to calculate the loss function L_c^v2a.

Does it play the same role as equation 2?
Looking for your reply and best wishes!