Giter VIP home page Giter VIP logo

talking-face_pc-avs's People

Contributors

hangz-nju-cuhk avatar sunyasheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

talking-face_pc-avs's Issues

Distorted Output

Hey @Hangz-nju-cuhk
Thanks for sharing such nice work. I tried replicating the work however got really distorted results as shown below.
They way I created my own dataset was

  • Cropped image to 256x256
  • Ran the script scripts/prepare_testing_files.py using the above source image and driving video.
  • Then copied the generated csv into demo folder
  • Ran bash experiments/demo_vox.sh

Let me know if I am doing something wrong

concat.mp4

nice work, about clip_len

您好,我想问下,论文中提及的video clip,具体是多长的片段呢,在训练的时候,您是每次迭代读取batch_size个视频帧片段吗?因为我看到您代码里有select_frames函数,相关的参数是clip_len和generate_interval?我对这个没太理解,训练的时候这两个怎么设置参数呢?期待您的回复,谢谢

人脸对齐失败?

您好,我在执行人脸对齐的时候,遇到如下错误,请问如何解决?谢谢。

python scripts/align_68.py --folder_path ./misc/Input/517600055
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.87it/s]
TypeError: expected dtype object, got 'numpy.dtype[float32]'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "scripts/align_68.py", line 110, in
main()
File "scripts/align_68.py", line 106, in main
align_folder(args.folder_path, save_img_path)
File "scripts/align_68.py", line 50, in align_folder
preds = fa.get_landmarks_from_directory(folder_path)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/api.py", line 238, in get_landmarks_from_directory
preds = self.get_landmarks_from_image(image, bounding_boxes)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/api.py", line 154, in get_landmarks_from_image
pts, pts_img = get_preds_fromhm(out, center.numpy(), scale)
File "/home/ruanjiyang/anaconda3/lib/python3.7/site-packages/face_alignment/utils.py", line 199, in get_preds_fromhm
preds, preds_orig = _get_preds_fromhm(hm, idx, center, scale)
SystemError: CPUDispatcher(<function _get_preds_fromhm at 0x7f5d8140b050>) returned a result with an error set

关于测试集的测试结果

目前,random选取了测试集的部分数据, 总体来说,就是生成的video<输入图片的清晰度,有些视频生成的比较清晰(例如倒数第二个),大部分有一些糊。现在,我不太确定这是否是正常的测试结果(目前,没有改动GitHub给出的inference里面的任何参数)。期待您的回复,谢谢。

concat.mp4
concat.mp4
concat.mp4
concat.mp4
concat.mp4

can you share the details of the augmentation generating non-id space

i notice that you mentioned the paper "Neural Head Reenactment with Latent Pose Descriptors" in an issue about non-id space. i tried the augmentation used in their code, but it seems different from yours, with fewer changes.
this part seems vital considering the non-id space is vital, promising the model to distangle the id feature and pose feature. so i would like to know the augmentation details in your paper. can you share it to us?

训练代码

请问训练代码什么时候可以开源呢?

Question on pose space training

Thanks so much for the great work and codes. When I read the paper and codes, I get confused about the pose space learning part.

As the training strategy said in the paper, firstly pre-train the identity encoder and speech content space and then loaded to the overall framework to train the generator and pose space learning, I could understand the training procedure, however, for learning the pose space I am confused if you use the loss (compute_diff_loss)

def compute_diff_loss(self, input_img, pose_feature, pose_feature_audio, G_losses):

when training the whole generator. If so, it is consistent with the codes to compute the l1 loss on pose differences and pose_feature_audio.

Last but not least, congratulations on the research progress, I think it is a great breakthrough to disentangle the sync information and head pose in the feature representation. Looking forward to your reply!

Preprocess video

Hello and thank you very much for sharing your great work!
I was wondering how to preprocess a pose video.
I get how to preprocess the input image, according to the README, but I am not sure on how to do the same with the video por setting the pose.
Thank you very much in advance!

result does not look like src

source has 700 frames aligned so:

el csv :
/content/Talking-Face_PC-AVS/misc/Input/faf/ 700 /content/Talking-Face_PC-AVS/misc/Pose_Source/517600078 160 /content/Talking-Face_PC-AVS/misc/Audio_Source/00015.mp3 /content/Talking-Face_PC-AVS/misc/Mouth_Source/681600002 363 dummy"

how to improve the result?

avconcat.6.mp4

Exception('None Image') while using custom audio source

Thank you author for your great work.
I have encountered problems while using custom audio source.
I have used cropped image from the audio source and set the mouth source path to the cropped image file.
For example, when I generated the demo.csv, I saw the moth source contains 208 frame. But when I run !bash experiments/demo_vox.sh, the image path will indicate a number beyond 208 and cause Exception('None Image').
However, there was no problem while using the audio sources you provided.
Do you have any idea about this problem?
Many thanks!

About constrast learning

Thanks for sharing your work! Where is the sricpt for constrast learning for image feature and audio feature. I found a class in "models\networks\loss.py": SoftmaxContrastiveLoss, is this the realization of contrast learning?

TypeError: exceptions must derive from BaseException

Thanks for your great job.
I run bash experiments/demo_vox.sh and got an error:

Network [ResNeXtEncoder] was created. Total number of parameters: 38.0 million. To see the architecture, do print(network).
./checkpoints/demo/latest_net_G.pth not exists yet!
Traceback (most recent call last):
File "inference.py", line 117, in
main()
File "inference.py", line 97, in main
model = create_model(opt).cuda()
File "/content/Talking-Face_PC-AVS/models/init.py", line 33, in create_model
instance = model(opt)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 24, in init
self.initialize_networks(opt)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 272, in initialize_networks
self.load_network(netG, 'G', opt.which_epoch)
File "/content/Talking-Face_PC-AVS/models/av_model.py", line 773, in load_network
raise ('Generator must exist!')
TypeError: exceptions must derive from BaseException.

Thanks

Error related to "align_68.py" file

Thank you for your excellent work!
After running the "align_68.py" file, I first encountered an error [ZeroDivisionError: division by zero], after fixing it. The code is executed but no file is generated in the p_cropped folder.

python scripts/align_68.py --folder_path scripts/p

0it [00:00, ?it/s]
cropped files saved at scripts/p_cropped

Thank you very much for your guidance!

identity encoder怎么训练的?

你好,你的论文里提到你在Voxceleb2数据集训练了identity encoder,用来提取pose信息,训练中使用了三种数据增强,能详细说下怎么训练的吗?损失函数怎么定义呢

Eyes can not move ?

hi, thanks for your great job ! when I test with my own video, i found that the eyes in the generated video didn't move with the driven video. Is it normal ?

About non-id space training.

Thanks for sharing your work.Could you tell me how you train the non-id speace and what kind of loss did you use.Cause I find that there are seems not loss for non-id space training have been mentioned in the paper. Many thanks.

Training code will be released?

Hi,

Congratulations on this great work and thanks for releasing the codes! The results are mighty impressive! Any plans on releasing the training code too?

Questions about the demo video in project page

In the demo video on the project page, there are generated faces of Obama and Biden,of time stamp 2:26.
image
It seems that there exists a little identity mismatch problem, which may be a potential improvement direction.
Could you please tell us the source videos you use in this situation or the location in VoxCeleb2(maybe) for better reference?
Thanks in advance and appreciate the great work PC-AVS!

window Install

Thanks for your sharing. Could you explain how to install on window with Anaconda. Thanks you!

RuntimeError: mat1 dim 1 must match mat2 dim 0

Hi Hangz_Zhou and team,

I've been struggling to get the demo experiment to work. When I run the code, I get the following Runtime error:

Network [ModulateGenerator] was created. Total number of parameters: 90.1 million. To see the architecture, do print(network).
Embedding size is 512, encoder SAP.
Network [ResSESyncEncoder] was created. Total number of parameters: 10.4 million. To see the architecture, do print(network).
Network [FanEncoder] was created. Total number of parameters: 14.3 million. To see the architecture, do print(network).
Network [ResNeXtEncoder] was created. Total number of parameters: 38.0 million. To see the architecture, do print(network).
Pretrained network G has fewer layers; The following are not initialized:
['conv1', 'convs', 'style', 'to_rgb1', 'to_rgbs']
model [AvModel] was created
working
dataset [VOXTestDataset] of size 361 was created
  0%|          | 0/181 [00:00<?, ?it/s]C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py:3328: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py:3458: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode)
  0%|          | 0/181 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "C:/Users/Admin/Documents/Github/Talking-Face_PC-AVS/app/inference.py", line 107, in main
    inference_single_audio(opt, path_label, model)
  File "C:/Users/Admin/Documents/Github/Talking-Face_PC-AVS/app/inference.py", line 66, in inference_single_audio
    fake_image_original_pose_a, fake_image_driven_pose_a = model.forward(data_i, mode='inference')
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 94, in forward
    driving_pose_frames)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 484, in inference
    fake_image_ref_pose_a, _ = self.generate_fake(sel_id_feature, ref_merge_feature_a)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\av_model.py", line 448, in generate_fake
    fake_image, style_rgb = self.netG(style)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 583, in forward
    out = self.conv1(out, latent[:, 0], noise=noise[0])
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 392, in forward
    out, _ = self.conv(input, style)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 295, in forward
    style = self.modulation(style).view(batch, 1, in_channel, 1, 1)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\models\networks\generator.py", line 214, in forward
    input, self.weight * self.scale, bias=self.bias * self.lr_mul
  File "C:\Users\Admin\Documents\Github\Talking-Face_PC-AVS\app\venv\lib\site-packages\torch\nn\functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: mat1 dim 1 must match mat2 dim 0
misc/Input/517600055 1 misc/Pose_Source/517600078 160 misc/Audio_Source/681600002.mp3 misc/Mouth_Source/681600002 363 dummy

mat1 dim 1 must match mat2 dim 0

Process finished with exit code 0

The error occurs with these variables, although I'm not sure this is telling you much:
image

I'm currently running the code with PyTorch 1.8.1 (and Python 3.6) as I haven't managed to get PyTorch 1.3.0 working due to CUDA 10 not supporting my GPU. What would you recommend as a following action? Your help is very appreciated. Keep up the good work!

unconsistent duration time between mp3 and mp4 in LRW

Thanks for your great work. I use your prepared video to test your model, there is no problem. Howerer, when I test the video in LRW, it does not work. And then, I find that the original mp4 duration time is 1.16s, but it change to 1.21s after converting to mp3 by ffmpeg, can you give me some advice?
Thanks in advance.

support for non-human faces

Hi,

First of all, thanks for sharing the implementation of your amazing work. I was wondering if it supports non-human faces?

Thanks

中文的支持

您好,看了你给出的示例视频,对中文也有较好的支持,请问你使用中文进行训练了?还有就是请教一下为什么没有采用mfcc或fbank等音频特征,作者有没有尝试其他音频特征吗,现在的音频特征获得了最好的效果吗

SyntaxError: invalid syntax

Hello with bash experiments/demo_vox.sh I obtained an error:

~/Talking-Face_PC-AVS-main$ bash experiments/demo_vox.sh
Traceback (most recent call last):
File "inference.py", line 4, in
from options.test_options import TestOptions
File "/home/andrea/Talking-Face_PC-AVS-main/options/test_options.py", line 1, in
from .base_options import BaseOptions
File "/home/andrea/Talking-Face_PC-AVS-main/options/base_options.py", line 5, in
from util import util
File "/home/andrea/Talking-Face_PC-AVS-main/util/util.py", line 47
imgs = np.concatenate([imgs, np.zeros((rowPadding, *imgs.shape[1:]), dtype=imgs.dtype)], axis=0)
^
SyntaxError: invalid syntax

Cannot find paper link

Could you please update the "MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement" paper link

How to expand output image?

In the inference.py,
I've have seen it just cropped face image but I want to expand the top part of the image

Running code on own image?

Hello,
I have configured the project as suggested here. I am able to run it on the demo images placed inside ./misc/input/some_id
My question is how can I run the project on my own image.

For example when I am creating a folder inside ./misc/input/ of name 123456 and then placing my own image of size 224x224 inside it with name 000000.jpg (the complete path is ./misc/input/123456/000000.jpg), and then after changing the demo.csv file when I am running the code then I am not getting desired results.

Please help me.

Questions for align_68.py

I really appreciate the authors for the great work! The demo looks super well and the models are robustness.

I have two questions for some details in the align_68.py.
(1) I'm just curious what's the purpose of adding the bias here
(2) Why do you calculate the average for the three points here. Wouldn't it be better to store the three points for each frame and calculate the M for each of them?

Thanks for the great work again!

LRW + VoxCeleb

Hi Hang, did you combine the two dataset for training the lip-sync and test separately on each, or did you separate the training dataset - i wonder if voxceleb2 alone can achieve the good performance in terms of the lip-sync.
Thanks.

Talking Face with just audio input

Hi, thank you for your amazing work.

I am just wondering if it's possible to render without a mouth frame and just based on the audio?(similar to what Wav2Lip does)

If so, can you tell me how to do it? Because I've been trying to figure it out if it's possible and keep on running into Exception: None Image error if I put the paths for mouth frames to None and amount of frames to 0 in demo.csv

Is it possible to do lip sync without audio?

I'm just curious whether if its possible to just have mouth_source and input to generate a video, where the audio source can be ignored (be None). Similar to how lips sync works. I tested on mine and it didn't work.

Sorry to bother you again. I tried to figure out by looking through the codes but that didn't help.

关于训练集

您好,请问一下,对于vox2数据集,训练集包含多少个speaker,您是随机选取的吗

About pretrained speech content space

Thank you for the great work.
In the Equation 2, you use F_c^v and F_c^a to calculate the loss function L_c^v2a. However, in the code "av_model.py", when mode == 'sync', you use function **sync(self, augmented, spectrogram)**to train the speech content space.
If i am not wrong, in the function, you use F_n of Non-Identity space and F_c^a to calculate the loss function L_c^v2a.

Does it play the same role as equation 2?
Looking for your reply and best wishes!

如何在自己的数据集上训练?

现有视频数据,可以根据prepare_testing_files.py 、scripts/align_68.py 这两个脚本转换成元数据。但是我没有找到有关训练的脚本,我该如何训练自己的数据集?希望您能补充一下训练的示例代码或说明,期待您的回复,谢谢。

如何提高清晰度

您好,我测试了你的模型,也测试了wap2lip等模型,口型和鲁棒性都很好,但是问题是清晰度都较低,想请教一下如何提高清晰度.提高训练数据的清晰度是否可行呢,这种talking-face的项目能否达到https://github.com/akanimax/msg-stylegan-tf 这种图片的清晰度呢?

眨眼的问题

现在生成的人脸可以眨眼吗?这个实现的难度大吗?要怎样实现呢?

生成的效果

老师您好,我按步骤一步步生成了自己数据集的demo,但是生成的脸不太对劲,有点偏欧美,是因为stylegan2训练集的缘故吗,我姿态视频源暂时是拿makeitTalk凑数的
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.