Giter VIP home page Giter VIP logo

atvgnet's Introduction

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss (ATVGnet)

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Table of Contents

  1. Introduction
  2. Citation
  3. Running
  4. Model
  5. Results
  6. Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

model model

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

  1. This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.

  2. Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)

  3. Install requirements.txt (pip install -r requirement.txt)

  4. Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.

  5. Run the demo code: python demo.py

    • -device_ids: gpu id
    • -cuda: using cuda or not
    • -vg_model: pretrained VGnet weight
    • -at_model: pretrained ATnet weight
    • -lstm: use lstm or not
    • -p: input example image
    • -i: input audio file
    • -lstm: use lstm or not
    • -sample_dir: folder to save the outputs
    • ...
  6. Download and unzip the training data from LRW

  7. Preprocess the data (Extract landmark and crop the image by dlib).

  8. Train the ATnet model: python atnet.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_dir: folder to save weights
    • -lstm: use lstm or not
    • -sample_dir: folder to save visualized images during training
    • ...
  9. Test the model: python atnet_test.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_name: pretrained weights
    • -sample_dir: folder to save the outputs
    • -lstm: use lstm or not
    • ...
  10. Train the VGnet: python vgnet.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_dir: folder to save weights
    • -sample_dir: folder to save visualized images during training
    • ...
  11. Test the VGnet: python vgnet_test.py

    • -device_ids: gpu id
    • -batch_size: batch size
    • -model_name: pretrained weights
    • -sample_dir: folder to save the outputs
    • ...

Model

  1. Overall ATVGnet model

  2. Regresssion based discriminator network

    model

Results

  1. Result visualization on different datasets:

    visualization

  2. Reuslt compared with other SOTA methods:

    visualization

  3. The studies on image robustness respective with landmark accuracy:

    visualization

  4. Quantitative results:

    visualization

Disclaimer and known issues

  1. These codes are implmented in Pytorch.
  2. In this paper, we train LRW and GRID seperately.
  3. The model are sensitive to input images. Please use the correct preprocessing code.
  4. I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
  5. If you want to train these models using this version of pytorch without modifications, please notice that:
    • You need at lest 12 GB GPU memory.
    • There might be some other untested issues.
  6. There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

  • Release training data

License

MIT

atvgnet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atvgnet's Issues

basic infomation about chinese

Thanks for sharing your code.I ran the Chinese audio file with your demo, and my lips were not coordinated. Is there any solution? Does your model plan to train on the Chinese lip dataset?Thanks.

Training vgnet generator loss keeps increasing

I am training VGNET on GRID dataset using the code in vgnet.py. The input to the network are the dlib landmarks after procrustes alignment, and the ground-truth image labels are the cropped face images after warping to the neutral head pose. While the discriminator losses for real and fake images keep reducing to a very low value, the generator loss keeps increasing throughout training. The quality of the generated images keeps decreasing after a few epochs, i.e, the GAN training does not converge. Could you please suggest what could be the issue.
discriminator loss fake
discriminator loss real
generator loss

local variable 'lmark' referenced before assignment

Am I doing something wrong?
Thanks

$ python demo.py --device_ids 0 --cuda CUDA --vg_model ../model
/generator_23.pth --at_model ../model/atnet_lstm_18.pth --lstm LSTM --p ../image/musk1_region.jpg -i ../audio/obama
.wav --sample_dir ../output

Traceback (most recent call last):
File "demo.py", line 271, in
test()
File "demo.py", line 192, in test
example_image, example_landmark = generator_demo_example_lips( config.person)
File "demo.py", line 164, in generator_demo_example_lips
return dst, lmark
UnboundLocalError: local variable 'lmark' referenced before assignment

Preprocess the data

hi ,thansks for your nice work!
Would you tell how to preprocess the data (Extract landmark and crop the image by dlib)?

I checked other issue about this , like #34
that using demo.py ,but I think this is just for single image , but for video ,how to process .
Waiting for your reply, thanks you again

RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected

Hi Lele,

This may be an environment issue with my machine, but I'm unable to run the demo with a GPU virtual machine running Linux.

Do I need to use CUDA 9, or will CUDA 10 work?

Thank you

This is the traceback:

$ python demo.py
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=74 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File "demo.py", line 271, in
test()
File "demo.py", line 173, in test
pca = torch.FloatTensor( np.load('../basics/U_lrw1.npy')[:,:6]).cuda()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

This works fine though:

import torch
print(torch.cuda.device_count())
1

My nvidia-smi:
Sun May 26 05:32:33 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 74W / 149W | 0MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

pickle file of dataset

hello,

would you share the code to generate the pickle file "new_16_full_gt_train.pkl",please?

How to run on our own image?

Could you please tell the steps to run the demo file for our own image.
How do we get the output from ATnet before passing it to VGnet.

About audio2lmark_24.pth

There is an error: No such file or directory: '../model/atnet/audio2lmark_24.pth' . And I cannot find this file in google drive

generator的loss一直上升,discriminator的loss一直下降

在自己的数据集上训练,G网络和D网络没有明显的对抗关系,试着调整了一下学习率,除了D网络最开始收敛的慢了一些,但是D开始loss下降就不回头了。。。看起来像是学到了G网络没有察觉的特征,然后G网络的loss就不断上升,请问怎么让G网络和D网络具有明显的对抗关系,或者什么样的训练曲线才是比较理想的GAN网络训练结果呢?

Confused about "normLmarks" function

Many thanks for this repo. I am trying to reimplement your training process but I am stucked in data preprocessing.

Actually, I am confused about "normLmarks" function.

  1. I wonder that when there only exist one face for one frame ( len(lmarks.shape) == 2 ), will "normLmarks" always output with the same results? I mark related lines in your code with "#". It seems @ssinha89 also found this issue. #17 (comment).

  2. Would you tell me more about the meaning of "init_params", "params" and "predicted"? What does "S" or "SK" mean here? I know you use "procrustes" to align each landmarks to mean face, but I am confused about the process after that. Or could you provide some related papers for how to do that?

def normLmarks(lmarks):
    norm_list = []
    idx = -1
    max_openness = 0.2
    mouthParams = np.zeros((1, 100))
    mouthParams[:, 1] = -0.06
    tmp = deepcopy(MSK)
    tmp[:, 48*2:] += np.dot(mouthParams, SK)[0, :, 48*2:]
    open_mouth_params = np.reshape(np.dot(S, tmp[0, :] - MSK[0, :]), (1, 100))

    if len(lmarks.shape) == 2:
        lmarks = lmarks.reshape(1,68,2)
    for i in range(lmarks.shape[0]):
        mtx1, mtx2, disparity = procrustes(ms_img, lmarks[i, :, :])
        mtx1 = np.reshape(mtx1, [1, 136])
        mtx2 = np.reshape(mtx2, [1, 136])
        norm_list.append(mtx2[0, :])
    pred_seq = []
    init_params = np.reshape(np.dot(S, norm_list[idx] - mtx1[0, :]), (1, 100))
    for i in range(lmarks.shape[0]):
        params = np.reshape(np.dot(S, norm_list[i] - mtx1[0, :]), (1, 100)) - init_params - open_mouth_params
######## "params" will always be equal to  (-open_mouth_params) ######## 
        predicted = np.dot(params, SK)[0, :, :] + MSK
        pred_seq.append(predicted[0, :])
    return np.array(pred_seq), np.array(norm_list), 1

How to extract landmarks PCA components for my own data?

Thank you for your contribution on this fabulous work. While studying your code. I'm confused about these question, and I hope you could give some hint.

  1. How to extract PCA components for my own data? Suppose I've got (68, 2) landmarks.
  2. What's the usage of lip_mean_*, mean_*, U_*? And how to get them? They are loaded in the code, but more explanations helps understanding.
    Thanks again. I'm looking forward to your reply!

VGNet training generate the some faces of each identity

First very thanks for release the code
I use the project to train the vgnet and encounter the issue that Gen Loss keeps increasing and the vgnet model generates the same faces of as the I_p of each identity, could you suggest what would be the reason?

about the brightness of the generated video

The brightness of the cropped region is different from the brightness of the generated video image, which obviously feels brighter. I want to know if this is a network problem or some generation parameter problem. Looking forward to your reply.How do we ensure that the brightness of the generated video is not different from the cropped region.Thanks.

The use of mean_shape_norm.npy & S.npy & S_3d.npy?

Hello,
I learned about ATVGNet at CVPR2019 site.And this is a very interesting work!
But when I read code after that,some place made me confuse.
1.What is the use of mean_shape_norm.npy & S.npy & S_3d.npy, I know it try to normazile landmark,but What is the individual function of the parts(mean_shape_norm.npy & S.npy & S_3d.npy)? Knowing this might give me a better understanding of the code.

2.How to get mean_shape_norm.npy & S.npy & S_3d.npy?

look forward to you reply

Saving the model

Hi im trying to save the model after loading it.
It says it requires 3 parameters. But only two are given.
Could you please tell me the third parameter to be inserted in the parenthesis?
Thanks in advance
image

Is it possible to generate the landmark from the VG output?

Hi:
I'm testing to get the landmarks from the output facial images. Since the VG output images are cropped, the landmarks not so stable from dlib. Is that possible to generate the landmarks directly from VG net? Since we already have a landmark input for the vg net.
thanks

training vgnet.py

while training vgnet.py, the model didn't find the file " new_img_full_gt_train.pkl", what should this file contain? and how to create it?
could anyone who worked on it help?

thank you in advance.

State_dict

Im trying to run this script.

import torchvision
import torch
from somefile import modelarchitecture
model = modelarchitecture()
model.load_state_dict(torch.load(???))
model.eval()

Please guide me which values will be set for "model" and path inside torch.load?

How is head pose taken into account for VGnet

In LRW dataset, speakers move significantly when they are talking, so unless you did frontalization for each frame (which I assume you didn't?), your frames (ground truth of VGnet) should include various head poses.
However, for the input of VGNet, I don't see any of which contains head pose information. If I understand correctly, VGNet has 3 inputs: example_frame, example_landmark and fake_landmarks. Both example_landmark and fake_landmarks are normalized so that they don't include either speaker's identity information or his/her head pose information; example_frame is a still frame, which cannot explain the head poses for a whole sequence either. As none of the VGNet input contains head pose info, I don't understand why VGNet fits so well on moving heads in LRW dataset. Can you explain why that works? Thanks

AT-net label相关问题

作者您好,感谢您出色的工作。
我有两个疑问比较疑惑
1.留意到您在AT-net中的dataset处理中,将mfcc特征向量堆叠为16个拼接在一起。
想问一下堆叠为16的依据是什么?(我们在deepspeech中也有留意到相似的做法)
"t_mfcc =mfcc[(r + ind - 3)*4: (r + ind + 4)*4, 1:]"
这一步操作已经是选取了前后3帧共计280ms的特征向量,为什么还要将之拼接16次呢?
2."landmark =lmark[r+1 : r + 17,:]" 正常来讲,我们的label不是以当前帧为中心,为什么这里选取了当前帧往后的16帧的landmark作为标签?
图片

期待您的解答,感激不尽!

Problem in training VGNet

feeling thankful for making your implementations publicly available.
i have trained atnet without problem on lrw, but in training on vtnet i have some issues:

in function forward in VG_net class, the shape of image is (16, 6) (16 is batch size).
going to self.image_encoder1(image) and ReflectionPad2d(3) it raise error:

NotImplementedError: Only 3D, 4D, 5D padding with non-constant padding are supported for now

for a probable solution, i changed image value with :
image = image.unsqueeze(1).unsqueeze(1)
just before self.image_encoder1(image), it raise another error:
RuntimeError: invalid argument 6: Padding size should be less than the corresponding input dimension, but got: padding (3, 3) at dimension 2 of input [16 x 1 x 1 x 6] at /pytorch/aten/src/THCUNN/generic/SpatialReflectionPadding.cu:41

could you please help me with this errors?

what does lmark_train.pkl include?

In dateset.py,the value of lmark_path and mfcc_path are as follows,their values come from the same pickle file,I'm a little confused about this. Comparing data processing in demo.py,I think lmark in dataset.py corresponds to the return value of generator_demo_example_lips in the demo.py,Was my original understanding wrong?

self.lmark_root_path = '../dataset/landmark1d'
if self.train=='train':
_file = open(os.path.join(dataset_dir, "lmark_train.pkl"), "rb")
self.train_data = pickle.load(_file)
lmark_path = os.path.join(self.lmark_root_path , self.train_data[index][0] , self.train_data[index][1],self.train_data[index][2], self.train_data[index][2] + '.npy')
mfcc_path = os.path.join('../dataset/mfcc/', self.train_data[index][0], self.train_data[index][1], self.train_data[index][2] + '.npy')

Error when installing opencv

opencv-python

The above line in requirements.txt throws a pretty long error that is not worth reproducing in its entirety:

Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /home/arta/anaconda3/envs/py2/bin/python /home/arta/anaconda3/envs/py2/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmp88lNX5
       cwd: /tmp/pip-install-hhbrzT/opencv-python

I found this website, that, when translated to English, describes a workaround whereby you specify a version of opencv that is compatible with Python 2.7:

python -m pip install opencv-python==4.2.0.32

I hope this helps.

Wrong images generated

Hi, I clone the code and run the demo.py, nothing error, but the video and the images generated are totally in wrong type like this.

img
motion
attention

By the way, I run in CPU mode.
thanks in advance.

no requirements.txt and other issues

there is no requirements.txt
also model.py has a bug
this lines should be
model += [nn.Conv2d(ngf // 2, output_nc, kernel_size=7, padding=3)]
and
model += [nn.Conv2d(ngf // 2, 1, kernel_size=7, padding=3)]
to avoid floating point error
and with torch=0.4.1 there is a cuda issue
RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS
upgrading to 1.1.0 causes this issue
ImportError: cannot import name 'rnnFusedPointwise'

Question about PCA preprocessing

In demo.py, you multiply the example_landmark by 5 before applying PCA,

example_landmark = example_landmark * 5.0
example_landmark = example_landmark - mean.expand_as(example_landmark)
example_landmark = torch.mm(example_landmark, pca)

And for fake_landmarks, you multiply 2 times 1.1~1.5 before applying PCA

fake_lmark[:, 1:6] *= 2 * torch.FloatTensor(np.array([1.1, 1.2, 1.3, 1.4, 1.5])).cuda()
fake_lmark = torch.mm(fake_lmark, pca.t())
fake_lmark = fake_lmark + mean.expand_as(fake_lmark)

so (1) do you apply different scaling parameters for example_landmark and fake_landmarks?
(2) how are those scaling parameters (5 vs 2.2~3.0) being selected?

Model

Can you provide a link for a trained model?

How to use evaluation_matrix.py?

I want to use evaluation indicators in evaluation_matrix.py to perform a comparison experiment. Can the author write a script to describe how to use the evaluation indicators?

bigger?

1, use tf.warp, part of face cut off,can you provide landmark bigger?
2, the resolution of 128 * 128 is very small , enlarge?

Comparing on GRID dataset

Dear Authors,
Thanks for the awesome release of the paper and code.

I was trying to compare our result with yours on the GRID dataset for the LMD metric. Can you please tell me that in the paper

  1. Which subjects IDs of GRID did you use for testing.

  2. How many keypoints did you use for each subject ? I usually use a dlib detector which gives me 68 keypoints.

  3. Do you perform any normalization of the keypoints (after getting raw pixel coordinates using a dlib detector) to get rid of scale effects before calculating the difference on real and synthetic faces?

  4. Lastly, when you report the SSIM and PSNR: do you calculate those metrics on the entire frame or just cropped out face regions.
    I just wanted to make sure that we compare fairly with you. So, keenly looking forward to your kind reply.

Thanks,
Avisek Lahiri

What is the performance on Chinese dataset?

Thanks for your greate work.
I want to know how well your model can perform on Chinese datasets. I try to inference using Chinese audio. It doesn't seem very ideal.
I want to try to retrain this model on the Chinese dataset. Is there any related dataset or processing method for the dataset.

Thanks again.Waiting for your reply

error when run demo.py

Traceback (most recent call last):
File "demo.py", line 271, in
test()
File "demo.py", line 178, in test
encoder = encoder.cuda()
File "/home/iie/.conda/envs/s2v/lib/python2.7/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/iie/.conda/envs/s2v/lib/python2.7/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/home/iie/.conda/envs/s2v/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 112, in _apply
self.flatten_parameters()
File "/home/iie/.conda/envs/s2v/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 105, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS

error running demo.py cv2

Traceback (most recent call last):
File "demo.py", line 486, in
test()
File "demo.py", line 465, in test
fake_store = restore_image(orgImage,rect,fake_store,indx)
File "demo.py", line 196, in restore_image
cv2.normalize(img, img, 0, 255, cv2.NORM_MINMAX)
cv2.error: OpenCV(4.5.5) 👎 error: (-5:Bad argument) in function 'normalize'

Overload resolution failed:

  • Layout of the output array dst is incompatible with cv::Mat
  • Expected Ptrcv::UMat for argument 'dst'

Training another Data

Hi. I appreciate you to provide ATVGnet Code.

I want to traing ATVGnet. so I found that atnet.py, vgnet.py can training code for other Data.
but, I only have VOXceleb2(or voxceleb1) dataset. that is not correct LRW, Grid Dataset.
so I have confused how can i preprocessing voxceleb data. could you have any file for preprocessing another Data??? I want to download your preprocessing code.

How to get LRW dataset

I am interested in your work and trying to get the LRW dataset, but it failed for the agreement. I don't know how to get the agreement with BBC Research & Development. Could you show me the details for getting the dataset? Thanks a lot !

Using Pre-trained Model on GRID dataset

Dear Authors,
Thanks for sharing the code. I just wanted to know whether the pre-trained model which you have released can be used on GRID dataset or not? I am only interested to run your demo.py.
For example, I want to

  1. give a frame from GRID dataset as a target frame
  2. provide a .wav audio file from GRID

Do you think that should work. Or do I need to take any special care for GRID dataset demo.

Where is the network architecture?

We are trying to deploy this project on an android application. In order to do so, we need to convert the pretrained pytorch model (atnet_lstm_18.pth and generator_23.pth) into tensorflow but it shows an error of 'state_dict'.
When i load the pre trained models, it only gives the weights but not the architecture.
Can you guide me where i can find the Model architecture?
And how to convert it to tensorflow?
WhatsApp Image 2019-10-22 at 9 28 59 PM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.