karanvivekbhargava / obamanet Goto Github PK

View Code? Open in Web Editor NEW

232.0 232.0 70.0 2.48 MB

ObamaNet : Photo-realistic lip-sync from audio (Unofficial port)

License: MIT License

Python 98.13% Shell 1.87%

obamanet's People

Contributors

Stargazers

Watchers

obamanet's Issues

Model is not learning.

Hi, when I try to train a model, the validation performance does not improve at all as it can be seen from the tensorboard graphs. Also, training accuracy does not achieve good results. I used proposed data. What should I do? Thanks.

Consider using pix2pixHD?

Hi,

I see that the mouth hasn't been mapped well by pix2pix and is pretty low res.

Why are your results different from the paper? (Paper seems to have higher res)
Have you played around with pix2pixHD for higher resolution lips?

Cheers

vid2wav.py and tool folder

Hello there! I am not able to find the vid2wav.py and tool/process.py here. Is there a master branch somewhere else?

Tools found in Pix2pix repository

Framerate to extract images from video?

In the readme the command for converting videos to images is as follows:

ffmpeg -i 00001.mp4 -r 1/5 -vf scale=-1:720 images/00001-$filename%05d.bmp

This will only extract an image every 5 seconds (per -r 1/5) - is this how it should be done and if so, why? Is this arbitrary? I see in processing.py it indicates only 1467 images from 20 videos were used to generate the image_kp_raw files which indicates this is how it was done.

IndexError: index 1 is out of bounds for axis 1 with size 1

How to create .pkl files for own data set.

I am using this model for own videos, pix2pix.py training has been completed which generates .data, .index and .meta model files but cannot run train.py for keras models due to no pkl files.

Kindly help me in generating data/audio_kp/audio_kp1467_mel.pickle, data/pca/pkp1467.pickle and data/pca/pca1467.pickle files used in train.py.

Is it real to train model on any voice?

how to normalize the keypoints in this paper?

I am interested in your project and read your paper. In your paper, you process the keypoints to be independent on the face location, face size, in-plane and out-of-plane face rotation. For example, you mean-normalization the 68 keypoints, project the keypoints into a horizontal axis and divide the keypoints by the norm of the 68 vectors. It is important for this processing, but the explanation is brief in your paper. Could you explain this process including the formulas and methods in detail?

Thanks a lot

low resolution/blurry output

Is there a way to improve the result at https://github.com/karanvivekbhargava/obamanet/blob/master/results/key2im.gif so the mouth area is high resolution like the rest of the picture?

The number of files in 'data/processing.py'

Hello, I'm trying to retrain the network from scratch and I want to use the 'processing.py' from data folder downloaded from the link.

However, at line 205
numOfFiles = 1467 # First 20 videos
I want to know how is this number been calculated in order to create a correct pickle file for the train.py file to train.
Thank you!

files missing

python3: can't open file 'tools/process.py': [Errno 2] No such file or directory
and what is dir c supposed to contain?

How did you get a2key_data/images and kp_test.pickle?

I want to create images and kp_test.pickle by my dataset.

How do I obtain the original face with lip sync after the lip points are generated?

I am able to generate the lip sync points for my input. How do I remove the black box and lip animation to obtain the original face with lip sync?

Release the code about how to get the cropped images

Copy the patched images into folder a and the cropped images to folder b ==> Could you release the code about how to get the cropped images ?

Thank you ~

How to train with GPU ?

After all processing, my training has started but on CPU, I want to train it on multiple GPUs..
Please, pull some requests.

Regards.

After training, audio-to-kp LSTM predicting identical "open-mouth" keypoints

Hi @karanvivekbhargava -- thanks for a really great implementation of the ObamaNet framework, this has been a real joy to work with. I'm wondering if you or any other others have run into any snafus with training the audio-to-keypoint LSTM you have implemented train.py. After training yours for about 50 epochs, the LSTM is only predicting the same "open mouth" keypoint vectors for every audio timestep, like so:

Some more details if they're useful:

I've dumped the images using a frame rate of 20, as opposed to 25 or 30.
I'm using the default logfbank feature representation of the audio keypoints.
The processed training keypoints (extracted using dlib and normalized, etc.) match up perfectly to the original images ... so it's not an issue with the keypoint extraction/dumping code.
The default pre-trained a2kp model works moderately well (at least it does not predict the same static output).
I'm using 50 address videos (as opposed to I think 20 in this repo?), so don't think it's an issue of insufficient data.
I'm using the default look back and time delay parameters from this repo. I train for about 50 epochs with a batch size of 1.
It does not appear that testing loss or validation loss seems to decrease nearly at all across epochs.
I'm using the audios from the demo data that you've included in this repo (e.g. kia.wav, 00002-007.wav`, etc.).

TL;DR: Does this seem like it might just be an issue of not training for enough epochs? Or might it be some bug, i.e. related to the audio timesteps not correctly broken up in predict.py or the PCA upsampling not working well for predicted keypoints?

Would appreciate any quick insights or hunches anyone might have on this or if folks could just verify that they've gotten replicable results simply from using the exact code in this repo. Thanks so much!

尝试变更参数训练模型, 推理生成的图片嘴巴总是张开的, 这是为什么?

若能提供帮助或提示, 感激不尽,

有共同研究的朋友, 可以加v64053493拉群互相交流

run.py not generating black image

I'm probably just doing something wrong but if somebody could help me that would be greatly appreciated.

How to keep the mouth area consistent with the face?

The work is amazing, but when try to test, something makes me confused:
As you described in paper, the keypoints are normalized to be invariant to the face location, in-plane and out-of-plane face rotation.

But when i try to testing, I found, in test dataset, the keyponits rotate with the face and the keyponits is consecutive. As the described above, the keyponits generated from audio are invariants the face location, in-plane and out-of-plane face rotation.

Therefore, I want to know how the keypoints keep consistence with the face with different poses and size?

Thanks a lot!

karanvivekbhargava / obamanet Goto Github PK

obamanet's People

Contributors

Stargazers

Watchers

Forkers

obamanet's Issues

Recommend Projects

Recommend Topics

Recommend Org