rudrabha / lip2wav Goto Github PK

This is the repository containing codes for our CVPR, 2020 paper titled "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"

License: MIT License

Python 99.72% Shell 0.28%

lip2wav's People

Contributors

Stargazers

Watchers

Forkers

tomguluson92 rogalag rishistyping zagir ai-natural-language-processing-lab voxlogic araray m-usamasaleem jdc08161063 huguensjean avatarworld achoora dspnerd 5l1v3r1 circuit-ideer lite-java iwillcodeu louismousine entn-at jokecorleone shifthex xixirupan avivsham tsuirak hegemon98 dharmapranata amritsingh183 tomgoter b1sounours mrkoujan umairahmd jonnylewis prashant9316 shujathkhan thetkim9 avijit9 limnic gogozhaoya devphelps vinaypn joshuamarksberry cuijianzhu satyajitovelil chenchy wangrui5781 peterzhousz iact-health-care-service dsantiago linhduongtuan donnie wonwizard jiangnanyida peterzs alex73630 xiangliu886 tgdxat abdullahalfaraj malvinsug wmjhome knut0815 anasirdev danielson23 jojocorleone cannotlose sz909394 caichun azuredsky kishan9999 patrickprakash chenyang918 tubbz-alt learningpro miaochangq manojkesani sshuster canvsleo kangzhiq xiaoniuchushi abinav-m aureliuspatiens greatfeel psui3905 pugangqiang amaljithcf freakcap ali-razmdideh bruinxiong davgit mohannadehabbarakat qianqq crankyz hrnbot huchenxucs alpha-ai-ltd sodapeter easy-shu maxmax2016 abhi-0525 shohanshabbir mathpopo

lip2wav's Issues

About SV2TTS concat

Hi, I am interested in your project. but I'm confused about SV2TTS. SV2TTS embed one person's sound to vector, however, your project train model for every single person. And the code you released concat zero matrix with encoder ouput. Can I know why you use SV2TTS?
Thanks a lot!

Do you consider to use new Speech synthesis model ?

Unable to reproduce the score claimed in paper

Hi, thanks for the great work.
I'm able to reproduce the score claimed in the paper using our pre-trained model weights. However, when I tried to train Lip2wav on the chem speaker without the weights. the score seems not very good. Here is the result I got:

# Speaker: chem
# Lip2wav - using pre-trained weights 
Mean PESQ: 1.2984
Mean STOI: 0.4285
Mean ESTOI: 0.3204

# Lip2wav - checkpoint on steps 230k 
Mean PESQ: 1.1618
Mean STOI: 0.3245
Mean ESTOI: 0.1539

Tensorboard:

Both scores get under the same training environment (ffmpeg version 2.8.17) and the same codebase. What could be the potential issue of this?

When you can't detect a face

Hi, I'm trying to refactor your code in Pytorch. And I wonder why your model can generate mel even when the model can't detect a face? Thanks a lot!

ModuleNotFoundError: No module named 'numpy

When running ModuleNotFoundError: No module named 'numpy

Fix: pip install numpy

how to run inference?

When I run python complete_test_generate.py -d Dataset/chem -r Dataset/chem/test_results --preset synthesizer/presets/chem.json --checkpoint checkpoint2/tacotron_model.ckpt-324000.data-00000-of-00001

got this message:
DataLossError (see above for traceback): Unable to open table file checkpoint2/tacotron_model.ckpt-324000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Help for the installation

I've been trying to install it in ubuntu but I get this error:

ERROR: Could not find a version that satisfies the requirement opencv-python==4.1.1.26 (from -r requirements.txt (line 11)) (from versions: 3.4.0.14, 3.4.8.29, 3.4.9.31, 3.4.9.33, 3.4.10.35, 3.4.10.37, 3.4.11.39, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.14.51, 3.4.14.53, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.61, 3.4.17.63, 3.4.18.65, 4.1.2.30, 4.2.0.32, 4.2.0.34, 4.3.0.36, 4.3.0.38, 4.4.0.40, 4.4.0.42, 4.4.0.44, 4.4.0.46, 4.5.1.48, 4.5.2.52, 4.5.2.54, 4.5.3.56, 4.5.4.58, 4.5.4.60, 4.5.5.62, 4.5.5.64, 4.6.0.66) ERROR: No matching distribution found for opencv-python==4.1.1.26 (from -r requirements.txt (line 11))

So I tried to install it manually like this: pip install opencv-python==4.1.1.26 but I still get this error:

ERROR: Could not find a version that satisfies the requirement opencv-python==4.1.1.26 (from versions: 3.4.0.14, 3.4.8.29, 3.4.9.31, 3.4.9.33, 3.4.10.35, 3.4.10.37, 3.4.11.39, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.14.51, 3.4.14.53, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.61, 3.4.17.63, 3.4.18.65, 4.1.2.30, 4.2.0.32, 4.2.0.34, 4.3.0.36, 4.3.0.38, 4.4.0.40, 4.4.0.42, 4.4.0.44, 4.4.0.46, 4.5.1.48, 4.5.2.52, 4.5.2.54, 4.5.3.56, 4.5.4.58, 4.5.4.60, 4.5.5.62, 4.5.5.64, 4.6.0.66) ERROR: No matching distribution found for opencv-python==4.1.1.26

So I noticed that the package is outdated, the newest version is 4.6.0.66

So didn't know what to do and I tried editing the requirements.txt file to opencv-python==4.6.0.66 and it went through a bit more but then i get this error

ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vrstclp5/pesq/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vrstclp5/pesq/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-vrstclp5/pesq/pip-egg-info cwd: /tmp/pip-install-vrstclp5/pesq/ Complete output (5 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-vrstclp5/pesq/setup.py", line 10, in <module> from Cython.Build import cythonize, build_ext ModuleNotFoundError: No module named 'Cython' ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

But of course I don't know if what I'm doing is even correct.

Hopefully someone can help me.

Dataset Downloading Error

Hi All,
Thank you very much for this project!
I tried to run sh download_speaker.sh Dataset/chem but it seems there is a problem downloading the videos from youtube.
This is the error I'm getting:

Downloading Train set of chem
[youtube] rvPHWWNYC0I: Downloading webpage
[youtube] rvPHWWNYC0I: Downloading video info webpage
ERROR: rvPHWWNYC0I: YouTube said: This video is unavailable.

which stops the training set download process, the same happens for val/test sets. I tried to check if it is possible to see the video directly from Youtube's website and indeed it is possible (so the video exists).

Thank you in advance.

The pre-trained model does not seem to work well

I tried to use the pre-trained model to generate audio on the test set of the DL speaker, and got the following results:

dl_test_results.zip

The results do not sound very good, I got the score:

Mean PESQ: 1.2502756974913858 
Mean STOI: 0.051719609840522554
Mean ESTOI: 0.011818173468155018

which seems poorer.

I used this pre-trained model and the dl.json as config. What could be the possible problem?

Training error with "ValueError: all input arrays must have the same shape"

I downloaded the dataset. Processed it. Then trained the model with the following command:
python train.py first_run --data_root Dataset/chem/ --preset synthesizer/presets/chem.json
And on step 8, I got the following error:
ValueError: all input arrays must have the same shape
What did I do it wrong here? Thanks!

Arguments:
    name:                   first_run
    data_root:              Dataset/chem/
    preset:                 synthesizer/presets/chem.json
    models_dir:             synthesizer/saved_models/
    mode:                   synthesis
    GTA:                    True
    restore:                True
    summary_interval:       2500
    embedding_interval:     1000000000
    checkpoint_interval:    1000
    eval_interval:          1000
    tacotron_train_steps:   2000000
    tf_log_level:           1

Training on 12.369824074074074 hours
Validating on 0.7361574074074074 hours
...
Instructions for updating:
Use tf.cast instead.
Loss is added.....
Optimizer is added....
Feeder is initialized....
Ready to train....
Step       1 [62.693 sec/step, loss=18.06177, avg_loss=18.06177]
Step       2 [32.221 sec/step, loss=11.06506, avg_loss=14.56342]
Step       3 [22.066 sec/step, loss=8.36187, avg_loss=12.49623]
Step       4 [16.988 sec/step, loss=9.19182, avg_loss=11.67013]
Step       5 [22.275 sec/step, loss=8.69534, avg_loss=11.07517]
Step       6 [18.854 sec/step, loss=10.56172, avg_loss=10.98960]
Step       7 [16.410 sec/step, loss=8.13013, avg_loss=10.58110]
Step       8 [14.579 sec/step, loss=7.24404, avg_loss=10.16397]
Exception in thread background:
Traceback (most recent call last):
  File "/home/kavinvin/miniconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/kavinvin/miniconda3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 147, in _enqueue_next_train_group
    feed_dict = dict(zip(self._placeholders, self._prepare_batch(batch, r)))
  File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 212, in _prepare_batch
    input_cur_device, input_max_len = self._prepare_inputs([x[0] for x in batch])
  File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 238, in _prepare_inputs
    return np.stack([self._pad_input(x, max_len) for x in inputs]), max_len
  File "/home/kavinvin/miniconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 416, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Difference between mel.npz and ref.npz

Thanks for such a great work!
I am wondering why you use the encoded mel spectrogram (ref.npz) using pretrained model rather than directly using mel.npz?
Does that because ref.npz contains more speaker info?
Thank you!

Teacher forcing on TIMIT and GRID dataset

Hi, I want to know how to set teacher forcing in GRID and TCDTIMIT dataset. The same as lip2wav dataset? teacher forcing decay from 29000 steps?

Pre-processing and training not working on custom dataset?

Whenever I preprocess the custom dataset, this is the output:

`C:\Users\Graham\Desktop\Lip2Wav-master>python preprocess.py --speaker_root Dataset/larry --speaker larry
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Started processing for Dataset/larrywith 1 GPUs
0it [00:00, ?it/s]

C:\Users\Graham\Desktop\Lip2Wav-master>`

but there is no new output, and when I try and train, this outputs:

`C:\Users\Graham\Desktop\Lip2Wav-master>python train.py first_run --data_root Dataset/larry/ --preset synthesizer/presets/larry.json
Arguments:
name: first_run
data_root: Dataset/larry/
preset: synthesizer/presets/larry.json
models_dir: synthesizer/saved_models/
mode: synthesis
GTA: True
restore: True
summary_interval: 2500
embedding_interval: 1000000000
checkpoint_interval: 1000
eval_interval: 1000
tacotron_train_steps: 2000000
tf_log_level: 1

Traceback (most recent call last):
File "train.py", line 61, in
log_dir, hparams = prepare_run(args)
File "train.py", line 21, in prepare_run
hparams.add_hparam('all_images', all_images)
File "C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\training\python\training\hparam.py", line 485, in add_hparam
'Multi-valued hyperparameters cannot be empty: %s' % name)
ValueError: Multi-valued hyperparameters cannot be empty: all_images

C:\Users\Graham\Desktop\Lip2Wav-master>`

How do you properly use a custom dataset with this project? Thank you.

Bug in multispeaker branch

Hi,

I came across a bug when preprocessing LRW, where ffmpeg fails silently. I'm pretty sure this line

Lip2Wav/preprocess.py

Line 63 in a5835ff

command = template2.format(vfile, hp.sample_rate, wavpath)

should be changed to command = template2.format(vfile, wavpath) . This is what worked for me.

By the way, are you planning to release test samples/checkpoints for GRID/TCD-TIMIT. This would be great so that we can compare accurately with your work.

Thanks a lot in advance!

Questions about data preprocessing

Hello, thanks for the lip2wav dataset you kindly provided. I noticed that there are several scenes in the dataset where there is no face on the screen, and wondered how you solved this problem. Did you filter these data when training the model? Or did you just ignore them and got a good result still?

Is this model on google colab?

has this model been ported to google colab?

Question about hyper parameters: 'mel_overlap'

Hi! Thanks for this great work! Now, I'm trying to train this model on other datasets, and I can't find 'mel_overlap' in the training code, does this hyper parameter only works on inference phase? And if I'm trying to train this model on other datasets, how could I know how to set hparams like 'mel_overlap' ?

consistency of input frames length and output waveform length

Great work! I wonder how to ensure the consistency of input frame length and output waveform length? When I use GRID datasets to train and test and set the hyper parameters as follow:
T = 40
overlap = 10
mel_step_size = 160
mel_overlap = 40
img_size = 96
fps = 25,
Test results shows that the ground truth is 3 seconds while the generated waveforms are 7 seconds. How can I solve this problem? Looking forward to your reply!

How many steps needed for multi-speaker model

What's the expected number of steps to achieve a similar result to the pre-trained model of multi-speaker settings (LRW)? The default steps in train.py is 2,000,000.

Pytorch implementation for lip2wav-multispeaker

Hi, thank you for sharing the code for this project. Just would like to ask if there is a PyTorch implementation of this model or any tips on making conversions from the TensorFlow architecture to the PyTorch one. Currently working on a project on integrating lip2wav and wav2lip together in order to try and achieve a cycle consistency model.

Dataset preprocessing

Hi,

I am wondering about the pre-processing techniques that you use on your dataset. From what I could see, you save the frames where face is detected. But you don't consider clipping the audio based on whether face is detected or not. In that case how do ensure time alignment of the audio and video streams for training?

If I am wrong about my assumption, could you please point me to the correct pre-processing code.

Thanks!
Nabarun

during preprocess how to save frames without faces？

Hi，this is a great job. I try to use my own dataset to reconstruct the speech.The dataset are videos including medical images of vocal organs without human faces.Can you tell me how to save these frames without faces? Thanks a lot!

Error while training the model on any data

hello all,
im trying the get the model to train on the same data provided by the authors, I was able to recreate the results using the pre-trained weights and everything worked pretty much fine. I am using Windows and running tf on GPU.
The output I am getting is as follows-

Exception in thread background:
Traceback (most recent call last):
File "C:\Users\admin\Documents\temp\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\Users\admin\Documents\temp\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 139, in _enqueue_next_train_group
examples = [self._get_next_example() for i in range(n * _batches_per_group)]
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 139, in
examples = [self._get_next_example() for i in range(n * _batches_per_group)]
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 194, in _get_next_example
input_data, mel_target = self.getitem()
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 172, in getitem
mel = np.load(os.path.join(os.path.dirname(img_name), 'mels.npz'))['spec'].T
File "C:\Users\admin\Documents\temp\lib\site-packages\numpy\lib\npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'Dataset/chem//preprocessed/46\mels.npz'

The name of the files are numbers, so the "46" is pointing tawrds the directory where all the images are with the original .wav file.
PLEASE HELP! I have been stuck on this issue for a week now:(

training and test split for GRID and TCD-TIMIT

Nice work!
Could you please share the train, validation and unseen test splits for GRID and TCD-TIMIT used in your paper?
The unseen means unseen speakers or unseen sentences?
Do you also train one model for each speaker?

Thanks!

Results of training on my own dataset

Hi,When I train with my own dataset, the result is as shown in the following picture.Can you tell me where is the problem? May be the length of data is too short? (my dataset is about 20 minutes ) Looking forward to your reply ：

）

How to avoid “CUDA out of memory” in PyTorch

hey there.

the preprocessing is working but actually it has only generated audio files of 3.9gb's.
I think there should be image aswell( i tried on mac without any gpu and set device-name:'cuda' to device-name:'cpu' and its generating images) , on my desktop(with nvidia gpu) its only giving me .wav audio files and i am getting warning like

“CUDA out of memory” in PyTorch Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB reserved in total by PyTorch)

can someone help me fix the issue?

Real time speech generation

Hi. Would it be possible to produce speech from a real-time live video?

Can't resume training from checkpoint

Whenever I try to resume training from the checkpoint I get this error.

c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: synthesizer/saved_models/logs-final/taco_pretrained : The system cannot find the path specified. ; No such process

how to generate voice from video?

I have an mp4 video of a person speaking with choppy sound.
Can you tell me where I put the mp4 and what is the script to generate sound in silent parts.
thanks since now

Great project!

Excellent work and I love that you used a synthesized voice for the project video. I spent a lot of time on Corentin's Real Time Voice Cloning project and noticed you used some of his code as your base.

Just wondering what you plan on doing from this point with the project. Are you going to continue working on it or have you already moved on to new projects?

About teacher force

Hello, nvidia tacotron2 uses teacher-forceing, I was wondering where it is in your code? Can you help me?

Encountered ValueError during training

Hi!
It's really a nice work! But I'm facing a problem while training:

When training train.py with the command: python train.py chem --data_root Dataset/chem/ --preset synthesizer/presets/chem.json, it occurs to the ValueError: 'Multi-valued hyperparameters cannot be empty: all_test_images.' I can't find the hyperparameter throughout the code that named 'all_test_images'.
Am I Assigning the wrong values of 'name of run'? Is it a directory like '/Dataset/chem' or the random name that we gave such as 'chem'? I have tried the training with different 'name' values, such as 'chem', '/Dataset/chem' or '/Dataset/chem/preprocessed/', it didn't work, still report the same error.
Looking forward to your answer! Thanks a lot!

Training on LRS3 or custom data

Hi All,
How are you?
Thank you for your wonderful and interesting work!
I'm trying to train Lip2Wav on LRS3 and wonder what steps I should follow for doing that. (what is the correct dir structure, etc.)

Cheers,

about time cost

I have a question. You trained 200k iterations with bs=32. How much time did you spend on it? My training is quite slow!

Train/test split for GRID and TIMIT and WER measurement.

Thanks for your great work.

Would you like to share the GRID and TIMIT train and test split and the ASR model that calculates the WER for these two datasets for fair comparisons of future works?

Lip2Wav dataset not available

Hi,
I am really happy to see your pioneering work! And I am wondering whether can I download the Lip2Wav dataset? (the project website do not offer the URL).
Really appreciate it, thank you~

How Can I Get Generated Text?

First of all thank you for the project.

As i understand, project doing lip reading and creating text first and then text-to-speech with Tacotron.
I'm trying to get generated text from lip reading. Is it possible?

Also do i need text which includes speeches in videos for training my own data?

Thank you.

How much time did it take for you guys to train your model?

Hi we are trying to replicate your work for a coursework project, wanted to know how long did it take you guys to

Train (For e.g on the chess dataset)
Preprocess ( For e.g on the chess dataset)

WER

Hi, thanks for the great work.

When I test the pre-trained multi-speaker model on the LRW test set I get similar STOI and ESTOI values quoted in the paper but the best WER I can achieve is 79.6% compared to the 34.2% in the paper.

Could you specify the steps you used to achieve 34.2% WER with Google ASR? Do you crop the synthesised word and use a specific Google ASR model/configuration? Do you use the entire LRW test dataset or just a subset?

It would be great to know for fair comparison of future research.

Thanks