rudrabha / lip2wav Goto Github PK
View Code? Open in Web Editor NEWThis is the repository containing codes for our CVPR, 2020 paper titled "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"
License: MIT License
This is the repository containing codes for our CVPR, 2020 paper titled "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"
License: MIT License
Hi, I am interested in your project. but I'm confused about SV2TTS. SV2TTS embed one person's sound to vector, however, your project train model for every single person. And the code you released concat zero matrix with encoder ouput. Can I know why you use SV2TTS?
Thanks a lot!
Hi, thanks for the great work.
I'm able to reproduce the score claimed in the paper using our pre-trained model weights. However, when I tried to train Lip2wav on the chem
speaker without the weights. the score seems not very good. Here is the result I got:
# Speaker: chem
# Lip2wav - using pre-trained weights
Mean PESQ: 1.2984
Mean STOI: 0.4285
Mean ESTOI: 0.3204
# Lip2wav - checkpoint on steps 230k
Mean PESQ: 1.1618
Mean STOI: 0.3245
Mean ESTOI: 0.1539
Both scores get under the same training environment (ffmpeg version 2.8.17) and the same codebase. What could be the potential issue of this?
Hi, I'm trying to refactor your code in Pytorch. And I wonder why your model can generate mel even when the model can't detect a face? Thanks a lot!
When running ModuleNotFoundError: No module named 'numpy
Fix: pip install numpy
When I run python complete_test_generate.py -d Dataset/chem -r Dataset/chem/test_results --preset synthesizer/presets/chem.json --checkpoint checkpoint2/tacotron_model.ckpt-324000.data-00000-of-00001
got this message:
DataLossError (see above for traceback): Unable to open table file checkpoint2/tacotron_model.ckpt-324000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
I've been trying to install it in ubuntu but I get this error:
ERROR: Could not find a version that satisfies the requirement opencv-python==4.1.1.26 (from -r requirements.txt (line 11)) (from versions: 3.4.0.14, 3.4.8.29, 3.4.9.31, 3.4.9.33, 3.4.10.35, 3.4.10.37, 3.4.11.39, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.14.51, 3.4.14.53, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.61, 3.4.17.63, 3.4.18.65, 4.1.2.30, 4.2.0.32, 4.2.0.34, 4.3.0.36, 4.3.0.38, 4.4.0.40, 4.4.0.42, 4.4.0.44, 4.4.0.46, 4.5.1.48, 4.5.2.52, 4.5.2.54, 4.5.3.56, 4.5.4.58, 4.5.4.60, 4.5.5.62, 4.5.5.64, 4.6.0.66) ERROR: No matching distribution found for opencv-python==4.1.1.26 (from -r requirements.txt (line 11))
So I tried to install it manually like this: pip install opencv-python==4.1.1.26
but I still get this error:
ERROR: Could not find a version that satisfies the requirement opencv-python==4.1.1.26 (from versions: 3.4.0.14, 3.4.8.29, 3.4.9.31, 3.4.9.33, 3.4.10.35, 3.4.10.37, 3.4.11.39, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.14.51, 3.4.14.53, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.61, 3.4.17.63, 3.4.18.65, 4.1.2.30, 4.2.0.32, 4.2.0.34, 4.3.0.36, 4.3.0.38, 4.4.0.40, 4.4.0.42, 4.4.0.44, 4.4.0.46, 4.5.1.48, 4.5.2.52, 4.5.2.54, 4.5.3.56, 4.5.4.58, 4.5.4.60, 4.5.5.62, 4.5.5.64, 4.6.0.66) ERROR: No matching distribution found for opencv-python==4.1.1.26
So I noticed that the package is outdated, the newest version is 4.6.0.66
So didn't know what to do and I tried editing the requirements.txt file to opencv-python==4.6.0.66 and it went through a bit more but then i get this error
ERROR: Command errored out with exit status 1: command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vrstclp5/pesq/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vrstclp5/pesq/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-vrstclp5/pesq/pip-egg-info cwd: /tmp/pip-install-vrstclp5/pesq/ Complete output (5 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-vrstclp5/pesq/setup.py", line 10, in <module> from Cython.Build import cythonize, build_ext ModuleNotFoundError: No module named 'Cython' ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
But of course I don't know if what I'm doing is even correct.
Hopefully someone can help me.
Hi All,
Thank you very much for this project!
I tried to run sh download_speaker.sh Dataset/chem
but it seems there is a problem downloading the videos from youtube.
This is the error I'm getting:
Downloading Train set of chem
[youtube] rvPHWWNYC0I: Downloading webpage
[youtube] rvPHWWNYC0I: Downloading video info webpage
ERROR: rvPHWWNYC0I: YouTube said: This video is unavailable.
which stops the training set download process, the same happens for val/test sets. I tried to check if it is possible to see the video directly from Youtube's website and indeed it is possible (so the video exists).
Thank you in advance.
I tried to use the pre-trained model to generate audio on the test set of the DL speaker, and got the following results:
The results do not sound very good, I got the score:
Mean PESQ: 1.2502756974913858
Mean STOI: 0.051719609840522554
Mean ESTOI: 0.011818173468155018
which seems poorer.
I used this pre-trained model and the dl.json as config. What could be the possible problem?
I downloaded the dataset. Processed it. Then trained the model with the following command:
python train.py first_run --data_root Dataset/chem/ --preset synthesizer/presets/chem.json
And on step 8, I got the following error:
ValueError: all input arrays must have the same shape
What did I do it wrong here? Thanks!
Arguments:
name: first_run
data_root: Dataset/chem/
preset: synthesizer/presets/chem.json
models_dir: synthesizer/saved_models/
mode: synthesis
GTA: True
restore: True
summary_interval: 2500
embedding_interval: 1000000000
checkpoint_interval: 1000
eval_interval: 1000
tacotron_train_steps: 2000000
tf_log_level: 1
Training on 12.369824074074074 hours
Validating on 0.7361574074074074 hours
...
Instructions for updating:
Use tf.cast instead.
Loss is added.....
Optimizer is added....
Feeder is initialized....
Ready to train....
Step 1 [62.693 sec/step, loss=18.06177, avg_loss=18.06177]
Step 2 [32.221 sec/step, loss=11.06506, avg_loss=14.56342]
Step 3 [22.066 sec/step, loss=8.36187, avg_loss=12.49623]
Step 4 [16.988 sec/step, loss=9.19182, avg_loss=11.67013]
Step 5 [22.275 sec/step, loss=8.69534, avg_loss=11.07517]
Step 6 [18.854 sec/step, loss=10.56172, avg_loss=10.98960]
Step 7 [16.410 sec/step, loss=8.13013, avg_loss=10.58110]
Step 8 [14.579 sec/step, loss=7.24404, avg_loss=10.16397]
Exception in thread background:
Traceback (most recent call last):
File "/home/kavinvin/miniconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/kavinvin/miniconda3/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 147, in _enqueue_next_train_group
feed_dict = dict(zip(self._placeholders, self._prepare_batch(batch, r)))
File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 212, in _prepare_batch
input_cur_device, input_max_len = self._prepare_inputs([x[0] for x in batch])
File "/home/kavinvin/projects/sandbox/Lip2Wav/synthesizer/feeder.py", line 238, in _prepare_inputs
return np.stack([self._pad_input(x, max_len) for x in inputs]), max_len
File "/home/kavinvin/miniconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 416, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
Thanks for such a great work!
I am wondering why you use the encoded mel spectrogram (ref.npz) using pretrained model rather than directly using mel.npz?
Does that because ref.npz contains more speaker info?
Thank you!
Hi, I want to know how to set teacher forcing in GRID and TCDTIMIT dataset. The same as lip2wav dataset? teacher forcing decay from 29000 steps?
Whenever I preprocess the custom dataset, this is the output:
`C:\Users\Graham\Desktop\Lip2Wav-master>python preprocess.py --speaker_root Dataset/larry --speaker larry
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Started processing for Dataset/larrywith 1 GPUs
0it [00:00, ?it/s]
C:\Users\Graham\Desktop\Lip2Wav-master>`
but there is no new output, and when I try and train, this outputs:
`C:\Users\Graham\Desktop\Lip2Wav-master>python train.py first_run --data_root Dataset/larry/ --preset synthesizer/presets/larry.json
Arguments:
name: first_run
data_root: Dataset/larry/
preset: synthesizer/presets/larry.json
models_dir: synthesizer/saved_models/
mode: synthesis
GTA: True
restore: True
summary_interval: 2500
embedding_interval: 1000000000
checkpoint_interval: 1000
eval_interval: 1000
tacotron_train_steps: 2000000
tf_log_level: 1
Traceback (most recent call last):
File "train.py", line 61, in
log_dir, hparams = prepare_run(args)
File "train.py", line 21, in prepare_run
hparams.add_hparam('all_images', all_images)
File "C:\Users\Graham\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\contrib\training\python\training\hparam.py", line 485, in add_hparam
'Multi-valued hyperparameters cannot be empty: %s' % name)
ValueError: Multi-valued hyperparameters cannot be empty: all_images
C:\Users\Graham\Desktop\Lip2Wav-master>`
How do you properly use a custom dataset with this project? Thank you.
Hi,
I came across a bug when preprocessing LRW, where ffmpeg fails silently. I'm pretty sure this line
Line 63 in a5835ff
command = template2.format(vfile, wavpath)
. This is what worked for me.
By the way, are you planning to release test samples/checkpoints for GRID/TCD-TIMIT. This would be great so that we can compare accurately with your work.
Thanks a lot in advance!
Hello, thanks for the lip2wav dataset you kindly provided. I noticed that there are several scenes in the dataset where there is no face on the screen, and wondered how you solved this problem. Did you filter these data when training the model? Or did you just ignore them and got a good result still?
has this model been ported to google colab?
Hi! Thanks for this great work! Now, I'm trying to train this model on other datasets, and I can't find 'mel_overlap' in the training code, does this hyper parameter only works on inference phase? And if I'm trying to train this model on other datasets, how could I know how to set hparams like 'mel_overlap' ?
Great work! I wonder how to ensure the consistency of input frame length and output waveform length? When I use GRID datasets to train and test and set the hyper parameters as follow:
T = 40
overlap = 10
mel_step_size = 160
mel_overlap = 40
img_size = 96
fps = 25,
Test results shows that the ground truth is 3 seconds while the generated waveforms are 7 seconds. How can I solve this problem? Looking forward to your reply!
What's the expected number of steps to achieve a similar result to the pre-trained model of multi-speaker settings (LRW)? The default steps in train.py is 2,000,000.
Hi, thank you for sharing the code for this project. Just would like to ask if there is a PyTorch implementation of this model or any tips on making conversions from the TensorFlow architecture to the PyTorch one. Currently working on a project on integrating lip2wav and wav2lip together in order to try and achieve a cycle consistency model.
Hi,
I am wondering about the pre-processing techniques that you use on your dataset. From what I could see, you save the frames where face is detected. But you don't consider clipping the audio based on whether face is detected or not. In that case how do ensure time alignment of the audio and video streams for training?
If I am wrong about my assumption, could you please point me to the correct pre-processing code.
Thanks!
Nabarun
Hi,this is a great job. I try to use my own dataset to reconstruct the speech.The dataset are videos including medical images of vocal organs without human faces.Can you tell me how to save these frames without faces? Thanks a lot!
hello all,
im trying the get the model to train on the same data provided by the authors, I was able to recreate the results using the pre-trained weights and everything worked pretty much fine. I am using Windows and running tf on GPU.
The output I am getting is as follows-
Exception in thread background:
Traceback (most recent call last):
File "C:\Users\admin\Documents\temp\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\Users\admin\Documents\temp\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 139, in _enqueue_next_train_group
examples = [self._get_next_example() for i in range(n * _batches_per_group)]
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 139, in
examples = [self._get_next_example() for i in range(n * _batches_per_group)]
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 194, in _get_next_example
input_data, mel_target = self.getitem()
File "C:\Users\admin\Desktop\Lip2Wav-master\Lip2Wav-master\synthesizer\feeder.py", line 172, in getitem
mel = np.load(os.path.join(os.path.dirname(img_name), 'mels.npz'))['spec'].T
File "C:\Users\admin\Documents\temp\lib\site-packages\numpy\lib\npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'Dataset/chem//preprocessed/46\mels.npz'
The name of the files are numbers, so the "46" is pointing tawrds the directory where all the images are with the original .wav file.
PLEASE HELP! I have been stuck on this issue for a week now:(
Nice work!
Could you please share the train, validation and unseen test splits for GRID and TCD-TIMIT used in your paper?
The unseen means unseen speakers or unseen sentences?
Do you also train one model for each speaker?
Thanks!
hey there.
the preprocessing is working but actually it has only generated audio files of 3.9gb's.
I think there should be image aswell( i tried on mac without any gpu and set device-name:'cuda' to device-name:'cpu' and its generating images) , on my desktop(with nvidia gpu) its only giving me .wav audio files and i am getting warning like
“CUDA out of memory” in PyTorch Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB reserved in total by PyTorch)
can someone help me fix the issue?
Hi. Would it be possible to produce speech from a real-time live video?
Whenever I try to resume training from the checkpoint I get this error.
c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: synthesizer/saved_models/logs-final/taco_pretrained : The system cannot find the path specified. ; No such process
I have an mp4 video of a person speaking with choppy sound.
Can you tell me where I put the mp4 and what is the script to generate sound in silent parts.
thanks since now
Excellent work and I love that you used a synthesized voice for the project video. I spent a lot of time on Corentin's Real Time Voice Cloning project and noticed you used some of his code as your base.
Just wondering what you plan on doing from this point with the project. Are you going to continue working on it or have you already moved on to new projects?
Hello, nvidia tacotron2 uses teacher-forceing, I was wondering where it is in your code? Can you help me?
Hi!
It's really a nice work! But I'm facing a problem while training:
When training train.py with the command: python train.py chem --data_root Dataset/chem/ --preset synthesizer/presets/chem.json, it occurs to the ValueError: 'Multi-valued hyperparameters cannot be empty: all_test_images.' I can't find the hyperparameter throughout the code that named 'all_test_images'.
Am I Assigning the wrong values of 'name of run'? Is it a directory like '/Dataset/chem' or the random name that we gave such as 'chem'? I have tried the training with different 'name' values, such as 'chem', '/Dataset/chem' or '/Dataset/chem/preprocessed/', it didn't work, still report the same error.
Looking forward to your answer! Thanks a lot!
Hi All,
How are you?
Thank you for your wonderful and interesting work!
I'm trying to train Lip2Wav on LRS3 and wonder what steps I should follow for doing that. (what is the correct dir structure, etc.)
Cheers,
I have a question. You trained 200k iterations with bs=32. How much time did you spend on it? My training is quite slow!
Thanks for your great work.
Would you like to share the GRID and TIMIT train and test split and the ASR model that calculates the WER for these two datasets for fair comparisons of future works?
Hi,
I am really happy to see your pioneering work! And I am wondering whether can I download the Lip2Wav dataset? (the project website do not offer the URL).
Really appreciate it, thank you~
First of all thank you for the project.
As i understand, project doing lip reading and creating text first and then text-to-speech with Tacotron.
I'm trying to get generated text from lip reading. Is it possible?
Also do i need text which includes speeches in videos for training my own data?
Thank you.
Hi we are trying to replicate your work for a coursework project, wanted to know how long did it take you guys to
Hi, thanks for the great work.
When I test the pre-trained multi-speaker model on the LRW test set I get similar STOI and ESTOI values quoted in the paper but the best WER I can achieve is 79.6% compared to the 34.2% in the paper.
Could you specify the steps you used to achieve 34.2% WER with Google ASR? Do you crop the synthesised word and use a specific Google ASR model/configuration? Do you use the entire LRW test dataset or just a subset?
It would be great to know for fair comparison of future research.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.