facebookresearch / visualvoice Goto Github PK

Audio-Visual Speech Separation with Cross-Modal Consistency

License: Other

Python 99.03% Shell 0.97%

visualvoice's Introduction

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

This repository contains the code for VisualVoice. [Project Page]

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
Ruohan Gao^1,2 and Kristen Grauman^1,2
¹UT Austin, ²Facebook AI Research
In CVPR, 2021

If you find our data or project useful in your research, please cite:

@inproceedings{gao2021VisualVoice,
  title = {VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency},
  author = {Gao, Ruohan and Grauman, Kristen},
  booktitle = {CVPR},
  year = {2021}
}

Demo with the pre-trained models

Download the pre-trained models:

wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/facial_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/lipreading_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/unet_best.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/av-speech-separation-model/vocal_best.pth

Preprocess the demo video using the following commands that convert the video to 25f/s, resample the audio to 16kHz, and track the speakers with a simple implementation based on a face detector. Using other advanced face tracker of your choice can lead to better separation results.

ffmpeg -i ./test_videos/interview.mp4 -filter:v fps=fps=25 ./test_videos/interview25fps.mp4
mv ./test_videos/interview25fps.mp4 ./test_videos/interview.mp4
python ./utils/detectFaces.py --video_input_path ./test_videos/interview.mp4 --output_path ./test_videos/interview/ --number_of_speakers 2 --scalar_face_detection 1.5 --detect_every_N_frame 8
ffmpeg -i ./test_videos/interview.mp4 -vn -ar 16000 -ac 1 -ab 192k -f wav ./test_videos/interview/interview.wav
python ./utils/crop_mouth_from_video.py --video-direc ./test_videos/interview/faces/ --landmark-direc ./test_videos/interview/landmark/ --save-direc ./test_videos/interview/mouthroi/ --convert-gray --filename-path ./test_videos/interview/filename_input/interview.csv
./

Use the downloaded pre-trained models to test on the demo video.

python testRealVideo.py \
--mouthroi_root ./test_videos/interview/mouthroi/ \
--facetrack_root ./test_videos/interview/faces/ \
--audio_path ./test_videos/interview/interview.wav \
--weights_lipreadingnet pretrained_models/lipreading_best.pth \
--weights_facial pretrained_models/facial_best.pth \
--weights_unet pretrained_models/unet_best.pth \
--weights_vocal pretrained_models/vocal_best.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--unet_output_nc 2 \
--normalization \
--visual_feature_type both \
--identity_feature_dim 128 \
--audioVisual_feature_dim 1152 \
--visual_pool maxpool \
--audio_pool maxpool \
--compression_type none \
--reliable_face \
--audio_normalization \
--desired_rms 0.7 \
--number_of_speakers 2 \
--mask_clip_threshold 5 \
--hop_length 2.55 \
--lipreading_extract_feature \
--number_of_identity_frames 1 \
--output_dir_root ./test_videos/interview/

Dataset preparation for VoxCeleb2

Download the VoxCeleb2 dataset. The pre-processed mouth ROIs can be downloaded as follows:

# mounth ROIs for VoxCeleb2 (train: 1T; val: 20G; seen_heard_test: 88G; unseen_unheard_test: 20G)
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_train.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_val.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_seen_heard_test.tar.gz
wget http://dl.fbaipublicfiles.com/VisualVoice/mouth_roi_unseen_unheard_test.tar.gz

# Directory structure of the dataset:
#    ├── VoxCeleb2                          
#    │       └── [mp4]               (contain the face tracks in .mp4)
#    │                └── [train]
#    │                └── [val]
#    │                └── [seen_heard_test]
#    │                └── [unseen_unheard_test]
#    │       └── [audio]             (contain the audio files in .wav)
#    │                └── [train]
#    │                └── [val]
#    │                └── [seen_heard_test]
#    │                └── [unseen_unheard_test]
#    │       └── [mouth_roi]         (contain the mouth ROIs in .h5)
#    │                └── [train]
#    │                └── [val]
#    │                └── [seen_heard_test]
#    │                └── [unseen_unheard_test]

Download the hdf5 files that contain the data paths, and then modify the hdf5 file accordingly by changing the paths to have the correct root prefix of your own.

wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/train.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/val.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/seen_heard_test.h5
wget http://dl.fbaipublicfiles.com/VisualVoice/hdf5/VoxCeleb2/unseen_unheard_test.h5

Training and Testing

(The code has been tested under the following system environment: Ubuntu 18.04.3 LTS, CUDA 10.0, Python 3.7.3, PyTorch 1.3.0, torchvision 0.4.1, face-alignment 1.2.0, librosa 0.7.0, av 8.0.3)

Download the pre-trained cross-modal matching models as initialization:

wget http://dl.fbaipublicfiles.com/VisualVoice/cross-modal-pretraining/facial.pth
wget http://dl.fbaipublicfiles.com/VisualVoice/cross-modal-pretraining/vocal.pth

Use the following command to train the VisualVoice speech separation model:

python train.py \
--name exp \
--gpu_ids 0,1,2,3,4,5,6,7 \
--batchSize 128 \
--nThreads 32 \
--display_freq 10 \
--save_latest_freq 500 \
--niter 1 \
--validation_on True \
--validation_freq 200 \
--validation_batches 30 \
--num_batch 50000 \
--lr_steps 30000 40000 \
--coseparation_loss_weight 0.01 \
--mixandseparate_loss_weight 1 \
--crossmodal_loss_weight 0.01 \
--lr_lipreading 0.0001 \
--lr_facial_attributes 0.00001 \
--lr_unet 0.0001 \
--lr_vocal_attributes 0.00001 \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--margin 0.5 \
--weighted_loss \
--visual_pool maxpool \
--audio_pool maxpool \
--optimizer adam \
--normalization \
--tensorboard True \
--mask_loss_type L2 \
--visual_feature_type both \
--unet_input_nc 2 \
--unet_output_nc 2 \
--compression_type none \
--mask_clip_threshold 5 \
--audioVisual_feature_dim 1152 \
--identity_feature_dim 128 \
--audio_normalization \
--lipreading_extract_feature \
--weights_facial ./pretrained_models/cross-modal-pretraining/facial.pth \
--weights_vocal ./pretrained_models/cross-modal-pretraining/vocal.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--data_path hdf5/VoxCeleb2/ \
|& tee logs.txt

Use the following command to test on a synthetic mixture:

python test.py \
--audio1_path /YOUR_DATASET_PATH/VoxCeleb2/audio/seen_heard_test/id06688/akPwstwDxjE/00023.wav \
--audio2_path /YOUR_DATASET_PATH/VoxCeleb2/audio/seen_heard_test/id08606/0o-ZBLLLjXE/00002.wav \
--mouthroi1_path /YOUR_DATASET_PATH/VoxCeleb2/mouth_roi/seen_heard_test/id06688/akPwstwDxjE/00023.h5 \
--mouthroi2_path /YOUR_DATASET_PATH/VoxCeleb2/mouth_roi/seen_heard_test/id08606/0o-ZBLLLjXE/00002.h5 \
--video1_path /YOUR_DATASET_PATH/VoxCeleb2/mp4/seen_heard_test/id06688/akPwstwDxjE/00023.mp4 \
--video2_path /YOUR_DATASET_PATH/VoxCeleb2/mp4/seen_heard_test/id08606/0o-ZBLLLjXE/00002.mp4 \
--num_frames 64 \
--audio_length 2.55 \
--hop_size 160 \
--window_size 400 \
--n_fft 512 \
--weights_lipreadingnet pretrained_models/lipreading_best.pth \
--weights_facial pretrained_models/facial_best.pth \
--weights_unet pretrained_models/unet_best.pth \
--weights_vocal pretrained_models/vocal_best.pth \
--lipreading_config_path configs/lrw_snv1x_tcn2x.json \
--unet_output_nc 2 \
--normalization \
--mask_to_use pred \
--visual_feature_type both \
--identity_feature_dim 128 \
--audioVisual_feature_dim 1152 \
--visual_pool maxpool \
--audio_pool maxpool \
--compression_type none \
--mask_clip_threshold 5 \
--hop_length 2.55 \
--audio_normalization \
--lipreading_extract_feature \
--number_of_identity_frames 1 \
--output_dir_root test

Evaluate the separation performance.

python evaluateSeparation.py --results_dir test/id06688_akPwstwDxjE_00023VSid08606_0o-ZBLLLjXE_00002

Model Variants

Audio-Visual speech separation model tailored to 2 speakers (with context): see subdirectory av-separation-with-context/.
Audio-Visual speech enhancement code: see subdirectory av-enhancement.

Acknowlegements

Some of the code is borrowed or adapted from Co-Separation. The code for the lip analysis network is adapted from Lipreading using Temporal Convolutional Networks.

Licence

The majority of VisualVoice is licensed under CC-BY-NC, however portions of the project are available under separate license terms: license information for Lipreading using Temporal Convolutional Networks is available at https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/blob/master/LICENSE.

visualvoice's People

Contributors

Stargazers

Watchers

visualvoice's Issues

Data structure

Hi @rhgao,
I'm trying to train it on a new dataset,
but I'm wondering what's the structure in the h5 files
thanks a lot !

error extracting mouth_roi_train.tar.gz

Hi @rhgao,

Thank you for releasing the code. When I perform "tar zxvf mouth_roi_train.tar.gz", it shows
"tar: Skipping to next header
tar: A lone zero block at 3218058299
tar: Exiting with failure status due to previous errors". The size of file is 1155287823247. Is the file broken? Thanks.

Questions about the testRealVideo.py

RuntimeError: Given groups=1, weight of size [512, 1152, 3, 3], expected input[1, 1792, 2, 64] to have 1152 channels, but got 1792 channels instead

Can you release speech enhancement models ?

Hello!
Thanks for sharing this code with us.

When testing your two-speaker speech separation pre-train models, I found that the model performance deteriorates when extracting a specific single speaker. Only when I combine two speakers' mouth RoIs and faces into the model at the same time can I get a satisfactory separation result. I think this deterioration is caused by separation models, not enhancement models.

In a real scene, the number of speakers is unknown, and extracting only one specific person is needed. So can you provide a speech enhancement model for testing? Such as model structure or pre-trained model.

We will appreciate it if you can provide.

Thanks again for your contribution.

num_classes

Hi,Thank you for releasing the code.I wonder why you set the value of num_classes to 500? I read the code, paper, and support file carefully, but I couldn't figure it out, please advise. Thank you so much!

The pre-processed mouth ROIs

Hello, I would like to ask a question.

Regarding the mouth data in the dataset, it is stored as an h5 file.

Could you please explain how it was generated? Is there a pre-trained model available?

If I want to replace VoxCeleb2 with a different dataset, how can I generate the mouth h5 files?

Looking forward to your answer! Thank you very much!!

How to define the weight coefficient in mask loss?

Hello, could you please explain the meaning of the weights here?

This coefficient is not included in the paper, and I have found that it is not necessary to calculate this weight in the test.py.

     # calculate loss weighting coefficient        
     if self.opt.weighted_loss:
        weight1 = torch.log1p(torch.norm(audio_mix_spec1[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
        weight1 = torch.clamp(weight1, 1e-3, 10)
        weight2 = torch.log1p(torch.norm(audio_mix_spec2[:,:,:-1,:], p=2, dim=1)).unsqueeze(1).repeat(1,2,1,1)
        weight2 = torch.clamp(weight2, 1e-3, 10)
    else:
        weight1 = None
        weight2 = None

Can't find the test_videos.

When I want to preprocess the demo videos, I can't find the test_videos. Where can I download it?

Where can I get vocal_best.pth and facial_best.pth for speech enhancement model?

I have recently tried av-enhancement and found out that the provided pretrained models only show classifier, identity, unet, and lipreading_best.pth models.
I could not find vocal_best.pth and facial_best.pth pretrained model, so I tried to used ones in the original repository, the result was not as good as what the demo video represents.
Could you please add the both pretrained model or could you tell me the way to solve my problem?
Thank you so much for your help.

Provide face landmarks

Hi, thanks for your great work.
Could you please release face landmarks for voxceleb2? Because the mouth roi files are too large to download. If landmarks are provided, we can cut mouth area on our own computer and save a lot of time.
Thank you!

Are the masks supposed to be clamped or clipped?

I am confused because of the function and variable names.

Data set classification

Hello, I want to try to run your code recently, but after I find the voxceleb2 data set and download it, the data set is a whole. How should I match the data set with the pre processed mouth ROIs you gave?

how to separate the audio and mp4 directory

the mouth_roi dataset is separated by train,val,seen,unseen 4 directories, but the audio and mp4 raw dataset only contain train and test directory. So how to extract the seen and unseen dataset from raw dataset? thx very much

Why num_frames is 64 not 75 or other number?

Thanks for your great work.
I was curious about the parameter of num_frames. Why we only get 64 frames of mouth ROI for 2.55 seconds? The end of 10 frames is abandoned? I can't figure it out. Thanks again.

Variants of Pretrained Model

Nice work and impressive results. From the ablation study of your paper, I saw some variants of your model that only take static face (identity features) or lip motion features as visual signals.

Personally, I am interested in those ablation models. I wonder whether I can ask for pre-trained model weights for the static face version. I assume the current weights you provide are the full model, if I change the configure to identity feature as input, the current weight of U-net is not supported for this feature input.
While I try to zero out the lip motion feature and perform a demo video, i.e. visual_feature = torch.cat((identity_feature, lipreading_feature * 0), dim=1), it could not work.

The experimental part of the consulting paper "VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency" Table 1 The first part is audio only

Dear Dr. Gao:
Hi!
Thank you for your excellent work on this paper "VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency". I would like to consult the experimental part. Do you have the code for Audio-Only[79] in Table 1? Why did I delete the visual information in the code VisualVoice, and only audio separation cannot achieve the result of Audio-Only[79] in Table 1. If you have this part of Audio-Only[79] code, can you send it to me to learn it? I would appreciate any help, thanks!

the pre-trained cross-modal matching models

Hello, how to generate pre-trained cross-modal matching models facial.pth and vocal.pth.

About audio files

Hi, I downloaded the dataset and the format of all audio files are m4a.
But the code in audioVisual_dataset used "wavfile.read()" directly.
Does that mean I have to convert the audio files from m4a to wav by myself?
Or may be I miss something important?

why the sdr performence in the paper cannot be realized

I've trained the model for up to 200 thousands epoch, but the sdr performence is only 7.4 in unseen_unheard_test set. It is wondered how long the model has been training in paper. Due to the limitation of the numbers of GPUs, the config of paper can't be implemented. So, any advice will help me.

my training config below
--gpu_ids 0,1 \ --batchSize 10 \ --nThreads 16 \ --decay_factor 0.5 \ --num_batch 400000 \ --lr_steps 40000 80000 120000 160000 200000 \ --coseparation_loss_weight 0.01 \ --mixandseparate_loss_weight 1 \ --crossmodal_loss_weight 0.01 \ --lr_lipreading 0.0001 \ --lr_facial_attributes 0.00001 \ --lr_unet 0.0001 \ --lr_vocal_attributes 0.00001 \ --

landmarks

Hi，thx for your great works， I am confused that which alignment algorithm you used and how many landmarks output ?

about 1/10 data in mouth_roi_train is lost

I think the number of files in mp4, audio, mouth_roi_train is identity.
when I check it, there are 1092003 videos in mp4/train and audio/train, but 84730 of which are lost in mouth_roi/train. There is no .h5 file of the same name in mouth_roi/train, such as:

video_path: mp4/train/id04262/96JSsr9Q00k/00009.mp4
mouthroi_path: mouth_roi/train/id04262/96JSsr9Q00k/00009.h5
audio_path: audio/train/id04262/96JSsr9Q00k/00009.wav

video_path: mp4/train/id04262/PX8fGdzDlEs/00011.mp4
mouthroi_path: mouth_roi/train/id04262/PX8fGdzDlEs/00011.h5
audio_path: audio/train/id04262/PX8fGdzDlEs/00011.wav

Questions about the network structure

I found out from the code that you use a visual and auditory network with shared parameters for the visual and auditory features of the two speakers. But I'm not sure if my findings are correct as it doesn't seem to be stated in the paper.

mouth_roi_train

Hi ,
After I downloaded the mouth_roi_train.tar.gz, I encountered a problem while decompressing it. I'm not sure whether it was a data problem or an operation problem. Others data can be decompressed correctly. Could you please take a look? Thank you.

`
$ tar -xf mouth_roi_train.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ tar -zxf mouth_roi_train.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
$ ll -h
总用量 1.1T
-rw-rw-r-- 1 **** **** 1.1T 1月 20 11:59 mouth_roi_train.tar.gz
$ file mouth_roi_train.tar.gz
mouth_roi_train.tar.gz: data
`

About

Thank you very much for your excellent work.
One problem I am confused about is the definition of the crossmodal loss function and coseparation loss function. In the train.py, why random numbers and opt.gt_percentage are used to select which audio feature (audio_embedding_A1_pred or audio_embedding_A1_gt) is used. According to the method of the paper, shouldn't the predictive features be used?

def get_coseparation_loss(output, opt, loss_triplet):
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']

coseparation_loss = loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B1) + loss_triplet(audio_embeddings_A1, audio_embeddings_A2, audio_embeddings_B2)
return coseparation_loss
def get_crossmodal_loss(output, opt, loss_triplet):
identity_feature_A = output['identity_feature_A']
identity_feature_B = output['identity_feature_B']
if random.random() > opt.gt_percentage:
audio_embeddings_A1 = output['audio_embedding_A1_pred']
audio_embeddings_A2 = output['audio_embedding_A2_pred']
audio_embeddings_B1 = output['audio_embedding_B1_pred']
audio_embeddings_B2 = output['audio_embedding_B2_pred']
else:
audio_embeddings_A1 = output['audio_embedding_A1_gt']
audio_embeddings_A2 = output['audio_embedding_A2_gt']
audio_embeddings_B1 = output['audio_embedding_B_gt']
audio_embeddings_B2 = output['audio_embedding_B_gt']
crossmodal_loss = loss_triplet(audio_embeddings_A1, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_A2, identity_feature_A, identity_feature_B) + loss_triplet(audio_embeddings_B1, identity_feature_B, identity_feature_A) + loss_triplet(audio_embeddings_B2, identity_feature_B, identity_feature_A)
return crossmodal_loss`
```

Speech enhancement evaluation

Hello, thanks for your great work.

I've been trying to reproduce the enhancement performance on the VoxCeleb2 test set, but the performance of the given pre-trained model was much lower than in the paper.
(I used evaluateSeparation.py from the main directory to evaluate the metrics.)

And when I tried with test_synthetic_script.sh, the outputs were bad for my hearing.
The offscreen noise in the mixture (audio_mixed.wav) was much larger than the voice from what I heard, so I felt that the enhancement would be too difficult for the model.

I have 2 questions regarding this.

Is the pre-trained model in the av-enhancement directory your best model for speech enhancement, not separation?
Is your evaluation done with a mixture of two speeches and an offscreen noise with weight 1?
Isn't it too difficult for the model to separate and enhance at the same time?

Thanks in advance.

the pre-trained cross-modal matching models(facial.pth and vocal.pth)

Hi, thanks for your great work, how can I generate the pretrained cross-modal matching models facial.pth and vocal.pth. I want to train the facial.pth and vocal.pth models on the Voxceleb1 dataset is it possible? How should I do it?

Ask for help about the net_vocal

Hello, I observed the effect of net_vocal_attributes in the whole model framework.

At present, the embedding extracted from the predicted sound, the distance of the negative sample pair (audio_embedding_A1_pred and audio_embedding_B1_pred) can reach 2, and the distance of the positive sample pair (audio_embedding_A1_pred and audio_embedding_A2_pred) can reach about 0.

But after I changed the input of net_vocal to pure real sound, the distance between negative sample pairs (audio_embedding_A1_gt and audio_embedding_B_gt) can only reach 1. That is to say, the sound feature extraction is not good when I train the net_vocal alone.

It stands to reason that pure ground voices are easier to extract features than predicted voices. I modified the parameters of the training (batch, learning rate, etc.) but none solved the problem. May I know what is the reason?

Looking forward to your reply！