Giter VIP home page Giter VIP logo

lipreading_using_temporal_convolutional_networks's Introduction

Lipreading using Temporal Convolutional Networks

PWC

Authors

Pingchuan Ma, Brais Martinez, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic.

Update

2022-09-09: We have released our DC-TCN models, see here.

2021-06-09: We have released our official training code, see here.

2020-12-08: We have released our audio-only models. see here.

Content

Deep Lipreading

Model Zoo

Citation

License

Contact

Deep Lipreading

Introduction

This is the respository of Training Strategies For Improved Lip-reading, Towards Practical Lipreading with Distilled and Efficient Models and Lipreading using Temporal Convolutional Networks. In this repository, we provide training code, pre-trained models, network settings for end-to-end visual speech recognition (lipreading). We trained our model on LRW. The network architecture is based on 3D convolution, ResNet-18 plus MS-TCN.

By using this repository, you can achieve a performance of 89.6% on the LRW dataset. This repository also provides a script for feature extraction.

Preprocessing

As described in our paper, each video sequence from the LRW dataset is processed by 1) doing face detection and face alignment, 2) aligning each frame to a reference mean face shape 3) cropping a fixed 96 × 96 pixels wide ROI from the aligned face image so that the mouth region is always roughly centered on the image crop 4) transform the cropped image to gray level.

You can run the pre-processing script provided in the preprocessing folder to extract the mouth ROIs.

0. Original 1. Detection 2. Transformation 3. Mouth ROIs

How to install environment

  1. Clone the repository into a directory. We refer to that directory as TCN_LIPREADING_ROOT.
git clone --recursive https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks.git
  1. Install all required packages.
pip install -r requirements.txt

How to prepare dataset

  1. Download a pre-trained model from Model Zoo and put the model into the $TCN_LIPREADING_ROOT/models/ folder.

  2. For audio-only experiments, please pre-process audio waveforms using the script extract_audio_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/audio_data/.

  3. For VSR benchmarks reported in Table 1, please download our pre-computed landmarks from GoogleDrive or BaiduDrive (key: m00k) and unzip them to $TCN_LIPREADING_ROOT/landmarks/ folder. please pre-process mouth ROIs using the script crop_mouth_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.

  4. For VSR benchmarks reported in Table 2, please download our pre-computed landmarks from GoogleDrive or BaiduDrive (key: kumy) and unzip them to $TCN_LIPREADING_ROOT/landmarks/ folder. please pre-process mouth ROIs using the script crop_mouth_from_video.py in the legacy_preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.

How to train

  1. Train a visual-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY>
  1. Train an audio-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --modality audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>

We call the original LRW directory that includes timestamps (.txt) as <ANNONATION-DIRECTORY>.

  1. Resume from last checkpoint.

You can pass the checkpoint path (.pth or .pth.tar) <CHECKPOINT-PATH> to the variable argument --model-path, and specify the --init-epoch to 1 to resume training.

How to test

You need to specify <ANNONATION-DIRECTORY> if you use a model with utilising word boundaries indicators.

  1. Evaluate the visual-only performance (lipreading).
CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY> \
                                      --test
  1. Evaluate the audio-only performance.
CUDA_VISIBLE_DEVICES=0 python main.py --modality audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>
                                      --test

How to extract embeddings

We assume you have cropped the mouth patches and put them into <MOUTH-PATCH-PATH>. The mouth embeddings will be saved in the .npz format.

  • To extract 512-D feature embeddings from the top of ResNet-18:
CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --extract-feats \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --mouth-patch-path <MOUTH-PATCH-PATH> \
                                      --mouth-embedding-out-path <OUTPUT-PATH>

Model Zoo

Table 1. Results of the audio-only and visual-only models on LRW. Mouth patches and audio waveforms are extracted in the preprocessing folder.

Architecture Acc. url size (MB)
Audio-only
resnet18_dctcn_audio_boundary 99.2 GoogleDrive or BaiduDrive (key: w3jh) 173
resnet18_dctcn_audio 99.1 GoogleDrive or BaiduDrive (key: hw8e) 173
resnet18_mstcn_audio 98.9 GoogleDrive or BaiduDrive (key: bnhd) 111
Visual-only
resnet18_dctcn_video_boundary 92.1 GoogleDrive or BaiduDrive (key: jb7l) 201
resnet18_dctcn_video 89.6 GoogleDrive or BaiduDrive (key: f3hd) 201
resnet18_mstcn_video 88.9 GoogleDrive or BaiduDrive (key: 0l63) 139
Table 2. Results of the visual-only models on LRW. Mouth patches are extracted in the legacy_preprocessing folder.

Architecture Acc. url size (MB)
Visual-only
snv1x_dsmstcn3x 85.3 GoogleDrive or BaiduDrive (key: 86s4) 36
snv1x_tcn2x 84.6 GoogleDrive or BaiduDrive (key: f79d) 35
snv1x_tcn1x 82.7 GoogleDrive or BaiduDrive (key: 3caa) 15
snv05x_tcn2x 82.5 GoogleDrive or BaiduDrive (key: ej9e) 32
snv05x_tcn1x 79.9 GoogleDrive or BaiduDrive (key: devg) 11

Citation

If you find this code useful in your research, please consider to cite the following papers:

@INPROCEEDINGS{ma2022training,
    author={Ma, Pingchuan and Wang, Yujiang and Petridis, Stavros and Shen, Jie and Pantic, Maja},
    booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    title={Training Strategies for Improved Lip-Reading},
    year={2022},
    pages={8472-8476},
    doi={10.1109/ICASSP43922.2022.9746706}
}

@INPROCEEDINGS{ma2021lip,
  title={Lip-reading with densely connected temporal convolutional networks},
  author={Ma, Pingchuan and Wang, Yujiang and Shen, Jie and Petridis, Stavros and Pantic, Maja},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={2857-2866},
  year={2021},
  doi={10.1109/WACV48630.2021.00290}
}

@INPROCEEDINGS{ma2020towards,
  author={Ma, Pingchuan and Martinez, Brais and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Towards Practical Lipreading with Distilled and Efficient Models},
  year={2021},
  pages={7608-7612},
  doi={10.1109/ICASSP39728.2021.9415063}
}

@INPROCEEDINGS{martinez2020lipreading,
  author={Martinez, Brais and Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Lipreading Using Temporal Convolutional Networks},
  year={2020},
  pages={6319-6323},
  doi={10.1109/ICASSP40776.2020.9053841}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

lipreading_using_temporal_convolutional_networks's People

Contributors

mapleandfire avatar mpc001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lipreading_using_temporal_convolutional_networks's Issues

DC-TCN number of parameters and Hardest words list

Hi,
I have questions about DC-TCN and MS-TCN papers.

  1. Could you provide me the number of parameters for the four settings in Table 2 of the DC-TCN paper?
    If possible, please let me know FLOPs as well.

  2. In second-page footnote of "LIPREADING USING TEMPORAL CONVOLUTIONAL NETWORKS" paper, it is mentioned that the list of “hardest words” is obtained from [10]. However, I couldn't get the list from the github and the paper. Could you provide the hardest 50 classes list your paper mentioned?

Thank you.

Testing on Video (.mp4) file

Hi,

May I ask how do I test the pretrained model (resnet18_mstcn(adamw)) on a video file? I have attempted to do so however the predictions are quite bad, I think I might be doing it wrongly. Any advice would be appreciated thank you! :)

What is the use of annonation?

When I want to train on my own data , a error happen and say that I need to offer annonation
but I think the annonation of lrw just include some information which help nothing to train
Must I have the annonation ? Can I skip this step?
Thank you very much

Test error

Hi, thanks for the code!

I want to do a test with your pretrained model on the LRW dataset.

Firstly, I run the crop_mouth_from_video.py and got many npz under “$TCN_LIPREADING_ROOT/datasets/visual_data/”
Secondly, I run

"""
CUDA_VISIBLE_DEVICES=0 python main.py
--config-path “$TCN_LIPREADING_ROOT/configs/lrw_resnet18_mstcn.json”
--model-path “$TCN_LIPREADING_ROOT/models/lrw_resnet18_mstcn_adamw_s3.pth.tar”
--data-dir “$TCN_LIPREADING_ROOT/datasets/visual_data/”
--test
"""

But I got an error that in

“$TCN_LIPREADING_ROOT/lipreading/preprocess.py L87 , frames actually is of shape (29, 96, 96, 3)

What is wrong in my procedure?

With the same data , why the result is so different on ms-tcn and dc-tcn ?

Hello, on the basis of using your code, I collected data again for training (there are 13 classes, and each type has dozens of data).
On the ms-tcn , I got a nice acc , but when I want to learn about you newest model (dc-tcn) , an issue happened.
I find that with the same data to train , it is also nice when training , acc up to 90%+. But when I test the model , whatever input I give the model , I will get the same output and even the same confidence .
Can you help me?
Thank you.

Testing on personal dataset

Hi, how can you test the model on personal data? Like I have an video, and I want to use the model to infer on it

UnboundLocalError: local variable 'tcn_options' referenced before assignment

While evaluating the module the code always crashes due to this error,
UnboundLocalError: local variable 'tcn_options' referenced before assignment
as it requires the tcn_options variable to be global.
How can i fix this please ?
def get_model(): if os.path.exists(args.config_path): args_loaded = load_json( args.config_path) args.backbone_type = args_loaded['backbone_type'] args.width_mult = args_loaded['width_mult'] args.relu_type = args_loaded['relu_type'] tcn_options = { 'num_layers': args_loaded['tcn_num_layers'], 'kernel_size': args_loaded['tcn_kernel_size'], 'dropout': args_loaded['tcn_dropout'], 'dwpw': args_loaded['tcn_dwpw'], 'width_mult': args_loaded['tcn_width_mult'], } return Lipreading( num_classes=args.num_classes, tcn_options=tcn_options, backbone_type=args.backbone_type, relu_type=args.relu_type, width_mult=args.width_mult, extract_feats=args.extract_feats).cuda()

Can we do training using CPU?

Hi, thanks for the training code.
Can we train the code using cpu only?
What changes we need to make in that case?

Test pretrained models

Hi, may I ask if I am able to test the pre-trained models available on ModelZoo? And if so, could you advise where am I able to obtain the model in JSON format as input to ? Thank you!

Error in pre-processing - help

!python crop_mouth_from_video.py --video-direc "../data/news.mp4"
--landmark-direc "../landmarks/LRW_landmarks"
--save-direc "../datasets"
--convert-gray
--testset-only

idx: 0 Processing. ABOUT/test/ABOUT_00001
Traceback (most recent call last):
File "crop_mouth_from_video.py", line 157, in
assert sequence is not None, "cannot crop from {}.".format(filename)
AssertionError: cannot crop from ABOUT/test/ABOUT_00001.

How to run this model on 2 GPUs?

Hi, thank you for your great work on lipreading.
I want to implement this model on 2 GPUs, and I used nn.DataParallel to do it. But there have some problems in the function __average_batch()_ in the model.py file, the error as follows:
IndexError: index 16 is out of bounds for dimension 0 with size 16.
Do you meet this problem?
Thank you very much! @mpc001

Extract Embedding

CUDA_VISIBLE_DEVICES=0 python main.py --extract-feats
--config-path
--model-path
--mouth-patch-path
--mouth-embedding-out-path
I tried jpg, npy, npz for image but failed to extract embeddings of mouth patches, May I know what type of file it expects to extract 512-D embedding?

ShuffleNet's Parameter

Hi, thanks for your work.
In the 'shufflenetv2.py', I see that ' Input size needs to be divisible by 32', such as 96 ... so we do this is only to make sure '(nn.AvgPool2d(int(input_size/32))' ?

Extract landmarks on videos outside the LRW dataset

Hi,

Thanks for providing the code and pre-trained models. Currently, I would like to evaluate the model using some wild videoes, instead of the testing set provided by the LRW dataset. Would you please tell me how to extract landmarks from my own videoes? Thanks.

May I know how to do sentence-level lip reading?

Hi,

Thank you very much for sharing your great work! Currently, the pretrained model is doing lipreading at word-level. If I was given a sentence, how do I do lipreading? How can I segment the sentence into words? Thank you very much!

Extract mean face on my own dataset

Hi, Thanks for you code! After reading the paper, I am not quite sure the meaning of mean face parameter, and if I hope to use with my own data, how to extract the mean face parameter before training?

I don't have a json model

Hello,
probably a stupid question but in the main function I want to use the pretrained model (.pth.tar) format since I dont have a model in json . Do I necessarely have to use a json model in order to start the training? If not do you know any way to fix it?
Thank you

line 169, in get_model_from_json
assert args.config_path.endswith('.json') and os.path.isfile(args.config_path),
AttributeError: 'NoneType' object has no attribute 'endswith'

Path problem during preprocessing

I'm typing in at the terminal
(abc) E:\TCN_LIPREADING_ROOT\TCN_LIPREADING_ROOT\preprocessing>
python crop_mouI'm typing in at the terminalth_from_video.py --video-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/LRW_landmarks/ --land
mark-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/ --save-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/datasets/visual_data/
idx: 0 Processing. ABOUT/test/ABOUT_00001
Traceback (most recent call last):
File "crop_mouth_from_video.py", line 139, in
assert os.path.isfile(video_pathname), "File does not exist. Path input: {}".format(video_pathname)
AssertionError: File does not exist. Path input: E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/LRW_landmarks/ABOUT/test/ABOUT_00001.mp4
I don't know why

IndexError: list index out of range

Hi,
I'm trying to compile main and I get the IndexError.
My datasets folder consists of only the ABOUT folder(npz format) due to memory restrictions and my labels folder only the txt provided with only the ABOUT word inside. I try to compile and get the following error. I tried to figure it out with debugging but no luck. Have you faced an issue like this one before?

Thank you

To be more specific :

Model and log being saved in: D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\train_logs\tcn
Traceback (most recent call last):
File "D:/new_lipreading/Lipreading_using_Temporal_Convolutional_Networks/main.py", line 260, in
main()
File "D:/new_lipreading/Lipreading_using_Temporal_Convolutional_Networks/main.py", line 202, in main
dset_loaders = get_data_loaders(args)
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataloaders.py", line 52, in get_data_loaders
) for partition in ['train', 'val', 'test']}
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataloaders.py", line 52, in
) for partition in ['train', 'val', 'test']}
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 30, in init
self.load_dataset()
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 39, in load_dataset
self._get_files_for_partition()
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 76, in _get_files_for_partition
self._data_files = [ f for f in self._data_files if f.split('/')[self.label_idx] in self._labels ]
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 76, in
self._data_files = [ f for f in self._data_files if f.split('/')[self.label_idx] in self._labels ]
IndexError: list index out of range

Process finished with exit code 1

Reproduce on LRW1000,only get 38.6,how can get 41.4?

HI,I have reproduce on LRW1000.
when I reproduce with batchsize=32,lr=1e-3, and the same optimizer
for bigru can get 38.4
for mutitcn can get 38.6

How can I get 41.4,can you share your trainning parameters?

when reproduce the other paper "LEARN AN EFFECTIVE LIP READING MODEL WITHOUT PAINS"
for bigru can get 57.68 (paper result is 55.7)
for mutitcn can get 55.49

About variable length augmentation

Hello, when I read you paper , I notice that you have variable length augmentation in this model

But what part of code is about variable length augmentation?
If I want to be more variable ( not just delete 0-5 , but add or delete dozens of frame) , do you think it is feasible in you code?

thank you very much

How to train the model on audiovisual mode

Hi,
I saw this question in am open thread but I didn't see a response so I'm opening one so others can search it in the future. First of all thank you for providing such a detailed read me! I'm interested in doing the audiovisual lip reading but I was wondering how to do that with the existing code since in your documentation there are mentiones for audio only and visual only

Error in extract_audio_from_video

Hi,
I'm having an issue when processing the audio and I get an error. I don't know whats wrong since the code sees the dataset but says its an unknown format. Any ideas?
Thank you

RuntimeError: Error opening 'D:\Lipreading_using_Temporal_Convolutional_Networks\lipread_mp4\ABOUT\test\ABOUT_00001.mp4': File contains data in an unknown format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Lipreading_using_Temporal_Convolutional_Networks/preprocessing/extract_audio_from_video.py", line 43, in
data = librosa.load(video_pathname, sr=16000)[0][-19456:]
File "C:\Users\KATE\Anaconda3\lib\site-packages\librosa\core\audio.py", line 166, in load
y, sr_native = __audioread_load(path, offset, duration, dtype)
File "C:\Users\KATE\Anaconda3\lib\site-packages\librosa\core\audio.py", line 190, in _audioread_load
with audioread.audio_open(path) as input_file:
File "C:\Users\KATE\Anaconda3\lib\site-packages\audioread_init.py", line 116, in audio_open
raise NoBackendError()
audioread.exceptions.NoBackendError

Training Code

I am working on a small project and the training code will be helpful
please post it if available if not give me a hint please for the loop to be done

Can't open model archives

Hello,

I've downloaded resnet18_mstcn(adamw) and resnet18_mstcn(adamw_s3) using GDrive links, however I can't open archives using 7zip software.

Could you please check the archives? If they are valid, could you please explain how to extract models from them?

Thank you.

How to train a model on 2 GPUs

I'm trying to train a model on two GPUs but I get the following error.
IndexError: index 16 is out of bounds for dimension 0 with size 16.
How can I solve this problem?
Looking forward to your answer!

OuluVS2 dataset

Hello, friend.

I want to use the OuluVS2 dataset, but I don't think the homepage is operating anymore.

So if you don't mind, can you share me the OuluVS2 dataset?

[email protected]

How to load audio data?

Thank you for sharing such nice code.

My question is: how to extract audio data from an MP4 file in the preprocessing stage? In this folder, I just found the ones related to the images.

Thanks again. Looking forward to your early reply.

Question about test error

Hi, thanks for sharing your code! While evaluating the code by pertained model, the following error raised:
Traceback (most recent call last):
File "main.py", line 261, in
main()
File "main.py", line 203, in main
dset_loaders = get_data_loaders(args)
File "/Lipreading_using_TCN/lipreading/dataloaders.py", line 53, in get_data_loaders
dset_loaders = {x: torch.utils.data.DataLoader(
File "/Lipreading_using_TCN/lipreading/dataloaders.py", line 53, in
dset_loaders = {x: torch.utils.data.DataLoader(
File "/python3.8/site-packages/torch/utils/data/dataloader.py", line 268, in init
sampler = RandomSampler(dataset, generator=generator)
File "/python3.8/site-packages/torch/utils/data/sampler.py", line 102, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

In question#3, you recommended to check if (DATA-DIRECTORY) has the following structure:
DATA-DIRECTORY

└───ABOUT
│ │ train
│ │ val
│ │
│ └───test
│ │ ABOUT_00001.npz
│ │ ABOUT_00002.npz
│ │ ...

└───ABSOLUTELY
│ train
│ ...

I checked my and the structure look the same. But I still can’t figure out how to solve the test error.
I would appreciate it if you could give me some advice : )

ValueError: too many values to unpack (expected 3)

While running the main.py file for visual model training, there is an error regarding the unpacking of the frames array. On printing the shape of the frames array, there are 4 values in the tuple rather than 3. I am passing the LRW directory (with .mp4 and .txt files) as a value to the argument --annotation-direc. Kindly help asap!

Screen Shot 2022-04-11 at 23 55 43

Ask a question about non-casual tcn

Thank you for providing your lip reading research and code.
I got a question after reading the code and paper you provided. According to the paper, tcn is designed as non-causal, but the code is designed as causal tcn. Non-causal can be designed through some modifications, so there is no problem.
Is the performance presented in the paper casual or non-casual tcn?
Thank you!

Preprocessing Issue

Hi, Thx for the code!

I found out that when it comes to 68 landmark files below, the file is empty so preprocessing code didn't work.

image
image

I'm wondering if this is my own issue (such as unzip issue..) or not.

Must convert gray?

hi,
I have a issue about ' convert-gray' must be 'true' to get a frame * h * w ? Not channel , such as RGB ?
Another query is we must prepocessing the video to .npz files?
Only we do this, we can train the work.
Looking forward your reply.

Can we do Sentence Prediction for the model?

Hello @mpc001 , thank you for your wonderful code with step by step guide to train and test the model.
I have tried the prediction code provided in issue #10. It provides only one word and the prediction is not correctly given with really low confidence.

Can we predict a sentence {the LRW dataset has multiple words in the video) not just a single word from the trained model?
Kindly share your valuable ideas. Thanks

I said about, it's saying needs :(
Screenshot 2022-04-25 151148

Processing webcam stream: optimal lengths and other clarifications

I think I managed to connect your project to a stream from a webcam, and I got it reasonably correct: it works on my machine and seems to produce outputs that somewhat resemble the words that I'm pronouncing.

I'm not sure about some details though. Would you be able to clarify them?

  1. I maintain a queue of frames from the webcam, and I pass 30 last frames into the network (see the model_input = ... line). From what I understand, it's what is used by the LFW dataset. Is this correct? Is there a better value for the queue length?
  2. What is the minimum value for the queue length that will work? Is it 5 because of the kernel size of the initial 3D convolution?
  3. I'm not sure what lengths means (a parameter expected by model.forward()). In main.py, the extract_feats() function sets lengths to a singleton list with the number of frames, but surely that can't be its sole purpose? There is also some weird averaging going on in _average_batch() that I don't understand. What is the optimal value of lengths for a stream from a webcam?
  4. Is it correct that the model outputs logits, and to obtain probabilities I need to apply softmax?

Here is my implementation. It is self-contained and should work if you put it in the root of the repository. The only library dependency is face-alignment (pip install --user face-alignment) that I used for extracting keypoints instead of dlib. The most interesting part is between the BEGIN PROCESSING / END PROCESSING comments.

import argparse
import json
from collections import deque
from contextlib import contextmanager
from pathlib import Path

import cv2
import face_alignment
import numpy as np
import torch
from torchvision.transforms.functional import to_tensor

from lipreading.model import Lipreading
from preprocessing.transform import warp_img, cut_patch

STD_SIZE = (256, 256)
STABLE_PNTS_IDS = [33, 36, 39, 42, 45]
START_IDX = 48
STOP_IDX = 68
CROP_WIDTH = CROP_HEIGHT = 96


@contextmanager
def VideoCapture(*args, **kwargs):
    cap = cv2.VideoCapture(*args, **kwargs)
    try:
        yield cap
    finally:
        cap.release()


def load_model(config_path: Path):
    with config_path.open() as fp:
        config = json.load(fp)
    tcn_options = {
        'num_layers': config['tcn_num_layers'],
        'kernel_size': config['tcn_kernel_size'],
        'dropout': config['tcn_dropout'],
        'dwpw': config['tcn_dwpw'],
        'width_mult': config['tcn_width_mult'],
    }
    return Lipreading(
        num_classes=500,
        tcn_options=tcn_options,
        backbone_type=config['backbone_type'],
        relu_type=config['relu_type'],
        width_mult=config['width_mult'],
        extract_feats=False,
    )


def visualize_probs(vocab, probs, col_width=4, col_height=300):
    num_classes = len(probs)
    out = np.zeros((col_height, num_classes * col_width + (num_classes - 1), 3), dtype=np.uint8)
    for i, p in enumerate(probs):
        x = (col_width + 1) * i
        cv2.rectangle(out, (x, 0), (x + col_width - 1, round(p * col_height)), (255, 255, 255), 1)
    top = np.argmax(probs)
    cv2.addText(out, f'Prediction: {vocab[top]}', (10, out.shape[0] - 30), 'Arial', color=(255, 255, 255))
    cv2.addText(out, f'Confidence: {probs[top]:.3f}', (10, out.shape[0] - 10), 'Arial', color=(255, 255, 255))
    return out


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--config-path', type=Path, default=Path('configs/lrw_resnet18_mstcn.json'))
    parser.add_argument('--model-path', type=Path, default=Path('models/lrw_resnet18_mstcn_adamw_s3.pth.tar'))
    parser.add_argument('--device', type=str, default='cuda')
    parser.add_argument('--queue-length', type=int, default=30)
    args = parser.parse_args()

    fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, device=args.device)
    model = load_model(args.config_path)
    model.load_state_dict(torch.load(Path(args.model_path), map_location=args.device)['model_state_dict'])
    model = model.to(args.device)

    mean_face_landmarks = np.load(Path('preprocessing/20words_mean_face.npy'))

    with Path('labels/500WordsSortedList.txt').open() as fp:
        vocab = fp.readlines()
    assert len(vocab) == 500

    queue = deque(maxlen=args.queue_length)

    with VideoCapture(0) as cap:
        while True:
            ret, image_np = cap.read()
            if not ret:
                break
            image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)

            all_landmarks = fa.get_landmarks(image_np)
            if all_landmarks:
                landmarks = all_landmarks[0]

                # BEGIN PROCESSING

                trans_frame, trans = warp_img(
                    landmarks[STABLE_PNTS_IDS, :], mean_face_landmarks[STABLE_PNTS_IDS, :], image_np, STD_SIZE)
                trans_landmarks = trans(landmarks)
                patch = cut_patch(
                    trans_frame, trans_landmarks[START_IDX:STOP_IDX], CROP_HEIGHT // 2, CROP_WIDTH // 2)

                cv2.imshow('patch', cv2.cvtColor(patch, cv2.COLOR_RGB2BGR))

                patch_torch = to_tensor(cv2.cvtColor(patch, cv2.COLOR_RGB2GRAY)).to(args.device)
                queue.append(patch_torch)

                if len(queue) >= args.queue_length:
                    with torch.no_grad():
                        model_input = torch.stack(list(queue), dim=1).unsqueeze(0)
                        logits = model(model_input, lengths=[args.queue_length])
                        probs = torch.nn.functional.softmax(logits, dim=-1)
                        probs = probs[0].detach().cpu().numpy()

                    vis = visualize_probs(vocab, probs)
                    cv2.imshow('probs', vis)

                # END PROCESSING

                for x, y in landmarks:
                    cv2.circle(image_np, (int(x), int(y)), 2, (0, 0, 255))

            cv2.imshow('camera', cv2.cvtColor(image_np, cv2.COLOR_RGB2BGR))

            key = cv2.waitKey(1)
            if key in {27, ord('q')}:  # 27 is Esc
                break
            elif key == ord(' '):
                cv2.waitKey(0)

    cv2.destroyAllWindows()


if __name__ == '__main__':
    main()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.