ICASSP'22 Training Strategies for Improved Lip-Reading; ICASSP'21 Towards Practical Lipreading with Distilled and Efficient Models; ICASSP'20 Lipreading using Temporal Convolutional Networks

License: Other

Python 100.00%

lipreading_using_temporal_convolutional_networks's Issues

Test pretrained models

Hi, may I ask if I am able to test the pre-trained models available on ModelZoo? And if so, could you advise where am I able to obtain the model in JSON format as input to ? Thank you!

The code for the Knowledge Distillation loss

Hi.

Did you publish the code for the Knowledge Distillation loss? I couldn't find it in the code.
If it is not there, could you please publish the code?

Thanks

Test error

Hi, thanks for the code!

I want to do a test with your pretrained model on the LRW dataset.

Firstly, I run the crop_mouth_from_video.py and got many npz under “$TCN_LIPREADING_ROOT/datasets/visual_data/”
Secondly, I run

"""
CUDA_VISIBLE_DEVICES=0 python main.py
--config-path “$TCN_LIPREADING_ROOT/configs/lrw_resnet18_mstcn.json”
--model-path “$TCN_LIPREADING_ROOT/models/lrw_resnet18_mstcn_adamw_s3.pth.tar”
--data-dir “$TCN_LIPREADING_ROOT/datasets/visual_data/”
--test
"""

But I got an error that in

“$TCN_LIPREADING_ROOT/lipreading/preprocess.py L87 , frames actually is of shape (29, 96, 96, 3)

What is wrong in my procedure?

How to train a model on 2 GPUs

I'm trying to train a model on two GPUs but I get the following error.
IndexError: index 16 is out of bounds for dimension 0 with size 16.
How can I solve this problem?
Looking forward to your answer!

May I know how to do sentence-level lip reading?

Hi,

Thank you very much for sharing your great work! Currently, the pretrained model is doing lipreading at word-level. If I was given a sentence, how do I do lipreading? How can I segment the sentence into words? Thank you very much!

这个是输入几张图片给模型？

是固定输入10张照片给模型吗？还是说嘴唇的照片数量不固定？不固定的这种输入是怎么实现的？

Testing on personal dataset

Hi, how can you test the model on personal data? Like I have an video, and I want to use the model to infer on it

Regarding "3 weeks to 1 week GPU-time", specify hardware

Can you also give some details of the training hardware required for training the model.

Can you give some details about the measure "1 week GPU time"

About variable length augmentation

Hello, when I read you paper , I notice that you have variable length augmentation in this model

But what part of code is about variable length augmentation?
If I want to be more variable ( not just delete 0-5 , but add or delete dozens of frame) , do you think it is feasible in you code?

thank you very much

Fine-tuning and Training on LRW

How to fine-tune or to train from scratch instead of just evaluating pre-trained models ?

How can I compute the landmarks using my own dataset?

hi,when you mention about how to prepare dataset,you note that download your pre-computed landmarks.
but when i want to use my own dataset to train ,how can i get the landmarks?
thank you

Any plan to release the code of "Lip-reading with Densely Connected Temporal Convolutional Networks"?

Hi @mpc001. Thanks for the great work! Do you have any plans to release the code of DC-TCN? I try to reimplement DC-TCN based on this repo, but I can only reach 88.6 on LRW.

Ask a question about Model configuration

Sorry to disturb, I don’t know how to obtain the json file required for the parameters of your model-config, Looking forward to your reply, thank you.

With the same data , why the result is so different on ms-tcn and dc-tcn ?

Hello, on the basis of using your code, I collected data again for training (there are 13 classes, and each type has dozens of data).
On the ms-tcn , I got a nice acc , but when I want to learn about you newest model (dc-tcn) , an issue happened.
I find that with the same data to train , it is also nice when training , acc up to 90%+. But when I test the model , whatever input I give the model , I will get the same output and even the same confidence .
Can you help me?
Thank you.

How to run this model on 2 GPUs?

Hi, thank you for your great work on lipreading.
I want to implement this model on 2 GPUs, and I used nn.DataParallel to do it. But there have some problems in the function __average_batch()_ in the model.py file, the error as follows:
IndexError: index 16 is out of bounds for dimension 0 with size 16.
Do you meet this problem?
Thank you very much! @mpc001

Can we do training using CPU?

Hi, thanks for the training code.
Can we train the code using cpu only?
What changes we need to make in that case?

Path problem during preprocessing

I'm typing in at the terminal
(abc) E:\TCN_LIPREADING_ROOT\TCN_LIPREADING_ROOT\preprocessing>
python crop_mouI'm typing in at the terminalth_from_video.py --video-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/LRW_landmarks/ --land
mark-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/ --save-direc E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/datasets/visual_data/
idx: 0 Processing. ABOUT/test/ABOUT_00001
Traceback (most recent call last):
File "crop_mouth_from_video.py", line 139, in
assert os.path.isfile(video_pathname), "File does not exist. Path input: {}".format(video_pathname)
AssertionError: File does not exist. Path input: E:/TCN_LIPREADING_ROOT/TCN_LIPREADING_ROOT/landmarks/LRW_landmarks/ABOUT/test/ABOUT_00001.mp4
I don't know why

Training code on LRW/LRW1000?

Hi, this project seems to contain only test code. Is training codes available？

Do this code in github include the part of data Augmentation?

Hello, I noticed that your paper mentioned a lot of data enhancement methods. Do you include data enhancement in the uploaded code？thank you

.

I don't have a json model

Hello,
probably a stupid question but in the main function I want to use the pretrained model (.pth.tar) format since I dont have a model in json . Do I necessarely have to use a json model in order to start the training? If not do you know any way to fix it?
Thank you

line 169, in get_model_from_json
assert args.config_path.endswith('.json') and os.path.isfile(args.config_path),
AttributeError: 'NoneType' object has no attribute 'endswith'

ValueError: too many values to unpack (expected 3)

While running the main.py file for visual model training, there is an error regarding the unpacking of the frames array. On printing the shape of the frames array, there are 4 values in the tuple rather than 3. I am passing the LRW directory (with .mp4 and .txt files) as a value to the argument --annotation-direc. Kindly help asap!

what is the use of 20words_mean_face.npy ？

Hi,I am a beginner ，but I want to know what is the use of 20words_mean_face.npy ?
Thank you very much.

Error in pre-processing - help

!python crop_mouth_from_video.py --video-direc "../data/news.mp4"
--landmark-direc "../landmarks/LRW_landmarks"
--save-direc "../datasets"
--convert-gray
--testset-only

idx: 0 Processing. ABOUT/test/ABOUT_00001
Traceback (most recent call last):
File "crop_mouth_from_video.py", line 157, in
assert sequence is not None, "cannot crop from {}.".format(filename)
AssertionError: cannot crop from ABOUT/test/ABOUT_00001.

Can we do Sentence Prediction for the model?

Hello @mpc001 , thank you for your wonderful code with step by step guide to train and test the model.
I have tried the prediction code provided in issue #10. It provides only one word and the prediction is not correctly given with really low confidence.

Can we predict a sentence {the LRW dataset has multiple words in the video) not just a single word from the trained model?
Kindly share your valuable ideas. Thanks

I said about, it's saying needs :(

Training Code

I am working on a small project and the training code will be helpful
please post it if available if not give me a hint please for the loop to be done

What is the use of annonation?

When I want to train on my own data , a error happen and say that I need to offer annonation
but I think the annonation of lrw just include some information which help nothing to train
Must I have the annonation ? Can I skip this step?
Thank you very much

Extract mean face on my own dataset

Hi, Thanks for you code! After reading the paper, I am not quite sure the meaning of mean face parameter, and if I hope to use with my own data, how to extract the mean face parameter before training?

Testing on Video (.mp4) file

Hi,

May I ask how do I test the pretrained model (resnet18_mstcn(adamw)) on a video file? I have attempted to do so however the predictions are quite bad, I think I might be doing it wrongly. Any advice would be appreciated thank you! :)

Question about test error

Hi, thanks for sharing your code! While evaluating the code by pertained model, the following error raised:
Traceback (most recent call last):
File "main.py", line 261, in
main()
File "main.py", line 203, in main
dset_loaders = get_data_loaders(args)
File "/Lipreading_using_TCN/lipreading/dataloaders.py", line 53, in get_data_loaders
dset_loaders = {x: torch.utils.data.DataLoader(
File "/Lipreading_using_TCN/lipreading/dataloaders.py", line 53, in
dset_loaders = {x: torch.utils.data.DataLoader(
File "/python3.8/site-packages/torch/utils/data/dataloader.py", line 268, in init
sampler = RandomSampler(dataset, generator=generator)
File "/python3.8/site-packages/torch/utils/data/sampler.py", line 102, in init
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

In question#3, you recommended to check if （DATA-DIRECTORY） has the following structure:
DATA-DIRECTORY
│
└───ABOUT
│ │ train
│ │ val
│ │
│ └───test
│ │ ABOUT_00001.npz
│ │ ABOUT_00002.npz
│ │ ...
│
└───ABSOLUTELY
│ train
│ ...

I checked my and the structure look the same. But I still can’t figure out how to solve the test error.
I would appreciate it if you could give me some advice : )

IndexError: list index out of range

Hi,
I'm trying to compile main and I get the IndexError.
My datasets folder consists of only the ABOUT folder(npz format) due to memory restrictions and my labels folder only the txt provided with only the ABOUT word inside. I try to compile and get the following error. I tried to figure it out with debugging but no luck. Have you faced an issue like this one before?

Thank you

To be more specific :

Model and log being saved in: D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\train_logs\tcn
Traceback (most recent call last):
File "D:/new_lipreading/Lipreading_using_Temporal_Convolutional_Networks/main.py", line 260, in
main()
File "D:/new_lipreading/Lipreading_using_Temporal_Convolutional_Networks/main.py", line 202, in main
dset_loaders = get_data_loaders(args)
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataloaders.py", line 52, in get_data_loaders
) for partition in ['train', 'val', 'test']}
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataloaders.py", line 52, in
) for partition in ['train', 'val', 'test']}
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 30, in init
self.load_dataset()
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 39, in load_dataset
self._get_files_for_partition()
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 76, in _get_files_for_partition
self._data_files = [ f for f in self._data_files if f.split('/')[self.label_idx] in self._labels ]
File "D:\new_lipreading\Lipreading_using_Temporal_Convolutional_Networks\lipreading\dataset.py", line 76, in
self._data_files = [ f for f in self._data_files if f.split('/')[self.label_idx] in self._labels ]
IndexError: list index out of range

Process finished with exit code 1

Error in extract_audio_from_video

Hi,
I'm having an issue when processing the audio and I get an error. I don't know whats wrong since the code sees the dataset but says its an unknown format. Any ideas?
Thank you

RuntimeError: Error opening 'D:\Lipreading_using_Temporal_Convolutional_Networks\lipread_mp4\ABOUT\test\ABOUT_00001.mp4': File contains data in an unknown format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Lipreading_using_Temporal_Convolutional_Networks/preprocessing/extract_audio_from_video.py", line 43, in
data = librosa.load(video_pathname, sr=16000)[0][-19456:]
File "C:\Users\KATE\Anaconda3\lib\site-packages\librosa\core\audio.py", line 166, in load
y, sr_native = __audioread_load(path, offset, duration, dtype)
File "C:\Users\KATE\Anaconda3\lib\site-packages\librosa\core\audio.py", line 190, in _audioread_load
with audioread.audio_open(path) as input_file:
File "C:\Users\KATE\Anaconda3\lib\site-packages\audioread_init.py", line 116, in audio_open
raise NoBackendError()
audioread.exceptions.NoBackendError

Can't open model archives

Hello,

I've downloaded resnet18_mstcn(adamw) and resnet18_mstcn(adamw_s3) using GDrive links, however I can't open archives using 7zip software.

Could you please check the archives? If they are valid, could you please explain how to extract models from them?

Thank you.

How to train the model on audiovisual mode

Hi,
I saw this question in am open thread but I didn't see a response so I'm opening one so others can search it in the future. First of all thank you for providing such a detailed read me! I'm interested in doing the audiovisual lip reading but I was wondering how to do that with the existing code since in your documentation there are mentiones for audio only and visual only

UnboundLocalError: local variable 'tcn_options' referenced before assignment

While evaluating the module the code always crashes due to this error,
UnboundLocalError: local variable 'tcn_options' referenced before assignment
as it requires the tcn_options variable to be global.
How can i fix this please ?
def get_model(): if os.path.exists(args.config_path): args_loaded = load_json( args.config_path) args.backbone_type = args_loaded['backbone_type'] args.width_mult = args_loaded['width_mult'] args.relu_type = args_loaded['relu_type'] tcn_options = { 'num_layers': args_loaded['tcn_num_layers'], 'kernel_size': args_loaded['tcn_kernel_size'], 'dropout': args_loaded['tcn_dropout'], 'dwpw': args_loaded['tcn_dwpw'], 'width_mult': args_loaded['tcn_width_mult'], } return Lipreading( num_classes=args.num_classes, tcn_options=tcn_options, backbone_type=args.backbone_type, relu_type=args.relu_type, width_mult=args.width_mult, extract_feats=args.extract_feats).cuda()

Reproduce on LRW1000,only get 38.6,how can get 41.4?

HI,I have reproduce on LRW1000.
when I reproduce with batchsize=32,lr=1e-3, and the same optimizer
for bigru can get 38.4
for mutitcn can get 38.6

How can I get 41.4,can you share your trainning parameters?

when reproduce the other paper "LEARN AN EFFECTIVE LIP READING MODEL WITHOUT PAINS"
for bigru can get 57.68 (paper result is 55.7)
for mutitcn can get 55.49

Extract landmarks on videos outside the LRW dataset

Hi,

Thanks for providing the code and pre-trained models. Currently, I would like to evaluate the model using some wild videoes, instead of the testing set provided by the LRW dataset. Would you please tell me how to extract landmarks from my own videoes? Thanks.

OuluVS2 dataset

Hello, friend.

I want to use the OuluVS2 dataset, but I don't think the homepage is operating anymore.

So if you don't mind, can you share me the OuluVS2 dataset?

[email protected]

Question regarding annonation directory

Hi, thanks for the code!

I'm wondering where can I get the txt file which should be included in annonation-direc

thx!

Why does training data need babble noise?

Hi,

I'm a beginner in this field. Why do you need babble noise during training? That's what's happening here.

Thank you very much.

Must convert gray?

hi,
I have a issue about ' convert-gray' must be 'true' to get a frame * h * w ？ Not channel , such as RGB ?
Another query is we must prepocessing the video to .npz files?
Only we do this, we can train the work.
Looking forward your reply.

Pretrained models are all corrupted

All pretrained models downloaded from google drive failed to extract, please check.

Processing webcam stream: optimal lengths and other clarifications

I think I managed to connect your project to a stream from a webcam, and I got it reasonably correct: it works on my machine and seems to produce outputs that somewhat resemble the words that I'm pronouncing.

I'm not sure about some details though. Would you be able to clarify them?

I maintain a queue of frames from the webcam, and I pass 30 last frames into the network (see the model_input = ... line). From what I understand, it's what is used by the LFW dataset. Is this correct? Is there a better value for the queue length?
What is the minimum value for the queue length that will work? Is it 5 because of the kernel size of the initial 3D convolution?
I'm not sure what lengths means (a parameter expected by model.forward()). In main.py, the extract_feats() function sets lengths to a singleton list with the number of frames, but surely that can't be its sole purpose? There is also some weird averaging going on in _average_batch() that I don't understand. What is the optimal value of lengths for a stream from a webcam?
Is it correct that the model outputs logits, and to obtain probabilities I need to apply softmax?

Here is my implementation. It is self-contained and should work if you put it in the root of the repository. The only library dependency is face-alignment (pip install --user face-alignment) that I used for extracting keypoints instead of dlib. The most interesting part is between the BEGIN PROCESSING / END PROCESSING comments.

import argparse
import json
from collections import deque
from contextlib import contextmanager
from pathlib import Path

import cv2
import face_alignment
import numpy as np
import torch
from torchvision.transforms.functional import to_tensor

from lipreading.model import Lipreading
from preprocessing.transform import warp_img, cut_patch

STD_SIZE = (256, 256)
STABLE_PNTS_IDS = [33, 36, 39, 42, 45]
START_IDX = 48
STOP_IDX = 68
CROP_WIDTH = CROP_HEIGHT = 96


@contextmanager
def VideoCapture(*args, **kwargs):
    cap = cv2.VideoCapture(*args, **kwargs)
    try:
        yield cap
    finally:
        cap.release()


def load_model(config_path: Path):
    with config_path.open() as fp:
        config = json.load(fp)
    tcn_options = {
        'num_layers': config['tcn_num_layers'],
        'kernel_size': config['tcn_kernel_size'],
        'dropout': config['tcn_dropout'],
        'dwpw': config['tcn_dwpw'],
        'width_mult': config['tcn_width_mult'],
    }
    return Lipreading(
        num_classes=500,
        tcn_options=tcn_options,
        backbone_type=config['backbone_type'],
        relu_type=config['relu_type'],
        width_mult=config['width_mult'],
        extract_feats=False,
    )


def visualize_probs(vocab, probs, col_width=4, col_height=300):
    num_classes = len(probs)
    out = np.zeros((col_height, num_classes * col_width + (num_classes - 1), 3), dtype=np.uint8)
    for i, p in enumerate(probs):
        x = (col_width + 1) * i
        cv2.rectangle(out, (x, 0), (x + col_width - 1, round(p * col_height)), (255, 255, 255), 1)
    top = np.argmax(probs)
    cv2.addText(out, f'Prediction: {vocab[top]}', (10, out.shape[0] - 30), 'Arial', color=(255, 255, 255))
    cv2.addText(out, f'Confidence: {probs[top]:.3f}', (10, out.shape[0] - 10), 'Arial', color=(255, 255, 255))
    return out


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--config-path', type=Path, default=Path('configs/lrw_resnet18_mstcn.json'))
    parser.add_argument('--model-path', type=Path, default=Path('models/lrw_resnet18_mstcn_adamw_s3.pth.tar'))
    parser.add_argument('--device', type=str, default='cuda')
    parser.add_argument('--queue-length', type=int, default=30)
    args = parser.parse_args()

    fa = face_alignment.FaceAlignment(face_alignment.LandmarksType._2D, device=args.device)
    model = load_model(args.config_path)
    model.load_state_dict(torch.load(Path(args.model_path), map_location=args.device)['model_state_dict'])
    model = model.to(args.device)

    mean_face_landmarks = np.load(Path('preprocessing/20words_mean_face.npy'))

    with Path('labels/500WordsSortedList.txt').open() as fp:
        vocab = fp.readlines()
    assert len(vocab) == 500

    queue = deque(maxlen=args.queue_length)

    with VideoCapture(0) as cap:
        while True:
            ret, image_np = cap.read()
            if not ret:
                break
            image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)

            all_landmarks = fa.get_landmarks(image_np)
            if all_landmarks:
                landmarks = all_landmarks[0]

                # BEGIN PROCESSING

                trans_frame, trans = warp_img(
                    landmarks[STABLE_PNTS_IDS, :], mean_face_landmarks[STABLE_PNTS_IDS, :], image_np, STD_SIZE)
                trans_landmarks = trans(landmarks)
                patch = cut_patch(
                    trans_frame, trans_landmarks[START_IDX:STOP_IDX], CROP_HEIGHT // 2, CROP_WIDTH // 2)

                cv2.imshow('patch', cv2.cvtColor(patch, cv2.COLOR_RGB2BGR))

                patch_torch = to_tensor(cv2.cvtColor(patch, cv2.COLOR_RGB2GRAY)).to(args.device)
                queue.append(patch_torch)

                if len(queue) >= args.queue_length:
                    with torch.no_grad():
                        model_input = torch.stack(list(queue), dim=1).unsqueeze(0)
                        logits = model(model_input, lengths=[args.queue_length])
                        probs = torch.nn.functional.softmax(logits, dim=-1)
                        probs = probs[0].detach().cpu().numpy()

                    vis = visualize_probs(vocab, probs)
                    cv2.imshow('probs', vis)

                # END PROCESSING

                for x, y in landmarks:
                    cv2.circle(image_np, (int(x), int(y)), 2, (0, 0, 255))

            cv2.imshow('camera', cv2.cvtColor(image_np, cv2.COLOR_RGB2BGR))

            key = cv2.waitKey(1)
            if key in {27, ord('q')}:  # 27 is Esc
                break
            elif key == ord(' '):
                cv2.waitKey(0)

    cv2.destroyAllWindows()


if __name__ == '__main__':
    main()

DC-TCN number of parameters and Hardest words list

Hi,
I have questions about DC-TCN and MS-TCN papers.

Could you provide me the number of parameters for the four settings in Table 2 of the DC-TCN paper?
If possible, please let me know FLOPs as well.
In second-page footnote of "LIPREADING USING TEMPORAL CONVOLUTIONAL NETWORKS" paper, it is mentioned that the list of “hardest words” is obtained from [10]. However, I couldn't get the list from the github and the paper. Could you provide the hardest 50 classes list your paper mentioned?

Thank you.

Ask a question about non-casual tcn

Thank you for providing your lip reading research and code.
I got a question after reading the code and paper you provided. According to the paper, tcn is designed as non-causal, but the code is designed as causal tcn. Non-causal can be designed through some modifications, so there is no problem.
Is the performance presented in the paper casual or non-casual tcn?
Thank you!

ShuffleNet's Parameter

Hi, thanks for your work.
In the 'shufflenetv2.py', I see that ' Input size needs to be divisible by 32', such as 96 ... so we do this is only to make sure '(nn.AvgPool2d(int(input_size/32))' ?

How to load audio data?

Thank you for sharing such nice code.

My question is: how to extract audio data from an MP4 file in the preprocessing stage? In this folder, I just found the ones related to the images.

Thanks again. Looking forward to your early reply.

Extract Embedding

CUDA_VISIBLE_DEVICES=0 python main.py --extract-feats
--config-path
--model-path
--mouth-patch-path
--mouth-embedding-out-path
I tried jpg, npy, npz for image but failed to extract embeddings of mouth patches, May I know what type of file it expects to extract 512-D embedding?

Preprocessing Issue

Hi, Thx for the code!

I found out that when it comes to 68 landmark files below, the file is empty so preprocessing code didn't work.

I'm wondering if this is my own issue (such as unzip issue..) or not.

mpc001 / lipreading_using_temporal_convolutional_networks Goto Github PK

lipreading_using_temporal_convolutional_networks's Issues

Recommend Projects

Recommend Topics

Recommend Org