MuAViC

https://arxiv.org/abs/2303.00628

A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.

Overview

MuAViC provides

1200 hours of transcribed audio-visual speech for 9 languages (English, Arabic, German, Greek, Spanish, French, Italian, Portuguese and Russian)
text translations for 6 English-to-X directions and 6 X-to-English directions (X = Greek, Spanish, French, Italian, Portuguese or Russian)

The raw data is collected from TED/TEDx talk recordings.

Detailed statistics

Audio-Visual Speech Recognition

Language	Code	Train Hours (H+P)	Train Speakers
English	En	436 + 0	4.7K
Arabic	Ar	16 + 0	95
German	De	10 + 0	53
Greek	El	25 + 0	113
Spanish	Es	178 + 0	987
French	Fr	176 + 0	948
Italian	It	101 + 0	487
Portuguese	Pt	153 + 0	810
Russian	Ru	49 + 0	238

Audio-Visual En-X Speech-to-Text Translation

Direction	Code	Train Hours (H+P)	Train Speakers
English-Greek	En-El	17 + 420	4.7K
English-Spanish	En-Es	21 + 416	4.7K
English-French	En-Fr	21 + 416	4.7K
English-Italian	En-It	20 + 417	4.7K
English-Portuguese	En-Pt	18 + 419	4.7K
English-Russian	En-Ru	20 + 417	4.7K

Audio-Visual X-En Speech-to-Text Translation

Direction	Code	Train Hours (H+P)	Train Speakers
Greek-English	El-En	8 + 17	113
Spanish-English	Es-En	64 + 114	987
French-English	Fr-En	45 + 131	948
Italian-English	It-En	48 + 53	487
Portuguese-English	Pt-En	53 + 100	810
Russian-English	Ru-En	8 + 41	238

Getting Data

We provide scripts to generate the audio/video data and AV-HuBERT training manifests for MuAViC.

First, clone this repo for the scripts

git clone https://github.com/facebookresearch/muavic.git

Install required packages:

conda install -c conda-forge ffmpeg==4.2.2
conda install -c conda-forge sox
pip install -r requirements.txt

Then get audio-visual speech recognition and translation data via

python get_data.py --root-path ${ROOT} --src-lang ${SRC_LANG}

where the speech language ${SRC_LANG} is one of en, ar, de, el, es, fr, it, pt and ru.

Generated data will be saved to ${ROOT}/muavic:

${ROOT}/muavic/${SRC_LANG}/audio for processed audio files
${ROOT}/muavic/${SRC_LANG}/video for processed video files
${ROOT}/muavic/${SRC_LANG}/*.tsv for AV-HuBERT AVSR training manifests
${ROOT}/muavic/${SRC_LANG}/${TGT_LANG}/*.tsv for AV-HuBERT AVST training manifests

Models

In the following table, we provide all end-to-end trained models mentioned in our paper:

Task	Languages	Best Checkpoint	Dictionary	Tokenizer
AVSR	ar	best_ckpt.pt	dict	tokenizer
	de	best_ckpt.pt	dict	tokenizer
	el	best_ckpt.pt	dict	tokenizer
	en	best_ckpt.pt	dict	tokenizer
	es	best_ckpt.pt	dict	tokenizer
	fr	best_ckpt.pt	dict	tokenizer
	it	best_ckpt.pt	dict	tokenizer
	pt	best_ckpt.pt	dict	tokenizer
	ru	best_ckpt.pt	dict	tokenizer
	ar,de,el,es,fr,it,pt,ru	best_ckpt.pt	dict	tokenizer
AVST	en-el	best_ckpt.pt	dict	tokenizer
	en-es	best_ckpt.pt	dict	tokenizer
	en-fr	best_ckpt.pt	dict	tokenizer
	en-it	best_ckpt.pt	dict	tokenizer
	en-pt	best_ckpt.pt	dict	tokenizer
	en-ru	best_ckpt.pt	dict	tokenizer
	el-en	best_ckpt.pt	dict	tokenizer
	es-en	best_ckpt.pt	dict	tokenizer
	fr-en	best_ckpt.pt	dict	tokenizer
	it-en	best_ckpt.pt	dict	tokenizer
	pt-en	best_ckpt.pt	dict	tokenizer
	ru-en	best_ckpt.pt	dict	tokenizer
	{el,es,fr,it,pt,ru}-en	best_ckpt.pt	dict	tokenizer

Demo

To try out our state-of-the-art audio-visual models with different audio and video inputs, including a recorded video through the webcam or an uploaded video, checkout our demo:

demo.mp4

You can read more about our model in the README file in the demo folder.

Training

For training Audio-Visual models, we are going to use AV-HuBERT framework.

Clone and install AV-HuBERT in the root directory:

$ # Clone the "muavic" branch of av_hubert's repo
$ git -b muavic clone https://github.com/facebookresearch/av_hubert.git
$ # Set the fairseq version
$ cd avhubert
$ git submodule init
$ git submodule update
$ # Install av-hubert's requirements
$ pip install -r requirements.txt
$ # Install fairseq
$ cd fairseq
$ pip install --editable ./

Download an AV-HuBERT pre-trained model from here.

Open the training script (scripts/train.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=

# path where output trained models will be located
OUT_PATH= 

# path to the downloaded pre-trained model
PRETRAINED_MODEL_PATH=

Run the training script:
```
$ bash scripts/train.sh
```

Note:
All audio-visual models found here used the large_vox_iter5.pt pre-trained model.

Decoding/Evaluating

To evaluate your trained model (or our trained models) against MuAViC, follow these steps:

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=???

# data split (e.g "test" or "valid")
GROUP=???

# inference modality (choices: "audio", "video", "audio,video")
MODALITIES=???

# path to the trained model
MODEL_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:
```
$ bash scripts/decode.sh
```

License

CC-BY-NC 4.0

Citation

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}

apollohuang1 / muavic Goto Github PK

muavic's Introduction

MuAViC

Overview

Detailed statistics

Getting Data

Models

Demo

Training

Decoding/Evaluating

License

Citation

muavic's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent