Multilingual AVSR model decoding and training

MuAViC

A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.

Paper

Overview

MuAViC provides

1200 hours of transcribed audio-visual speech for 9 languages (English, Arabic, German, Greek, Spanish, French, Italian, Portuguese and Russian)
text translations for 6 English-to-X directions and 6 X-to-English directions (X = Greek, Spanish, French, Italian, Portuguese or Russian)

The raw data is collected from TED/TEDx talk recordings.

Detailed statistics

Audio-Visual Speech Recognition

Language	Code	Train Hours (H+P)	Train Speakers
English	En	436 + 0	4.7K
Arabic	Ar	16 + 0	95
German	De	10 + 0	53
Greek	El	25 + 0	113
Spanish	Es	178 + 0	987
French	Fr	176 + 0	948
Italian	It	101 + 0	487
Portuguese	Pt	153 + 0	810
Russian	Ru	49 + 0	238

Audio-Visual En-X Speech-to-Text Translation

Direction	Code	Train Hours (H+P)	Train Speakers
English-Greek	En-El	17 + 420	4.7K
English-Spanish	En-Es	21 + 416	4.7K
English-French	En-Fr	21 + 416	4.7K
English-Italian	En-It	20 + 417	4.7K
English-Portuguese	En-Pt	18 + 419	4.7K
English-Russian	En-Ru	20 + 417	4.7K

Audio-Visual X-En Speech-to-Text Translation

Direction	Code	Train Hours (H+P)	Train Speakers
Greek-English	El-En	8 + 17	113
Spanish-English	Es-En	64 + 114	987
French-English	Fr-En	45 + 131	948
Italian-English	It-En	48 + 53	487
Portuguese-English	Pt-En	53 + 100	810
Russian-English	Ru-En	8 + 41	238

Getting Data

We provide scripts to generate the audio/video data and AV-HuBERT training manifests for MuAViC.

First, clone this repo for the scripts

git clone https://github.com/facebookresearch/muavic.git

Install required packages:

conda install -c conda-forge ffmpeg==4.2.2
conda install -c conda-forge sox
pip install -r requirements.txt

Then get audio-visual speech recognition and translation data via

python get_data.py --root-path ${ROOT} --src-lang ${SRC_LANG}

where the speech language ${SRC_LANG} is one of en, ar, de, el, es, fr, it, pt and ru.

Generated data will be saved to ${ROOT}/muavic:

${ROOT}/muavic/${SRC_LANG}/audio for processed audio files
${ROOT}/muavic/${SRC_LANG}/video for processed video files
${ROOT}/muavic/${SRC_LANG}/*.tsv for AV-HuBERT AVSR training manifests
${ROOT}/muavic/${SRC_LANG}/${TGT_LANG}/*.tsv for AV-HuBERT AVST training manifests

Models

In the following table, we provide all end-to-end trained models mentioned in our paper:

Task	Languages	Best Checkpoint	Dictionary	Tokenizer
AVSR	ar	best_ckpt.pt	dict	tokenizer
	de	best_ckpt.pt	dict	tokenizer
	el	best_ckpt.pt	dict	tokenizer
	en	best_ckpt.pt	dict	tokenizer
	es	best_ckpt.pt	dict	tokenizer
	fr	best_ckpt.pt	dict	tokenizer
	it	best_ckpt.pt	dict	tokenizer
	pt	best_ckpt.pt	dict	tokenizer
	ru	best_ckpt.pt	dict	tokenizer
	ar,de,el,es,fr,it,pt,ru	best_ckpt.pt	dict	tokenizer
AVST	en-el	best_ckpt.pt	dict	tokenizer
	en-es	best_ckpt.pt	dict	tokenizer
	en-fr	best_ckpt.pt	dict	tokenizer
	en-it	best_ckpt.pt	dict	tokenizer
	en-pt	best_ckpt.pt	dict	tokenizer
	en-ru	best_ckpt.pt	dict	tokenizer
	el-en	best_ckpt.pt	dict	tokenizer
	es-en	best_ckpt.pt	dict	tokenizer
	fr-en	best_ckpt.pt	dict	tokenizer
	it-en	best_ckpt.pt	dict	tokenizer
	pt-en	best_ckpt.pt	dict	tokenizer
	ru-en	best_ckpt.pt	dict	tokenizer
	{el,es,fr,it,pt,ru}-en	best_ckpt.pt	dict	tokenizer

Demo

To try out our state-of-the-art audio-visual models with different audio and video inputs, including a recorded video through the webcam or an uploaded video, checkout our demo:

demo.mp4

You can read more about our model in the README file in the demo folder.

Training

For training Audio-Visual models, we are going to use AV-HuBERT framework.

Clone and install AV-HuBERT in the root directory:

$ # Clone the "muavic" branch of av_hubert's repo
$ git -b muavic clone https://github.com/facebookresearch/av_hubert.git
$ # Set the fairseq version
$ cd avhubert
$ git submodule init
$ git submodule update
$ # Install av-hubert's requirements
$ pip install -r requirements.txt
$ # Install fairseq
$ cd fairseq
$ pip install --editable ./

Download an AV-HuBERT pre-trained model from here.

Open the training script (scripts/train.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=

# path where output trained models will be located
OUT_PATH= 

# path to the downloaded pre-trained model
PRETRAINED_MODEL_PATH=

Run the training script:
```
$ bash scripts/train.sh
```

Note:
All audio-visual models found here used the large_vox_iter5.pt pre-trained model.

Decoding/Evaluating

To evaluate your trained model (or our trained models) against MuAViC, follow these steps:

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g "en" or "en-fr")
LANG=???

# data split (e.g "test" or "valid")
GROUP=???

# inference modality (choices: "audio", "video", "audio,video")
MODALITIES=???

# path to the trained model
MODEL_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:
```
$ bash scripts/decode.sh
```

License

CC-BY-NC 4.0

Citation

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}

	process_map(
	partial(
	segment_normalize_video_file,
	mean_face_metadata,
	metadata_path / src_lang / split,
	video_dir_path,
	out_path,
	),
	video_segments.items(),
	max_workers=os.cpu_count(),
	chunksize=1,
	)

	for split in ["train", "valid", "test"]:
	# create directory for segmented & normalized audio
	out_path = muavic_path / src_lang / "audio" / split
	out_path.mkdir(parents=True, exist_ok=True)
	if not is_empty(out_path):
	if split == "train":
	print(f"\nSegmenting {src_lang} audio files")
	# collect needed info from segment file
	segments_info = []
	split_dir_path = mtedx_path / f"{src_lang}-{src_lang}" / "data" / split
	wav_dir_path = split_dir_path / "wav"
	segment_file = split_dir_path / "txt" / "segments"

	for line in read_txt_file(segment_file):
	seg_id, fid, start, end = line.strip().split(' ')
	segments_info.append(
	(wav_dir_path/(fid+".flac"), fid, seg_id, float(start), float(end))
	)
	# preprocess audio files
	process_map(
	partial(segment_normalize_audio_file, out_path),
	segments_info,
	max_workers=os.cpu_count(),
	desc=f"Preprocessing {src_lang}/{split} Audios",
	chunksize=1,
	)

facebookresearch / muavic Goto Github PK

muavic's Introduction

MuAViC

Overview

Detailed statistics

Getting Data

Models

Demo

Training

Decoding/Evaluating

License

Citation

muavic's People

Contributors

Stargazers

Watchers

Forkers

muavic's Issues

Recommend Projects

Recommend Topics

Recommend Org