CRESS: Understanding and Bridging the Modality Gap for Speech Translation

Qingkai Fang, Yang Feng* | Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

This is a PyTorch implementation of the ACL 2023 main conference paper Understanding and Bridging the Modality Gap for Speech Translation.

🙌 We provide our code, ST/MT model weights, and processed ST/MT data in this repository.

👀 Also see our other works dedicated to bridging the modality gap for speech translation (ST):

Release

We have released the following assets for all 8 translation directions of MuST-C:

Processed ST data in .tsv format
Processed external MT data in fairseq binary format
SentencePiece vocabulary
Pretrained MT models in both base and expand settings
Pretrained CRESS models in both base and expand settings

	Link	Password
Processed ST Data	https://pan.baidu.com/s/1J7BgcbSNwma4SdJfHENRdg	94wu
Processed MT Data	https://pan.baidu.com/s/1gDMOU35_pug73y0kd-F3vw	6tbk
Vocabulary	https://pan.baidu.com/s/13ucCEVzAdxRu99bdZ2oIdw	nph3
MT Model (base)	https://pan.baidu.com/s/1xm6myQfY-wYS4D0_rMBT_g	tm6k
MT Model (expand)	https://pan.baidu.com/s/1byufAhoYQmgA8DCf9WUZQg	61g4
CRESS Model (base)	https://pan.baidu.com/s/1_KCS_-a_Ss4Bm40dTQc6Vw	ra8j
CRESS Model (expand)	https://pan.baidu.com/s/1zGJKmJf8TEnwBLzpOmfGYQ	ctyf

Environment Configuration

Clone this repository:

git clone [email protected]:ictnlp/CRESS.git
cd CRESS/

Install fairseq:

cd fairseq/
pip install --editable ./
python setup.py build develop

We organize our implementation as fairseq plug-ins in the cress directory:

.
├── criterions
│   ├── __init__.py
│   ├── speech_and_text_translation_criterion.py
│   ├── speech_and_text_translation_with_oracle_reg_adaptive_criterion.py
│   └── speech_and_text_translation_with_oracle_reg_criterion.py
├── datasets
│   ├── audio_utils.py
│   ├── __init__.py
│   ├── speech_and_text_translation_dataset.py
│   └── speech_to_text_dataset.py
├── __init__.py
├── models
│   ├── hubert_transformer.py
│   └── __init__.py
├── tasks
│   ├── __init__.py
│   ├── speech_and_text_translation.py
│   └── speech_to_text_modified.py
├── test_scripts
│   ├── avg_epoch.sh
│   ├── test.en-x.mt.sh
│   └── test.en-x.st.sh
└── train_scripts
    ├── train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress_adaptive.sh
    ├── train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress.sh
    ├── train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.sh
    ├── train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.sh
    └── train.en-x.postln.wmt_pretrain.sh

You can import our implementation with --user-dir cress in fairseq.

Data Preparation

Make directories to store ST (MuST-C) and MT (WMT) datasets. Please specify the target language via $TGT_LANG:

TGT_LANG=de
MUSTC_ROOT=data/mustc
WMT_ROOT=data/wmt
mkdir -p $MUSTC_ROOT $WMT_ROOT

Download the MuST-C v1.0 archive to the $MUSTC_ROOT directory and uncompress it:

cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-${TGT_LANG}.tar.gz

We provide the processed ST data and the SentencePiece vocabulary files. You can download them via the Baidu Netdisk:

	Link	Password
Processed ST Data	https://pan.baidu.com/s/1J7BgcbSNwma4SdJfHENRdg	94wu
Vocabulary	https://pan.baidu.com/s/13ucCEVzAdxRu99bdZ2oIdw	nph3

Put the downloaded files in the $MUSTC_ROOT/en-${TGT_LANG}/ directory. It should look like the this:

.
├── binary
├── config.yaml
├── data
├── dev.tsv
├── docs
├── spm_unigram10000.model
├── spm_unigram10000.txt
├── spm_unigram10000.vocab
├── train.tsv
└── tst-COMMON.tsv

For MT pretraining, we need additional MT datasets. We provide the processed MT data in the fairseq binary format. You can download them via the Baidu Netdisk:

	Link	Password
Processed MT Data	https://pan.baidu.com/s/1gDMOU35_pug73y0kd-F3vw	6tbk

Put the downloaded files in the $WMT_ROOT/en-${TGT_LANG} directory.

Model Training

The modal training contains two steps: MT pretraining and ST finetuning.

In the base setting, we pretrain the model with <transcription, translation> pairs from the MuST-C dataset.
In the expand setting, we first pretrain the model with external MT datasets, and then pretrain the model with <transcription, translation> pairs from MuST-C.

All the training scripts below are configured to run using 4 GPUs. You can adjust --update-freq depending on the number of your available GPUS.

Before training, please download the HuBERT-Base model and place it in the checkpoints/hubert_base_ls960.pt path.

MT Pretraining

(Optional) Pretrain the model with the external MT dataset. Please run the script:

sh cress/train_scripts/train.en-x.postln.wmt_pretrain.sh $TGT_LANG

You should adjust the maximum training steps (--max-update) based on the size of the training data.

After training, please average the last 5 checkpoints:

python scripts/average_checkpoints.py \
    --inputs checkpoints/en-$tgt.postln.wmt_pretrain \
    --num-epoch-checkpoints 5 \
    --output checkpoints/$ckpt/avg_last_5_epoch.pt

Pretrain the model with <transcription, translation> pairs from the MuST-C dataset. Please run the script:

sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.sh $TGT_LANG

After training, please average the last 10 checkpoints. You can use the script cress/test_scripts/avg_epoch.sh. The averaged checkpoint will be used to intialize the ST model.

To ensure consistent performance, we have released our checkpoints of pretrained MT models in both base and expand settings. You can download them via the Baidu Netdisk.

	Link	Password
MT (base)	https://pan.baidu.com/s/1xm6myQfY-wYS4D0_rMBT_g	tm6k
MT (expand)	https://pan.baidu.com/s/1byufAhoYQmgA8DCf9WUZQg	61g4

Multitask Learning

For multitask learning (the MTL baseline in the paper), please run the script:

sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.sh $TGT_LANG

Cross-modal Regularization with Scheduled Sampling (CRESS)

For the CRESS training, please first run the script below. Note that token-level adaptive training is not used for the first 20 epochs of training.

sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress.sh $TGT_LANG

For the subsqeuent training epochs, token-level adaptive training will be used. Please run the script:

sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress_adaptive.sh $TGT_LANG

We also released checkpoints of CRESS. You can download and evaluate them.

	Link	Password
CRESS (base)	https://pan.baidu.com/s/1_KCS_-a_Ss4Bm40dTQc6Vw	ra8j
CRESS (expand)	https://pan.baidu.com/s/1zGJKmJf8TEnwBLzpOmfGYQ	ctyf

Evaluation

For evaluation, please first average the last 10 checkpoints using the cress/test_scripts/avg_epoch.sh script. Next, please use the scripts below to evaluate the ST/MT performance of the averaged checkpoint.

The values of --lenpen vary across different target languages as follows:

TGT_LANG	De	Fr	Es	Ro	Ru	It	Pt	Nl
`--lenpen`	1.2	1.8	0.6	1.4	0.8	1.0	1.4	1.0

ST Evaluation

To evaluation the ST performance of the model, please use the cress/test_scripts/test.en-x.st.sh script:

sh cress/test_scripts/test.en-x.st.sh $CKPT $TGT_LANG $LENPEN

MT Evaluation

To evaluation the MT performance of the model, please use the cress/test_scripts/test.en-x.mt.sh script.

sh cress/test_scripts/test.en-x.mt.sh $CKPT $TGT_LANG $LENPEN

Citation

If this repository is useful for you, please cite as:

@inproceedings{fang-and-feng-2023-understanding,
	title = {Understanding and Bridging the Modality Gap for Speech Translation},
	author = {Fang, Qingkai and Feng, Yang},
	booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
	year = {2023},
}

The program got stuck while executing the script test.en-x.st.sh

2023-07-12 20:20:51 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': False, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': True}
2023-07-12 20:20:56 | INFO | cress.tasks.speech_to_text_modified | pre-tokenizer: {'tokenizer': None}
2023-07-12 20:20:56 | INFO | cress.tasks.speech_to_text_modified | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': 'xxxx/data/mustc/en-de/spm_unigram10000.model'}
2023-07-12 20:20:56 | INFO | cress.datasets.speech_to_text_dataset | 'tst-COMMON' has 0.00% OOV
2023-07-12 20:20:56 | INFO | cress.datasets.speech_to_text_dataset | SpeechToTextDataset(split="tst-COMMON", n_samples=2_641, prepend_tgt_lang_tag=False, shuffle=False, transforms=None, n_frames_per_step=1
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | can_reuse_epoch_itr = True
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | reuse_dataloader = True
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | rebuild_batches = False
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | creating new batches for epoch 1
0%| | 0/154 [00:00<?, ?it/s]2023-07-12 20:20:59 | INFO | cress.tasks.speech_to_text_modified | pre-tokenizer: {'tokenizer': None}
2023-07-12 20:20:59 | INFO | cress.tasks.speech_to_text_modified | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': 'xxxx/data/mustc/en-de/spm_unigram10000.model'}

The program is stuck here, is there a problem with the writing of speech_to_text_modified.py?

ictnlp / cress Goto Github PK

cress's Introduction

CRESS: Understanding and Bridging the Modality Gap for Speech Translation

Release

Environment Configuration

Data Preparation

Model Training

MT Pretraining

Multitask Learning

Cross-modal Regularization with Scheduled Sampling (CRESS)

Evaluation

ST Evaluation

MT Evaluation

Citation

cress's People

Contributors

Stargazers

Watchers

Forkers

cress's Issues

Recommend Projects

Recommend Topics

Recommend Org