Giter VIP home page Giter VIP logo

cress's Introduction

CRESS: Understanding and Bridging the Modality Gap for Speech Translation

Qingkai Fang, Yang Feng* | Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

This is a PyTorch implementation of the ACL 2023 main conference paper Understanding and Bridging the Modality Gap for Speech Translation.

๐Ÿ™Œ We provide our code, ST/MT model weights, and processed ST/MT data in this repository.

๐Ÿ‘€ Also see our other works dedicated to bridging the modality gap for speech translation (ST):

Release

We have released the following assets for all 8 translation directions of MuST-C:

  • Processed ST data in .tsv format
  • Processed external MT data in fairseq binary format
  • SentencePiece vocabulary
  • Pretrained MT models in both base and expand settings
  • Pretrained CRESS models in both base and expand settings
Link Password
Processed ST Data https://pan.baidu.com/s/1J7BgcbSNwma4SdJfHENRdg 94wu
Processed MT Data https://pan.baidu.com/s/1gDMOU35_pug73y0kd-F3vw 6tbk
Vocabulary https://pan.baidu.com/s/13ucCEVzAdxRu99bdZ2oIdw nph3
MT Model (base) https://pan.baidu.com/s/1xm6myQfY-wYS4D0_rMBT_g tm6k
MT Model (expand) https://pan.baidu.com/s/1byufAhoYQmgA8DCf9WUZQg 61g4
CRESS Model (base) https://pan.baidu.com/s/1_KCS_-a_Ss4Bm40dTQc6Vw ra8j
CRESS Model (expand) https://pan.baidu.com/s/1zGJKmJf8TEnwBLzpOmfGYQ ctyf

Environment Configuration

  1. Clone this repository:
git clone [email protected]:ictnlp/CRESS.git
cd CRESS/
  1. Install fairseq:
cd fairseq/
pip install --editable ./
python setup.py build develop
  1. We organize our implementation as fairseq plug-ins in the cress directory:
.
โ”œโ”€โ”€ criterions
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ speech_and_text_translation_criterion.py
โ”‚   โ”œโ”€โ”€ speech_and_text_translation_with_oracle_reg_adaptive_criterion.py
โ”‚   โ””โ”€โ”€ speech_and_text_translation_with_oracle_reg_criterion.py
โ”œโ”€โ”€ datasets
โ”‚   โ”œโ”€โ”€ audio_utils.py
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ speech_and_text_translation_dataset.py
โ”‚   โ””โ”€โ”€ speech_to_text_dataset.py
โ”œโ”€โ”€ __init__.py
โ”œโ”€โ”€ models
โ”‚   โ”œโ”€โ”€ hubert_transformer.py
โ”‚   โ””โ”€โ”€ __init__.py
โ”œโ”€โ”€ tasks
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ speech_and_text_translation.py
โ”‚   โ””โ”€โ”€ speech_to_text_modified.py
โ”œโ”€โ”€ test_scripts
โ”‚   โ”œโ”€โ”€ avg_epoch.sh
โ”‚   โ”œโ”€โ”€ test.en-x.mt.sh
โ”‚   โ””โ”€โ”€ test.en-x.st.sh
โ””โ”€โ”€ train_scripts
    โ”œโ”€โ”€ train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress_adaptive.sh
    โ”œโ”€โ”€ train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress.sh
    โ”œโ”€โ”€ train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.sh
    โ”œโ”€โ”€ train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.sh
    โ””โ”€โ”€ train.en-x.postln.wmt_pretrain.sh

You can import our implementation with --user-dir cress in fairseq.

Data Preparation

  1. Make directories to store ST (MuST-C) and MT (WMT) datasets. Please specify the target language via $TGT_LANG:
TGT_LANG=de
MUSTC_ROOT=data/mustc
WMT_ROOT=data/wmt
mkdir -p $MUSTC_ROOT $WMT_ROOT
  1. Download the MuST-C v1.0 archive to the $MUSTC_ROOT directory and uncompress it:
cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-${TGT_LANG}.tar.gz
  1. We provide the processed ST data and the SentencePiece vocabulary files. You can download them via the Baidu Netdisk:
Link Password
Processed ST Data https://pan.baidu.com/s/1J7BgcbSNwma4SdJfHENRdg 94wu
Vocabulary https://pan.baidu.com/s/13ucCEVzAdxRu99bdZ2oIdw nph3

Put the downloaded files in the $MUSTC_ROOT/en-${TGT_LANG}/ directory. It should look like the this:

.
โ”œโ”€โ”€ binary
โ”œโ”€โ”€ config.yaml
โ”œโ”€โ”€ data
โ”œโ”€โ”€ dev.tsv
โ”œโ”€โ”€ docs
โ”œโ”€โ”€ spm_unigram10000.model
โ”œโ”€โ”€ spm_unigram10000.txt
โ”œโ”€โ”€ spm_unigram10000.vocab
โ”œโ”€โ”€ train.tsv
โ””โ”€โ”€ tst-COMMON.tsv
  1. For MT pretraining, we need additional MT datasets. We provide the processed MT data in the fairseq binary format. You can download them via the Baidu Netdisk:
Link Password
Processed MT Data https://pan.baidu.com/s/1gDMOU35_pug73y0kd-F3vw 6tbk

Put the downloaded files in the $WMT_ROOT/en-${TGT_LANG} directory.

Model Training

The modal training contains two steps: MT pretraining and ST finetuning.

  • In the base setting, we pretrain the model with <transcription, translation> pairs from the MuST-C dataset.
  • In the expand setting, we first pretrain the model with external MT datasets, and then pretrain the model with <transcription, translation> pairs from MuST-C.

All the training scripts below are configured to run using 4 GPUs. You can adjust --update-freq depending on the number of your available GPUS.

Before training, please download the HuBERT-Base model and place it in the checkpoints/hubert_base_ls960.pt path.

MT Pretraining

  1. (Optional) Pretrain the model with the external MT dataset. Please run the script:
sh cress/train_scripts/train.en-x.postln.wmt_pretrain.sh $TGT_LANG

You should adjust the maximum training steps (--max-update) based on the size of the training data.

After training, please average the last 5 checkpoints:

python scripts/average_checkpoints.py \
    --inputs checkpoints/en-$tgt.postln.wmt_pretrain \
    --num-epoch-checkpoints 5 \
    --output checkpoints/$ckpt/avg_last_5_epoch.pt
  1. Pretrain the model with <transcription, translation> pairs from the MuST-C dataset. Please run the script:
sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.sh $TGT_LANG

After training, please average the last 10 checkpoints. You can use the script cress/test_scripts/avg_epoch.sh. The averaged checkpoint will be used to intialize the ST model.

To ensure consistent performance, we have released our checkpoints of pretrained MT models in both base and expand settings. You can download them via the Baidu Netdisk.

Link Password
MT (base) https://pan.baidu.com/s/1xm6myQfY-wYS4D0_rMBT_g tm6k
MT (expand) https://pan.baidu.com/s/1byufAhoYQmgA8DCf9WUZQg 61g4

Multitask Learning

  1. For multitask learning (the MTL baseline in the paper), please run the script:
sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.sh $TGT_LANG

Cross-modal Regularization with Scheduled Sampling (CRESS)

  1. For the CRESS training, please first run the script below. Note that token-level adaptive training is not used for the first 20 epochs of training.
sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress.sh $TGT_LANG
  1. For the subsqeuent training epochs, token-level adaptive training will be used. Please run the script:
sh cress/train_scripts/train.en-x.postln.wmt_pretrain.mustc_mt_pretrain.mustc_st+mt.cress_adaptive.sh $TGT_LANG

We also released checkpoints of CRESS. You can download and evaluate them.

Link Password
CRESS (base) https://pan.baidu.com/s/1_KCS_-a_Ss4Bm40dTQc6Vw ra8j
CRESS (expand) https://pan.baidu.com/s/1zGJKmJf8TEnwBLzpOmfGYQ ctyf

Evaluation

For evaluation, please first average the last 10 checkpoints using the cress/test_scripts/avg_epoch.sh script. Next, please use the scripts below to evaluate the ST/MT performance of the averaged checkpoint.

The values of --lenpen vary across different target languages as follows:

TGT_LANG De Fr Es Ro Ru It Pt Nl
--lenpen 1.2 1.8 0.6 1.4 0.8 1.0 1.4 1.0

ST Evaluation

To evaluation the ST performance of the model, please use the cress/test_scripts/test.en-x.st.sh script:

sh cress/test_scripts/test.en-x.st.sh $CKPT $TGT_LANG $LENPEN

MT Evaluation

To evaluation the MT performance of the model, please use the cress/test_scripts/test.en-x.mt.sh script.

sh cress/test_scripts/test.en-x.mt.sh $CKPT $TGT_LANG $LENPEN

Citation

If this repository is useful for you, please cite as:

@inproceedings{fang-and-feng-2023-understanding,
	title = {Understanding and Bridging the Modality Gap for Speech Translation},
	author = {Fang, Qingkai and Feng, Yang},
	booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
	year = {2023},
}

cress's People

Contributors

poeroz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

crabbit-f

cress's Issues

The program got stuck while executing the script test.en-x.st.sh

2023-07-12 20:20:51 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': False, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': True}
2023-07-12 20:20:56 | INFO | cress.tasks.speech_to_text_modified | pre-tokenizer: {'tokenizer': None}
2023-07-12 20:20:56 | INFO | cress.tasks.speech_to_text_modified | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': 'xxxx/data/mustc/en-de/spm_unigram10000.model'}
2023-07-12 20:20:56 | INFO | cress.datasets.speech_to_text_dataset | 'tst-COMMON' has 0.00% OOV
2023-07-12 20:20:56 | INFO | cress.datasets.speech_to_text_dataset | SpeechToTextDataset(split="tst-COMMON", n_samples=2_641, prepend_tgt_lang_tag=False, shuffle=False, transforms=None, n_frames_per_step=1
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | can_reuse_epoch_itr = True
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | reuse_dataloader = True
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | rebuild_batches = False
2023-07-12 20:20:59 | INFO | fairseq.tasks.fairseq_task | creating new batches for epoch 1
0%| | 0/154 [00:00<?, ?it/s]2023-07-12 20:20:59 | INFO | cress.tasks.speech_to_text_modified | pre-tokenizer: {'tokenizer': None}
2023-07-12 20:20:59 | INFO | cress.tasks.speech_to_text_modified | tokenizer: {'bpe': 'sentencepiece', 'sentencepiece_model': 'xxxx/data/mustc/en-de/spm_unigram10000.model'}

The program is stuck here, is there a problem with the writing of speech_to_text_modified.py?

An error is reported when executing cress/test_scripts/test.en-x.st.sh script

Traceback (most recent call last):
File "/home/xxxx/anaconda3/envs/CRESS/bin/fairseq-generate", line 33, in
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-generate')())
File "/home/xxxx/projects/CRESS/fairseq/fairseq_cli/generate.py", line 413, in cli_main
main(args)
File "/home/xxxx/projects/CRESS/fairseq/fairseq_cli/generate.py", line 50, in main
return _main(cfg, sys.stdout)
File "/home/xxxx/projects/CRESS/fairseq/fairseq_cli/generate.py", line 96, in _main
models, saved_cfg = checkpoint_utils.load_model_ensemble(
File "/home/xxxx/projects/CRESS/fairseq/fairseq/checkpoint_utils.py", line 374, in load_model_ensemble
ensemble, args, _task = load_model_ensemble_and_task(
File "/home/xxxx/projects/CRESS/fairseq/fairseq/checkpoint_utils.py", line 484, in load_model_ensemble_and_task
model = task.build_model(cfg.model, from_checkpoint=True)
File "/home/xxxx/projects/CRESS/cress/tasks/speech_to_text_modified.py", line 129, in build_model
return super(SpeechToTextTaskModified, self).build_model(args, from_checkpoint)
File "/home/xxxx/projects/CRESS/fairseq/fairseq/tasks/fairseq_task.py", line 691, in build_model
model = models.build_model(args, self, from_checkpoint)
File "/home/xxxx/projects/CRESS/fairseq/fairseq/models/init.py", line 106, in build_model
return model.build_model(cfg, task)
File "/home/xxxx/projects/CRESS/cress/models/hubert_transformer.py", line 233, in build_model
encoder = cls.build_encoder(args, task, encoder_embed_tokens)
File "/home/xxxx/projects/CRESS/cress/models/hubert_transformer.py", line 211, in build_encoder
return HubertTransformerEncoder(args, task.target_dictionary, embed_tokens)
File "/home/xxxx/projects/CRESS/cress/models/hubert_transformer.py", line 300, in init
ckpt = checkpoint_utils.load_checkpoint_to_cpu(self.hubert_model_path)
File "/home/xxxx/projects/CRESS/fairseq/fairseq/checkpoint_utils.py", line 321, in load_checkpoint_to_cpu
with open(local_path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/hubert_base_ls960.pt'

about 'expanded setting'

Would you mind tell me detail about 'expanded setting' in papers.Is it mean use additional mt dataset on training but not only must?Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.