Giter VIP home page Giter VIP logo

dphubert's Introduction

DPHuBERT

This repo contains the code and models for our paper:

Yifan Peng, Yui Sudo, Shakeel Muhammad, and Shinji Watanabe, “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models,” in Proc. INTERSPEECH, 2023. (to appear)

Overview

DPHuBERT is a task-agnostic compression method based on joint distillation and structured pruning. DPHuBERT outperforms pure distillation methods in most SUPERB tasks. It also performs well with limited training data. Our method can be directly applied to various speech SSL models like HuBERT (eithr Base or Large) and WavLM.

The training procedure is illustrated in the figure below:

Training procedure of DPHuBERT


The main results are summarized in this table:

DPHuBERT results


Our models are also shown in the SUPERB leaderboard. Here are the results sorted by Rank and Score, respectively.

SUPERB sorted by Rank


SUPERB sorted by Score

Requirements

Our code is based on PyTorch, TorchAudio, and PyTorch Lightning. Please install these required packages from their official sources. The latest versions should work. We include our versions below for reference.

# Main packages for training
pytorch=1.13.1
cuda=11.6.2
pytorch-lightning=1.8.1
torchaudio=0.13.1

# Other packages for obtaining pre-trained SSL
fairseq=0.12.2
transformers=4.24.0

Usage

Please follow these steps to train DPHuBERT.

1. Download and prepare audio data

The following script creates file lists for LibriSpeech in tsv format. LibriSpeech_PATH is the path to the downloaded raw data.

python prepare_data.py --data LibriSpeech_PATH --out data/librispeech

The output directory has this structure:

data
└── librispeech
    ├── train100.tsv
    ├── train960.tsv
    └── valid.tsv

2. Download pre-trained SSL (e.g., HuBERT Base) and convert it to our format

We need to download pre-trained SSL checkpoints from fairseq or Hugging Face and then convert them to our own format. These models will be used as the teacher for compression. For example, we can obtain HuBERT Base by executing:

mkdir -p pretrained
python convert_hubert_from_hf.py

The converted checkpoint will be saved as pretrained/hubert-base-ls960.hf.pth. The output path can be changed in the python script.

3. Start training

After preparing data and pre-trained model, we can start training by sequentially executing the four python scripts: distill.py, prune.py, final_distill.py, and save_final_ckpt.py. We provide a shell script run.sh to better record the hyper-parameters. By default, we request 4 NVIDIA A100 (40GB) GPUs via the SLURM job scheduler. It takes around 6 hours to compress HuBERT Base. Please modify the hyper-parameters if the environment is different. For example, one can reduce the number of GPUs but enable gradient accumulation to keep the total batch size in a similar range.

sbatch run.sh

After training, the compressed model parameters and configurations will be saved in the corresponding experiment directory. We can easily load a compressed model as follows:

import torch
from wav2vec2.model import wav2vec2_model

ckpt_path = "path/to/ckpt"
ckpt = torch.load(ckpt_path)
model = wav2vec2_model(**ckpt["config"])
result = model.load_state_dict(ckpt["state_dict"], strict=False)
print(f"missing: {result.missing_keys}, unexpected: {result.unexpected_keys}")
print(f"{sum(p.numel() for p in model.parameters())} params")

Pre-trained models

We also provide some pre-trained models.

Name Teacher Sparsity Params Link
DPHuBERT HuBERT Base 0.75 23,585,946 Hugging Face
DPWavLM WavLM Base+ 0.75 23,586,325 Hugging Face

Citation

Please cite our paper if you use DPHuBERT.

@inproceedings{dphubert,
    title={{DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models}},
    author={Yifan Peng and Yui Sudo and Shakeel Muhammad and Shinji Watanabe},
    booktitle={Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH)},
    year={2023},
}

Acknowledgments

We thank the authors of the following projects for open-sourcing their code:

  • TorchAudio: Our speech SSL models and training pipelines are based on TorchAudio.
  • FLOP: Our implementation of the Hard Concrete Distribution is from FLOP.
  • CoFiPruning: Some of our training hyper-parameters follow CoFiPruning.

Our method is inspired by prior studies:

dphubert's People

Contributors

pyf98 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.