Giter VIP home page Giter VIP logo

mm_alt's Introduction

MM-ALT: A Multimodal Automatic Lyric Transcription System

This is the author's official PyTorch implementation for MM-ALT. This repo contains code for experiments in the ACM MM 2022 (Oral) paper:

MM-ALT: A Multimodal Automatic Lyric Transcription System

And this repo also covers the ALT experiments in our journal extension paper: Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Project Description

Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer.

Method Overview

Installation

Environement

Install Anaconda and create the environment with python 3.8.12, pytorch 1.9.0 and cuda 11.1:

conda create -n mmalt python=3.8.12
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

SpeechBrain

We run experiments based on SpeechBrain toolkit. For simiplicity, we remove the original recipes. To install SpeechBrain, run following commands:

cd MM_ALT
pip install -r requirements.txt
pip install --editable .

Transformers and other packages are also required:

pip install transformers
pip install datasets
pip install scikit-learn

AV-Hubert

We adapt AV-Hubert (Audio-Visual Hidden Unit BERT) in our experiments. To enable the usage of AV-Hubert, run following commands:

cd ..
git clone https://github.com/facebookresearch/av_hubert.git
cd av_hubert
git submodule init
git submodule update

Fairseq and dependencies are also required:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Datasets

DSing Dataset

DSing dataset is one of the most popular singing datasets. To download and prepare this dataset, we follow its github website https://github.com/groadabike/Kaldi-Dsing-task.

The resulting folder should be organized as:

/path/to/DSing
├── dev
├── test
├── train1
├── train3
├── train30

N20EM Dataset

N20EM dataset is curated by ourselves for our multimodal ALT task. The dataset is released here: https://zenodo.org/record/6905332.

The resulting folder should be organized as:

/path/to/N20EM
├── data
    ├── id1
        ├── downsample_audio.wav
        ├── downsample_accomp.wav
        ├── video.mp4
        ├── imu.csv
    ├── id2
    ├── ...
├── metadata_split_by_song.json
├── README.txt

NOTE: Please make sure the audio input to model is 16 kHz and has mono-channel. The video input to model is 25 fps.

Training and Evaluation

We follow the internal logic of SpeechBrain, you can run experiments in this way:

cd <dataset>/<task>
python experiment.py params.yaml

You may need to create csv files according to our guidance in <dataset>/<task>. The results will be saved in the output_folder specified in the yaml file. Both detailed logs and experiment outputs are saved there. Furthermore, less verbose logs are output to stdout.

Citation

If you use MM-ALT or this codebase in your own work, please cite our paper:

@inproceedings{gu2022mm,
  title={Mm-alt: A multimodal automatic lyric transcription system},
  author={Gu, Xiangming and Ou, Longshen and Ong, Danielle and Wang, Ye},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={3328--3337},
  year={2022}
}

@article{gu2024automatic,
  title={Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing}, 
  author={Gu, Xiangming and Ou, Longshen and Zeng, Wei and Zhang, Jianan and Wong, Nicholas and Wang, Ye},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  publisher={ACM New York, NY},
  year={2024}
}

We borrow the code from SpeechBrain, Fairseq, and AV-Hubert, please also consider citing their works.

Also Check Our Relevant Work

Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription
Longshen Ou*, Xiangming Gu*, Ye Wang
International Society for Music Information Retrieval Conference (ISMIR), 2022
[paper][code]

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing
Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang
ACM Transactions on Multimedia Computing, Communications and Applications (TOMM), 2024
[paper][code]

License

MM-ALT is released under the Apache License, version 2.0.

mm_alt's People

Contributors

guxm2021 avatar sonata165 avatar

Stargazers

 avatar Guoqiang Hu avatar vbsource avatar diamondsia avatar Daydremaer-Li avatar Huali Zhou avatar Aria F avatar yifang avatar YQC avatar Wei Zeng avatar  avatar Lucas Rodrigues avatar Hongfu Liu avatar  avatar

Watchers

Kostas Georgiou avatar  avatar Lucas Rodrigues avatar

mm_alt's Issues

available checkpoint

Could you provide the checkpoint file of wav2vec trained by DSinger? I am very interested in your research.

On the question of WER and CER

hi~ I have done some modification on your model architecture, But when i train, WER and CER are both 100...
Could you tell me how can i get the text of model output? I wanna find the reason!
By the way, thanks to your Work! i will cite your great paper!
1723280125821

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.