Giter VIP home page Giter VIP logo

espnet_onnx's Introduction

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 1.12.1 1.13.1 2.0.1 2.1.0
ubuntu/python3.10/pip ci on ubuntu ci on ubuntu
ubuntu/python3.9/pip ci on ubuntu ci on ubuntu
ubuntu/python3.8/pip ci on ubuntu ci on ubuntu
ubuntu/python3.7/pip ci on ubuntu ci on ubuntu
debian11/python3.10/conda ci on debian11
windows/python3.10/pip ci on windows
macos/python3.10/pip ci on macos
macos/python3.10/conda ci on macos

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi-style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
  • Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
    • Decoder: RNN (LSTM/GRU), Transformer, or S4
  • Attention: Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
    • Architecture:
      • Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
      • Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
      • Pre-encoder: VGG2L or Conv2D available.
    • Search algorithms:
    • Features:
      • Unified interface for offline and streaming speech recognition.
      • Multi-task learning with various auxiliary losses:
        • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
        • Decoder: cross-entropy w/ label smoothing.
      • Transfer learning with an acoustic model and/or language model.
      • Training with FastEmit regularization method [Yu et al., 2021].

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
    • Set frontend to s3prl
    • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
  • Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
  • Restricted Self-Attention based on Longformer as an encoder for long sequences
  • OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration

TTS: Text-to-speech

  • Architecture
    • Tacotron2
    • Transformer-TTS
    • FastSpeech
    • FastSpeech2
    • Conformer FastSpeech & FastSpeech2
    • VITS
    • JETS
  • Multi-speaker & multi-language extension
    • Pre-trained speaker embedding (e.g., X-vector)
    • Speaker ID embedding
    • Language ID embedding
    • Global style token (GST) embedding
    • Mix of the above embeddings
  • End-to-end training
    • End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
    • Joint training of text2mel and vocoder
  • Various language support
    • En / Jp / Zn / De / Ru / And more...
  • Integration with neural vocoders
    • Parallel WaveGAN
    • MelGAN
    • Multi-band MelGAN
    • HiFiGAN
    • StyleMelGAN
    • Mix of the above models

Demonstration

To train the neural vocoder, please check the following repositories:

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domain models
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pre-trained models from Asteroid
    • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

  • Interactive SE demo with ESPnet2 Open In Colab
  • Streaming SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer-based end-to-end ST (new!)
  • Transformer-based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2-based parallel VC using Mel spectrogram
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

SLU: Spoken Language Understanding

  • Architecture
    • Transformer-based Encoder
    • Conformer-based Encoder
    • Branchformer based Encoder
    • E-Branchformer based Encoder
    • RNN based Decoder
    • Transformer-based Decoder
  • Support Multitasking with ASR
    • Predict both intent and ASR transcript
  • Support Multitasking with NLU
    • Deliberation encoder based 2 pass model
  • Support using pre-trained ASR models
    • Hubert
    • Wav2vec2
    • VQ-APC
    • TERA and more ...
  • Support using pre-trained NLP models
    • BERT
    • MPNet And more...
  • Various language support
    • En / Jp / Zn / Nl / And more...
  • Supports using context from previous utterances
  • Supports using other tasks like SE in a pipeline manner
  • Supports Two Pass SLU that combines audio and ASR transcript Demonstration
  • Performing noisy spoken language understanding using a speech enhancement model followed by a spoken language understanding model. Open In Colab
  • Performing two-pass spoken language understanding where the second pass model attends to both acoustic and semantic information. Open In Colab
  • Integrated to Hugging Face Spaces with Gradio. See SLU demo on multiple languages: Hugging Face Spaces

SUM: Speech Summarization

  • End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [Sharma et al., 2022]

SVS: Singing Voice Synthesis

  • Framework merge from Muskits
  • Architecture
    • RNN-based non-autoregressive model
    • Xiaoice
    • Tacotron-singing
    • DiffSinger (in progress)
    • VISinger
    • VISinger 2 (its variations with different vocoders-architecture)
  • Support multi-speaker & multilingual singing synthesis
    • Speaker ID embedding
    • Language ID embedding
  • Various language support
    • Jp / En / Kr / Zh
  • Tight integration with neural vocoders (the same as TTS)

SSL: Self-supervised Learning

UASR: Unsupervised ASR (EURO: ESPnet Unsupervised Recognition - Open-source)

  • Architecture
    • wav2vec-U (with different self-supervised models)
    • wav2vec-U 2.0 (in progress)
  • Support PrefixBeamSearch and K2-based WFST decoding

S2T: Speech-to-text with Whisper-style multilingual multitask models

  • Reproduces Whisper-style training from scratch using public data: OWSM
  • Supports multiple tasks in a single model
    • Multilingual speech recognition
    • Any-to-any speech translation
    • Language identification
    • Utterance-level timestamp prediction (segmentation)

DNN Framework

  • Flexible network architecture thanks to Chainer and PyTorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard-based monitoring

ESPnet2

See ESPnet2.

  • Independent from Kaldi/Chainer, unlike ESPnet1
  • On-the-fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe that can be applied to all corpora
  • Possible to train any size of corpus without CPU memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments, including DNN training, then see Installation.

  • If you just need the Python module only:

    # We recommend you install PyTorch before installing espnet following https://pytorch.org/get-started/locally/
    pip install espnet
    # To install the latest
    # pip install git+https://github.com/espnet/espnet
    # To install additional packages
    # pip install "espnet[all]"

    If you use ESPnet1, please install chainer and cupy.

    pip install chainer==6.0.0 cupy==6.0.0    # [Option]

    You might need to install some packages depending on each task. We prepared various installation scripts at tools/installers.

  • (ESPnet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking the time for ESPnet! Any contributions to ESPnet are welcome, and feel free to ask any questions or requests to issues. If it's your first ESPnet contribution, please follow the contribution guide.

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task CER (%) WER (%) Pre-trained model
Aishell dev/test 4.6/5.1 N/A link
ESPnet2 Aishell dev/test 4.1/4.4 N/A link
Common Voice dev/test 1.7/1.8 2.2/2.3 link
CSJ eval1/eval2/eval3 5.7/3.8/4.2 N/A link
ESPnet2 CSJ eval1/eval2/eval3 4.5/3.3/3.6 N/A link
ESPnet2 GigaSpeech dev/test N/A 10.6/10.5 link
HKUST dev 23.5 N/A link
ESPnet2 HKUST dev 21.2 N/A link
Librispeech dev_clean/dev_other/test_clean/test_other N/A 1.9/4.9/2.1/4.9 link
ESPnet2 Librispeech dev_clean/dev_other/test_clean/test_other 0.6/1.5/0.6/1.4 1.7/3.4/1.8/3.6 link
Switchboard (eval2000) callhm/swbd N/A 14.0/6.8 link
ESPnet2 Switchboard (eval2000) callhm/swbd N/A 13.4/7.3 link
TEDLIUM2 dev/test N/A 8.6/7.2 link
ESPnet2 TEDLIUM2 dev/test N/A 7.3/7.1 link
TEDLIUM3 dev/test N/A 9.6/7.6 link
WSJ dev93/eval92 3.2/2.1 7.0/4.7 N/A
ESPnet2 WSJ dev93/eval92 1.1/0.8 2.8/1.8 link

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

expand

You can recognize speech in a WAV file using pre-trained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

# go to the recipe directory and source path of espnet tools
cd egs/tedlium2/asr1 && . ./path.sh
# let's recognize speech!
recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pre-trained models in the demo script are listed below.

Model Notes
tedlium2.rnn.v1 Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2 Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1 Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1 Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1 Joint-CTC attention Transformer trained on CSJ
csj.rnn.v1 Joint-CTC attention VGGBLSTM trained on CSJ

SE results

expand

We list results from three different models on WSJ0-2mix, which is one the most widely used benchmark dataset for speech separation.

Model STOI SAR SDR SIR
TF Masking 0.89 11.40 10.24 18.04
Conv-Tasnet 0.95 16.62 15.94 25.90
DPRNN-Tasnet 0.96 18.82 18.29 28.92

SE demos

expand
You can try the interactive demo with Google Colab. Please click the following button to get access to the demos.

Open In Colab

It is based on ESPnet2. Pre-trained models are available for both speech enhancement and speech separation tasks.

Speech separation streaming demos:

Open In Colab

ST results

expand

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 51.03 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 20.44 link
Libri-trans test (En->Fr) 16.70 link
How2 dev5 (En->Pt) 45.68 link
Must-C tst-COMMON (En->De) 22.91 link
Mboshi-French dev (Fr->Mboshi) 6.18 N/A

cascaded system

Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 42.16 N/A
Fisher-CallHome Spanish callhome_evltest (Es->En) 19.82 N/A
Libri-trans test (En->Fr) 16.96 N/A
How2 dev5 (En->Pt) 44.90 N/A
Must-C tst-COMMON (En->De) 23.65 N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

expand

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

Open In Colab


You can translate speech in a WAV file using pre-trained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

# Go to recipe directory and source path of espnet tools
cd egs/fisher_callhome_spanish/st1 && . ./path.sh
# download example wav file
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf -
# let's translate speech!
translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pre-trained models in the demo script are listed as below.

Model Notes
fisher_callhome_spanish.transformer.v1 Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

expand
Task BLEU Pre-trained model
Fisher-CallHome Spanish fisher_test (Es->En) 61.45 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 29.86 link
Libri-trans test (En->Fr) 18.09 link
How2 dev5 (En->Pt) 58.61 link
Must-C tst-COMMON (En->De) 27.63 link
IWSLT'14 test2014 (En->De) 24.70 link
IWSLT'14 test2014 (De->En) 29.22 link
IWSLT'14 test2014 (De->En) 32.2 link
IWSLT'16 test2014 (En->De) 24.05 link
IWSLT'16 test2014 (De->En) 29.13 link

TTS results

ESPnet2

You can listen to the generated samples in the following URL.

Note that in the generation, we use Griffin-Lim (wav/) and Parallel WaveGAN (wav_pwg/).

You can download pre-trained models via espnet_model_zoo.

You can download pre-trained vocoders via kan-bayashi/ParallelWaveGAN.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results.

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pre-trained models and generated samples:

Note that in the generated samples, we use the following vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on the following repositories.

If you want to build your own neural vocoder, please check the above repositories. kan-bayashi/ParallelWaveGAN provides the manual about how to decode ESPnet-TTS model's features with neural vocoders. Please check it.

Here we list all of the pre-trained neural vocoders. Please download and enjoy the generation of high-quality speech!

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
ljspeech.wavenet.mol.v2 EN 22.05k 80-7600 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v2 EN 22.05k 80-7600 1024 / 256 / None Parallel WaveGAN
ljspeech.melgan.v1 EN 22.05k 80-7600 1024 / 256 / None MelGAN
ljspeech.melgan.v3 EN 22.05k 80-7600 1024 / 256 / None MelGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

If you want to use the above pre-trained vocoders, please exactly match the feature setting with them.

TTS demo

ESPnet2

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

  • Real-time TTS demo with ESPnet2 Open In Colab

English, Japanese, and Mandarin models are available in the demo.

ESPnet1

NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo.

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis.

  • Real-time TTS demo with ESPnet1 Open In Colab

We also provide a shell script to perform synthesis. Go to a recipe directory and run utils/synth_wav.sh as follows:

# Go to recipe directory and source path of espnet tools
cd egs/ljspeech/tts1 && . ./path.sh
# We use an upper-case char sequence for the default model.
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
# let's synthesize speech!
synth_wav.sh example.txt

# Also, you can use multiple sentences
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt
echo "TEXT TO SPEECH IS A TECHNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt
synth_wav.sh example_multi.txt

You can change the pre-trained model as follows:

synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with the Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pre-trained vocoder model as follows:

synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

WaveNet vocoder provides very high-quality speech, but it takes time to generate.

See more details or available models via --help.

synth_wav.sh --help

VC results

expand
  • Transformer and Tacotron2-based VC

You can listen to some samples on the demo webpage.

  • Cascade ASR+TTS as one of the baseline systems of VCC2020

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross-lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system here.

SLU results

expand

We list the performance on various SLU tasks and datasets using the metric reported in the original dataset paper

Task Dataset Metric Result Pre-trained Model
Intent Classification SLURP Acc 86.3 link
Intent Classification FSC Acc 99.6 link
Intent Classification FSC Unseen Speaker Set Acc 98.6 link
Intent Classification FSC Unseen Utterance Set Acc 86.4 link
Intent Classification FSC Challenge Speaker Set Acc 97.5 link
Intent Classification FSC Challenge Utterance Set Acc 78.5 link
Intent Classification SNIPS F1 91.7 link
Intent Classification Grabo (Nl) Acc 97.2 link
Intent Classification CAT SLU MAP (Zn) Acc 78.9 link
Intent Classification Google Speech Commands Acc 98.4 link
Slot Filling SLURP SLU-F1 71.9 link
Dialogue Act Classification Switchboard Acc 67.5 link
Dialogue Act Classification Jdcinal (Jp) Acc 67.4 link
Emotion Recognition IEMOCAP Acc 69.4 link
Emotion Recognition swbd_sentiment Macro F1 61.4 link
Emotion Recognition slue_voxceleb Macro F1 44.0 link

If you want to check the results of the other recipes, please check egs2/<name_of_recipe>/asr1/RESULTS.md.

CTC Segmentation demo

ESPnet1

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As a demo, we align the start and end of utterances within the audio file ctc_align_test.wav, using the example script utils/asr_align_wav.sh. For preparation, set up a data directory:

cd egs/tedlium2/align1/
# data directory
align_dir=data/demo
mkdir -p ${align_dir}
# wav file
base=ctc_align_test
wav=../../../test_utils/${base}.wav
# recipe files
echo "batchsize: 0" > ${align_dir}/align.yaml

cat << EOF > ${align_dir}/utt_text
${base} THE SALE OF THE HOTELS
${base} IS PART OF HOLIDAY'S STRATEGY
${base} TO SELL OFF ASSETS
${base} AND CONCENTRATE
${base} ON PROPERTY MANAGEMENT
EOF

Here, utt_text is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments:

# pre-trained ASR model
model=wsj.transformer_small.v1
mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf

../../../utils/asr_align_wav.sh \
    --models ${model} \
    --align_dir ${align_dir} \
    --align_config ${align_dir}/align.yaml \
    ${wav} ${align_dir}/utt_text

Segments are written to aligned_segments as a list of file/utterance names, utterance start and end times in seconds, and a confidence score. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments

The demo script utils/ctc_align_wav.sh uses an already pre-trained ASR model (see the list above for more models). It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models with a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed. A full example recipe is in egs/tedlium2/align1/.

ESPnet2

CTC segmentation determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets.

As a demo, we align the start and end of utterances within the audio file ctc_align_test.wav. This can be done either directly from the Python command line or using the script espnet2/bin/asr_align.py.

From the Python command line interface:

# load a model with character tokens
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
# load the example file included in the ESPnet repository
import soundfile
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
# CTC segmentation
from espnet2.bin.asr_align import CTCSegmentation
aligner = CTCSegmentation( **wsjmodel , fs=rate )
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
# utt1 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 utt 4.20 6.10 -0.4899 AND CONCENTRATE ON PROPERTY MANAGEMENT

Aligning also works with fragments of the text. For this, set the gratis_blank option that allows skipping unrelated audio sections without penalty. It's also possible to omit the utterance names at the beginning of each line by setting kaldi_style_text to False.

aligner.set_config( gratis_blank=True, kaldi_style_text=False )
text = ["SALE OF THE HOTELS", "PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.37 1.72 -2.0651 SALE OF THE HOTELS
# utt_0001 utt 4.70 6.10 -5.0566 PROPERTY MANAGEMENT

The script espnet2/bin/asr_align.py uses a similar interface. To align utterances:

# ASR model and config files from pre-trained model (e.g., from cachedir):
asr_config=<path-to-model>/config.yaml
asr_model=<path-to-model>/valid.*best.pth
# prepare the text file
wav="test_utils/ctc_align_test.wav"
text="test_utils/ctc_align_text.txt"
cat << EOF > ${text}
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE
utt5 ON PROPERTY MANAGEMENT
EOF
# obtain alignments:
python espnet2/bin/asr_align.py --asr_train_config ${asr_config} --asr_model_file ${asr_model} --audio ${wav} --text ${text}
# utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE
# utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT

The output of the script can be redirected to a segments file by adding the argument --output segments. Each line contains the file/utterance name, utterance start and end times in seconds, and a confidence score; optionally also the utterance text. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances:

min_confidence_score=-7
# here, we assume that the output was written to the file `segments`
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' segments

See the module documentation for more information. It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with sox if needed.

Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the text file. See the discussion in #4278 (comment).

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={{ESPnet}: End-to-End Speech Processing Toolkit},
  year={2018},
  booktitle={Proceedings of Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@inproceedings{hayashi2020espnet,
  title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu},
  booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7654--7658},
  year={2020},
  organization={IEEE}
}
@inproceedings{inaguma-etal-2020-espnet,
    title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit",
    author = "Inaguma, Hirofumi  and
      Kiyono, Shun  and
      Duh, Kevin  and
      Karita, Shigeki  and
      Yalta, Nelson  and
      Hayashi, Tomoki  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.34",
    pages = "302--311",
}
@article{hayashi2021espnet2,
  title={{ESP}net2-{TTS}: Extending the edge of {TTS} research},
  author={Hayashi, Tomoki and Yamamoto, Ryuichi and Yoshimura, Takenori and Wu, Peter and Shi, Jiatong and Saeki, Takaaki and Ju, Yooncheol and Yasuda, Yusuke and Takamichi, Shinnosuke and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2110.07840},
  year={2021}
}
@inproceedings{li2020espnet,
  title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration},
  author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe},
  booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)},
  pages={785--792},
  year={2021},
  organization={IEEE},
}
@inproceedings{arora2021espnet,
  title={{ESPnet-SLU}: Advancing Spoken Language Understanding through ESPnet},
  author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7167--7171},
  year={2022},
  organization={IEEE}
}
@inproceedings{shi2022muskits,
  author={Shi, Jiatong and Guo, Shuai and Qian, Tao and Huo, Nan and Hayashi, Tomoki and Wu, Yuning and Xu, Frank and Chang, Xuankai and Li, Huazhe and Wu, Peter and Watanabe, Shinji and Jin, Qin},
  title={{Muskits}: an End-to-End Music Processing Toolkit for Singing Voice Synthesis},
  year={2022},
  booktitle={Proceedings of Interspeech},
  pages={4277-4281},
  url={https://www.isca-speech.org/archive/pdfs/interspeech_2022/shi22d_interspeech.pdf}
}
@inproceedings{lu22c_interspeech,
  author={Yen-Ju Lu and Xuankai Chang and Chenda Li and Wangyou Zhang and Samuele Cornell and Zhaoheng Ni and Yoshiki Masuyama and Brian Yan and Robin Scheibler and Zhong-Qiu Wang and Yu Tsao and Yanmin Qian and Shinji Watanabe},
  title={{ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={5458--5462},
}
@inproceedings{gao2023euro,
  title={{EURO: ESP}net unsupervised {ASR} open-source toolkit},
  author={Gao, Dongji and Shi, Jiatong and Chuang, Shun-Po and Garcia, Leibny Paola and Lee, Hung-yi and Watanabe, Shinji and Khudanpur, Sanjeev},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}
@inproceedings{peng2023reproducing,
  title={Reproducing {W}hisper-style training using an open-source toolkit and publicly available data},
  author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and others},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--8},
  year={2023},
  organization={IEEE}
}
@inproceedings{sharma2023espnet,
  title={ESPnet-{SUMM}: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems},
  author={Sharma, Roshan and Chen, William and Kano, Takatomo and Sharma, Ruchira and Arora, Siddhant and Watanabe, Shinji and Ogawa, Atsunori and Delcroix, Marc and Singh, Rita and Raj, Bhiksha},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--8},
  year={2023},
  organization={IEEE}
}
@article{jung2024espnet,
  title={{ESPnet-SPK}: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models},
  author={Jung, Jee-weon and Zhang, Wangyou and Shi, Jiatong and Aldeneh, Zakaria and Higuchi, Takuya and Theobald, Barry-John and Abdelaziz, Ahmed Hussen and Watanabe, Shinji},
  journal={Proc. Interspeech 2024},
  year={2024}
}
@inproceedings{yan-etal-2023-espnet,
    title = "{ESP}net-{ST}-v2: Multipurpose Spoken Language Translation Toolkit",
    author = "Yan, Brian  and
      Shi, Jiatong  and
      Tang, Yun  and
      Inaguma, Hirofumi  and
      Peng, Yifan  and
      Dalmia, Siddharth  and
      Pol{\'a}k, Peter  and
      Fernandes, Patrick  and
      Berrebbi, Dan  and
      Hayashi, Tomoki  and
      Zhang, Xiaohui  and
      Ni, Zhaoheng  and
      Hira, Moto  and
      Maiti, Soumi  and
      Pino, Juan  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
    pages = "400--411",
}

espnet_onnx's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

espnet_onnx's Issues

Support for transducer

Thanks for creating this and simplifying onnx export.

  1. Do you have a timeline for the transducer ASR model support?
from espnet2.bin.asr_inference import Speech2Text
import os
import yaml
from espnet_onnx.export import ModelExport
from pathlib import Path

m = ModelExport()

os.chdir("egs2/librispeech_100/asr1/")


transducer_conf = yaml.safe_load(Path('conf/decode_rnnt_conformer.yaml').read_text())

speech2text = Speech2Text(asr_train_config="exp/asr_train_rnnt_conformer_raw_en_bpe5000_sp/config.yaml",
                          asr_model_file="exp/asr_train_rnnt_conformer_raw_en_bpe5000_sp/latest.pth",
                          transducer_conf=transducer_conf["transducer_conf"],
                          lm_weight=0.0)

m.export(speech2text, 'onnx_export', quantize=True)

^ I was trying to export a sample transducer checkpoint like this and was running into attribute errors at espnet_onnx/export/asr/export_asr.py" _create_config function.

  1. Can you comment if the above snippet is correct? If I just provide None to attributes that dont exist (Eg: model.beam_search), would the conversion still work ?

AttributeError: 'ContextualBlockXformerEncoder' object has no attribute 'get_frontend_config'

Hi,

First, thanks a lot for your work on onnx conversion of espnet models!

I am trying to convert a streaming Conformer-Transformer model (encoder: contextual_block_conformer, decoder: transformer) from espnet2 to onnx format.
It is not a pretrained espnet_zoo model but I trained it on my own data.

My call is this:

import sys
from espnet_onnx.export import ASRModelExport

model="./asr_train_asr_streaming_conformer7_n_fft256_hop_length128_conv2d_n_mels40_medium_raw_de_bpe5000_sp_valid.acc.ave.zip"
tag_name="streaming_conformer-transformer"

m = ASRModelExport()
m.export_from_zip(
  model,
  tag_name,
  optimize=True,
  quantize=True
)

I am getting the following error:

  m.export_from_zip(
  File "/home/espnetUser/scm/external/espnet_onnx/espnet_onnx/export/asr/export_asr.py", line 191, in export_from_zip
    self.export(model, tag_name, quantize, optimize)
  File "/home/espnetUser/scm/external/espnet_onnx/espnet_onnx/export/asr/export_asr.py", line 92, in export
    model_config.update(encoder=enc_model.get_model_config(
  File "/home/espnetUser/scm/external/espnet_onnx/espnet_onnx/export/asr/models/encoders/contextual_block_xformer.py", line 188, in get_model_config
    frontend=self.get_frontend_config(asr_model.frontend),
  File "/home/espnetUser/scm/external/espnet/tools/anaconda/envs/espnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'ContextualBlockXformerEncoder' object has no attribute 'get_frontend_config'

I also tried converting two pretrained espnet models:

Non-streaming model:

kamo-naoyuki/librispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000_scheduler_confwarmup_steps40000_optim_conflr0.0025_sp_valid.acc.ave.zip

This worked fine for me.

Streaming model:

Emiru_Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave.zip

Here I am getting the same AttributeError: 'ContextualBlockXformerEncoder' object has no attribute 'get_frontend_config' error as for my own streaming model.

Any suggestions what is going wrong? Do I need a special espnet/python etc. version for the streaming models?

Thanks!

Infer rnnt onnx wrong

the pretrained model got from
https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1
Conformer-RNN Transducer
Environments
date: Wed Apr 27 09:30:57 EDT 2022
python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
espnet version: espnet 0.10.7a1
pytorch version: pytorch 1.8.1+cu102
Git hash: 21d19be00089678ca27f7fce474ef8d787689512
Commit date: Wed Mar 16 08:06:52 2022 -0400
ASR config: conf/tuning/transducer/train_conformer-rnn_transducer.yaml
Decode config: conf/tuning/transducer/decode.yaml
Pretrained model: https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp

export onnx script:
from espnet2.bin.asr_inference import Speech2Text
import os
import yaml
from espnet_onnx.export import ASRModelExport
from pathlib import Path
m = ASRModelExport()
m.set_export_config(max_seq_len=5000)
transducer_conf = yaml.safe_load(Path('espnet/egs2/librispeech/asr1/conf/decode_rnnt_conformer.yaml').read_text())
speech2text = Speech2Text(asr_train_config="test/espnet_onnx2/rnnt/exp/asr_train_rnnt_conformer_ngpu4_raw_en_bpe5000_sp/config.yaml",
asr_model_file="espnet_onnx2/rnnt/exp/asr_train_rnnt_conformer_ngpu4_raw_en_bpe5000_sp/valid.loss.ave_10best.pth",
transducer_conf=transducer_conf["transducer_conf"],
lm_train_config = "test/espnet_onnx2/rnnt/lm_config/config.yaml",
lm_file="test/espnet_onnx2/rnnt/lm_config/17epoch.pth",
lm_weight=0.0
)

m.export(speech2text, 'onnx_export', quantize=False)

And I got 4 onnx files
image

the infer script is:
import librosa
from espnet_onnx import Speech2Text
#from pyacl.acl_infer import init_acl, release_acl
from tqdm import tqdm
import os,re

def findAllFile(base):
for root, ds, fs in os.walk(base):
for f in fs:
if re.match(r'.\d.', f) and f.endswith("flac"):
fullname = os.path.join(root, f)
yield fullname

#init_acl(0)
#speech2text = Speech2Text(tag_name='')
speech2text = Speech2Text(model_dir='/root/.cache/espnet_onnx/onnx_export/')
path = "espnet/egs2/librispeech/asr1/downloads/LibriSpeech/test-clean/"
j = 0
with open("test2.txt", 'w') as fout:
for i in findAllFile(path):
y, sr = librosa.load(i, sr=16000)
nbest = speech2text(y)
res = ""
res = "".join(nbest[0][1])
fout.write('{} {}\n'.format(i.split('/')[-1].split('.')[0], res))

the ERROR is:
root@ubuntu:/home/test/espnet_onnx2/rnnt# python3 infer.py
Traceback (most recent call last):
File "infer.py", line 24, in
nbest = speech2text(y)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/espnet_onnx/asr/asr_model.py", line 84, in call
nbest_hyps = self.beam_search(enc[0])[:1]
File "/usr/local/python3.7.5/lib/python3.7/site-packages/espnet_onnx/asr/beam_search/beam_search_transducer.py", line 111, in call
nbest_hyps = self.search_algorithm(enc_out)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/espnet_onnx/asr/beam_search/beam_search_transducer.py", line 238, in default_beam_search
lm_tokens, max_hyp.lm_state, None
File "/usr/local/python3.7.5/lib/python3.7/site-packages/espnet_onnx/asr/model/lms/transformer_lm.py", line 62, in score
k: v for k, v in zip(self.enc_in_cache_names, state)
TypeError: zip argument #2 must support iteration
I found the script run function score in espnet_onnx/asr/model/lms/transformer_lm.py, but the state is None, that may be the reason for this ERROR

Decoding speed and accuracy on the transformed onnx model

Hi, thanks for you share of the espnet_onnx system!

I met two problems when I tried to inference thorough your codes. My acoustic model is trained by myself on our own dataset. The AM architecture is the typical Conformer. I downloaded this code on June.

First, the decoding speed is too slow by it. When using torch to decode, the RTF is around 2.32; however it becomes around 20 when using the transformed onnx.

Second, the CER calculated in the torch version is 7.8% while for the onnx, it becomes 10.6%. I think it is probably wrong.

I'm giving some configs here:

export.py

import sys
sys.path.append('espnet-master')
sys.path.append('espnet-master/espnet_tts_frontend-master')
sys.path.append('espnet_onnx-master/espnet_onnx/export/asr')
import torch

from export_asr import ModelExport
from espnet2.bin.asr_inference import Speech2Text

if __name__ == '__main__':
    m = ModelExport(cache_dir = sys.argv[5])

    # export from trained model
    speech2text=Speech2Text(
            asr_train_config = sys.argv[1],
            asr_model_file=sys.argv[2],
            lm_train_config=sys.argv[3],
            lm_file=sys.argv[4],
            )

    m.export(model = speech2text, tag_name = 'speech2text', quantize=True)

And I get an onnx dir structured like:

asr/onnx/speech2text/
      config.yaml
      feats_stats.npz
      full/
      quantize/

The test wav is a filelist, structured as:

bigfar_001_000001 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000001.wav
bigfar_001_000002 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000002.wav
bigfar_001_000003 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000003.wav
bigfar_001_000004 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000004.wav
bigfar_001_000005 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000005.wav
bigfar_001_000006 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000006.wav
...

The decoding process is:

decode.py

import sys
sys.path.append('espnet_onnx-master/espnet_onnx/asr')

import time
import threading
import librosa
import os
from tqdm import tqdm
from asr_model import Speech2Text

if __name__ == '__main__':
    """ step1: load onnx file """
        speech2text = Speech2Text(tag_name = 'speech2text', model_dir=sys.argv[3],)

        """ step2: ASR """
        f = open(sys.argv[1])
        lines = f.readlines()
        for line in tqdm(lines):
            with open(os.path.join(sys.argv[2], 'hyp_flush_1process.trn'),'a') as fout:
                wav_name = line.split(' ')[0].strip()
                processing_wav = line.split(' ')[1].strip()

                start = time.time()
                y, sr = librosa.load(processing_wav, sr=16000)
                nbest = speech2text(y)
                asr_result = nbest[0][0]
                end = time.time()

                for j in range (len(asr_result)):
                    fout.write(asr_result[j])
                    if j != len(asr_result) - 1:
                        fout.write(' ')
                fout.write('\t')
                fout.write('(')
                fout.write(wav_name)
                fout.write('-')
                fout.write(wav_name)
                fout.write(')')
                fout.write('\n')

                print('processing:  ', processing_wav)
                print('Result:         ', asr_result)
                print('Time:           ', end-start, 's')

Furthermore, I noticed that you have mentioned there may be some problems for Conformer AM considering ASR in latest issue, has it been fixed?

Looking forward for your reply!

Using by huggingface model

Hello, this project looks very useful to me

Is it possible to use this project as a registered model on huggingface ( ASR model- VGGRNN) ?
If so, how should I use tag name or model_dir?

I have pth file trained on Espnet, can I convert pth model to onnx by using this project?

Conformer-RNN Transducer model export onnx error

when i use https://github.com/espnet/espnet/tree/v.202205/egs2/librispeech/asr1
the Conformer-RNN Transducer model
date: Fri Mar 25 04:35:42 EDT 2022
python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0]
espnet version: espnet 0.10.7a1
pytorch version: pytorch 1.8.1+cu111
Git hash: 21d19be00089678ca27f7fce474ef8d787689512
Commit date: Wed Mar 16 08:06:52 2022 -0400
ASR config: conf/train_rnnt_conformer.yaml
Pretrained model: https://huggingface.co/espnet/chai_librispeech_asr_train_rnnt_conformer_raw_en_bpe5000_sp

to export onnx, i met the following error:
Traceback (most recent call last):
File "export.py", line 30, in
m.export(speech2text, 'onnx_export', quantize=False)
File "/home/p00510131/espnet_onnx/espnet_onnx/export/asr/export_asr.py", line 110, in export
ctc_model = CTC(model.asr_model.ctc.ctc_lo)
AttributeError: 'NoneType' object has no attribute 'ctc_lo'

the export script is :
from espnet2.bin.asr_inference import Speech2Text
import os
import yaml
from espnet_onnx.export import ASRModelExport

m = ASRModelExport()
transducer_conf = yaml.safe_load(Path('espnet/egs2/librispeech/asr1/conf/decode_rnnt_conformer.yaml').read_text())
speech2text = Speech2Text(asr_train_config="exp/asr_train_rnnt_conformer_ngpu4_raw_en_bpe5000_sp/config.yaml",
asr_model_file="exp/asr_train_rnnt_conformer_ngpu4_raw_en_bpe5000_sp/valid.loss.ave_10best.pth",
transducer_conf=transducer_conf["transducer_conf"],
lm_weight=0.0
)

m.export(speech2text, 'onnx_export', quantize=False)

[Feature request] ESPnet2 standalone Transducer

Hi Masao!

I was wondering if you could consider adding support for the standalone version of ESPnet2 Transducer? See the doc.
I'm quite interested in ONNX but I have too much at hands right now to work on this project... I can help on the code understanding or similar things though.

Module not found error

Hi, I am getting the following error

from espnet_onnx.espnet_onnx.export import TTSModelExport

image

ONNX Model Fail to run

Hi have exported the espnet model trained on my custom dataset using espnet_onnx. Model fails to work properly on some audios. Below is the error which i am getting

Non-zero status code returned while running Add node. Name:'/encoders/encoders.0/self_attn/Add' Status Message: /encoders/encoders.0/self_attn/Add: right operand cannot broadcast on dim 3 LeftShape: {1,8,171,171}, RightShape: {1,1,1,127}

Any idea what could be the issue here. I have infered model on 1500 audio clips and i am getting exactly same error on around 400 audio clips.

xvector-vits conversion error

Hi @Masao-Someki , have you tried converting xvector-vits to onnx ? when I try to do this, I run into this problem:

/root/anaconda3/envs/espnet/lib/python3.7/site-packages/torch/onnx/symbolic_helper.py:719: UserWarning: allowzero=0 by default. In order to honor zero value in shape use allowzero=1
warnings.warn("allowzero=0 by default. In order to honor zero value in shape use allowzero=1")
WARNING: The shape inference of prim::Constant type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Floating point exception(core dump)

converting to triton model

Hi, Thanks to this repository, I made the onnx model what I trained and I curious about the model is on cpu or gpu
Now I'm trying to convert espnet onnx model to triton server model
How can I do this? Let me some hint..
Thank you

Unknown model file format version

  1. exported model sucessfully
/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/export/asr/models/language_models/embed.py:360: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/export/asr/models/multihead_att.py:116: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
/home/miniconda3/envs/espnet/lib/python3.8/site-packages/torch/onnx/symbolic_helper.py:719: UserWarning: allowzero=0 by default. In order to honor zero value in shape use allowzero=1
  warnings.warn("allowzero=0 by default. In order to honor zero value in shape use allowzero=1")
  1. decode with exported model
Traceback (most recent call last):
  File "test_text_e2e_espnet.py", line 20, in <module>
    speech2text = Speech2Text("wenet")
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/asr/asr_model.py", line 43, in __init__
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/asr/abs_asr_model.py", line 53, in _build_model
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/asr/model/encoder.py", line 11, in get_encoder
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.9-py3.8.egg/espnet_onnx/asr/model/encoders/encoder.py", line 35, in __init__
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 280, in __init__
    self._create_inference_session(providers, provider_options)
  File "/home/miniconda3/envs/espnet/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 307, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from espnet_model_onnx/wenet/full/xformer_encoder.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:111 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&) Unknown model file format version.

The bug caused by torch version

@Masao-Someki Hi Masao, I found a bug, when I use the 1.11.0 version of torch to export the vits model, everything works fine, but when I upgrade the torch version to 1.12.1 (required by torchaudio), I get a core dump when exporting the vits.
Can you reproduce this bug ? BTW, everything works fine when export jets model using torch version 1.12.1

Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Warning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied.
Floating point exception(core dump)

IndexError: list index out of range

Hi,

Could you please help me with bellow issue, when I want to export TTS model using export_from_zip? Thanks in advance!
Traceback (most recent call last): File "convert_to_onnx.py", line 18, in <module> m.export_from_zip( File "/home/jetson/.virtualenvs/espnet2/lib/python3.8/site-packages/espnet_onnx/export/tts/export_tts.py", line 119, in export_from_zip self.export(model, tag_name, quantize) File "/home/jetson/.virtualenvs/espnet2/lib/python3.8/site-packages/espnet_onnx/export/tts/export_tts.py", line 60, in export model_config = self._create_config(model, export_dir) File "/home/jetson/.virtualenvs/espnet2/lib/python3.8/site-packages/espnet_onnx/export/tts/export_tts.py", line 127, in _create_config ret.update(get_preprocess_config(model.preprocess_fn, path)) File "/home/jetson/.virtualenvs/espnet2/lib/python3.8/site-packages/espnet_onnx/export/tts/get_config.py", line 47, in get_preprocess_config 'cleaner_types': model.text_cleaner.cleaner_types[0] IndexError: list index out of range

wav quality drop

hi, I initial text2speech using my own am_model and vocoder and export onnx model, but sound quality drops significantly, I just modify hifigan inference code in https://github.com/Masao-Someki/espnet_onnx/blob/feature/add_PWGVocoder/espnet_onnx/export/tts/models/vocoders/parallel_wavegan.py because hifigan code in repo ParalleWaveGAN does not support parameter x,
and i checked Espnet am and vocoder and onnx am and vocoder, they look the same
could you please offer some advises?

"indices element out of data bounds" while inferencing

hi ,
i try to inference wav as the document
"import librosa
from espnet_onnx import Speech2Text
speech2text = Speech2Text(tag_name='')
y, sr = librosa.load('sample.wav', sr=16000)
nbest = speech2text(y)"
but when run "nbest = speech2text(y)" , occur a error like
"onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'Gather_4' Status Message: indices element out of data bounds, idx=525 must be within the inclusive range [-512,511]"
my feats_lengths is 526, why this error occurs

a small bug

@Masao-Someki Hi, thank you very much for completing the onnx export of vits. After the test of my own training model, when I set cleaner=none, there was a small problem, and after solving everything works fine !
File "/work/ysj/espnet_onnx/espnet_onnx/export/tts/get_config.py", line 46, in get_preprocess_config 'cleaner_types': model.text_cleaner.cleaner_types[0] IndexError: list index out of range

For streaming model result first word is often missing when displaying the best hypo

Hi @Masao-Someki,

I am testing my streaming Conformer-Transformer model after converting it from espnet2 to onnx format. I noticed that the first word in the best hypo is often missing although it got recognized correctly (debug log is showing it via ("batch_beam_search:271) DEBUG: best hypo: ..." message).

I found some suspicious code in this line https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#164 where the _get_result() function always removes the first two tokens:

def _get_result(self, hyp: Hypothesis):
        token_int = hyp.yseq[2:-1].astype(np.int64).tolist()

I think _get_result() should just remove the first & last token (assumed to be "<sos/eos>") like so:

def _get_result(self, hyp: Hypothesis):
        token_int = hyp.yseq[1:-1].astype(np.int64).tolist()

After the change my issue with missing word at beginning was resolved. So, this looks like a bug to me.

Could you please double-check and let me know if I should create a PR?

Thanks!

max_seq_length not working for fastspeech2

@Masao-Someki I tried changing the max_seq_length for fastspeech2, but it gives error

image

I used the command while exporting and only changed the tag name

m.set_export_config(
    max_seq_len=3000
)

If i use the command with vits, it works well. Can you please check

about JETS

espnet2-tts already supports end-to-end model jets, do you have any plans to support jets to onnx? I think with experience with fastspeech2 and hifigan, this will be done very quickly

vits inference bug

@Masao-Someki I encountered an error when inferring vits using the latest version
Traceback (most recent call last):
File "export2onnx.py", line 21, in
output_dict = tts("hello how are you")
File "/work/espnet_onnx/espnet_onnx/tts/tts_model.py", line 86, in call
output_dict = self.tts_model(text, **options)
File "/work/espnet_onnx/espnet_onnx/tts/model/tts_models/vits.py", line 58, in call
wav, att_w, dur = self.model.run(output_names, input_dict)
File "/root/anaconda3/envs/espnet/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 2 inputs. Input Feed contains 1

A difference in accuracy between a onnx model and a torch model.

Thanks for developing this repo.

I tested espnet_onnx by using CSJ corpus.
There was a difference in accuracy between a onnx model and a torch model.

I describe how I tested the onnx model by using CSJ corpus.
I converted the CSJ torch model (https://zenodo.org/record/4037458#.YsVJT-zP30o) to the onnx model by using a script shown below.

from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ASRModelExport

m = ASRModelExport()
m.set_export_config(max_seq_len=3000)
m.export_from_pretrained('kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave', quantize=False)

I transcribed CSJ eval sets with the onnx model by using a script shown below.

import librosa
import sys
from espnet_onnx import Speech2Text

speech2text = Speech2Text(tag_name='kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave')

target = sys.argv[1]
# /opt/csj_eval_dump/eval[1-3] are copied from dump/raw/eval[1-3] generated by egs2/csj/asr1/run.sh
root_path = "/opt/csj_eval_dump/"
with open(f"{root_path}/{target}/wav.scp") as f:
    for line in f:
        t = line.strip().split(" ")
        utterance_id = t[0]
        filepath = t[1].split("dump/raw/")[1]
        y, sr = librosa.load(f"{root_path}/{filepath}", sr=16000)
        nbest = speech2text(y)[0][0]
        print(f"{utterance_id} {nbest}")

An example of transcribed texts is shown below.

A01M0097_0000211_0001695 えー内容としましては
A01M0097_0002761_0009322 生成過程のモデルモデルに基づくF0パターンの分析各音節の分節的特徴と
A01M0097_0009867_0020254 えー音調指令との間の相対的な時間関係の規則化え音調指令の大きさの比の規則化えー音調指令とフレーズ指令の大きさの量子化
A01M0097_0020642_0026743 規則によって生成されたF0パターンを持つ合成音声のデモンストレーションの順でお話ししたいと思います
A01M0097_0033771_0045423 えーこの図は日本語音声についてのとてもよく知られた生成過程モデルまー藤崎モデルなんですがえーこのモデルはフレーズ指令と正のアクセント指令によって実測のF0パターンをよく近似できることが

Transcribed texts are copied to egs2/csj/asr1/exp/[...]/decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval[1-3]/text and are scored by local/score.sh .
I obtained results shown below.

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval1|1272|43897|95.0|2.9|2.1|1.0|6.0|60.4|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval2|1292|43623|95.8|2.5|1.7|0.7|4.9|61.8|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval3|1385|28225|95.4|2.5|2.1|1.0|5.6|43.5|

According to https://zenodo.org/record/4037458#.YsVJT-zP30o the torch model can obtain results shown below.

|decode_asr_lm_lm_train_lm_char_valid.loss.ave_asr_model_valid.acc.ave/eval1|1272|43897|95.9|2.5|1.6|0.8|4.9|50.5|
|decode_asr_lm_lm_train_lm_char_valid.loss.ave_asr_model_valid.acc.ave/eval2|1292|43623|96.9|2.0|1.0|0.6|3.7|50.3|
|decode_asr_lm_lm_train_lm_char_valid.loss.ave_asr_model_valid.acc.ave/eval3|1385|28225|96.8|2.1|1.1|0.7|3.9|32.6|

My environment is as follows.

Python 3.7.13

torch==1.11.0+cpu
torch-complex==0.4.3
torch-optimizer==0.3.0
torchaudio==0.11.0+cpu
espnet-model-zoo==0.1.7
espnet-onnx==0.1.9
espnet-tts-frontend==0.0.3
-e git+https://github.com/espnet/espnet.git@a2abaf11c81e58653263d6cc8f957c0dfd9677e7#egg=espnet

Could you give me any hint to improve accuracy?
Thanks for your help.

Can not import "TTSModelExport"

Hello, I follow the tutorial for installation on colab.
But I can import the TTSModelExport successfully.
The running example can be show as follow
image

Speech2Text.__init__() got an unexpected keyword argument 'train_config' when downloading model

Hi there.

I am using this code, stright from the usage section:

from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ASRModelExport

m = ASRModelExport()

# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained('espnet/kan-bayashi_vctk_full_band_multi_spk_vits', quantize=True)

And getting this result:

Downloading (…)de1ff/.gitattributes: 100%|████████████████████████████████████████████████| 1.18k/1.18k [00:00<?, ?B/s]
Downloading (…)7a4efde1ff/README.md: 100%|████████████████████████████████████████████████| 1.91k/1.91k [00:00<?, ?B/s]
Downloading (…)rg/tr_no_dev/spk2sid: 100%|████████████████████████████████████████████████████| 872/872 [00:00<?, ?B/s]
Downloading (…)no_space/config.yaml: 100%|████████████████████████████████████████████████| 8.23k/8.23k [00:00<?, ?B/s]
Downloading (…)or_backward_time.png: 100%|█████████████████████████████████████████| 33.1k/33.1k [00:00<00:00, 353kB/s]
Downloading (…)scriminator_loss.png: 100%|█████████████████████████████████████████| 45.5k/45.5k [00:00<00:00, 486kB/s]
Downloading (…)tor_forward_time.png: 100%|█████████████████████████████████████████| 32.4k/32.4k [00:00<00:00, 346kB/s]
Downloading (…)inator_fake_loss.png: 100%|█████████████████████████████████████████| 59.9k/59.9k [00:00<00:00, 640kB/s]
Downloading (…)_optim_step_time.png: 100%|█████████████████████████████████████████| 59.7k/59.7k [00:00<00:00, 637kB/s]
Downloading (…)nator_train_time.png: 100%|█████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 307kB/s]
Downloading (…)nerator_adv_loss.png: 100%|█████████████████████████████████████████| 59.5k/59.5k [00:00<00:00, 476kB/s]
Downloading (…)inator_real_loss.png: 100%|█████████████████████████████████████████| 60.4k/60.4k [00:00<00:00, 322kB/s]
Downloading (…)or_backward_time.png: 100%|█████████████████████████████████████████| 30.1k/30.1k [00:00<00:00, 321kB/s]
Downloading (…)nerator_dur_loss.png: 100%|█████████████████████████████████████████| 42.3k/42.3k [00:00<00:00, 451kB/s]
Downloading (…)tor_forward_time.png: 100%|█████████████████████████████████████████| 54.0k/54.0k [00:00<00:00, 494kB/s]
Downloading (…)_feat_match_loss.png: 100%|█████████████████████████████████████████| 40.1k/40.1k [00:00<00:00, 427kB/s]
Downloading (…)rator_train_time.png: 100%|█████████████████████████████████████████| 31.0k/31.0k [00:00<00:00, 283kB/s]
Downloading (…)ax_cached_mem_GB.png: 100%|█████████████████████████████████████████| 28.9k/28.9k [00:00<00:00, 231kB/s]
Downloading (…)images/iter_time.png: 100%|█████████████████████████████████████████| 57.1k/57.1k [00:00<00:00, 406kB/s]
Downloading (…)s/generator_loss.png: 100%|█████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 102kB/s]
Downloading (…)_optim_step_time.png: 100%|█████████████████████████████████████████| 56.8k/56.8k [00:00<00:00, 298kB/s]
Downloading (…)nerator_mel_loss.png: 100%|█████████████████████████████████████████| 30.9k/30.9k [00:00<00:00, 165kB/s]
Downloading (…)enerator_kl_loss.png: 100%|████████████████████████████████████████| 23.4k/23.4k [00:00<00:00, 83.1kB/s]
Downloading (…)mages/optim0_lr0.png: 100%|█████████████████████████████████████████| 25.2k/25.2k [00:00<00:00, 269kB/s]
Downloading (…)7a4efde1ff/meta.yaml: 100%|████████████████████████████████████████████████████| 360/360 [00:00<?, ?B/s]
Downloading (…)mages/train_time.png: 100%|█████████████████████████████████████████| 29.1k/29.1k [00:00<00:00, 310kB/s]
Downloading (…)mages/optim1_lr0.png: 100%|█████████████████████████████████████████| 25.0k/25.0k [00:00<00:00, 266kB/s]
Downloading (…)count.ave_10best.pth: 100%|██████████████████████████████████████████| 387M/387M [01:50<00:00, 3.49MB/s]
Fetching 28 files: 100%|███████████████████████████████████████████████████████████████| 28/28 [01:54<00:00,  4.10s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Enrique.Navarro\Desktop\Pruebas_TTS\espnet_onnx\env_espnet\lib\site-packages\espnet_onnx\export\asr\export_asr.py", line 176, in export_from_pretrained
    model = Speech2Text.from_pretrained(tag_name, **pretrained_config)
  File "C:\Users\Enrique.Navarro\Desktop\Pruebas_TTS\espnet_onnx\env_espnet\lib\site-packages\espnet2\bin\asr_inference.py", line 516, in from_pretrained
    return Speech2Text(**kwargs)
TypeError: Speech2Text.__init__() got an unexpected keyword argument 'train_config'
>>> speech2text = Speech2Text(args)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'args' is not defined

Failed to convert Espnet2 model using external (Fbank+Pitch) features to ONNX format

I simply test the espnet_onnx function by train the Espnet2 model using recipe "librispeech_100".
The ONNX conversion working in model train using default features (FBank), which is good.

But when I change the feature_type from default to "fbank_pitch", which will pre-generate fbank+pitch features using Kaldi extractor and use it in subsequent training and decode, error pop up when I using espnet_onnx to convert that trained model.

Can suggest if there is any problem and how to fix it ?

Command I am using is as below
python3 -m espnet_onnx.export --model_type asr --tag conformer_ext_feature --input asr_conformer_lr2e-3_warmup15k_amp_nondeterministic_valid.acc.ave.zip

And here is the error shown
File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/__main__.py", line 91, in <module> m.export_from_zip( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/export_asr.py", line 191, in export_from_zip self.export(model, tag_name, quantize, optimize) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/export_asr.py", line 91, in export self._export_encoder(enc_model, export_dir, verbose) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/export_asr.py", line 246, in _export_encoder self._export_model(model, verbose, path) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/export_asr.py", line 226, in _export_model torch.onnx.export( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/__init__.py", line 350, in export return utils.export( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/utils.py", line 163, in export _export( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/utils.py", line 1074, in _export graph, params_dict, torch_out = _model_to_graph( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/utils.py", line 727, in _model_to_graph graph, params, torch_out, module = _create_jit_graph(model, args) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/utils.py", line 602, in _create_jit_graph graph, torch_out = _trace_and_get_graph_from_model(model, args) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/onnx/utils.py", line 517, in _trace_and_get_graph_from_model trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/jit/_trace.py", line 1175, in _get_trace_graph outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/jit/_trace.py", line 127, in forward graph, out = torch._C._create_graph_by_tracing( File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/jit/_trace.py", line 118, in wrapper outs.append(self.inner(*trace_inputs)) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward result = self.forward(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/models/encoders/conformer.py", line 104, in forward xs_pad, mask = self.embed(feats, mask) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward result = self.forward(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/models/language_models/embed.py", line 74, in forward return self.model(x, mask) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward result = self.forward(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/espnet_onnx/export/asr/models/language_models/subsampling.py", line 48, in forward x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f)) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward result = self.forward(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward result = self.forward(*input, **kwargs) File "/home/cli/miniconda3/envs/espnet-curr-test/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (24x4864 and 5120x256)

IndexError: list index out of range

export codes:

from espnet2.bin.tts_inference import Text2Speech
from espnet_onnx.export import TTSModelExport

m = TTSModelExport()

# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained('kan-bayashi/csmsc_tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave', quantize=True)

# export from trained model
text2speech = Text2Speech(args)
m.export(text2speech, 'kan-bayashi/csmsc_tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave', quantize=True)

error:

WARNING:root:Fallback to conformer_pos_enc_layer_type = 'legacy_rel_pos' due to the compatibility. If you want to use the new one, please use conformer_pos_enc_layer_type = 'latest'.
WARNING:root:Fallback to conformer_self_attn_layer_type = 'legacy_rel_selfattn' due to the compatibility. If you want  to use the new one, please use conformer_pos_enc_layer_type = 'latest'.
Traceback (most recent call last):
  File "export_espnet_tts.py", line 7, in <module>
    m.export_from_pretrained('kan-bayashi/csmsc_tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave', quantize=True)
  File "/home/ybZhang/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.10-py3.8.egg/espnet_onnx/export/tts/export_tts.py", line 114, in export_
  File "/home/ybZhang/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.10-py3.8.egg/espnet_onnx/export/tts/export_tts.py", line 127, in _create_config
  File "/home/ybZhang/miniconda3/envs/espnet/lib/python3.8/site-packages/espnet_onnx-0.1.10-py3.8.egg/espnet_onnx/export/tts/get_co

Hardcoded `tag_config_path`

Error encountered when trying to export a model to a custom output directory:

Traceback (most recent call last):
  File "xxx.py", line 59, in <module>
    main()
  File "xxx.py", line 33, in main
    m.export(
  File "E:\repos\on-device-speech-translation\onnx_utils\export_st.py", line 141, in export
    update_model_path(tag_name, base_dir)
  File "D:\Anaconda\Lib\site-packages\espnet_onnx\utils\config.py", line 55, in update_model_path
    save_config(config, tag_config_path)
  File "D:\Anaconda\Lib\site-packages\espnet_onnx\utils\config.py", line 32, in save_config
    with open(path, "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\xxx\\.cache\\espnet_onnx\\tag_config.yaml'

Seems like the path is hardcoded to home cache dir here

HiFiGAN vocoder support in TTS

Hi,

Supported vocoders include HiFiGAN in https://github.com/espnet/espnet_onnx/blob/master/docs/TTSSupported.md, but in the code HiFiGAN seems to be not supported in the vocoder loader function below.

Any idea how to use the HiFiGAN vocoder? I am also attaching the config.yaml file that TTSModelExport created -> config.yaml

def _build_vocoder(self, providers, use_quantized):
self.vocoder = None
if self.config.vocoder.vocoder_type == 'not_used':
logging.info('Vocoder is not used.')
elif self.config.vocoder.vocoder_type == 'Spectrogram2Waveform':
self.vocoder = Spectrogram2Waveform(self.config.vocoder)
elif self.config.vocoder.vocoder_type == 'PretrainedPWGVocoder':
raise RuntimeError('Currently, PWGVocoder is not supported.')
elif self.config.vocoder.vocoder_type == 'OnnxVocoder':
self.vocoder = Vocoder(self.config.vocoder, providers, use_quantized)
else:
raise RuntimeError(f'vocoder type {self.config.vocoder_type} is not supported.')

Question on stream_asr.end() function for streaming asr

Hi @Masao-Someki,

In the readme the example for streaming asr shows the use of start() and end() methods:

from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)

# start streaming asr
stream_asr.start()
while streaming:
  wav = <some code to get wav>
  assert len(wav) == stream_asr.hop_size
  stream_text = stream_asr(wav)[0][0]

# You can get non-streaming asr result with end function
nbest = stream_asr.end()

In a real streaming scenario should the start() and end() methods be called whenever the microphone is opened and closed?

I am asking because I noticed that the end() function in https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 calls the self.batch_beam_search() function which will restart decoding from postion 0 again causing a rather large delay for longer speech inputs. If I change https://github.com/espnet/espnet_onnx/blob/master/espnet_onnx/asr/asr_streaming.py#151 to use self.beam_search() method instead it avoids decoding the entire utterance at the end again and thus the delay.

Could you please clarify why self.batch_beam_search() is used in stream_asr.end() function?

Thanks!

ModuleNotFoundError: No module named 'espnet_model_zoo.downloader'; 'espnet_model_zoo' is not a package

Trying to download the pretrained model using following code -

`from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ASRModelExport

m = ASRModelExport()

tag_name = 'asr_train_asr_transformer2_raw_zh_char_batch_bins20000000_ctc_confignore_nan_gradtrue_sp_valid.acc.ave'

download with espnet_model_zoo and export from pretrained model

m.export_from_pretrained(tag_name, quantize=True)

export from trained model

speech2text = Speech2Text(args)
m.export(speech2text, '', quantize=True)`

and ends up with the error

Traceback (most recent call last): File "espnet_onnx_export.py", line 7, in <module> from espnet_onnx.export import ASRModelExport File "/home/neso/espnet_onnx/espnet_onnx/export/__init__.py", line 1, in <module> from .asr.export_asr import ASRModelExport File "/home/neso/espnet_onnx/espnet_onnx/export/asr/export_asr.py", line 14, in <module> from espnet_model_zoo.downloader import ModelDownloader File "/home/neso/espnet_onnx/espnet_model_zoo.py", line 8, in <module> speech2text = Speech2Text.from_pretrained( File "/home/neso/anaconda3/envs/espnet_onnx/lib/python3.8/site-packages/espnet2/bin/asr_inference.py", line 556, in from_pretrained from espnet_model_zoo.downloader import ModelDownloader ModuleNotFoundError: No module named 'espnet_model_zoo.downloader'; 'espnet_model_zoo' is not a package

onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 211 by 421

when I try to use inference demo, I got an error
import librosa
from espnet_onnx import Speech2Text

speech2text = Speech2Text(model_dir='/root/autodl-tmp/.cache/espnet_onnx/librispeech_100-asr-conformer-aed')

wav_file = '../wav_test/121-121726-0000.wav'
y, sr = librosa.load(wav_file, sr=16000)
nbest = speech2text(y)

error info
/root/miniconda3/envs/espnet-onnx/lib/python3.8/site-packages/espnet_onnx/utils/abs_model.py:63: UserWarning: Inference will be executed on the CPU. Please provide gpu providers. Read How to use GPU on espnet_onnx in readme in detail.
warnings.warn(
2023-04-15 19:34:26.289677552 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running Add node. Name:'Add_398' Status Message: /hdd/doc/onnxruntime/onnxruntime/core/providers/cpu/math/element_wise_ops.h:523 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 211 by 421

Traceback (most recent call last):
File "inference.py", line 9, in
nbest = speech2text(y)
File "/root/miniconda3/envs/espnet-onnx/lib/python3.8/site-packages/espnet_onnx/asr/asr_model.py", line 79, in call
enc, _ = self.encoder(speech=speech, speech_length=lengths)
File "/root/miniconda3/envs/espnet-onnx/lib/python3.8/site-packages/espnet_onnx/asr/model/encoders/encoder.py", line 70, in call
self.forward_encoder(feats, feat_length)
File "/root/miniconda3/envs/espnet-onnx/lib/python3.8/site-packages/espnet_onnx/asr/model/encoders/encoder.py", line 87, in forward_encoder
self.encoder.run(["encoder_out", "encoder_out_lens"], {
File "/root/miniconda3/envs/espnet-onnx/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 192, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_398' Status Message: /hdd/doc/onnxruntime/onnxruntime/core/providers/cpu/math/element_wise_ops.h:523 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 211 by 421

please help me. thank you very much!

v0.1.9 bug

When I tested the demo of ljspeech, I encountered the following problems

Traceback (most recent call last): File "export2onnx.py", line 6, in <module> m.export_from_pretrained(tag_name) File "/work/ysj/espnet_onnx/espnet_onnx/export/tts/export_tts.py", line 109, in export_from_pretrained self.export(model, tag_name, quantize) File "/work/ysj/espnet_onnx/espnet_onnx/export/tts/export_tts.py", line 64, in export self._export_tts(tts_model, export_dir, verbose) File "/work/ysj/espnet_onnx/espnet_onnx/export/tts/export_tts.py", line 174, in _export_tts self._export_model(model, verbose, path) File "/work/ysj/espnet_onnx/espnet_onnx/export/tts/export_tts.py", line 163, in _export_model dynamic_axes=model.get_dynamic_axes(), File "/root/anaconda3/envs/espnet/lib/python3.7/site-packages/torch/onnx/__init__.py", line 320, in export custom_opsets, enable_onnx_checker, use_external_data_format) File "/root/anaconda3/envs/espnet/lib/python3.7/site-packages/torch/onnx/utils.py", line 111, in export custom_opsets=custom_opsets, use_external_data_format=use_external_data_format) File "/root/anaconda3/envs/espnet/lib/python3.7/site-packages/torch/onnx/utils.py", line 707, in _export _set_opset_version(opset_version) File "/root/anaconda3/envs/espnet/lib/python3.7/site-packages/torch/onnx/symbolic_helper.py", line 849, in _set_opset_version raise ValueError("Unsupported ONNX opset version: " + str(opset_version)) ValueError: Unsupported ONNX opset version: 15

My pytorch version is 1.10.1, is it related to this?

Quantize model is slower than raw model

I test espnet_onnx with a conformer model, I eval 100 wav 10 times and calculate the RTF only forward time, the result is

cpu gpu
fp32 0.0180668 0.00263397
quantize 0.0172804 0.0124609

quantize model is very slower than fp32 model on GPU and just a litter bit faster on cpu

System information:
torch /cuda / GPU: 11.0 / 11.6 / A100
cpu: AMD EPYC 7402 24-Core Processor
onnx: 1.10.1
onnxruntime-gpu : 1.13.1
espnet_onnx: 0.1.9

Have you tested the speed of the quantize model on GPU

A difference in accuracy between a onnx conformer model and a torch conformer model.

Sorry to bother you again.

I tested the conformer model of espnet_onnx by using CSJ corpus.
There was a difference in accuracy between a onnx conformer model and a torch conformer model.

I describe how I tested the conformer model of espnet_onnx by using CSJ corpus.
I downloaded the CSJ torch conformer model ( https://zenodo.org/record/4065140#.YsurAezP30o ), and converted it to the onnx model by using a script shown below.

from espnet2.bin.asr_inference import Speech2Text
from espnet_onnx.export import ASRModelExport

m = ASRModelExport()

m.set_export_config(max_seq_len=3000)
speech2text = Speech2Text.from_pretrained('asr_train_asr_conformer_raw_char_sp_valid.acc.ave.zip')
m.export(speech2text, 'kan-bayashi/csj_asr_train_asr_conformer_raw_char_sp_valid.acc.ave', quantize=False)

I modified ~/.cache/espnet_onnx/kan-bayashi/csj_asr_train_asr_conformer_raw_char_sp_valid.acc.ave/config.yaml as shown below.

weights:
  ctc: 0.3
  decoder: 0.7
  length_bonus: 0.0
  lm: 0.3
  ngram: 0.0

I transcribed CSJ eval sets with the onnx conformer model in the same manner as https://github.com/Masao-Someki/espnet_onnx/issues/36#issue-1295560021

I obtained results shown below.

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval1|1272|43897|95.0|3.3|1.7|1.2|6.2|56.3|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval2|1292|43623|96.5|2.3|1.2|0.8|4.3|53.3|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval3|1385|28225|96.0|2.6|1.4|1.0|5.0|38.1|

I transcribed CSJ eval sets with the torch conformer model( https://zenodo.org/record/4065140#.YsurAezP30o ).
I obtained a result shown below.

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval1|1272|43897|96.3|2.3|1.5|0.7|4.5|49.4|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval2|1292|43623|97.2|1.7|1.0|0.5|3.3|47.6|
|decode_asr_lm_lm_train_lm_jp_char_valid.loss.ave_asr_model_valid.acc.ave/eval3|1385|28225|97.2|1.8|1.0|0.7|3.6|32.5|

Could you give me any hint to improve accuracy?
Thanks for your help.

testing onnx module encoder

I was testing the encoder module.
https://github.com/espnet/espnet_onnx/blob/18eb341d44ccf83c3ab35bc040b4102a73602922/tests/unit_tests/test_inference_asr.py.
raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 9 inputs. Input Feed contains 1.
The dummy input is for the feats. do we need to create dummy inputs for rest of the inputs.
{ 'xs_pad',
'mask
'buffer_before_downsampling'
'buffer_after_downsampling'
'prev_addin'
'pos_enc_xs':
'pos_enc_addin'
'past_encoder_ctx'
'indicies'}

BTW the decoder module works fine.

Decoding speed difference between before and after conversion

Thank you for your always kind reply. Also, your research has really helped me a lot.

In general, the purpose of onnx conversion is to make the model lighter and get faster speed. Have you ever checked the difference in decoding speed before and after conversion onnx model conversion in the asr task? I would also like to know if there is also a difference in accuracy. It doesn't matter which model you've been running.

How to include stats.h5 of PWG Vocoder during ONXX conversion for TTS

Hi..
I am trying to convert pretrained LJSpeech TTS model based on kan-bayashi/ljspeech_fastspeech2 and parallel_wavegan/ljspeech_parallel_wavegan.v1 using the below code:

########################### ONNX Conversion ############################

from espnet2.bin.tts_inference import Text2Speech
from espnet_onnx.export import TTSModelExport

m = TTSModelExport()

tag_exp = "exp/tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth"
train_config="exp/tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml"

vocoder_tag = 'parallel_wavegan.v1/checkpoint-400000steps.pkl'
vocoder_config= 'parallel_wavegan.v1/config.yml'

text2speech = Text2Speech.from_pretrained(
train_config=train_config,
model_file=tag_exp,
vocoder_file=vocoder_tag,
vocoder_config=vocoder_config,
speed_control_alpha=1.0,
always_fix_seed=False
)

tag_name = 'ljspeech_pretrained'
m.export(text2speech, tag_name, quantize=True)

########################### Inference ############################

from espnet_onnx import Text2Speech
import soundfile
import numpy as np
import time

text2speech = Text2Speech(tag_name)

text = 'hello world!'
wav = wav['wav']

soundfile.write("ljspeech_pretrained_test.wav", wav, 22050, "PCM_16")

######################################################################

On synthesizing, the audio quality is very low.
I realized that the converted ONNX folder did not have stats.h5 file from the pwg vocoder folder.
~/.cache/espnet_onnx/ljspeesch_pretrained/: config.yaml feats_stats.npz full quantize

Can anyone please help how to include the stats.h5 during inference using espnet_onnx

quantize model take very long time to synthesis

I've test a sentence with these three models, the quantize model seems abnormal.

pytorch vits
infer time : 10.691850900650024
RTF = 0.349893

onnx vits
infer time : 4.495715618133545
RTF = 0.146124

onnx quantize vits
infer time : 83.49901461601257
RTF = 2.672614

(bug)[ONNXRuntimeError] : 2 : INVALID_ARGUMENT

Error:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running ScatterND node. Name:'ScatterND_70' Status Message: invalid indice found, indice = -1
File "D:\recognition\espnet\esp2onnx\espnet_onnx\asr\model\decoders\xformer.py", line 84, in batch_score
input_dict
File "D:\recognition\espnet\esp2onnx\espnet_onnx\asr\beam_search\batch_beam_search.py", line 136, in score_full
scores[k], states[k] = d.batch_score(hyp.yseq, hyp.states[k], x)
File "D:\recognition\espnet\esp2onnx\espnet_onnx\asr\beam_search\batch_beam_search.py", line 195, in search
[x for _ in range(n_batch)]).reshape(n_batch, *x.shape))
File "D:\recognition\espnet\esp2onnx\espnet_onnx\asr\beam_search\beam_search.py", line 334, in call
best = self.search(running_hyps, x)
File "D:\recognition\espnet\esp2onnx\espnet_onnx\asr\asr_model.py", line 84, in call
nbest_hyps = self.beam_search(enc[0])[:1]
File "D:\recognition\espnet\esp2onnx\demo.py", line 12, in
res = speech2text(speech=data)

Code:

from espnet_onnx import Speech2Text
import librosa
import numpy as np

if __name__ == "__main__":
    tag = 'espnet/Shinji_Watanabe_laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave'

    speech2text = Speech2Text(tag)

    data, _ = librosa.load('1.wav',sr=16000)

    res = speech2text(speech=data)
    print(res)

Enviroment:
win11,python3.7.7,onnx==1.12.0,onnxruntime==1.10.0,espnet==202205

tts onnx gpu inference time problems

hi, i met a problem when onnx inference on gpu

  1. onnx inference on gpu slower than onnx cpu inference much time and sometimes faster than gpu pt inference(2 times acceleration)
  2. when i inference same text twice or more, inference achieves 2 time acceleration compare to gpu pt inference
    any advicec?
    thanks

ASR onnx export accuracy drop

Hi, I was wondering if you test the accuracy using espnet onnx. I have encountered accuracy drop issue from ~5% wer to 12% wer after using onnx. Just wondering if anyone encounter the same issue.

Dynamic input for TTS

Hi, thanks for your work on great project!

I tried exporting as specified in example:

from espnet_onnx.export import TTSModelExport

m = TTSModelExport()

tag_name = 'kan-bayashi/ljspeech_vits'
# download with espnet_model_zoo and export from pretrained model
m.export_from_pretrained(tag_name, quantize=True)

Then I tried to use model:

from espnet_onnx import Text2Speech
import IPython

tag_name = 'kan-bayashi/ljspeech_vits'
text2speech = Text2Speech(tag_name, use_quantized=True)

text = 'This model is so small it can be run on a smartwatch! This model is so small it can be run on a smartwatch!'
output_dict = text2speech(text) # inference with onnx model.
wav = output_dict['wav']
IPython.display.Audio(data=wav, rate=22050)

But I get the following message:

2022-10-01 12:31:15.826208114 [E:onnxruntime:, sequential_executor.cc:364 Execute] Non-zero status code returned while running Gather node. Name:'Gather_3861' Status Message: indices element out of data bounds, idx=712 must be within the inclusive range [-512,511]
Traceback (most recent call last):
  File "<path>/onnx-tts/use_onnx.py", line 8, in <module>
    output_dict = text2speech(text) # inference with onnx model.
  File "<path>/onnx-tts/.venv/lib/python3.9/site-packages/espnet_onnx/tts/tts_model.py", line 86, in __call__
    output_dict = self.tts_model(text, **options)
  File "<path>/onnx-tts/.venv/lib/python3.9/site-packages/espnet_onnx/tts/model/tts_models/vits.py", line 58, in __call__
    wav, att_w, dur = self.model.run(output_names, input_dict)
  File "<path>/onnx-tts/.venv/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 192, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'Gather_3861' Status Message: indices element out of data bounds, idx=712 must be within the inclusive range [-512,511]

I believe it's because inputs are fixed.
I can try to fix this, but maybe do you know how to handle that?

Long sentence cause onnxruntime error

First of all, thank you for you great work, I've converted the model to onnx successfully, but I found an issue ,When I synthesis a long sentence, will cause onnx runtime error.

Traceback (most recent call last):
File "infer_onnx.py", line 39, in
wav = text2speech(text)["wav"]
File "/opt/conda/lib/python3.8/site-packages/espnet_onnx/tts/tts_model.py", line 79, in call
output_dict = self.tts_model(text, options)
File "/opt/conda/lib/python3.8/site-packages/espnet_onnx/tts/model/tts_models/vits.py", line 59, in call
wav, att_w, dur = self.model.run(output_names, input_dict)
File "/opt/conda/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 192, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'Gather_3704' Status Message: indices element out of data bounds, idx=2595 must be within the inclusive range [-512,511]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.