Giter VIP home page Giter VIP logo

end-to-end-asr-pytorch's Introduction

End-to-end Automatic Speech Recognition Systems - PyTorch Implementation

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR by Tzu-Wei Sung and me. Implementation was mostly done with Pytorch, the well known deep learning toolkit.

The end-to-end ASR was based on Listen, Attend and Spell1. Multiple techniques proposed recently were also implemented, serving as additional plug-ins for better performance. For the list of techniques implemented, please refer to the highlights, configuration and references.

Feel free to use/modify them, any bug report or improvement suggestion will be appreciated. If you find this project helpful for your research, please do consider to cite our paper, thanks!

Highlights

  • Feature Extraction

    • On-the-fly feature extraction using torchaudio as backend
    • Character/subword2/word encoding of text
  • Training End-to-end ASR

    • Seq2seq ASR with different types of encoder/attention3
    • CTC-based ASR4, which can also be hybrid5 with the former
    • yaml-styled model construction and hyper parameters setting
    • Training process visualization with TensorBoard, including attention alignment
  • Speech Recognition with End-to-end ASR (i.e. Decoding)

    • Beam search decoding
    • RNN language model training and joint decoding for ASR6
    • Joint CTC-attention based decoding6
    • Greedy decoding & CTC beam search contributed by Heng-Jui (Harry) Chang

You may checkout some example log files with TensorBoard by downloading them from coming soon

Dependencies

  • Python 3
  • Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
  • Required packages and their use are listed requirements.txt.

Instructions

Step 0. Preprocessing - Generate Text Encoder

You may use the text encoders provided at tests/sample_data/ and skip this step.

The subword model is trained with sentencepiece. As for character/word model, you have to generate the vocabulary file containing the vocabulary line by line. You may also use util/generate_vocab_file.py so that you only have to prepare a text file, which contains all texts you want to use for generating the vocabulary file or subword model. Please update data.text.* field in the config file if you want to change the mode or vocabulary file. For subword model, use the one ended with .model as vocab_file.

python3 util/generate_vocab_file.py --input_file TEXT_FILE \
                                    --output_file OUTPUT_FILE \
                                    --vocab_size VOCAB_SIZE \
                                    --mode MODE

For more details, please refer to python3 util/generate_vocab_file.py -h.

Step 1. Configuring - Model Design & Hyperparameter Setup

All the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and experiments can be managed easily this way. See documentation and examples for the exact format. Note that the example configs provided were not fine-tuned, you may want to write your own config for best performance.

Step 2. Training (End-to-end ASR or RNN-LM)

Once the config file is ready, run the following command to train end-to-end ASR (or language model)

python3 main.py --config <path of config file> 

For example, train an ASR on LibriSpeech and watch the log with

# Checkout options available
python3 main.py -h
# Start training with specific config
python3 main.py --config config/libri/asr_example.yaml
# Open TensorBoard to see log
tensorboard --logdir log/
# Train an external language model
python3 main.py --config config/libri/lm_example.yaml --lm

All settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard. Please notice that the error rate reported on the TensorBoard is biased (see issue #10), you should run the testing phase in order to get the true performance of model. Options available in this phase include the followings

Options Description
config Path of config file.
seed Random seed, note this is an option that affects the result
name Experiments for logging and saving model.

By default it's <name of config file>_<random seed>

logdir Path to store training logs (log files for tensorboard), default log/.
ckpdir The directory to store model, default ckpt/.
njobs Number of workers used for data loader, consider increase this if you find data preprocessing takes most of your training time, default using 6.
no-ping Disable the pin-memory option of pytorch dataloader.
cpu CPU-only mode, not recommended, use it for debugging.
no-msg Hide all message from stdout.
lm Switch to rnnlm training mode.
test Switch to decoding mode (do not use during training phase)
cudnn-ctc Use CuDNN as the backend of PyTorch CTC. Unstable, see this issue, not sure if solved in latest Pytorch with cudnn version > 7.6

Step 3. Speech Recognition & Performance Evaluation

To test a model, run the following command

python3 main.py --config <path of config file> --test --njobs <int>

Please notice that the decoding is performed without batch processing, use more workers to speedup at the cost of using more RAM. By default, recognition result will be stored at result/<name>/ as two csv files with auto-naming according to the decoding config file. output.csv will store the best hypothesis provided by ASR and beam.csv will recored the top hypotheses during beam search. The result file may be evaluated with eval.py. For example, test the example ASR trained on LibriSpeech and check performance with

python3 main.py --config config/libri/decode_example.yaml --test --njobs 8
# Check WER/CER
python3 eval.py --file result/asr_example_sd0_dev_output.csv

Most of the options work similar to training phase except the followings:

Options Description
test Must be enabled
config Path to the decoding config file.
outdir Path to store decode result.
njobs Number of threads used for decoding, very important in terms of efficiency. Large value equals fast decoding yet RAM/GPU RAM expensive.

Troubleshooting

  • Loss becomes nan right after training begins

    For CTC, len(pred)>len(label) is necessary. Also consider set zero_infinity=True for torch.nn.CTCLoss

ToDo

  • Provide examples
  • Pure CTC training / CTC beam decode bug (out-of-candidate)
  • Greedy decoding
  • Customized dataset
  • Util. scripts
  • Finish CLM migration and reference
  • Store preprocessed dataset on RAM

Acknowledgements

  • Parts of the implementation refer to ESPnet, a great end-to-end speech processing toolkit by Watanabe et al.
  • Special thanks to William Chan, the first author of LAS, for answering my questions during implementation.
  • Thanks xiaoming, Odie Ko, b-etienne, Jinserk Baik and Zhong-Yi Li for identifying several issues in our implementation.

Reference

  1. Listen, Attend and Spell, W Chan et al.
  2. Neural Machine Translation of Rare Words with Subword Units, R Sennrich et al.
  3. Attention-Based Models for Speech Recognition, J Chorowski et al.
  4. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, A Graves et al.
  5. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, S Kim et al.
  6. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM, T Hori et al.

Citation

@inproceedings{liu2019adversarial,
  title={Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model},
  author={Liu, Alexander and Lee, Hung-yi and Lee, Lin-shan},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  organization={IEEE}
}

@misc{alex2019sequencetosequence,
    title={Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding},
    author={Alexander H. Liu and Tzu-Wei Sung and Shun-Po Chuang and Hung-yi Lee and Lin-shan Lee},
    year={2019},
    eprint={1910.12740},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

end-to-end-asr-pytorch's People

Contributors

alexander-h-liu avatar vectominist avatar windqaq avatar ywk991112 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

end-to-end-asr-pytorch's Issues

libgomp: Thread creation failed: Resource temporarily unavailable

Hi,
I've run python3 main.py --config config/libri/asr_example.yaml --njobs=12
Everything seems normal until libgomp: Thread creation failed: Resource temporarily unavailable pops out. I got stuck here for quite a while.
Did anyone encounter this issue too?

Facing missing positional argument : 'init_adadelta'

While testing the model on 'test-clean' dataset, i am facing issue
"self.model = ASR(self.feat_dim, self.vocab_size, **self.config['model'])", this error is occurring because it is missing one required positional argument : 'inint_adadelta'
image

i only changed the max_steps, tf_steps and valid_step in asr_example.yaml config file

Step 0 : text encoding

Can we use the subword models in the sample data folder for most datasets?
Also, if we train the ASR, do we need to train the LM ?

Listener #layers

Do you have any comment on using 2 encoder (listener) layers (as per your code 1 BLSTM + 1 pBLSTM) against 4 encoder layers ( 1 BLSTM + 3 pBLSTM as given in paper LAS)?

predict have a problem

predict only output one character,for example:

pred:尤, truth:白云区钟落潭竹一村民白云区钟落潭竹一村的村民跟记者报料, er:27.000000, len:27

what is the "subword" mean

Hi,

In the text mode, I know what the "character" stand for, but what dose "subword" stand for?
And what is the different between "subword.model", "subword-16k.model" and "subword-460.model"?

Thank you for your help

Padding target strings with the SOS token

I noticed that you use the same key, 0, for both indicating start of sentence (SOS)
table = {'<sos>':0,'<eos>':1}

as well as padding the targets

new_y = np.zeros((len(y),max_len),dtype=int)
for idx,label_seq in enumerate(y):
     new_y[idx,:len(label_seq)] = np.array(label_seq)

Meaning that a short sentence in a batch, e.g. "hello, world" could be encoded as

<sos>, 11, 8, 15, 15, 18, 48, 49, 24, 18, 20, 15,  7, <eos>, <sos>, <sos>, <sos>, <sos>, <sos>

depending on the mapping and the maximum length of the batch.

I haven't really thought about the consequences of this but wouldn't it be better to add some new pad token for targets?

Error in CharacterTextEncoder

Hi,
In decode method in CharacterTextEncoder have error.
Line 65:
if idx == self.pad_idx or (ignore_repeat and t > 0 and idx == vocabs[t-1]):

I think must be:
if idx == self.pad_idx or (ignore_repeat and t > 0 and idx == idxs[t-1]):

Question for obtaining context in Attention class

Hi @Xenderliu,

First of all, thanks for sharing this good project. I have a question for obtaining "context" in Attention class. According to the eq. (11) in the paper, context ci is a vector at time step i, which is the decoder timestep. So in my opinion, context should have a dimension of (batch_size, decoder_time_length, listener_feature_dim). In that case, context = torch.bmm(attention_score, listener_feature) could be enough. Isn't it? Please let me know if my understanding is wrong.

Thank you!
Jinserk

Single audio inference

How does your code make inference? That is, I pass in an audio, and the code outputs the audio recognition result.

Is there anybody uses custom data using librosa for traning LAS

Hi. I want to change dataset and recognize in korean.
I use librosa mel-spectro as extraction features. It shows below.
image
I didn't modify model parameters and train parameters. But model predicts same class almost one class when i gave some samples.

Below, When i gave sample by using torch.randn, results
image
It was almost just 2 classes. Is there problems at preprocessing audio file(e.g. mfcc, mel spectro) or models?

anybody knows?

Multihead attention

I think there is an no attribute error in /src/asr.py line 437 : self.preprocess_mlp_dim. Should it be self.num_head? Thanks for your great and clear implementation.

Config file for Librispeech 960h

Hi,

Does anyone have a config file that works wonders for training the ASR model on librispeech 960h ?
I can't seem to get it to the ~4% WER promised by many research papers. My best so far is above 10%.
Clearly with the tools provided by this repository, there must be a way to reach that much WER.

Inference decoding procedure is biased

I am not sure the decoding prodecure is right at test time. Indeed, you use
max_label_len = min([batch_label.size()[1],kwargs['max_label_len']]) to tell the decoder when to stop. At test time, you don't feed the decoder with labels, so you shouldn't use labels lengths either, right ?
In my case, I decode until all the items in the batch have emitted the EOS symbol, which means that the decoded sequences can be shorter or longer than the target sequences.
As a consequence, you shouldn't be able to compute cross entropy loss on the test set...

How to get TIMIT Working

Hello,

I got the TIMIT data from academic torrents.

It is not working with the code.

Preprocessing training data...
Done
Preprocessing testing data...
Done
Preprocessing completed.

Collected 0 training instances from ../data/timit/train (should be 4620 in complete TIMIT )
Collected 0 testing instances from ../data/timit/test (should be 1680 in complete TIMIT )
Spliting 0 out of 0 (5.0%) training data as validation set.

Normalizing data to let mean=0, sd=1 for each channel.
[]
Traceback (most recent call last):
File "timit_preprocess.py", line 181, in
mean_val, std_val, _ = calc_norm_param(X_train)
File "timit_preprocess.py", line 74, in calc_norm_param
mean_val = np.zeros(X[0].shape[1])
IndexError: list index out of range

How to make the limit example work?

I want to upgrade the repo to python 3 as well. Could you help in giving details on how to make limit librispeech and wsj work with the las model?

error running example code line

Hello! I tried to run test example and get this error

~/End-to-end-ASR-Pytorch$ python3.7 main.py --config config/libri_example.yaml
main.py:28: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(open(paras.config,'r'))
[INFO] Loading data from data/libri_fbank80_subword5000
<src.dataset.LibriDataset object at 0x7f7d895c9080>
Traceback (most recent call last):
  File "main.py", line 47, in <module>
    solver.load_data()
  File "/home/karina/End-to-end-ASR-Pytorch/src/solver.py", line 75, in load_data
    setattr(self,'train_set',LoadDataset('train',text_only=False,use_gpu=self.paras.gpu,**self.config['solver']))
  File "/home/karina/End-to-end-ASR-Pytorch/src/dataset.py", line 155, in LoadDataset
    return  DataLoader(ds, batch_size=1,shuffle=shuffle,drop_last=False,num_workers=n_jobs,pin_memory=use_gpu)
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 176, in __init__
    sampler = RandomSampler(dataset)
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 66, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

do you know how to solve it? or what's the problem itself?

Single file Inference

Hi team,
I want to do a single audio file inference. Can anyone pls help me with this?

Pretrained model

Hello! I want to know how can I use the pretrained model to recognize an audio file, Can you show an example?

Questions regarding ctc prefix score algorithm

prev_blank = np.full((self.odim),r_prev[t-1,1],dtype=np.float32)
prev_blank[last_char] = self.logzero
# prev_nonblank
prev_nonblank = np.full((self.odim),r_prev[t-1,0],dtype=np.float32)
phi = np.logaddexp(prev_nonblank, prev_blank)
# P(h|current step is non-blank) = [ P(prev. step = y) + P()]*P(c)
r[t,0,:] = np.logaddexp(r[t-1,0,:],phi) + self.x[t,:]
# P(h|current step is blank) = [P(prev. step is blank) + P(prev. step is non-blank)]*P(now=blank)
r[t,1,:] = np.logaddexp(r[t-1,1,:],r[t-1,0,:]) + self.x[t,self.blank]
psi = np.logaddexp(psi,phi+self.x[t,:])

Hi there ! Thanks for implementing such a great SoTA end-to-end ASR toolkit !
Really appreciate the complicated joint decoding algorithm part.
I'm a little bit confused about the implementation of ctc prefix score decoding.
In ctc.py, line 51, I'm not sure whether the last char dim of prev_blank[last_char] should be assigned logzero or not.
Would it make a little bit more sense if prev_nonblank[last_char] be assigned to logzero ?
2019-02-27 13-25-42
The phd thesis of Alex Graves mentioned that lin line 17, the nonblank part of newLabelProb is (log)zero if p* ends in k, which might corresponds to the last_char dim of prev_nonblank ?

Thanks again for all the hard work !

Librispeech results

Hi,
I try to run this code on Librispeech 100h data. But the NN can not converge. The CER keeps high on the development set. Have you ever trained the NN on Librispeech corpus?

Thanks

Inference is extremely slow

During training validation is running at about 3-5 iteration per second (batch size = 16), but inference is extremely slow - 4 minutes per example, which makes it completely impractical.

I'm using GeForce RTX 2080 Ti, beam size = 20, no language model, only seq2seq.

Encoding Target

In Step 0, you mentioned we can use one of the following options phoneme/char/subword/word. But when I choose "word" instead of "subword". The encoding doesn't recognize it. The error is in line 134 of preprocess_librispeech.py. It occurs in the function read_text().

Can we apply the same encoding as subword on wor"?

Also for the subword option the bpe.vocab file is missing (in case of LibriSpeech). Do we have to generate it ourselves? If, yes then how?

no text encoder training in timit preprocess

Hi, when representing the text as 'subword' feature, we need pretrain a text encoder, just like that in preprocess_libri.py. But the training step is missed in preprocess_timit.py, so an error will occur when we represent text as 'subword' feature for timit dataset, that is "Key error: ". So @Alexander-H-Liu can you fix this error if you have time? Thank you very much!

Multi-GPU implementation

Hi, do you plan to implement multi-GPU version of the training code?
On a single GPU even training on 100 hours Librispeech takes enormous amount of time (about 4 days for me) so the training on a whole Librispeech will take a month or more...

Maybe you can provide some instructions to do it myself? If it'll work properly I would make a pull request with corresponding changes

Thanks

Is this repo still maintained ?

Hello @Alexander-H-Liu ,

Great work you have done here. Really enjoyed working with this repository.
I would like to know if this repository is still maintained, and if you have any upgrades you'd like to implement in mind.

Thanks !

FileNotFoundError: [Errno 2] No such file or directory: 'result/libri_example_sd0/asr'

Hello,
After training the End-to-end ASR with the following command:
python3 main.py --config config/libri_example.yaml
I got the following error:

image

It seems that I don't have the ASR model in the mentioned directory. I checked it and there is not any model (file) there. Shouldn't it be created in the training steps? Could anyone help me to solve this problem?

which state vector of multiple Speller layers is suggested to be used for attention calculation

Hi there,

I can tell from your code that you applied the state vector at the first layer of Speller to calculate attention score.

attention_score,context = self.attention(self.decoder.state_list[0],encode_feature,encode_len)

Did you compare it with other possible manipulations, such as

  1. The state vector at the last layer of Speller?
  2. The concatenation of all state vectors from all layers?

I am interested in the advice of optimal solutions. Original LAS paper did not mention which one is the best.

Thanks

Xuesong

Could you mind tell me the WER of librispeech datasets? thanks

Hi, I have seen that you share some example log files with Tensorboard:log/log_url.txt.

image

So, could you mind tell me what is error rate/dev and error rate/train? Is it WER(word error rate)? Which sub-dataset is it test on?(dev-clean/dev-other/test-clean/test-other/or both of them?)

If it is not WER, could you mind tell me the WER of librispeech datasets like this:

========= =====
Dev 4.3%
DevOther 13.0%
Test 4.5%
TestOther 13.2%
========= =====

Thank you a lot!!

About loading pretrained models

Hey.

Is there any good reason for not implementing loading pre-trained model (reference) ?

Afaik loading a pre trained model is as easy as self.asr = torch.load('./results/<exp_name>/asr). Is that assumption wrong?

Very different loss on validation

Just for testing I'm trying to overfit a very small dataset and I've set validation dataset to be the same as the training one, but I get very different loss progression for these stages. On training set it is constantly decreasing, but on validation after a few epochs it starts to increase. I do not use dropout. Shouldn't it be roughly the same?

pretrained model

Hi, @Alexander-H-Liu , this is great work and I am doing some works based on your work, I will cite your paper and code. The training is very time-consuming, so could provide the pretrained model on librispeech and TIMIT, if it is available? Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.