WER are we? An attempt at tracking states of the art(s) and recent results on speech recognition. Feel free to correct! (Inspired by Are we there yet?)
To be updated with Interspeech 2015...
(Possibly trained on more data than LibriSpeech.)
WER test-clean | WER test-other | Paper | Notes |
---|---|---|---|
4.83% | A time delay neural network architecture for efficient modeling of long temporal contexts | TDNN + iVectors | |
5.51% | 13.97% | LibriSpeech: an ASR Corpus Based on Public Domain Audio Books | HMM-DNN + pNorm* |
8.01% | 22.49% | same, Kaldi | HMM-(SAT)GMM |
12.51% | Audio Augmentation for Speech Recognition | TDNN + pNorm + speed up/down speech |
(Possibly trained on more data than WSJ.)
WER eval'92 | WER eval'93 | Paper | Notes |
---|---|---|---|
3.63% | 5.66% | LibriSpeech: an ASR Corpus Based on Public Domain Audio Books | test-set on open vocabulary (i.e. harder), model = HMM-DNN + pNorm* |
5.6% | Convolutional Neural Networks-based Continuous Speech Recognition using Raw Speech Signal | CNN over RAW speech (wav) |
(Possibly trained on more data than SWB, but test set = full Hub5'00.)
WER (SWB) | WER (full=SWB+CH) | Paper | Notes |
---|---|---|---|
8% | 14.1% | The IBM 2015 English Conversational Telephone Speech Recognition System | CNN+RNN (lattice-based MBR loss) with maxout + annealed dropout trained on SWB+Fisher+CH. NNLM scoring. |
12.6% | 16% | Deep Speech: Scaling up end-to-end speech recognition | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trained only on SWB |
12.6% | 18.4% | Sequence-discriminative training of deep neural networks | HMM-DNN +sMBR |
12.9% | 19.3% | Audio Augmentation for Speech Recognition | TDNN + pNorm + speed up/down speech |
15% | 19.1% | Building DNN Acoustic Models for Large Vocabulary Speech Recognition | DNN + Dropout |
10.4% | Joint Training of Convolutional and Non-Convolutional Neural Networks | CNN on MFSC/fbanks + 1 non-conv layer for FMLLR/I-Vectors concatenated in a DNN | |
11.5% | Deep Convolutional Neural Networks for LVCSR | CNN |
(So far, all results trained on TIMIT and tested on the standard test set.)
PER | Paper | Notes |
---|---|---|
16.7% | Combining Time- and Frequency-Domain Convolution in Convolutional Neural Network-Based Phone Recognition | CNN in time and frequency + dropout, 17.6% w/o dropout |
17.6% | Attention-Based Models for Speech Recognition | Bi-RNN + Attention |
17.7% | Speech Recognition with Deep Recurrent Neural Networks | Bi-LSTM + skip connections w/ CTC |
23% | Deep Belief Networks for Phone Recognition | (first, modern) HMM-DBN |
TODO
TODO
TODO?
- WER: word error rate
- PER: phone error rate
- LM: language model
- HMM: hidden markov model
- GMM: Gaussian mixture model
- DNN: deep neural network
- CNN: convolutional neural network
- DBN: deep belief network (RBM-based DNN)
- RNN: recurrent neural network
- LSTM: long short-term memory
- CTC: connectionist temporal classification
- MMI: maximum mutual information (MMI),
- MPE: minimum phone error
- sMBR: state-level minimum Bayes risk
- SAT: speaker adaptive training
- MLLR: maximum likelihood linear regression
- LDA: (in this context) linear discriminant analysis
- MFCC: Mel frequency cepstral coefficients
- FB/FBANKS/MFSC: Mel frequency spectral coefficients