uhh-lt / kaldi-tuda-de Goto Github PK

Scripts for training general-purpose large vocabulary German acoustic models for ASR with Kaldi.

License: Apache License 2.0

Shell 62.32% Python 37.68%

kaldi-tuda-de's Introduction

Open source speech recognition recipe and corpus for building German acoustic models with Kaldi

Open source speech recognition recipe and corpus for building German acoustic models with Kaldi
- News
- Pretrained models
Training your own models
References

This recipe and collection of scripts enables you to train large vocabulary German acoustic models for speaker-independent automatic speech recognition (ASR) with Kaldi. The scripts currently use three freely available German speech corpora: The Tuda-De corpus is recorded with a Microsoft Kinect and two other microphones in parallel at Technische Universität Darmstadt and has been released under a permissive license (CC-BY 4.0). This corpus compromises ~31h of training data per microphone and ~5h separated into development and test partitions. We also make use of the German subset from the Spoken Wikipedia Corpora (SWC), containing about 285h of additional data and the German subset of m-ailabs read speech data corpus (mirror) (237h). Recently we also added the German Commonvoice corpus from Mozilla (https://commonvoice.mozilla.org/de) with 370h of data. We use the test/dev sets from Tuda-De for WER evaluations.

The newest recipe (s5_r2) trains and tests on data from multiple microphones by default (all but Realtek - about 127h of audio in total). By editing run.sh you can also restrict it to a single microphone (e.g. only Kinect). It also trains on SWC data and M-ailabs by default, too, resulting in 630h of speech data in total after cleaning. See our paper for more information and WER results. More recent results are in the table in the pretrained models section.

The old s5 recipe used in our previous paper is also still available and trained only on the beamformed data of the Kinect microphone, checkout the README.md in the s5 directory if you want to reproduce the results of our old paper.

The scripts will ask you where to place larger files and can download all necessary files (speech corpus, German texts, phoneme dictionaries) to train the acoustic and language models. You can also download these resources manually, see Section "Getting data files separately" down below.

If you use our data, models or scripts in your academic work please cite our paper!

News

19 April 2023

As an alternative to the German Kaldi models in this repository, checkout speechcatcher. Speechcatcher models are trained end-to-end with punctuation, support streaming and run fast on CPU as well.

19 April 2022

We updated the LM (v6) and recrawled recent text data and extended the vocabulary (now 900k). Our newest best result (with 140 million sentences) is 6.19% WER on Tuda-De Dev and 6.93% WER on Tuda-De Test. See our newest pre-trained models here: pretrained models. If you need an overall faster model, you can also replace the default HCLG with this much smaller one: HCLG_s.

4 April 2022

We added the newest Common voice data (version 8), bringing the total training amount to 1700 hours! We also added a const arpa and rnn lm model for this new model. Our best result with rescoring is now a WER of 6.51 on Tuda-De dev and 7.43 on Tuda-De test! Currently, you can find the training scripts for this new model in the CV7 branch: https://github.com/uhh-lt/kaldi-tuda-de/tree/CV7 We'll soon merge them to the master branch.
The Tuda-De dataset also got updated to version 4, this release includes several fixed utterances. Thank you again, Sven Hartrumpf!

6 April 2021

We have added const arpa language models for rescoring (trained on 100 million sentences). These did reduce error rates further, our best result on tuda-de test is 11.85% WER now. A pre-trained RNN-LM will soon be available as well.

2 July 2020

We have added two new pretrained models: tuda_swc_mailabs_cv_voc683k and tuda_swc_mailabs_cv_voc683k_smaller_fst, both trained on 1000h of speech data and with our new LM.
The new model has a 13% lower WER on tuda-test. It also contains many more new and uptodate words and a better phoneme lexicon. See pretrained models for more details and download links.
You can also check out kaldi-model-server, our PyKaldi based solution to easily load our Kaldi models.

12 June 2020

We have added the Common Voice (de) dataset, the total amount of training data is over 1000h now!
We added a new language model (LM) trained on 100 million normalized German sentences, with recent data as well
We now ship a pre-trained ARPA for the LM, but you can also crawl and normalize your own data with the steps detailed in https://github.com/bmilde/german-asr-lm-tools/
Some errors in the phoneme inventory have been corrected. You will need to train the new model from scratch, as the phoneme inventories are incompatible.
A new manual lexicon resource has been added to kaldi-tuda-de, with recent words as well. Adds 13K+ manually verified lexicon words in X-SAMPA-DE format. See https://github.com/uhh-lt/kaldi-tuda-de/blob/master/s5_r2/local/de_extra_lexicon.txt
We created a lexicon editor to add and verify manual phoneme entries with active learning: https://github.com/uhh-lt/speech-lex-edit
New pre-trained ASR models will follow shortly

5 March 2019

A new pretrained model with a vocabulary of 400 thousand words is available: download
We added more aligned speech data (630h total now), thanks to the m-ailabs speech data corpus (mirror). We also thank Pavel Denisov for sending us a Kaldi data preparation script for this new open source corpus.

21 August 2018

A new pretrained model with a vocabulary of 350 thousand words is available: download
This model is also the best performing one in our paper.
This model has also been succesfully tested in the popular Kaldi Gstreamer Server software. The paths in this package are organized according to the Kaldi Gstreamer examples, a matching kaldi_tuda_de_nnet3_chain.yaml configuration file is included. A worker startup script is also included (run_tuda_de.sh), but you will probably need to change paths. See also the Kaldi + Gstreamer Server Software installation guide here.

15 August 2018

We thank Sven Hartrumpf for fixing xml files with incorrect transcriptions in the Tuda corpus! A new release of the corpus data will soon be available.

26 July 2018

Our paper "Open Source Automatic Speech Recognition for German" is accepted at ITG2018 (10.-12. October 2018, Oldenburg, Germany)! A preprint of the paper is available here.

26 June 2018

We moved the repository from tudarmstadt-lt to language technologies new uhh-lt github repository.
The ivector extractor had been missing from the acoustic model binary archive. You can download it separately from https://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/tdnn_chain_cleaned_tuda_swc_voc126k_ivector_extractor.tar.bz2 or redownload the full archive.

31 May 2018

A pre-trained TDNN-HMM chain model for German can now be downloaded from this address: https://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/tdnn_chain_cleaned_tuda_swc_voc126k.tar.bz2
Vocabulary: 126794 words, trained on tuda-de and SWC (268h combined). Contains all the content of the exp/chain_cleaned directory, also decoding lattices of tuda test/dev. See https://github.com/alumae/kaldi-gstreamer-server and https://github.com/jcsilva/docker-kaldi-gstreamer-server for a nice full-duplex server API for Kaldi models, you should be able to use our models there, too.

30 May 2018

We have added the option to train with additional data from the SWC corpus. See https://nats.gitlab.io/swc/ for more information on this dataset. The combined amount of training data is now around 268 hours.

02 May 2018

A recipe to train TDNN-HMM chain acoustic models is now available, similar to TED-LIUMs egs. See: https://github.com/tudarmstadt-lt/kaldi-tuda-de/blob/master/s5_r2/local/run_tdnn_1f.sh

25 April 2018

New s5_r2 recipe adapted from swbd s5c (GMM-HMM at the moment, TDNN recipe coming soon)!
s5_r2 local scripts are now compatible with Python3
Training on all microphones data is now possible and also the default
Instead of MARYs phonemizer for OOV words, sequitur G2P is now used
Updated Kaldi install instructions

Newest pretrained models

Acoustic model + FST	Training data	Tuda dev WER	Tuda test WER
tuda_swc_mailabs_cv8_voc900k (FST)	1700h (tuda+SWC+m-ailabs+cv8)	9.30	10.17
+ lm_v6_voc900k const arpa rescoring	140 million sentences	7.23	7.96
+ rnn_lmv6_lstm4x_voc900k rnnlm rescoring	140 million sentences	6.19	6.93

All results above are with number reformating, e.g. drei und sechzig -> dreiundsechzig. If you need an overall faster model, you can also replace the default HCLG with this much smaller one: HCLG_s. Together with RNNLM rescoring, the WER result will only be minimally bigger.

Previous pretrained models

Acoustic model + FST	Training data	Tuda dev WER (FST)	Tuda test WER (FST)
tuda_swc_voc126k / mirror	375h (tuda+SWC)	20.30	21.43
tuda_swc_voc350k / mirror	375h (tuda+SWC)	15.32	16.49
tuda_swc_mailabs_voc400k / mirror	630h (tuda+SWC+m-ailabs)	14.78	15.87
tuda_swc_mailabs_cv_voc683k_smaller_fst	1000h (tuda+SWC+m-ailabs+cv)	12.69	14.29
+ lm_v5_voc683k_smaller_fst const arpa rescoring	100 million sentences	10.92	12.37
+ reformat numbers	e.g. drei und sechzig -> dreiundsechzig	8.94	10.26
tuda_swc_mailabs_cv3_voc683k	1000h (tuda+SWC+m-ailabs+cv3)	12.26	13.79
+ lm_v5_voc683k const arpa rescoring	100 million sentences	10.47	11.85
+ reformat numbers	e.g. drei und sechzig -> dreiundsechzig	8.61	9.85
tuda_swc_mailabs_cv8_voc722k	1700h (tuda+SWC+m-ailabs+cv8)	10.94	12.09
+ lm_v5_voc722k const arpa rescoring	100 million sentences	9.25	10.17
+ reformat numbers	e.g. drei und sechzig -> dreiundsechzig	7.51	8.53
+ rnn_lm_lstm2x_voc722k rnnlm rescoring	100 million sentences	6.51	7.43

New: We have now added results for rescoring as well, which improves the FST decoding results further as expected. This includes both const arpa models as well as RNN LMs, see also our paper.

We have developed a PyKaldi based solution to use the models with either a local microphone or network streaming in real time: https://github.com/uhh-lt/kaldi-model-server

For batch decoding of media files, we developped subtitle2go to automatically generate German subtitles.

Another option to use the models is the Kaldi gstreamer server project. You can either stream audio and do online (real-time) recogniton with it or send wav files via http and get a JSON result back. See also the Kaldi + Gstreamer Server Software installation guide here. There is a run_tuda_de.sh in the package that starts Kaldi gstreamer workers for tuda_de. You will need to modify the KALDI_ROOT variable in the script so that it finds your Kaldi installation properly.

Training your own models

If you want to adapt our models (add training data, augment training data, change vocabulary, ...), you will need to retrain our models. A workstation or server with more than 64GB memory might be needed, having access to a lot of CPU cores is recommended and a recent Nvidia GPU is needed to train neural models such as the TDNN-HMM.

Prerequisites

Clone the repository with the submodule:

git clone --recurse-submodules https://github.com/uhh-lt/kaldi-tuda-de

The scripts are only tested under Linux (Ubuntu 16.04 - 20.04). Install at first some mandatory packages:

sudo apt install sox libsox-fmt-all

Download and install Kaldi and follow the installation instructions. You can download a recent version using git:

 git clone https://github.com/kaldi-asr/kaldi.git kaldi-trunk --origin golden

In Kaldi trunk:

go to tools/ and follow INSTALL instructions there.
Install a BLAS library. This can be Intel MKL, OpenBLAS or Atlas.

If you have an Intel CPU the easist and now recommended library is to install Intel MKL. You can install it easily on Debian/Ubuntu by running extras/install_mkl.sh. You can then skip the rest of this section and go to section 3.

Cross platform solution: Download and install OpenBLAS, build a non-multithreading (important!) library with:

make USE_THREAD=0 USE_LOCKING=1 FC=gfortran

Now follow the displayed instructions to install OpenBLAS headers and libs to a new and empty directory.

Warning! It is imperative to build a single threaded OpenBLAS library, otherwise you will encounter hard to debug problems with Kaldi as Kaldis parallelization interferes with the OpenBLAS one.

go to src/ and follow INSTALL instructions there. Intel MKL is found automatically, if you installed with OpenBLAS point the configure script to your OpenBLAS installation (see ./configure --help).

Our scripts are meant to be placed into its own directory in KALDIs egs/ directory. This is also where all the other recipes reside in. If you want to build DNN models, you probably want to enable CUDA in KALDI with the configure script in src/. You should have a relatively recent Nvidia GPU, at least one with the Kepler architecture.

You also need Sequitur G2P (https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html, https://github.com/sequitur-g2p/sequitur-g2p). Download the package and run make, then edit the sequitur_g2p variable in s5_r2/cmd.sh to point to the g2p.py script.

You will also need a recent version of Python 3. Package requirements are:

pip3 install beautifulsoup4 lxml spacy && python -m spacy download de_core_news_lg

Additinally, the requests package was previously used to communicate with MaryTTS to generate phonemizations, however you won't need it if you run the standard setup.

Get LM text data

See https://github.com/bmilde/german-asr-lm-tools/ for instructions on getting recent German text data normalized. Place the resulting gzipped file in ${lm_dir}/cleaned_lm_text.gz, with the defaults: data/local/lm_std_big_v6/cleaned_lm_text.gz If you forget this step, the run.sh script will complain about a missing LM text file. Warning: The default vocabulary file local/voc_800k.txt may give suboptimal WER results, if you pair it with your own crawled data, so make sure to replace it with your own vocabulary file.

Building the acoustic models

After you have installed the prerequisites, edit cmd.sh in the s5_r2/ directory of this distribution to adjust for the number of processors you have locally (change nJobs and nDecodeJobs accordingly). You could probably also uncomment the cluster configuration and run the scripts on a cluster, but this is untested and may require some tinkering to get it running.

Then, simply run ./run.sh in s5_r2/ to build the acoustic and language models. The script will ask you where to place larger files (feature vectors and KALDI models) and automatically build appropriate symlinks. Kaldi_lm is automatically downloaded and compiled if it is not found on your system and standard Kneser-Ney is used for a 4-gram LM.

Getting data files separately

You can of course also use and download our data resources separately.

Speech corpus

The corpus can be downloaded here. The license is CC-BY 4.0. The run.sh script expects to find the corpus data extracted in data/wav/ and will download it for you automatically, if it does not find the data.

Newer recipes also make use of SWC data.

German language texts

See https://github.com/bmilde/german-asr-lm-tools/ for new instructions to obtain a large amount of German LM text data.

German phoneme dictionary

The phoneme dictionary is currently not supplied with this distribution, but the scripts to generate them are. DFKIs MARY includes a nice LGPL German phoneme dictionary with ~26k entries. Other sources for phoneme dictionary entries can be found at BAS. Our parser understands the different formats of VM.German.Wordforms, RVG1_read.lex, RVG1_trl.lex and LEXICON.TBL. The final dictionary covers ~44.8k unique German words with 70k entries total (pronunciation variants). Since the licensing of the BAS dictionaries is unclear, they are not included into the phoneme dictionary by default. You can however enable them by editing the header of run.sh and setting use_BAS_dictionaries to true.

build_big_lexicon.py can import many dictionaries in the BasSAMPA format and merge them into a single dictionary. Its parser understand many variants and dialects of BasSAMPA and the adhoc dictionary formats. To support new variants you'll have to edit def guessImportFunc(filename). The output is a serialised python object.

export_lexicon.py will export such a serialised python dictionary into KALDIs lexion_p.txt format (this allows to model different phonetic realisations of the same word with probabilities). Stress markers in the phoneme set are grouped with their unstressed equivalents in KALDI using the extra_questions.txt file. It is also possible to generate a CMU Sphinx formated dictionary with the same data using the -spx option. The Sphinx format also allows pronunciation variants, but cannot model probabilities for these variants.

References

If you use our scripts and/or data in your academic work please cite:

@InProceedings{milde-koehn-18-german-asr,
author = {Benjamin Milde and Arne K{\"o}hn},
title = {Open Source Automatic Speech Recognition for {German}},
booktitle = {Proceedings of ITG 2018},
year = {2018},
address = {Oldenburg, Germany},
pages = {251--255}
}

and

@inproceedings{geislinger-etal-2022-improved,
    title = "Improved Open Source Automatic Subtitling for Lecture Videos",
    author = "Geislinger, Robert  and
      Milde, Benjamin  and
      Biemann, Chris",
    booktitle = "Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)",
    month = "12--15 " # sep,
    year = "2022",
    address = "Potsdam, Germany",
    publisher = "KONVENS 2022 Organizers",
    url = "https://aclanthology.org/2022.konvens-1.11",
    pages = "98--103",
}

An open access Arxiv preprint of "Open Source Automatic Speech Recognition for German" is also available here: https://arxiv.org/abs/1807.10311 (same content as the ITG version).

We have also previously published on German open source ASR in 2015:

Open Source German Distant Speech Recognition: Corpus and Acoustic Model:

@InProceedings{Radeck-Arneth2015,
author = {Radeck-Arneth, Stephan and Milde, Benjamin and Lange, Arvid and Gouvea, Evandro and Radomski, Stefan and M{\"{u}}hlh{\"{a}}user, Max and Biemann, Chris},
booktitle = {Proceedings Text, Speech and Dialogue (TSD)},
title = {{Open Source German Distant Speech Recognition: Corpus and Acoustic Model}},
year = {2015},
address = {Pilsen, Czech Republic},
pages = {480--488}
}

If you use our training scripts or models commercially, please mention this repository in your about section, documentation or similar.

kaldi-tuda-de's People

Contributors

Stargazers

Watchers

kaldi-tuda-de's Issues

m_ailab speech dataset's website

I couldn't open the m_ailab's website, could you please check it?
Thanks

SyntaxError: Missing parentheses in call to 'print'

After ~3 days of training the script crashed because of missing parantheses in choose_utt_to_combine.py. The last log was

...
utils/validate_data_dir.sh: Successfully validated data-directory data/train_cleaned_sp_hires
utils/data/get_utt2dur.sh: data/train_cleaned_sp_hires/utt2dur already exists with the expected length.  We won't recompute it.
choose_utts_to_combine.py: combined 1081479 utterances to 1020406 utterances while respecting speaker boundaries, and then to 1020308 utterances with merging across speaker boundaries.
  File "<string>", line 25
    print uniq, uniq2orig_uniq[uniq]
             ^
SyntaxError: Missing parentheses in call to 'print'

Are the scripts supposed to be ran with python 2 rather then 3?

New words?

Hello,
I would like to add a few new words to a trained instance of Kaldi (trained with your scripts). Could you point me to a resource where I can find how to do it?

I did this before for a model trained with the ASPIRE recipe , but this strategy did not work here

Thanks a lot and kind regards
Ernst

missing ivector extractor from prebuilt model archive

Hi and thanks for sharing your model.

I test your model but it seems that the ivector extractor is missing from the archive in this adress:
http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/tdnn_chain_cleaned_tuda_swc_voc126k.tar.bz2

SWC Alignments

I used the segment file for SWC found in http://speech.tools/kaldi_tuda_de/swc_train_v2.tar.gz to segment the wav files for my ASR. However when checking the data, the audio alignments were very faulty. Did you use the data like that or some correction was performed to make the alignments more accurate? Would it harm the model if I used these faulty segmentations?

Trying to split data while taking duration into account leads to a severe imbalance in splits.

Hi,

When I run the training script at some point I always get this message just before the script stops.

...
fix_data_dir.sh: old files are kept in data/train_100k_nodup/.backup
+ utils/data/remove_dup_utts.sh 1000 data/train_nodev data/train_nodup
Reduced number of utterances from 341385 to 341385
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept all 341385 utterances.
fix_data_dir.sh: old files are kept in data/train_nodup/.backup
+ '[' '!' -d data/lang_std_nosp_std ']'
+ echo 'Copying data/lang_std to data/lang_std_nosp_std...'
Copying data/lang_std to data/lang_std_nosp_std...
+ cp -R data/lang_std data/lang_std_nosp_std
+ '[' 0 -le 10 ']'
+ steps/train_mono.sh --nj 12 --cmd utils/run.pl data/train_30kshort data/lang_std_nosp_std exp/mono
steps/train_mono.sh --nj 12 --cmd utils/run.pl data/train_30kshort data/lang_std_nosp_std exp/mono
Trying to split data while taking duration into account leads to a severe imbalance in splits. This happens when there is a lot more data for some speakers than for others.
You should use utils/data/modify_speaker_duration.sh to fix that.

Do you have any idea what could cause this problem?
Kind regards

feats.scp no such file or directory

Hi, using the tuda scripts and Im currently trying to add more data.
To do so, I have generated xml- and wav files of the form:
2017-01-01-0-0-23.xml and 2017-01-01-0-0-23_Kinect-Beam.wav
I have quite many files, around 90k...

Now, when I add just a few files, everything works just fine, but once I add more than around 40k
I get an error somewhere around steps/compute_cmvn_stats.sh because feats.scp is not created for some reason.

ERROR (apply-cmvn[5.1.80~1-e5275]:SequentialTableReader():util/kaldi-table-inl.h:876) Error constructing TableReader: rspecifier is scp:data/train/split8/1/feats.scp

Do you have hint on why this happens?

Thanks already for your help.
Viele Grüße aus Erlangen, Julian

ERROR: FstHeader::Read: Bad FST header: standard input

I have run into the following error after running ./run.sh from scratch (i.e., I deleted all of the data and started from the beginning). Attached is the output from as far back as the console would show. For the most part, it looks like everything checks out. It even does some training for tri1, but quits after that. I know that my Kaldi version is compiled properly and working as I've used it to train other models. Any ideas?

steps/train_deltas.sh: Done training system with delta+delta-delta features in exp/tri1
fstdeterminizestar --use-log=true
fstminimizeencoded
fsttablecompose data/lang_test/L_disambig.fst data/lang_test/G.fst
std::bad_allocERROR: FstHeader::Read: Bad FST header: -
ERROR (fstminimizeencoded:ReadFstKaldi():fstext/fstext-utils-inl.h:1300) Reading FST: error reading FST header from standard input
ERROR (fstminimizeencoded:ReadFstKaldi():fstext/fstext-utils-inl.h:1300) Reading FST: error reading FST header from standard input

[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()
fst::ReadFstKaldi(std::string)
fstminimizeencoded(main+0x1c0) [0x80b191b]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0xb7289a83]
fstminimizeencoded() [0x80b1681]

ERROR: FstHeader::Read: Bad FST header: standard input

kaldierror2.txt

Additional German speech training data

Hey,
I'd like to bring to your attention the German corpus from our paper "CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition". It consists of audiobooks from Librivox and a re-aligned spoken wikipedia corpus.
Sources and data an be found here:
https://github.com/lumaku/german-corpus-aligned

Tuda-DE and this dataset have a certain overlap.
Both contain the SWC. M-ailabs also contains a audiobooks from librivox.
Feel free to merge etc.!

General questions about project

Hi @milde

I am delighted what you and your team have done with Kaldi-tuda-de (https://github.com/uhh-lt/kaldi-tuda-de). It’s just brilliant! I am your fun! I’d like to say thank you (Especially for the latest version with 683k words! Wow!) I am curious to know about future plans

As you might notice, I’ve already created a few tickets in this repository while trying to run it. And you’ve helped me a lot.
I have some general questions to you (sorry for writing it here):

Do you have a roadmap?
What are your plans in general?
Would it be possible to ask you to consult my team?
Are you interested in investments?
Do you need more resources?

I would highly appreciate it if you find the time for me.

Would be some kind of you if you can provide other option where we can discuss it privately (whatsApp, Skype, phone, email, etc).

Feel free to write me:
Email: [email protected]
Skype: prokopchuk.vitaliy.
Telegram: @prvit

Thank you very much for your work and your time!

Looking forward to hearing from you!

Thank you!
Vitalii

Add Mozilla Common Voice corpus

Hi,
Is there a special reason why the Mozilla common voice is not part of this recipe ? Maybe licensing issues ?
If I prepared a pull request adding Mozilla common voice corpus to this recipe, would you accept it ?

DNN models

Has anybody succeeded in building DNN models with tuda-de?
What performance (WER) is achievable?

Where is cuda_cmd used

Hi,

you set cuda_cmd in the cmd.sh, but inside run.sh there is no use of this variable, is it used inside some subscript or do I have to change anything to use my gpu?

Thanks in advance, and thanks for this whole repo!

The amount of the data

Hi,
from your readme, the data is 600+ hours, but after I run the recipe, there are only 400h data for chain model training, so someone can give me some suggestions?
Thanks

RROR (fstisstochastic[5.5.1077~1-ed31c6]:ReadFstKaldiGeneric():kaldi-fst-io.cc:59) Reading FST: error reading FST header from

I have this issue
Hi.
While running run.sh script, stuck with an error:

`tree-info exp/tri1/tree
tree-info exp/tri1/tree
fstpushspecial
fstdeterminizestar --use-log=true
fstminimizeencoded
fsttablecompose data/lang_test/L_disambig.fst data/lang_test/G.fst
fstisstochastic data/lang_test/tmp/LG.fst
-0.203813 -0.204425
[info]: LG not stochastic.
fstcomposecontext --context-size=3 --central-position=1 --read-disambig-syms=data/lang_test/phones/disambig.int --write-disambig-syms=data/lang_test/tmp/disambig_ilabels_3_1.int data/lang_test/tmp/ilabels_3_1.697494 data/lang_test/tmp/LG.fst
ERROR: FstHeader::Read: Bad FST header: standard input
mv: cannot stat 'data/lang_test/tmp/ilabels_3_1.697494': No such file or directory
fstisstochastic data/lang_test/tmp/CLG_3_1.fst
ERROR: FstHeader::Read: Bad FST header: data/lang_test/tmp/CLG_3_1.fst
ERROR (fstisstochastic[5.5.1077~1-ed31c6]:ReadFstKaldiGeneric():kaldi-fst-io.cc:59) Reading FST: error reading FST header from data/lang_test/tmp/CLG_3_1.fst

[ Stack-Trace: ]
//home/taqneen/Desktop/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7effe7a1f1ce]
fstisstochastic(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x557d5fd9edd1]
//home/taqneen/Desktop/kaldi/src/lib/libkaldi-fstext.so(fst::ReadFstKaldiGeneric(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, bool)+0x1c5) [0x7effe7a82bd1]
fstisstochastic(main+0x18d) [0x557d5fd9dd16]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7effe7029d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7effe7029e40]
fstisstochastic(_start+0x25) [0x557d5fd9dac5]

kaldi::KaldiFatalError[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/tri1/graph/disambig_tid.int --transition-scale=1.0 data/lang_test/tmp/ilabels_3_1 exp/tri1/tree exp/tri1/final.mdl
ERROR (make-h-transducer[5.5.1077~1-ed31c6]:Input():kaldi-io.cc:756) Error opening input stream data/lang_test/tmp/ilabels_3_1

[ Stack-Trace: ]
//home/taqneen/Desktop/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x70c) [0x7fe6b12ed1ce]
make-h-transducer(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x25) [0x5641485b810d]
//home/taqneen/Desktop/kaldi/src/lib/libkaldi-util.so(kaldi::Input::Input(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, bool*)+0xbe) [0x7fe6b13bc892]
make-h-transducer(main+0x262) [0x5641485b794b]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe6b0c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fe6b0c29e40]
make-h-transducer(_start+0x25) [0x5641485b7625]
`

I have on my device (42 GB memory and 270 GB swap memory)

Any ideas?

Ivector optiomization issue

I'm getting this problem randomly with the most recent version of Kaldi (this problem seems to happen with TEDLIUM 3 too)

WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 5.59071 > 4.92847.  Will do a
n exact optimization.                                                                                                                                                                                       
LOG (ivector-extract-online2[5.5.643~1-ab82de]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 3 eigenvalues.                                
WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 14.5627 > 7.93962.  Will do a
n exact optimization.                                                                                                                                                                                       
LOG (ivector-extract-online2[5.5.643~1-ab82de]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 3 eigenvalues.                                
WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 216.241 > 6.63521.  Will do a
n exact optimization.                                                                                                                                                                                       
LOG (ivector-extract-online2[5.5.643~1-ab82de]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 2 eigenvalues.                                
WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 28.3819 > 15.9563.  Will do a
n exact optimization.                                                                                                                                                                                       
LOG (ivector-extract-online2[5.5.643~1-ab82de]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 3 eigenvalues.                                
WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 843.787 > 30.8869.  Will do a
n exact optimization.                                                                                                                                                                                       
LOG (ivector-extract-online2[5.5.643~1-ab82de]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 3 eigenvalues.                                
WARNING (ivector-extract-online2[5.5.643~1-ab82de]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 6840.46 > 34.1176.  Will do a
n exact optimization.                                                                                                                                                                                       
ASSERTION_FAILED (ivector-extract-online2[5.5.643~1-ab82de]:SymPosSemiDefEig():sp-matrix.cc:62) Assertion failed: (-min <= tolerance * max)                                                                 
                                                                                                                                                                                                            
[ Stack-Trace: ]                                                                                                                                                                                            
ivector-extract-online2(kaldi::MessageLogger::LogMessage() const+0xb42) [0x55be19470674]                                                                                                                    
ivector-extract-online2(kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)+0x6e) [0x55be19471370]                                                                                       
ivector-extract-online2(kaldi::SpMatrix<double>::MaxAbsEig() const+0) [0x55be19445962]                                                                                                                      
ivector-extract-online2(double kaldi::SolveQuadraticProblem<double>(kaldi::SpMatrix<double> const&, kaldi::VectorBase<double> const&, kaldi::SolverOptions const&, kaldi::VectorBase<double>*)+0x578) [0x55b
e1943fec3]                                                                                                                                                                                                  
ivector-extract-online2(int kaldi::LinearCgd<double>(kaldi::LinearCgdOptions const&, kaldi::SpMatrix<double> const&, kaldi::VectorBase<double> const&, kaldi::VectorBase<double>*)+0xb6c) [0x55be1946ef14]
ivector-extract-online2(kaldi::OnlineIvectorEstimationStats::GetIvector(int, kaldi::VectorBase<double>*) const+0x8c) [0x55be1929b678]
ivector-extract-online2(kaldi::OnlineIvectorFeature::UpdateStatsUntilFrame(int)+0x13b) [0x55be19293d29]
ivector-extract-online2(kaldi::OnlineIvectorFeature::GetFrame(int, kaldi::VectorBase<float>*)+0x4e) [0x55be1929421a]
ivector-extract-online2(main+0xd15) [0x55be1927b29e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f5b9c286b97]
ivector-extract-online2(_start+0x2a) [0x55be1927a47a]

LOG (copy-feats[5.5.643~1-ab82de]:main():copy-feats.cc:143) Copied 5 feature matrices.
# Accounting: time=2 threads=1
# Ended (code 0) at Tue Feb 18 10:18:27 CET 2020, elapsed time 2 seconds

sox FAIL formats: sox not able to handle common_voice mp3. Is there any solution?

I am facing error.

steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
run.pl: 28 / 28 failed, log is in exp/make_mfcc/commonvoice_train/make_mfcc_commonvoice_train.*.log

and inside the log is


# compute-mfcc-feats --write-utt2dur=ark,t:exp/make_mfcc/commonvoice_train/utt2dur.1 --verbose=2 --config=conf/mfcc.conf scp,p:exp/make_mfcc/commonvoice_train/wav_commonvoice_train.1.scp ark:- | copy-feats --write-num-frames=ark,t:exp/make_mfcc/commonvoice_train/utt2num_frames.1 --compress=true ark:- ark,scp:/home/desktop/Desktop/research_and_development/lab_work/workshop/speech_lab/kaldi/egs/csj/s5/mfcc/raw_mfcc_commonvoice_train.1.ark,/home/desktop/Desktop/research_and_development/lab_work/workshop/speech_lab/kaldi/egs/csj/s5/mfcc/raw_mfcc_commonvoice_train.1.scp 
# Started at Wed Mar 31 18:00:26 CEST 2021
#
copy-feats --write-num-frames=ark,t:exp/make_mfcc/commonvoice_train/utt2num_frames.1 --compress=true ark:- ark,scp:/home/desktop/Desktop/research_and_development/lab_work/workshop/speech_lab/kaldi/egs/csj/s5/mfcc/raw_mfcc_commonvoice_train.1.ark,/home/desktop/Desktop/research_and_development/lab_work/workshop/speech_lab/kaldi/egs/csj/s5/mfcc/raw_mfcc_commonvoice_train.1.scp 
compute-mfcc-feats --write-utt2dur=ark,t:exp/make_mfcc/commonvoice_train/utt2dur.1 --verbose=2 --config=conf/mfcc.conf scp,p:exp/make_mfcc/commonvoice_train/wav_commonvoice_train.1.scp ark:- 
sox FAIL formats: can't open input file `data/wav/cv/clips/common_voice_de_17298952.mp3': No such file or directory
ERROR (compute-mfcc-feats[5.5.899~1-3d0e4313]:Read4ByteTag():wave-reader.cc:56) WaveData: expected 4-byte chunk-name, got read error

while I tried to change at prepare_commonvoice.py in local and replaced sox with lame but didn't help.

If I can use ffmpeg? and where is required/ if you can guide me.

Rerunning kaldi-tuda-de on a single wav file

I am tracking an error in my setup of kaldi. How can I rerun kaldi with a given setting (e.g. the one corresponding to exp/sgmm_5a_mmi_b0.1/decode4/wer_13) on a single wav file? Thanks for your time and effort!

How to run/compile MARY 5.1.1?

The documentation on running MARY 5.1.1 version is missing on the official page. Could you please point out, how to run it?

missing dictionary files

There seem to be missing information in data/local/dict. Could you please provide the following files:

silence_phones.txt
nonsilence_phones.txt
optional_silence.txt

I might have missed them somewhere but I don't think so.

data hosted at http://dialogplus.lt.informatik.tu-darmstadt.de/ not accessible

Hi,

I want to train an acoustic model on german speech, but I'm unable to download any data from http://dialogplus.lt.informatik.tu-darmstadt.de/downloads/speechdata/ .

I managed to get the wavs, but I don't find the source sentence archive for LM building . Do you have a mirror ?

Thanks !

ERROR: SymbolTable::ReadText: Can't open file data/lang_test/words.txt

I downloaded the corpus separately and put the data into the proper folder. I then ran ./run.sh which worked for a while (building LMs and doing data prep, it appeared) until it stopped on the above error. It is probably the case that the root cause came earlier, as the above-mentioned file doesn't exist at all. Below is the output from the script from program invocation all the way until the error.

kaldierror.txt

expected file data/train_cleaned/feats.scp to exist

Hi, I am trying to reproduce the model myself so that I can play around with more data afterwards. Unfortunately the training always fails at a certain point because data/train_cleaned/feats.scp is not being generated. A few lines before this happens this error is getting logged (if this helps pinpoint the problem):

ERROR: SymbolTable::ReadText: Can't open file data/lang_300k4_test_pron/words.txt
ERROR: FstHeader::Read: Bad FST header: standard input
remove_oovs.pl: removed 0 lines.
ERROR: FstHeader::Read: Bad FST header: standard input
fstisstochastic data/lang_300k4_test_pron/G.fst 
ERROR: FstHeader::Read: Bad FST header: data/lang_300k4_test_pron/G.fst
ERROR (fstisstochastic[5.5.0~1-317c]:ReadFstKaldiGeneric():kaldi-fst-io.cc:53) Reading FST: error reading FST header from data/lang_300k4_test_pron/G.fst

[ Stack-Trace: ]
/opt/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x82c) [0x7f865ff282aa]
fstisstochastic(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x40e83d]
/opt/kaldi/src/lib/libkaldi-fstext.so(fst::ReadFstKaldiGeneric(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool)+0x196) [0x7f8660381c89]
fstisstochastic(main+0x227) [0x40d72d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f865eade830]
fstisstochastic(_start+0x29) [0x40d439]

kaldi::KaldiFatalErrorDone. New lang dir in data/lang_300k4_test_pron
run.pl: job failed, log is in exp/tri3/graph_pron/mkgraph.log

...later then

steps/decode_fmllr.sh --nj 12 --num-threads 4 --cmd utils/run.pl --num-threads 4 exp/tri4_cleaned/graph data/test exp/tri4_cleaned/decode_test
cat: exp/tri4_cleaned/graph/phones/silence.csl: No such file or directory
Now running TDNN chain data preparation, i-vector training and TDNN-HMM training
./local/run_tdnn_1f.sh --lang_dir data/lang_300k4
local/run_ivector_common.sh: expected file data/train_cleaned/feats.scp to exist

To run the script I am using kaldi's docker container with gpu support to be able to easily reproduce the training. To install sequitur I am using the install_sequitur.sh skript under /opt/kaldi/tools/extras/ and adjusting the path in cmd.sh afterwards which seems to work perfectly fine. If you are interested I can post the Dockerfile here as well.

Expected WERs?

Hi. Thanks for the excellent scripts. Worked very nicely with current kaldi. But I am unsure about the word error rates: the best line in RESULTS... is:
%WER 19.47 [ 3428 / 17605, 850 ins, 295 del, 2283 sub ] exp/sgmm_5a_mmi_b0.1/decode4/wer_13
Should the expected WERs be added to the REAME.md or similar?
Ciao
Sven

Worker fails on start

I've tried to run the provided pre-trained model de_350k_nnet3chain_tdnn1f_1024_sp_bi.tar.bz2 on a working installation of Kaldi (i.e., the tedlium worker works with Gstreamer plugin successfully). I've copied the files and changed the paths in yaml accordingly but when starting the worker, the following error occurs. Obviously, self.asr is None. Any hints?

a@a:~/kaldi/tools/kaldi-gstreamer-server$ ./run_tuda_de.sh 
   DEBUG 2018-12-21 14:14:47,472 Starting up worker 
2018-12-21 14:14:47 -    INFO:   decoder2: Creating decoder using conf: {'post-processor': "perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\\1./;'", 'logging': {'version': 1, 'root': {'level': 'DEBUG', 'handlers': ['console']}, 'formatters': {'simpleFormater': {'datefmt': '%Y-%m-%d %H:%M:%S', 'format': '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'}}, 'disable_existing_loggers': False, 'handlers': {'console': {'formatter': 'simpleFormater', 'class': 'logging.StreamHandler', 'level': 'DEBUG'}}}, 'use-vad': False, 'decoder': {'ivector-extraction-config': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/ivector_extractor/ivector_extractor.conf', 'lattice-beam': 5.0, 'acoustic-scale': 1.0, 'do-endpointing': True, 'beam': 5.0, 'mfcc-config': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/conf/mfcc_hires.conf', 'traceback-period-in-secs': 0.25, 'nnet-mode': 3, 'endpoint-silence-phones': '1:2:3:4:5:6', 'word-syms': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/words.txt', 'num-nbest': 10, 'frame-subsampling-factor': 3, 'phone-syms': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/phones.txt', 'max-active': 10000, 'fst': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/HCLG.fst', 'use-threaded-decoder': True, 'model': 'test/models/german/de_350k_nnet3chain_tdnn1f_1024_sp_bi/final.mdl', 'chunk-length-in-secs': 0.25}, 'silence-timeout': 15, 'out-dir': 'tmp', 'use-nnet2': True}
Traceback (most recent call last):
  File "kaldigstserver/worker.py", line 419, in <module>
    main()
  File "kaldigstserver/worker.py", line 409, in main
    decoder_pipeline = DecoderPipeline2(conf)
  File "/home/xxx/kaldi/tools/kaldi-gstreamer-server/kaldigstserver/decoder2.py", line 25, in __init__
    self.create_pipeline(conf)
  File "/home/xxx/kaldi/tools/kaldi-gstreamer-server/kaldigstserver/decoder2.py", line 55, in create_pipeline
    self.asr.set_property("use-threaded-decoder", conf["decoder"]["use-threaded-decoder"])
AttributeError: 'NoneType' object has no attribute 'set_property'

run.sh failing: text contains 103359 lines with non-printable characters

I am trying to run ./run.sh script to build a model. I didn't make any changes to any of the scripts or data.
On stage 8 during making mfcc features steps/make_mfcc.sh --cmd utils/run.pl --nj 28 data/swc_train exp/make_mfcc/swc_train mfcc is called (line 399).
Then inside steps/make_mfcc.sh on line 76 utils/validate_data_dir.sh is called.
And it exits with code 1 at line 131 because it found 103359 lines with-non printable characters:
utils/validate_data_dir.sh: text contains 103359 lines with non-printable characters

How do I fix or avoid this non-printable characters to continue building a model? There is a boolean variable that is false by defaultnon_print=false but I don't know what it would influence on if change it to true.

Corpus for LM

I found this file: all_corpora_filtered_maryfied.txt.gz .
Does a version before applying "mary" (and/or filtering) exist somewhere?

oov words (after g2p)

I've tried to train the new bigger models but ran into some issues.
For some reason there appeared about 1000 oov words like juba (even after the g2p step). This lead to this initial error that broke the training eventually:

...
Output lang directory is: data/lang_std_big_v5_test                                                                      [876/1982]
arpa2fst - 
LOG (arpa2fst[5.5.0~1-d774]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.0~1-d774]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.0~1-d774]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.0~1-d774]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.0~1-d774]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
FATAL: FstCompiler: Symbol "juba" is not mapped to any integer arc ilabel, symbol table = data/lang_std_big_v5_test/words.txt, sou$
ce = standard input, line = 187592
ERROR: FstHeader::Read: Bad FST header: standard input
ERROR: FstHeader::Read: Bad FST header: standard input
fstisstochastic data/lang_std_big_v5_test/G.fst 
ERROR: FstHeader::Read: Bad FST header: data/lang_std_big_v5_test/G.fst
ERROR (fstisstochastic[5.5.0~1-d774]:ReadFstKaldiGeneric():kaldi-fst-io.cc:53) Reading FST: error reading FST header from data/lan$
_std_big_v5_test/G.fst
...

I extracted all missing words with arpa2fst and added them manually to data/local/de_extra_lexicon.txt and hope the training runs through now. I didn't have these issues with the 600h model however. Are there any LM corpora added after the g2p step maybe?

Kind regards

Issues executing run.sh

Greetings,

I've installed kaldi using their INSTALL in /tools and /src.
I was using make with USE_THREAD=0 as described and correctly installed OpenBLAS to be used from a shared directory.
I also successfully installed Sequitur G2P and all other requirements.
First, I had an error that 'spacy' module not found after the first run of ./run.sh.
I installed it using pip3 install -U spacy and then downloaded de model by python3 -m spacy download de.
But now I am getting the following error :

+ python3 local/export_lexicon.py -f data/local/combined.dict -o data/local/dict_std_big_v5/_lexiconp.txt
Load  data/local/combined.dict
Succesfully opened pickle file, now exporting to: data/local/dict_std_big_v5/_lexiconp.txt
Warning!: No % entry found! Will add <UNK> -> usb mapping manually.
Warning!: No <Lachen> entry found! Will add <Lachen> -> lau mapping manually.
+ g2p_model=data/local/g2p_std_big_v5/de_g2p_model
+ final_g2p_model=data/local/g2p_std_big_v5/de_g2p_model-6
+ mkdir -p data/local/g2p_std_big_v5/
+ '[' 0 -le 5 ']'
+ train_file=data/local/g2p_std_big_v5/lexicon.txt
+ cut '-d ' -f 1,3- data/local/dict_std_big_v5/_lexiconp.txt
+ cut '-d ' -f 1 data/local/dict_std_big_v5/_lexiconp.txt
+ '[' '!' -f data/local/g2p_std_big_v5/de_g2p_model-6 ']'
+ mkdir -p data/local/g2p_std_big_v5/
+ /usr/local/bin/g2p.py -e utf8 --train data/local/g2p_std_big_v5/lexicon.txt --devel 3% --write-model data/local/g2p_std_big_v5/de_g2p_model-1
  File "/usr/local/bin/g2p.py", line 184
    except translator.TranslationFailure, exc:
                                        ^
SyntaxError: invalid syntax

My google search didn't help me to find out how to fix the problem. I tried to reinstall gg2p but it didn't help.
Could you help me with this one, please?
I can provide full console output if needed.

The wav files have bad headers

Running wav-to-duration fails with <wavfile> has no duration in header. Running it with read_entire_file=true still results in an error: WaveData: expected 4-byte chunk-name, got read error

I'm a bit perplexed why this happening, soxi and ffprobe work on the tuda wav files, but despite wav-to-duration working for tons of other wav files that I have they only fail on most of the tuda ones?

Data license

Hi,
Where can I find the license of the data: http://speech.tools/kaldi_tuda_de/german-speechdata-package-v2.tar.gz?
BTW, from your readme, the data is 600+ hours, but after I run the recipe, there are only 400h data for chain model training, so does anyone have the same condition?
Thanks

Incorrect training data

Hi,
so I trained Kaldi using your (old) s5 script, and as a sanity check I tried to decode the training data.
When I compared the text file from the training data to my results, I noticed that there seem to be quite a number of errors in the texts.
I checked the audio and xml files and saw that the sentences were wrong.

I added a screenshot of a partial vimdiff of the text and my results.

Appears to be a mixup, as those sentences do exist, but in other audio files

No pretrained tri3 graph

I would like to decode my own audio files, but I found no pretrained tri3 in exp/

Wrong timeout data type?

Tried several times to get this to run. Followed the steps within the README.md , but every time all training data is dropped because either it is not complete or an error like this happens:

. Error in file, omitting data/wav/german-speechdata-package-v2/dev/2015-01-27-11-31-32
Timeout value connect was (10.0, 10.0), but it must be an int or float.

Tried with Xubuntu 14.04.

The rest of the output(omitted thousands of similar messages at [...]):
outs_low_redundancy.txt

EDIT:
Changed local/maryclient.py line 63 from:
r = self.connection_pool[connection_pool_num].post('http://'+self.host+':'+str(self.port)+'/process',data=params,timeout=(10.0, 10.0))
to:
r = self.connection_pool[connection_pool_num].post('http://'+self.host+':'+str(self.port)+'/process',data=params,timeout=10.0)

Installation check and better error handling

Currently, the run.sh scripts runs through even though errors occur in the data prepartions scripts. This could easily be fixed with appropriate exit checks in run.sh, so that they stop at the correct point.
Also it would be a good idea to check the installation with an additional script, e.g. that Kaldi, Mary and additional Python modules are correctly installed.

build_lm.sh needs option -a for grep

Recent grep versions (like on Ubuntu 16.04) fail for non-ascii characters. This leads in build_lm.sh to a broken file - later breaking the build process. The fix is easy: please add -a in grep call of file s5/local/build_lm.sh

Similarly, 3 grep calls in s5/local/format_data.sh need -a, too.

CIao
Sven

G.fst error , bad FST header

hello , when i run script that build my own language model i had this error but i don't understand it can you help me

ERROR: FstHeader::Read: Bad FST header: standard input
arpa2fst -
LOG (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
WARNING (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:139) Zero ngram count in ngram order 2(look for 'ngram 2=0' in the \data\ section). There is possibly a problem with the file.
LOG (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
WARNING (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:139) Zero ngram count in ngram order 3(look for 'ngram 3=0' in the \data\ section). There is possibly a problem with the file.
LOG (arpa2fst[5.5.6032-4bbac]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
remove_oovs.pl: removed 0 lines.
fstisstochastic /Users/saramuneef/kaldi/egs/maghnaAsr/data/local/lang/G.fst
ERROR: FstHeader::Read: Bad FST header: /Users/saramuneef/kaldi/egs/maghnaAsr/data/local/lang/G.fst
ERROR (fstisstochastic[5.5.603~2-4bbac]:ReadFstKaldiGeneric():kaldi-fst-io.cc:53) Reading FST: error reading FST header from /Users/saramuneef/kaldi/egs/maghnaAsr/data/local/lang/G.fst

[ Stack-Trace: ]
0 libkaldi-base.dylib 0x0000000106fa892f kaldi::KaldiGetStackTrace() + 63
1 libkaldi-base.dylib 0x0000000106fa86a2 kaldi::MessageLogger::LogMessage() const + 354
2 libkaldi-fstext.dylib 0x0000000106b95798 kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&) + 24
3 libkaldi-fstext.dylib 0x0000000106b95c1f fst::ReadFstKaldiGeneric(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, bool) + 879
4 fstisstochastic 0x0000000105df9bad main + 285
5 libdyld.dylib 0x00007fff6d69e7fd start + 1
6 ??? 0x0000000000000002 0x0 + 2

the errors when run the eg——yesno

**Creating data/local/dict/lexiconp.txt from data/local/dict/lexicon.txt
utils/prepare_lang.sh: line 493: fstaddselfloops: command not found
ERROR: FstHeader::Read: Bad FST header: standard input
Preparing language models for test
local/prepare_lm.sh: line 13: arpa2fst: command not found
local/prepare_lm.sh: line 15: fstisstochastic: command not found
ERROR: ReadFst: Can't open file: data/lang_test_tg/G.fst
ERROR: FstHeader::Read: Bad FST header: tmpdir.g/empty_words.fst
Succeeded in formatting data.
steps/make_mfcc.sh --nj 1 data/train_yesno exp/make_mfcc/train_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
run.pl: job failed, log is in exp/make_mfcc/train_yesno/make_mfcc_train_yesno.1.log
steps/compute_cmvn_stats.sh data/train_yesno exp/make_mfcc/train_yesno mfcc
make_cmvn.sh: no such file data/train_yesno/feats.scp
fix_data_dir.sh: kept all 31 utterances.
fix_data_dir.sh: old files are kept in data/train_yesno/.backup
steps/make_mfcc.sh --nj 1 data/test_yesno exp/make_mfcc/test_yesno mfcc
utils/validate_data_dir.sh: WARNING: you have only one speaker. This probably a bad idea.
Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/test_yesno
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
run.pl: job failed, log is in exp/make_mfcc/test_yesno/make_mfcc_test_yesno.1.log
steps/compute_cmvn_stats.sh data/test_yesno exp/make_mfcc/test_yesno mfcc
make_cmvn.sh: no such file data/test_yesno/feats.scp
fix_data_dir.sh: kept all 31 utterances.
fix_data_dir.sh: old files are kept in data/test_yesno/.backup
steps/train_mono.sh --nj 1 --cmd utils/run.pl --totgauss 400 data/train_yesno data/lang exp/mono0a
steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: line 69: feat-to-dim: command not found
error getting feature dimension
mkgraph.sh: expected data/lang_test_tg/G.fst to exist
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno
decode.sh: no such file data/test_yesno/split1/1/feats.scp
grep: exp/mono0a/decode_test_yesno/wer_*: 没有那个文件或目录

local/run_ivector_common.sh: expected file data/train_cleaned/feats.scp to exist

After downloading the data, when I try to use the cuda with kaldi. it failed.
It throws error
local/run_ivector_common.sh: expected file data/train_cleaned/feats.scp to exist

while.

s5_r2/run.sh while running on cpu only works without any error.

Can you guide the proper run for cuda utilization which also after prepares the data makes all process on cuda for kaldi?
Your kind help is highly appreciated.

language model issue. kaldi_lm is not in path

I am now facing error.

local/build_lm.sh --srcdir data/local/lang_std_big_v5 --dir data/local/lm_std_big_v5 --lmstage 2
Not installing the kaldi_lm toolkit since it is already there.
You need to have kaldi_lm on your path

Can you guide me which path is it.
I have already set the path as

s5/path.sh
export KALDI_LM=$KALDI_ROOT/tools/kaldi_lm

empty wav on m_ailab dataset

when I run this recipe, I found that there are some wavs are empty like:data/wav/german-speechdata-package-v2/train/2014-03-24-13-39-24_Kinect-RAW.wav, so there are some errors when get the duration of the wav files, and I removed it, but could you please still check it?
Thanks,

Spacy v3 does not support loading from alias any more

When trying to build the models, run.sh failed because of the following error message:

OSError: [E941] Can't find model 'de'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("de_core_news_sm")

Simply replacing the line affected in prepare_commonvoice_data.py seems to have fixed this.

Edit: This also applies to renormalize_datadir_text.py.

Common Voice 7.0?

Hello!
I would like to train your tuda-de model with the current common voice dataset 7.0 with 965 hrs of German text. Can I bring it in the kaldi formats (scp,...), split it in train, test, dev and add it to the corresponding files, or did you do some cleansing / preprocessing?

Thanks for your feedback and kind regards,
Ernst

Tree file for pre-trained models?

Would it be possible to include the tree with the pretrained models? -- Thanks!

Error when I try to run run.sh However I can't find any variable called sequitur_g2p in run.sh,

Could not find g2p.py
Please edit run.sh and point sequitur_g2p to the g2p.py python script of your Sequitur G2P installation.
Sequitur G2P can be downloaded from https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html
E.g. wget https://www-i6.informatik.rwth-aachen.de/web/Software/g2p-r1668-r3.tar.gz

Kaldi data preprocessing

Dear Friends.

I am trying to preprocess the kaldi data SWC

but, I failed several times.
I have error
Invalid data set of type kaldi: files wav.scp text not found at data/swc

Can you guide me how to process / preprocess it to get those required files? any script available?

New German model is arrived (Kaldi_tuda_de)

Hi Alphacep,

I am currently using https://alphacephei.com/vosk/models.html you German model (49 Mb Lightweight wideband model for Android and RPi). It works. Accuracy is ok, but I want to know,

Do we have source of the scripts how you built up this model? (I mean the mobile/short version).
As I understood you built this mobile model up based on http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/de_400k_nnet3chain_tdnn1f_2048_sp_bi.tar.bz2
Kaldi_tuda_de has been provided the new extremely large model (http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/de_683k_nnet3chain_tdnn1f_2048_sp_bi.tar.bz2) - would it be possible me make mobile/short version for this model?

Number in Train set

In the Tuda-De train set, there is a single sentence with a number:

Das Europäische Parlament konnte sein ambitioniertes Langfristziel gegenüber Rat und Kommission durchsetzen demzufolge bis zum Jahr zweitausendzwanzig der CO2 Ausstoß von Kleinsttransportern auf maximal hundertsiebenundvierzig Gramm pro gefahrenem Kilometer begrenzt wird

No other numbers are in the train / dev / test sets.
This should probably be replaced with its spoken equivalent, e.g. "C O zwei"...

tree for pretrained chain model

Hi I'm trying to create a new HCLG.fst with my own language model, I was wondering if you could upload tree for de_400k_nnet3chain_tdnn1f_2048_sp_bi? Thanks!