I am trying to train a model for AMI dataset and I get the following error:
- Reading config file......OK!
- Chunk creation......OK!
------------------------------ Epoch 000 / 023 ------------------------------
Training AMI_tr chunk = 1 / 50
ERROR: training epoch 0, chunk 0 not done! File exp/AMI_MLP_fbank/exp_files/train_AMI_tr_ep000_ck00.info does not exist.
See exp/AMI_MLP_fbank/log.log
##########################################################
exp/AMI_MLP_fbank/log.log:
add-deltas --delta-order=0 ark:- ark:-
apply-cmvn --utt2spk=ark:/disk/scratch1/s1569548/software/kaldi/egs/ami/s5b/data/ihm/train_cleaned/utt2spk scp:/disk/scratch1/s1569548/software/kaldi/egs/ami/s5b/data-fbank/ihm/train_cleaned/cmvn.scp ark:- ark:-
copy-feats scp:exp/AMI_MLP_fbank/exp_files/train_AMI_tr_ep000_ck00_fbank.lst ark:-
LOG (copy-feats[5.5.1051-d3379]:main():copy-feats.cc:143) Copied 1969 feature matrices.
LOG (apply-cmvn[5.5.1051-d3379]:main():apply-cmvn.cc:162) Applied cepstral mean normalization to 1969 utterances, errors on 0
ali-to-phones --per-frame=true /disk/scratch1/s1569548/software/pytorch-kaldi/kaldi/exp/ihm/tri3_cleaned_ali_train_cleaned/final.mdl ark:- ark:-
LOG (ali-to-phones[5.5.1051-d3379]:main():ali-to-phones.cc:134) Done 98455 utterances.
copy-feats scp:exp/AMI_MLP_fbank/exp_files/train_AMI_tr_ep000_ck00_fbank.lst ark:-
apply-cmvn --utt2spk=ark:/disk/scratch1/s1569548/software/kaldi/egs/ami/s5b/data/ihm/train_cleaned/utt2spk scp:/disk/scratch1/s1569548/software/kaldi/egs/ami/s5b/data-fbank/ihm/train_cleaned/cmvn.scp ark:- ark:-
add-deltas --delta-order=0 ark:- ark:-
LOG (copy-feats[5.5.1051-d3379]:main():copy-feats.cc:143) Copied 1969 feature matrices.
LOG (apply-cmvn[5.5.1051-d3379]:main():apply-cmvn.cc:162) Applied cepstral mean normalization to 1969 utterances, errors on 0
ali-to-pdf /disk/scratch1/s1569548/software/pytorch-kaldi/kaldi/exp/ihm/tri3_cleaned_ali_train_cleaned/final.mdl ark:- ark:-
LOG (ali-to-pdf[5.5.1051-d3379]:main():ali-to-pdf.cc:68) Converted 98455 alignments to pdf sequences.
Traceback (most recent call last):
File "run_nn.py", line 207, in
outs_dict=forward_model(fea_dict,lab_dict,arch_dict,model,nns,costs,inp,inp_out_dict,max_len,batch_size,to_do,forward_outs)
File "/disk/scratch1/s1569548/software/pytorch-kaldi/utils.py", line 1630, in forward_model
lab_dnn=lab_dnn.view(-1).long()
RuntimeError: CUDA error: device-side assert triggered
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [14,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes
failed.
/opt/conda/conda-bld/pytorch_1544081127912/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes
failed.
##########################################################
So it looks like something is wrong with the targets. Therefore, I added those lines (in load_chunk() in data_io.py):
print("Min label value of this chunk: ", min(data_lab))
print("Max label value of this chunk: ", max(data_lab))
and got:
Min label value of this chunk: 1
Max label value of this chunk: 175
Min label value of this chunk: 0
Max label value of this chunk: 3983
I didn't modify the default architecture, so it is a monophone+cd model.
From gmm-info on the triphone model used for alignments, I get the following:
number of phones 176
number of pdfs 3984
number of transition-ids 27460
number of transition-states 13650
feature dimension 40
number of gaussians 80060
So for the pdfs, the labels satisfy t >= 0 && t < n_classes
.
For the monophone part, the silence label is missing. I'm not sure why...
##########################################################
Do you know what can be the issue here?
I was trying to train with only monophone and with only cd targets, but I got the same error.