zzw922cn / automatic_speech_recognition Goto Github PK

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

License: MIT License

Python 98.71% Shell 1.02% Dockerfile 0.27%

automatic-speech-recognition tensorflow timit-dataset feature-vector phonemes data-preprocessing rnn audio deep-learning lstm

automatic_speech_recognition's People

Contributors

Stargazers

Watchers

Forkers

rhythm92 galaxias-sapphi-ren murugeshmarvel cherryleechen jdc08161063 fireae chagge tukjet allensmile statml nieshaoshuai xsongx likeucode hxl1990 fancycheung lyk125 fancyerii xiliangsong neal0432 msnvip suanfeng qiu931110 zzmjohn leezqcst niucheney lihengtianxia ajnovice researchase bogdanovich wpli chuckhacker ml-ai-nlp-ir jon-galloway longjohncoder skynode allpratik k-leon andyhyh think-station vico heyuhere yrbahn dantodor ali5h ezhangle thegoleffect zhangaustin lucianferoiu deepinfinity youngkwonjo raj347 jemisa kobihcmomanyi collawolley nemik neo4reo mitulrupert neozoik cclauss soroushmehr nanfengpo mtfelix sudanenator awesome-archive lyrl blackaller jkkorea ml-lab 19ai hhy5277 xuanhan863 vtpp2014 jgabriellima bradparks amitk144 dzmaster johndpope zelladoor winnerineast pursh2002 orchestor yangqiokay wuyuzaizai micseb revolter aheba hemel-cse aitorbajo erikgoldman ln512 vdt hedgefair kingreym mylearning2017 pyup chenjiasheng deepread kevin1h piandpower judywawira

automatic_speech_recognition's Issues

How to evaluate?

Is there a cli command for evaluating an utterance using one of the trained models?

The size of a model

Hi，I am wondering the size of the model。
hundreds MB or GB？
Can you tell me？
Thanks very much!

BN is suggested to be applied immediately before RELU, not after.

Automatic_Speech_Recognition/models/deepSpeech2.py

Line 73 in 545a198

 layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True) 

As in (Laurent et al., 2015), there are two ways of applying
BatchNorm to the recurrent operation. A natural extension
is to insert a BatchNorm transformation, B(), immediately
before every non-linearity as follows:
h[l, t] = f(B(W[l]*h[l-1, t] + U[l]*h[l, t-1]))

In this case the mean and variance statistics are accumulated
over a single time-step of the minibatch. We did not
find this to be effective.
An alternative (sequence-wise normalization) is to batch
normalize only the vertical connections. The recurrent
computation is given by
h[l, t] = f(B(W[l]*h[l-1, t]) + U[l]*h[l, t-1])

So should we set activation of rnn_cell to None and move RELU activation immediately after BN?

Deep Speech 2 build_graph not called

Hello,

In the Deep Speech model that you defined in "models/deepSpeech2.py", the function build_graph was not called, this will result an error as following:
"AttributeError: 'DeepSpeech2' object has no attribute 'var_trainable_op' "
I compared with the "dynamic_brnn.py", you should add "self.build_graph(args, maxTimeSteps)" in the class initialiser.

Thanks,

timit_preprocess.py

I can't read timit, the problem is ValueError: File format 'NIST'... not understood. can you help me?

运行libri语料库出错

您好！
我在运行您github上的代码时，已经运行成功了libri_preprocess.py，然后再运行训练的代码出现了一个错误，如下所示：
[gpu3@localhost main]$ python libri_train.py
Initializing
Epoch 1 ...
Traceback (most recent call last):
File "libri_train.py", line 281, in
runner.run()
File "libri_train.py", line 216, in run
feed_dict=feedDict)
File "/home/gpu3/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/home/gpu3/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 961, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1731, 64, 39) for Tensor u'Placeholder:0', which has shape '(1731, 64, 60)
麻烦问一下这个怎么解决呢？

Training on TIMIT Corpus

Hello

I am trying to use this library to train on TIMIT corpus for phoneme classification. I am facing multiple issues while running the preprocessing script and the train script. It would be great if you could provide a step by step guide on how to run it. The main problem seems to be inability to correctly import packages, also the preprocessing script throws error that input directory doesn't exist while it certainly does.

Thank you

Dead links in project page

a lot of your links are 404 error in the first page of this repo.
nearly all pointing inside the project are dead.
others pointing to external ressoruces are ok

How many epochs is appropriate for librispeech?

I did experiment on librispeech dataset, I have tried 15 epochs when training, but got 0.46 CER and results seems not good. So could you tell me how many epochs is appropriate for librispeech?

ImportError: No module named dynamic_brnn

遇到许多问题

我是个初学者，所以可能遇到很多问题不会解决，目前遇到过的有：
1.在导入模块那里无法导入core_rnn,以及impl，我把这些话全都注释了才得以继续
2.return data_lists_to_batches([np.load(os.path.join(mfccPath, fn)) for fn in os.listdir(mfccPath)],
OSError: [Errno 2] No such file or directory: '/home/pony/github/data/timit/phn/train/mfcc'
现在卡在这里不知道怎么办

Is there any way I can print the phonemes

I wanted to print the phonemes instead of English language. How can I do it?

License?

Hi I was wondering if you could specify a license for usage?

timit

SyntaxError: Missing parentheses in call to 'print'

Hi, after i followed the instruction and install the packages, i got errors as below when running train command, any idea? thanks.

File "/usr/local/lib/python3.5/dist-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/utils/visualization.py", line 24
print 'Just mono files'
^
SyntaxError: Missing parentheses in call to 'print'

Please specify the research paper

on which your work is based upon so that other devs may follow what u r doing
Best of Luck!!!

what should I do when I have run the program successfuly

I don't know how to draw the PER form like the author, and I also want to use the trained model to identify my own voice data, and I don't know how to do it，many thx！

I don't see any RELU activation after Conv and BN layers, is this implementation correct?

Automatic_Speech_Recognition/models/deepSpeech2.py

Line 68 in 545a198

 layer3 = tf.contrib.layers.dropout(layer3, keep_prob=args.keep_prob[2], is_training=args.is_training) 

error when running libri_train

Hello, when I run the libri_train code based on processed librispeech dataset, it apper followed error:

Initializing
Epoch 1 ...
Traceback (most recent call last):
File "libri_train.py", line 281, in
runner.run()
File "libri_train.py", line 216, in run
feed_dict=feedDict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 961, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1720, 64, 39) for Tensor u'Placeholder:0', which has shape '(1720, 64, 60)'

It seems phoneme mapping problem, How can I solve the problem? Thanks for answering.

which language is it for？

Sorry I am a freshman in speech recognition
I saw that "timit phonemes, it is 62; if timit characters, it is 29"
I want to know is it for English?
If for chinese, the number of phonemes and characters should be how many?
really thanks!

无法复现实验结果

你好，我现在用这个实验的代码在timit进行了实验。但是在test集上得到的错误率大概在0.35左右，与主页上的图标的实验结果相差很多。而且，主页上图标的训练集和测试集的错误率下降曲线的横坐标不是很明确。不太了解是epoch还是什么呢？我目前选用的参数是2层blstm，learning rate 0.0001，也就是脚本中默认的模型参数。所以想问一下，如何能够复现出如图中所示的实验结果？是否有其他的trick或是用了不同的模型参数呢？
多谢

Pretrained Model

I was wondering if a pretrained model could be made available. Specifically, I was hoping for the model used to generate the librispeech examples in the readme.

Thanks!

Have a Problem in timit_train.py

Hi:
WHEN I TRY TO RUN THE COOMMAND :
python3 timit_train.py --mode train --level cha --batch_size 8

IT PRODUCE THE PROBLEM AS FOLLOWS:

2018-01-14 14:23:26.356409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-01-14 14:23:26.356440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1)
Epoch 1 ...
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "timit_train.py", line 255, in
runner.run()
File "timit_train.py", line 183, in run
feed_dict=feedDict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'capsule_cnn_layer_2/conv_kernel/read', defined at:
File "timit_train.py", line 255, in
runner.run()
File "timit_train.py", line 139, in run
model = model_fn(args, maxTimeSteps)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 114, in init
self.build_graph(self.args, self.maxTimeSteps)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 153, in build_graph
output = capLayer(output, [2, 2], (1,1,1,1), args.num_iter)
File "/home/lab/Automatic_Speech_Recognition-master/speechvalley/models/capsuleNetwork.py", line 87, in call
self._num_channelsself._num_capsulesself._output_vector_len], dtype=tf.float32)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1203, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1092, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 425, in get_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 805, in _get_single_variable
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 213, in init
constraint=constraint)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 356, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 125, in identity
return gen_array_ops.identity(input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2071, in identity
"Identity", input=input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value capsule_cnn_layer_2/conv_kernel
[[Node: capsule_cnn_layer_2/conv_kernel/read = IdentityT=DT_FLOAT, _class=["loc:@capsule_cnn_layer_2/conv_kernel"], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: Mean/_17 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_958_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Is there any other parameter that i should set?

Thanks!

中文语音数据库

hi,
您好，能推荐下您用的中文语音数据集吗

About the audiolab in this project

Sorry to bother you again. But I really would like to know how important is this external lib: scikits.audiolab to this project.

I want to use python 3, and I see that your code is not that hard to be modified to adapt python 3. However, the scikits.audiolab only supports python 2. Therefore, I have this question. Can I find any replacement for that? Or where (what part of functions) is the lib used in your project?

Thank you very much!

Training on Isolated words

hi,
i have been trying to train words like command, backspace, one two, etc .
the preprocessing n training went well.
But the result of testing produces a long sequence of phonemes i suppose
Any suggestion to correct the output
I am attaching the output obtained
cha_result.txt

No module named utils

O running run_timit.sh I am getting following error

saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$ sudo pip install utilsThe directory '/home/saurabh/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/saurabh/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Requirement already satisfied: utils in /usr/local/lib/python2.7/dist-packages
saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$ ./run_timit.sh
loop index: 2
Traceback (most recent call last):
  File "timit_train.py", line 27, in <module>
    from utils.utils import load_batched_data
ImportError: No module named utils
saurabh@saurabh-Inspiron-5559:~/saurabh/asr_new/main$

中文语音识别

@zzw922cn我看了代码，发现timit文件是把英文分解成音素或者字符级别，来结合语音特征向量进行训练。libri文件是根据英文对应的数字编码自然转化为向量来训练的。那么，中文呢？我理解的是好像是直接把语音特征向量和中文的单词进行对应训练。也许是我代码理解的还不全面，但是，作者可以回答下我的这个疑惑吗？

New tf.Session() for each subdir?

Looking into the code the model initializes the tf.session() for every subdirectory (4000 samples). It doesn't make sense (at least to me). Also there is no relation for the model of each of the subdirectories. I am referring to ./Automatic_Speech_Recognition/main/libri_train.py file. Please explain.

TEDLUIM?

It'll be great to have TEDLUIM dataset support. And then compare against Mozilla DeepSpeech and DeepSpeech PyTroch.

TRAIN CONFIGS

Thanks a lot for your amazing ASR repository, the best implementation of DeepSpeech there is. I will really appreciate it, if you could help me with one issue.

I am trying to train Librispeech with your repo but my results are far from your results.

Could you please inform me with what configs you got those results? ( #layers, lr, activation, rnncell, etc.)

why is "if er / batch_size == 1.0:" necessary?

Hi! I found "if er / batch_size == 1.0:" is necessary in the *_train.py file. If I delete it, then the errorRate will stay at 1.0 forever. I am very confused. Why this happens? 3ks~

Pretrained Model

Hi,

Is it possible to share your pretrained models (Checkpoint, meta files) ? We can evaluate the performance without training the model by ourselves.

Thanks,

deeepSpeech2 model ValueError: Shape must be rank 4 but is rank 3 for 'Conv2D' (op: 'Conv2D') with input shapes:

Traceback (most recent call last):
File "main/timit_train.py", line 255, in
runner.run()
File "main/timit_train.py", line 144, in run
model = model_fn(args, maxTimeSteps)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 120, in init
self.build_graph(args, maxTimeSteps)
File "/mnt/Automatic_Speech_Recognition-master/utils/utils.py", line 32, in wrapper
result = func(*args, **kwargs)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 149, in build_graph
output_fc = build_deepSpeech2(self.args, maxTimeSteps, self.inputX, self.cell_fn, self.seqLengths)
File "/mnt/Automatic_Speech_Recognition-master/models/deepSpeech2.py", line 58, in build_deepSpeech2
layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 403, in conv2d
data_format=data_format, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2338, in create_op
set_shapes_for_outputs(ret)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1719, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1669, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn
debug_python_shape_fn, require_shape_fn)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 676, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Shape must be rank 4 but is rank 3 for 'Conv2D' (op: 'Conv2D') with input shapes: [778,32,39], [41,11,1,32].

need to preprocess data?

unable to import utils

Hi, this is an awesome project! Thx a lot!
I use the most recent version but still encounter the problem of "No module named utils.utils". I saw it is mentioned in a closed issue as well. Any idea to solve this?

I have a problem with the mfcc

今天偶然发现，在您的程序中完成特征提取之后的npy文件里面的数据不是391，而是39n（根据不同的语音n取值不一样，有292,370等），我之前一直以为您的预处理程序处理完语料产生的就是一个长度为39的特征向量，之前接触的其他的语音识别特征提取都是一个39维的特征向量，为什么您特征提取完的矩阵这么大？后面是否有将他转化为长度为39的特征向量的操作？我并没有在您的程序中找到，望能指教，非常感谢

How to use GPU when training?

what is the setting if I want to train libri with GPUs, any generous suggestion? thx.

are there pretrained models?

I have a question when run the example that using lstm+ctc,thanks

when i run the example which using timit to build a lstm+ctc system,i meet a mistake,the error as follows：

can you give a suggest how to deal with it,thank very much.

incorrect init.py

File Automatic_Speech_Recognition/speechvalley/feature/libri/init.py contain invalid file name

from speechvalley.feature.libri.libri_proprecess import preprocess, wav2feature

should be

from speechvalley.feature.libri.libri_preprocess import preprocess, wav2feature

Any plan to support TensorFlow r1.3?

Hi,

It seems TensorFlow moved around the RNN modules in r1.3.
Are you planning to support the lastest version?

Many thanks,

build_deepSpeech2 function bug?

the first three conv layers asked the input to be like [batch, freq_bin, time_len, in_channels]

''' Parameters:

          maxTimeSteps: maximum time steps of input spectrogram power
          inputX: spectrogram power of audios, [batch, freq_bin, time_len, in_channels]
          seqLengths: lengths of samples in a mini-batch
   '''
# 3 2-D convolution layers
    layer1_filter = tf.get_variable('layer1_filter', shape=(41, 11, 1, 32), dtype=tf.float32)
    layer1_stride = [1, 2, 2, 1]
    layer2_filter = tf.get_variable('layer2_filter', shape=(21, 11, 32, 32), dtype=tf.float32)
    layer2_stride = [1, 2, 1, 1]
    layer3_filter = tf.get_variable('layer3_filter', shape=(21, 11, 32, 96), dtype=tf.float32)
    layer3_stride = [1, 2, 1, 1]
    layer1 = tf.nn.conv2d(inputX, layer1_filter, layer1_stride, padding='SAME')
    layer1 = tf.layers.batch_normalization(layer1, training=args.is_training)
    layer1 = tf.contrib.layers.dropout(layer1, keep_prob=args.keep_prob[0], is_training=args.is_training)

    layer2 = tf.nn.conv2d(layer1, layer2_filter, layer2_stride, padding='SAME')
    layer2 = tf.layers.batch_normalization(layer2, training=args.isTraining)
    layer2 = tf.contrib.layers.dropout(layer2, keep_prob=args.keep_prob[1], is_training=args.is_training)

    layer3 = tf.nn.conv2d(layer2, layer3_filter, layer3_stride, padding='SAME')
    layer3 = tf.layers.batch_normalization(layer3, training=args.isTraining)
    layer3 = tf.contrib.layers.dropout(layer3, keep_prob=args.keep_prob[2], is_training=args.is_training)

However, the rnn layers asked the batch to be like [max_time, batch_size ,...]

    # 4 recurrent layers
    # inputs must be [max_time, batch_size ,...]
    layer4_cell = cell_fn(args.num_hidden, activation=args.activation)
    layer4 = tf.nn.dynamic_rnn(layer4_cell, layer3, sequence_length=seqLengths, time_major=True) 
    layer4 = tf.layers.batch_normalization(layer4, training=args.isTraining)
    layer4 = tf.contrib.layers.dropout(layer4, keep_prob=args.keep_prob[3], is_training=args.is_training)

And I don't see any transpose being made to make time to be the first axis, so after the conv layers the tensor is organized like [batch, freq_bin, time_len, 96] , where time is the third axis ,so is this a bug?

NoneType' and 'int'

Traceback (most recent call last):
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/libri/libri_preprocess.py", line 176, in
mode=mode, feature_len=feature_len, seq2seq=seq2seq, save=True)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/libri/libri_preprocess.py", line 90, in wav2feature
feat = calcfeat_delta_delta(sig,rate,win_length=win_len,win_step=win_step,mode=mode,feature_len=feature_len)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 68, in calcfeat_delta_delta
feat = calcMFCC(signal,samplerate,win_length,win_step,feature_len,filters_num,NFFT,low_freq,high_freq,pre_emphasis_coeff,cep_lifter,appendEnergy,mode=mode) #首先获取13个一般MFCC系数
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 118, in calcMFCC
feat,energy=fbank(signal,samplerate,win_length,win_step,filters_num,NFFT,low_freq,high_freq,pre_emphasis_coeff)
File "/administrator/PycharmProjects/Automatic_Speech_Recognition/feature/core/calcmfcc.py", line 151, in fbank
high_freq=high_freq or samplerate/2
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

when I run the libri_preprocess.py, the above error occured? how to solve it

out_channels of the third conv layer in deep_speech2 is 96, but Baidu uses 32. Is it a typo?

Automatic_Speech_Recognition/models/deepSpeech2.py

Line 56 in 545a198

 layer3_filter = tf.get_variable('layer3_filter', shape=(21, 11, 32, 96), dtype=tf.float32) 

see https://github.com/PaddlePaddle/models/blob/develop/deep_speech_2/layer.py, conv_group

关于timit实验重现的问题

你好，在重现DBRNN的TIMIT实验过程中程序无法运行；tensorflow==1.1.0 tensorflow-gpu==1.1.0 特征提取过程已经完成：请问这个问题该如何解决？
(tensorflow35)jtang@nelslip-k40-server219:~/tfcode/Automatic_Speech_Recognition/speechvalley/main$ ./run_timit.sh
loop index: 2
test mode...
load_data...
load_data in 2.1332616806030273 s
build_graph...
Traceback (most recent call last):
File "timit_train.py", line 248, in
runner.run()
File "timit_train.py", line 136, in run
model = model_fn(args, maxTimeSteps)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/models/dynamic_brnn.py", line 87, in init
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/utils/utils.py", line 25, in wrapper
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/SpeechValley-1.0.0-py3.5.egg/speechvalley/models/dynamic_brnn.py", line 118, in build_graph
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 167, in truncated_normal
shape_tensor = _ShapeTensor(shape)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 42, in _ShapeTensor
return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 639, in convert_to_tensor
as_ref=False)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 704, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 444, in make_tensor_proto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 444, in
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/home/jtang/anaconda2/envs/tensorflow35/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got 128

package/lib requirements

I checked the requirements.txt, and it states that this project requires the following libraries or frameworks:

tabulate==0.7.7
theano==0.9.0
xlwt==1.2.0

Is theano really required, what is that for? It will not be supported in the future. Therefore, I would like to confirm this.

Also, the other two packages look weird to me, what are they used for: creating tabular data, and excel file? Are they essential to this project?

Small data

Just wonder if this also works well in small data. Let's say I have only a few hundreds of training instances, and 20 - 30 test instances. Because my task is quite specific for one person only.

Question: TensorFlow devices are created at every step. Isn't it inefficient?

I found that training our model with a Tesla P100 GPU is not any faster than training it with a ordinary CPU.

According to these console outputs, it seems TensorFlow devices are created at every step.
(I am using 24 GPUs and global_step increase 24 for each step.)

Is it necessary to do so?
Is it costly to create a Tensorflow device?
Who is in charge of the device creation, TensorFlow or us?

2017-06-15 15:54:40.224138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:36 phn mode, global_step:34755.0,total:4620,batch:40/144,epoch:11/200,train loss=82.963,mean train PER=0.013
Model has been saved in /home/chenjiasheng/log/timit/phn/save
2017-06-15 15:54:46.530416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:42 phn mode, global_step:34793.0,total:4620,batch:41/144,epoch:11/200,train loss=104.673,mean train PER=0.016
2017-06-15 15:54:50.600103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0)
15:54:46 phn mode, global_step:34818.0,total:4620,batch:42/144,epoch:11/200,train loss=59.382,mean train PER=0.009

Not getting the specified accuracy

I do not get the accuracy as u mentioned under "LibriSpeech recognition result without LM".. Can you state what are the parameters that you used and what are the exact training dataset names that you used from the librispeech corpus at [(http://www.openslr.org/12/)]

zzw922cn / automatic_speech_recognition Goto Github PK

automatic_speech_recognition's People

Contributors

Stargazers

Watchers

Forkers

automatic_speech_recognition's Issues

Recommend Projects

Recommend Topics

Recommend Org