Giter VIP home page Giter VIP logo

chinese-automatic-speech-recognition's Introduction

中文语音识别

by

   _____ _                    __  __ _                  _                   
  / ____| |                  |  \/  (_)                (_)                  
 | |    | |__   ___ _ __     | \  / |_ _ __   __ ___  ___  __ _ _ __   __ _ 
 | |    | '_ \ / _ \ '_ \    | |\/| | | '_ \ / _` \ \/ / |/ _` | '_ \ / _` |
 | |____| | | |  __/ | | |_  | |  | | | | | | (_| |>  <| | (_| | | | | (_| |
  \_____|_| |_|\___|_| |_( ) |_|  |_|_|_| |_|\__, /_/\_\_|\__,_|_| |_|\__, |
                         |/                   __/ |                    __/ |
                                             |___/                    |___/ 

Email: [email protected]

更新 2020.06.23

一些朋友提到,将拼音转换成文字时,国内使用谷歌拼音输入法不太方便。我这里根据 wyf19941128 的建议,新写了通过谷歌翻译将拼音转成汉字的方案(国内访问谷歌翻译无需科学上网),所需代码附在 ./alternative 文件夹下。以下是一个实现转换的例子:

from Pinyin2SimplifiedChinese import *

t = translator()
print(t.translate("jin tian tian qi zhen bu cuo")) # return "今天天气真不错"

模型简介

模型输入是一段不长于10秒钟的语音,模型的输出是该语音所对应的拼音标签。本项目使用python 3.6为主要编程语言。

模型参考了Baidu Deep Speech 2:http://proceedings.mlr.press/v48/amodei16.pdf

使用了CNN+GRU+CTC_loss的结构

训练数据

所用的训练数据包含两个部分:

  1. aishell-1语音数据集

AISHELL-ASR0009-OS1录音时长178小时,约14万条语音数据,下载地址:http://www.aishelltech.com/kysjcp

  1. YouTube视频及对应字幕文件

从YouTube上获取MP4视频文件后转化成wav音频,同时使用对应的srt字幕文件作为target。总计时长大约120小时,有约20万条语音数据。数据量过大,且有版权归属问题,所以暂时不提供公开下载渠道。

使用方法

1. 训练模型

根据实际需求和硬件情况,可以选择需要的模型进行训练和调试。各个模型区别如下。如果在含GPU的机器上进行模型训练,直接运行 train901.py,train902.py,或者train903.py 即可。如果是在CPU上训练,则运行 train901_cpu.py,train902_cpu.py,或者train903_cpu.py。

模型名称 CNN层数 GRU层数 GRU维度 训练时间
901 2 3 256 约30小时
902 2 5 256 约55小时
903 2 5 1024 约130小时

这里的训练时间仅仅是一个大概的统计,训练使用一块Tesla V100完成。

model 903 链接: https://pan.baidu.com/s/1NcTN8gojuIBaIFT9FB3EJw 密码: 261u

model 902 链接: https://pan.baidu.com/s/1do7C6Egj6sJO7kn1yHPzBg 密码: 9o87

model 901 链接: https://pan.baidu.com/s/1utz-1Vv4IO9D-3awj3x1QQ 密码: pv08

下载后放在model文件夹下。

2. 识别音频

  1. 初始化模型并加载必要的工具
import os
import time
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    import tensorflow as tf
import numpy as np
from urllib.request import urlopen

from lib.tools_batch import *
from lib.tools_math import *
from lib.tools_sparse import *
from lib.tools_audio import *
from lib.contrib.audio_featurizer import AudioFeaturizer
from lib.contrib.audio import AudioSegment

# 根据你所使用的模型修改这两行
from model903 import *
model_name = "v903"

pyParser = pinyinParser("lib/pinyinDictNoTone.pickle")
af = AudioFeaturizer()
model = model(409)
  1. 初始化session并reload已经训练好的模型
sess = tf.Session()
saver = tf.train.Saver()
saver.restore(sess, "models/"+model_name+"/"+model_name+"_0.ckpt")
  1. 读取音频并转化格式
rate, data = read_wav("data/test.wav")
data = mergeChannels(data)
data = zero_padding_1d(data, 160240)
a_seg = AudioSegment(data, rate)
xs = np.transpose(np.array([af.featurize(a_seg)]), [0,2,1])
  1. 预测并转化成拼音
pred = model.predict(sess, xs)[0]
pred_dense = sparseTuples2dense(pred)
detected_line = []
for stuff in pred_dense[0]:
    if stuff!=-1:
        detected_line.append(stuff)
pinyin = pyParser.decodeIndices(detected_line, useUnderline = False)
  1. 转化成汉字
response = urlopen("https://www.google.com/inputtools/request?ime=pinyin&ie=utf-8&oe=utf-8&app=translate&num=10&text="+pinyin)
html = response.read()
result = (html.decode('utf8')).split(",")[2][2:-1]
print(result)

这里转化成汉字这一步使用了谷歌拼音输入法。如果有需要也可以使用自定义的词表/Markov Chain/seq2seq模型。如果使用词表来定制输入法,可以参考我的另外一个project:https://github.com/chenmingxiang110/SimpleChinese2

效果和demo

ASR 应用场景十分多样。这里我做了一个自动添加字幕的demo,代码详见subtitle_demo.ipynb。一下为字幕添加效果。

  1. 视频一,视频地址:https://www.youtube.com/watch?v=t5cPgIGNosc

左侧为自动添加的字幕,右侧为YouTuber人工手动添加的字幕

Alt text

  1. 视频二,视频地址:https://www.youtube.com/watch?v=HLJJlQkY6ro

左侧为自动添加的字幕,右侧为YouTuber人工手动添加的字幕

Alt text

完整的字幕原文件和预测结果可以再data文件夹中找到。

  _____ _              _    __   __          ___         __      __    _      _    _           
 |_   _| |_  __ _ _ _ | |__ \ \ / /__ _  _  | __|__ _ _  \ \    / /_ _| |_ __| |_ (_)_ _  __ _ 
   | | | ' \/ _` | ' \| / /  \ V / _ \ || | | _/ _ \ '_|  \ \/\/ / _` |  _/ _| ' \| | ' \/ _` |
   |_| |_||_\__,_|_||_|_\_\   |_|\___/\_,_| |_|\___/_|     \_/\_/\__,_|\__\__|_||_|_|_||_\__, |
                                                                                         |___/ 
                                              _..  
                                          .qd$$$$bp.
                                        .q$$$$$$$$$$m.
                                       .$$$$$$$$$$$$$$
                                     .q$$$$$$$$$$$$$$$$
                                    .$$$$$$$$$$$$P\$$$$;
                                  .q$$$$$$$$$P^"_.`;$$$$
                                 q$$$$$$$P;\   ,  /$$$$P
                               .$$$P^::Y$/`  _  .:.$$$/
                              .P.:..    \ `._.-:.. \$P
                              $':.  __.. :   :..    :'
                             /:_..::.   `. .:.    .'|
                           _::..          T:..   /  :
                        .::..             J:..  :  :
                     .::..          7:..   F:.. :  ;
                 _.::..             |:..   J:.. `./
            _..:::..               /J:..    F:.  : 
          .::::..                .T  \:..   J:.  /
         /:::...               .' `.  \:..   F_o'
        .:::...              .'     \  \:..  J ;
        ::::...           .-'`.    _.`._\:..  \'
        ':::...         .'  `._7.-'_.-  `\:.   \
         \:::...   _..-'__.._/_.--' ,:.   b:.   \._ 
          `::::..-"_.'-"_..--"      :..   /):.   `.\   
            `-:/"-7.--""            _::.-'P::..    \} 
 _....------""""""            _..--".-'   \::..     `. 
(::..              _...----"""  _.-'       `---:..    `-.
 \::..      _.-""""   `""""---""                `::...___)
  `\:._.-"""

chinese-automatic-speech-recognition's People

Contributors

chenmingxiang110 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinese-automatic-speech-recognition's Issues

YouTube上获取MP4

从YouTube上获取MP4视频文件后转化成wav音频,同时使用对应的srt字幕文件作为target。总计时长大约120小时,有约20万条语音数据。数据量过大,且有版权归属问题,所以暂时不提供公开下载渠道。

--> 想知道你是怎么做到的?我想做类似的事情,去进行TTS培训

关于模型大小问题

你好,下载 了你的model3模型,发现是1G多,而我自己训练出来的model3模型才300M,请问是什么原因

could not find pinyinDict?

if pinyin in self.pinyinDict:
raise ValueError("Could not find "+pinyin+" in the dictionary.")

请问 pinyinDictNoToneInv 和 pinyinDictNoTone 应该如何生成

Link provided in description no longer valid

Hello sir, I hope you are doing good, links that are provided in the description are not opening

not working:
https://pan.baidu.com/s/1NcTN8gojuIBaIFT9FB3EJw
not working:
https://pan.baidu.com/s/1do7C6Egj6sJO7kn1yHPzBg
not working:
https://pan.baidu.com/s/1utz-1Vv4IO9D-3awj3x1QQ

i cant train on a data and i want to watch a tutorial which is in chinese so i need to generate subtitles and then i will translate it to english can you please upload a good model and provide link?
thank you

package error

when I run this project, it occur a mistake like:module 'tensorflow.keras' has no attribute 'rnn', I think it may be the environment does not matching, so could you tell me when you run this project the enviroment that you use? Thanks so much!

How to perform inference?

Hi, I am bit confused on how to perform inference with the help of your pretrained model?
Can you please provide the steps?

ValueError

你好,打扰你了当我换一个语音文件的时候,就会报错:
Traceback (most recent call last):
File "C:/Users/lihao/Desktop/Chinese-automatic-speech-recognition-master/test.py", line 46, in
data = zero_padding_1d(data, 160240)
File "C:\Users\lihao\Desktop\Chinese-automatic-speech-recognition-master\lib\tools_math.py", line 46, in zero_padding_1d
result = np.concatenate([vec, np.zeros(obj_length-len(vec))])
ValueError: negative dimensions are not allowed

是因为语音的采样率有限制吗,如果是这样,有什么好的解决办法吗,谢谢不胜感激

tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: /models/v903 : ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4·\udcbe\udcb6\udca1\udca3 ; No such process

完整报错的地方
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "J:/PyCharm项目/项目/语音命令库/expand/Chinese-speech-recognition/user.py", line 38, in
saver.restore(sess, "/models/"+model_name+"/"+model_name+"_0.ckpt")
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\training\saver.py", line 1266, in restore
if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)):
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\util\deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\training\checkpoint_management.py", line 372, in checkpoint_exists
if file_io.get_matching_files(pathname):
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\lib\io\file_io.py", line 361, in get_matching_files
return get_matching_files_v2(filename)
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\lib\io\file_io.py", line 389, in get_matching_files_v2
for single_filename in pattern
File "C:\Users\lenovo\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: /models/v903 : ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4·\udcbe\udcb6\udca1\udca3
; No such process

使用其他语料库测试的CER 过高

你好,刚随手使用ths30语料库的第一条语料测试,声学模型(即拼音结果)的CER有30%左右

原文:
lv_shi_yang_chun_yan_jing_da_kuai_wen_zhang_di_di_se_si_yve_de_lin_luan_geng_shi_lv_de_xian_huo_xiu_mei_shi_yi_ang_ran
识别结果
lv_shen_yang_che_ye_jie_da_po_wen_zhang_de_di_se_si_yue_de_li_lun_geng_shi_lv_de_xian_huo_xiu_mei_shi_yi_er_ran

但是正常最简单的CNN+CTC模型仅使用aishell训练(双卡1080Ti 2小时),使用ths30验证,CER也可以到20%
本项目模型泛化性似乎有问题?
按照deep speech的宣传资料,应该base line就是CER 10%起跳的?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.