netease-youdao / emotivoice Goto Github PK

View Code? Open in Web Editor NEW

6.9K 60.0 577.0 3.81 MB

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

License: Apache License 2.0

Dockerfile 0.20% Python 99.71% Shell 0.09%

pytorch speech speech-synthesis tts multi-speaker text-to-speech deep-learning prompt emotivoice ai

emotivoice's Introduction

README: EN | 中文

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

EmotiVoice is a powerful and modern open-source text-to-speech engine that is available to you at no cost. EmotiVoice speaks both English and Chinese, and with over 2000 different voices (refer to the List of Voices for details). The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

An easy-to-use web interface is provided. There is also a scripting interface for batch generation of results.

Here are a few samples that EmotiVoice generates:

emotivoice_intro_cn_im.1.mp4
emotivoice_intro_en_im.1.mp4
emotivoice_intro_en_fun_im.1.mp4

Demo

A demo is hosted on Replicate, EmotiVoice.

Hot News

Tuning voice speed is now supported in 'OpenAI-compatible-TTS API', thanks to @john9405. #90 #67 #77
The EmotiVoice app for Mac was released on December 28th, 2023. Just download and taste EmotiVoice's offerings!
The EmotiVoice HTTP API was released on December 6th, 2023. Easier to start, faster to use, and with over 13,000 free calls. Additionally, users can explore more captivating voices provided by Zhiyun.
Voice Cloning with your personal data has been released on December 13th, 2023, along with DataBaker Recipe and LJSpeech Recipe.

Features under development

Support for more languages, such as Japanese and Korean. #19 #22

EmotiVoice prioritizes community input and user requests. We welcome your feedback!

Quickstart

EmotiVoice Docker image

The easiest way to try EmotiVoice is by running the docker image. You need a machine with a NVidia GPU. If you have not done so, set up NVidia container toolkit by following the instructions for Linux or Windows WSL2. Then EmotiVoice can be run with,

docker run -dp 127.0.0.1:8501:8501 syq163/emoti-voice:latest

The Docker image was updated on January 4th, 2024. If you have an older version, please update it by running the following commands:

docker pull syq163/emoti-voice:latest
docker run -dp 127.0.0.1:8501:8501 -p 127.0.0.1:8000:8000 syq163/emoti-voice:latest

Now open your browser and navigate to http://localhost:8501 to start using EmotiVoice's powerful TTS capabilities.

Starting from this version, the 'OpenAI-compatible-TTS API' is now accessible via http://localhost:8000/.

Full installation

conda create -n EmotiVoice python=3.8 -y
conda activate EmotiVoice
pip install torch torchaudio
pip install numpy numba scipy transformers soundfile yacs g2p_en jieba pypinyin pypinyin_dict

Prepare model files

We recommend that users refer to the wiki page How to download the pretrained model files if they encounter any issues.

git lfs install
git lfs clone https://huggingface.co/WangZeJun/simbert-base-chinese WangZeJun/simbert-base-chinese

or, you can run:

git clone https://www.modelscope.cn/syq163/WangZeJun.git

Inference

You can download the pretrained models by simply running the following command:

git clone https://www.modelscope.cn/syq163/outputs.git

The inference text format is <speaker>|<style_prompt/emotion_prompt/content>|<phoneme>|<content>.

inference text example: 8051|Happy|<sos/eos> [IH0] [M] [AA1] [T] engsp4 [V] [OY1] [S] engsp4 [AH0] engsp1 [M] [AH1] [L] [T] [IY0] engsp4 [V] [OY1] [S] engsp1 [AE1] [N] [D] engsp1 [P] [R] [AA1] [M] [P] [T] engsp4 [K] [AH0] [N] [T] [R] [OW1] [L] [D] engsp1 [T] [IY1] engsp4 [T] [IY1] engsp4 [EH1] [S] engsp1 [EH1] [N] [JH] [AH0] [N] . <sos/eos>|Emoti-Voice - a Multi-Voice and Prompt-Controlled T-T-S Engine.

You can get phonemes by python frontend.py data/my_text.txt > data/my_text_for_tts.txt.
Then run:

TEXT=data/inference/text
python inference_am_vocoder_joint.py \
--logdir prompt_tts_open_source_joint \
--config_folder config/joint \
--checkpoint g_00140000 \
--test_file $TEXT

the synthesized speech is under outputs/prompt_tts_open_source_joint/test_audio.

Or if you just want to use the interactive TTS demo page, run:

pip install streamlit
streamlit run demo_page.py

OpenAI-compatible-TTS API

Thanks to @lewangdev for adding an OpenAI compatible API #60. To set it up, use the following command:

pip install fastapi pydub uvicorn[standard] pyrubberband
uvicorn openaiapi:app --reload

Wiki page

You may find more information from our wiki page.

Training

Voice Cloning with your personal data has been released on December 13th, 2023.

Roadmap & Future work

Our future plan can be found in the ROADMAP file.
The current implementation focuses on emotion/style control by prompts. It uses only pitch, speed, energy, and emotion as style factors, and does not use gender. But it is not complicated to change it to style/timbre control.
Suggestions are welcome. You can file issues or @ydopensource on twitter.

WeChat group

Welcome to scan the QR code below and join the WeChat group.

Credits

PromptTTS. The PromptTTS paper is a key basis of this project.
LibriTTS. The LibriTTS dataset is used in training of EmotiVoice.
HiFiTTS. The HiFi TTS dataset is used in training of EmotiVoice.
ESPnet.
WeTTS
HiFi-GAN
Transformers
tacotron
KAN-TTS
StyleTTS
Simbert
cn2an. EmotiVoice incorporates cn2an for number processing.

License

EmotiVoice is provided under the Apache-2.0 License - see the LICENSE file for details.

The interactive page is provided under the User Agreement file.

emotivoice's People

Stargazers

Watchers

Forkers

huaxuanw ishine liuyongjie985 amorjnyh maxmax2016 playvoice pengyun1314123 whitefu maoshuiyang pigorz taichuai peter05010402 zwglory haojingyuan v-mi fortunecat0884 louis-xwb prahs superfreakman jollyant zqlsnr yuanfangme jetwaves zhaoxiaobao keikinn hangox albin-zhu alex-ibb zfbok mhe014 wendongj moreyogurt liangshaojiang achillesxu majiajue forica q-coding-cg splinter21 14923523 larriti thinkerchina zlg810 poeticmedia ilovefeng kawais ghowtan rogervaas sagarneo11 hadryan lextimezy iamleon121 road0001 eric-wei gaoxiaowei soon14 z1446722374 news780 ai-jie01 zhoulingjie pmp181818 kamasamikon caplost zinjoyce zjw-swun sunilgitb lemon22333 af-74413592 hhy5277 guonetnet51 xhwskhizein creative-v bankxi zhangdekui wuxiaoxrj kary372022 bambuo linecode wjmboss genexis-ai willkhoza codeaudit twonp168 chunhualiu suryatmodulus heycms fdoperezi tomchapin ryanhollander umhau ali-biz-gh f901107 lyhiving michaelten dgreen2017 liuguoyou dowhere luchuanze shanjuke qiuxuhao qi-hua

emotivoice's Issues

Support more languages?

Hello, and Awesome work!

I would like support for other languages like Spanish.

Voice cloning

Hello @netease-youdao,
Is there any way to support zero-shot voice cloning from a voice sample?
Thank you!

如何基于自己的声音进行微调呢

如题

Hggingface Space ?

Can a Huggingface Space be made for this project ?

demo error

text:
一枚天鹅蛋在鸭窠里被母鸭孵出后，它长相奇丑无比。被同行，外行甚至养殖场的歧视，嘲笑它是丑小鸭。经历过无数风霜的丑小鸭安全地长大了，眼尖的天师傅发现了它。于是乎和老板约定以6元一斤收购了。很快，丑小鸭被做成一直美味的烤天鹅。

你好，我在测试英文tts生成的时候，他的速度很慢13-14s，但是在生成中文语音的时候却很快，1-2秒就生成好了，这正常吗？

如图，cuda正常

生成的声音背景有沙沙的杂音

体验过程中发现用数字标号的speaker生成的语音背景都有相似的杂音，有方法解决嘛

中英混合的如何做呢？Mixing Chinese and English leads to errors.

'NoneType' object has no attribute 'start'

Training To be released.

when?

这个方案能进行声音克隆吗

这个方案能进行声音克隆吗？
如果不能，有什么修改的思路吗？

请问推理输入的sp0， sp1是什么意思？

请问推理输入的音素序列中 sp0， sp1 是什么？是停顿标记吗？在推理时是怎么得到的？

python inference_am_vocoder_joint.py 命令没起作用呢

，看见输出run，但是没有test_audio目录和生成的语音文件呢

不是英伟达的显卡和gpu能跑这个项目吗

不是英伟达的显卡和gpu能跑这个项目吗，不是指docker运行，就是比如要做修改那种

有些单字放在第一个，转换有问题

比如：
哎，今天天气好
爱，是不是哟

建议

生成的phnoeme text 并没有包含说话人，情绪和原始内容，然后直接推理的时候又会切片最后index error。
要么就写一个脚本直接从txt 生成audio，要么分两步就全部生成，不要前后逻辑对不上。

感谢作者开源，请问有官方交流群没？

Putting model weights on HuggingFace

Hi,

Can I put checkpoint files (checkpoint_163431, g_00140000, do_00140000) on Hugging Face so they can be easier accessible than by Google Drive?

Thank you 😀

TTS API Support?

It's a great project! Is there any plan to have a support API interface?

IsADirectoryError: [Errno 21] Is a directory: '/home/firefly/project_lzl/EmotiVoice/outputs/style_encoder/ckpt/checkpoint_163431' python-BaseException

EmotiVoice/inference_am_vocoder_joint.py", line 66, in main style_encoder.load_state_dict(model_ckpt) File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for StyleEncoder: Unexpected key(s) in state_dict: "bert.embeddings.position_ids".

EmotiVoice/inference_am_vocoder_joint.py", line 66, in main
style_encoder.load_state_dict(model_ckpt)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for StyleEncoder:
Unexpected key(s) in state_dict: "bert.embeddings.position_ids".

EmotiVoice/frontend.py", line 26, in split_py if py[-1] == 'r': IndexError: string index out of range

EmotiVoice/frontend.py", line 26, in split_py
if py[-1] == 'r':
IndexError: string index out of range

测试问题：

抱歉刚刚的回答可能让你感到不满意了。作为一个大语言模型，我并不具备情感和自主意识，我的回答是基于大量的数据和算法生成的。如果我的回答有不准确或者不恰当的地方，还请您多多包涵和指教。
我是由百川智能的工程师们开发和维护的。他们是一群富有创造力和激情的人，致力于为我提供更好的服务和功能。
测试一下中英混合文本，hello,你好啊。Hello, this is the best test for now。我们很期待您的到来，希望你在这次盛会中得到你想要的结果。

请问能否提供一个可以在 gitee.com 或国内主流云盘的下载地址？

目前 https://huggingface.co 好像国内很难访问，请支持一下国内开发者和测试者，提供一个方便国内网络下载本项目模型或较大文件的地址。谢谢先。

demo_page.py里头 device写死了用CPU

建议改为
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

苹果Mac M2机子能玩吗？

后续会不会支持下苹果环境？

有没有技术文档呢？

群可以拉我一下吗，申请不进去

How to obtain the pretrained model

config.py 里头的encoding还是没改过来

第28行
with open(file_path, encoding = "UTF-8") as f:

UnicodeDecodeError

我做inference推理使用，一直遇到这个问题：

config.py", line 40, in Config
emotions = [t.strip() for t in f.readlines()]
UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 2: illegal multibyte sequence

请问可能的原因和解决办法是什么？Thx.

Streaming TTS Support？

This is a great project! Is there any plan to support streaming TTS?

哪里可以看全部的speaker ?

我的 data/inference/text 看到样例文件，每行的前面部分是指定speaker,那么如何查看全部可用的speaker呢？
这12个speaker就已经是全部的speaker ?

IsADirectoryError: [Errno 21] Is a directory: '/EmotiVoice/outputs/style_encoder/ckpt/checkpoint_163431'

AttributeError: 'NoneType' object has no attribute 'seek'.

I tried to run the program in Windows 10 and the web page opens with an error
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

为了提高音质,请问如果提高音频的采样率?

我尝试修改了config.py中的"sampling_rate = 16_000",但是当我将这个值改为24_000时,输出的音频阅读的速度变的非常快.

所以我想问,如何可以调高采样率并且音频阅读速度保持正常?

性能压测指标有吗?

求问大佬们

Needs documentation for hardware requirement

想知道大概需要多少GPU memory做inference

encode error when open file

When starting with 'streamlit run demo_page.py', you may encounter the following error: "UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 2: illegal multibyte sequence".

To resolve this issue, it is recommended to change the encoding when opening the file. You can do this by modifying your code as follows:
the file path: EmotiVoice/config/joint/config.py

#### Speaker ####
with open(speaker2id_path, encoding='utf-8') as f:
    speakers = [t.strip() for t in f.readlines()]
speaker_n_labels = len(speakers)

同样的文本内容，部分指定的speaker用命令无法生产wav，也没报错

python inference_am_vocoder_joint.py
--logdir prompt_tts_open_source_joint
--config_folder config/joint
--checkpoint g_00140000
--test_file $TEXT
用1028speaker目前什么都可以生成，没有问题，用3095的时候就不能生成
下面这个可以
1028|普通|<sos/eos> n i3 sp1 k e3 sp0 i3 sp1 b a3 sp1 zh e4 sp1 d ang4 sp0 z uo4 sp1 sh iii4 sp1 x ie2 sp0 p o4 sp3 b u4 sp0 g uo4 sp3 n i3 sp1 ie3 sp1 ing1 sp0 g ai1 sp1 q ing1 sp0 ch u3 sp3 x ian4 sp0 sh iii2 sp1 j iou4 sp0 sh iii4 sp1 zh e4 sp0 iang4 sp3 m ei2 sp0 iou3 sp1 sh en2 sp0 m e5 sp1 sh iii4 sp0 sh iii4 sp1 j ve2 sp0 d uei4 sp1 d e5 sp1 g ong1 sp0 p ing2 sp3 s uei1 sp0 r an2 sp1 b ing4 sp1 b u4 sp0 x iang3 sp1 b iao3 sp0 d a2 sp1 sh en2 sp0 m e5 sp3 k e3 sp1 n i3 sp1 ie3 sp1 q ing1 sp0 ch u3 sp1 n i3 sp1 v3 sp1 uo3 sp1 zh iii1 sp0 j ian1 sp1 d e5 sp1 ch a1 sp0 j v4 sp3 uo3 sp0 m en5 sp3 j i1 sp0 b en3 sp1 m ei2 sp0 sh en2 sp0 m e5 sp1 x i1 sp0 uang4 <sos/eos>|你可以把这当做是胁迫,不过,你也应该清楚,现实就是这样,没有什么事是绝对的公平,虽然并不想表达什么,可你也清楚你与我之间的差距,我们,基本没什么希望

下面这个不可以
3095|普通|<sos/eos> n i3 sp1 k e3 sp0 i3 sp1 b a3 sp1 zh e4 sp1 d ang4 sp0 z uo4 sp1 sh iii4 sp1 x ie2 sp0 p o4 sp3 b u4 sp0 g uo4 sp3 n i3 sp1 ie3 sp1 ing1 sp0 g ai1 sp1 q ing1 sp0 ch u3 sp3 x ian4 sp0 sh iii2 sp1 j iou4 sp0 sh iii4 sp1 zh e4 sp0 iang4 sp3 m ei2 sp0 iou3 sp1 sh en2 sp0 m e5 sp1 sh iii4 sp0 sh iii4 sp1 j ve2 sp0 d uei4 sp1 d e5 sp1 g ong1 sp0 p ing2 sp3 s uei1 sp0 r an2 sp1 b ing4 sp1 b u4 sp0 x iang3 sp1 b iao3 sp0 d a2 sp1 sh en2 sp0 m e5 sp3 k e3 sp1 n i3 sp1 ie3 sp1 q ing1 sp0 ch u3 sp1 n i3 sp1 v3 sp1 uo3 sp1 zh iii1 sp0 j ian1 sp1 d e5 sp1 ch a1 sp0 j v4 sp3 uo3 sp0 m en5 sp3 j i1 sp0 b en3 sp1 m ei2 sp0 sh en2 sp0 m e5 sp1 x i1 sp0 uang4 <sos/eos>|你可以把这当做是胁迫,不过,你也应该清楚,现实就是这样,没有什么事是绝对的公平,虽然并不想表达什么,可你也清楚你与我之间的差距,我们,基本没什么希望

win显示编码错误

UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 2: illegal multibyte sequence

请问 inference_am_vocoder_joint.py 怎样调节音量和语速？

sampling_rate 24k?

我尝试将config文件中的采样率改成24k，但明显是出错的，请问开源的这个模型支持24k音频合成吗，应该如何修改呢？

cuda和python之类的版本有要求么？

Getting Same voice with Different Emotion Prompts

Speaker - Maria_Kasper
Text - "Emoti Voice is a powerful and modern open-source text-to-speech engine. Emoti Voice speaks both English and Chinese, and with over two thousand different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others"
Emotion Prompts Tried - Happy / Sad / Excited / Angry / Whisper / Shout
Generated Audios - https://drive.google.com/drive/folders/1JqWnVFSiu5DMyZhGt7XyGXhrlB6eCvPR?usp=sharing
Generated Using the Demo UI

Can someone please help, if i am missing something here?

请问支持在移动端生成么

The pretrained models？

The pretrained models cannot be downloaded， is there another link?

请问有自带**人声音的speaker吗

看了一下speaker列表似乎都是外国人？有**人的speaker吗，还是说需要自己在哪里下载导入到什么地方，谢谢啦

[feature request]Japanese support

Thanks for sharing this amazing work! Will Japanese be supported in the future?

建议增加一个http类的接口方便集成到运营环境中，开箱即用

建议使用 get或者post 传文字、speaker_id 、提示这类参数返回文件 url 或者音频数据

如果为了效率可以增加一层缓存，对传参做个md5，作为文件名缓存文件效果也挺好

style_prompt不起作用

Maria_Kasper|哭唧唧|<sos/eos> uo3 sp1 l ai2 sp0 d ao4 sp1 b ei3 sp0 j ing1 sp3 q ing1 sp0 h ua2 sp0 d a4 sp0 x ve2 <sos/eos>|我来到北京，清华大学
Maria_Kasper|非常开心|<sos/eos> uo3 sp1 l ai2 sp0 d ao4 sp1 b ei3 sp0 j ing1 sp3 q ing1 sp0 h ua2 sp0 d a4 sp0 x ve2 <sos/eos>|我来到北京，清华大学
上面两种inference text一个来自readme样例，一个来自data/inference/text，生成的音频听不出区别，另外三种语速也感受不到实际差别，只是style_embedding确实不同，但实际效果几乎没有差别

netease-youdao / emotivoice Goto Github PK

emotivoice's Introduction

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

Demo

Hot News

Features under development

Quickstart

EmotiVoice Docker image

Full installation

Prepare model files

Inference

OpenAI-compatible-TTS API

Wiki page

Training

Roadmap & Future work

WeChat group

Credits

License

emotivoice's People

Stargazers

Watchers

Forkers

emotivoice's Issues

Recommend Projects

Recommend Topics

Recommend Org