myshell-ai / openvoice Goto Github PK

View Code? Open in Web Editor NEW

27.3K 208.0 2.6K 3.78 MB

Instant voice cloning by MyShell.

Home Page: https://research.myshell.ai/open-voice

License: MIT License

Python 86.55% Jupyter Notebook 13.45%

text-to-speech tts voice-clone zero-shot-tts

openvoice's People

Contributors

Stargazers

Watchers

Forkers

oytunturk macroustc subjectmatteregnoramoose mariambasents ishine entn-at atuxhe maxmax2016 fyphen1223 eltociear xbeheydt hanzvivatma kingfener positivewon wingjoezhou sunnyxorange sunnnnnnnny pzx-star willow991031 qoboty danielwang1012 xjspace xiesiyang duj12 lqiuqiu linlhc mskycoder shaun95 zqlsnr luozhenyu336 kekewind justin-12138 whitefu amorjnyh bigshipai by123 zcfrank1st sharon6666666 assassindesign ganjunhong 96vksingh aaronpanxiaofeng konglingchao168 yasinlin happyxy kmishra1204 ethanyhzhang venkatesh06119 lancelee98 kevinwang676 captainarmenia sibozhu therealsphinx 2016duncan21 sweat-tiger kiranbeethoju shashipal95 sasajib gladiopeace defr0ggy tangtc1981 alpatrykos mrcodechef mdek nanosekun-de krokite shane-reaume traviscooper shivamsinha15 kp666 jorisschaller klei22 bashirhassan-appness ben-xd arunkumar-patange kc-ck zouhuigang edustack jiackylee lyhiving asteryas cmd-rm-rf l-g-t cc6907953 imomin thanhpham1987 raise-hanct chris930101 m3dbox tywldx yoru-aniro gnoparus itsaquestion xymfei 8-diagrams thechori jmaigc edenbuaa kaidduong tomchapin

openvoice's Issues

chinese is not suitable,how to train base checkpoint to improve?

rt,thx!any plan open train code?

Enhancing the Versatility of OpenVoice for Diverse Linguistic Contexts

Dear OpenVoice Contributors,

First and foremost, I would like to extend my sincerest commendations for the remarkable work you have accomplished with OpenVoice. The technology's ability to clone voice tones accurately and facilitate flexible voice style control is nothing short of revolutionary. Moreover, the zero-shot cross-lingual voice cloning feature is a testament to the innovative strides you are making in the field of speech synthesis.

Having perused your paper and explored the OpenVoice demos, I am thoroughly impressed by the system's capabilities. However, I would like to propose an enhancement that could potentially augment the versatility of OpenVoice, particularly in handling diverse linguistic contexts.

Issue: Expanding Linguistic Adaptability for Underrepresented Languages

While OpenVoice performs admirably with languages and accents present in the massive-speaker multi-lingual training dataset, there is an opportunity to extend its adaptability to underrepresented languages that are often not included in global datasets. These languages, which may have unique phonetic and prosodic characteristics, present a challenge for any voice cloning technology.

Proposed Enhancement:

Incorporating a Broader Range of Phonetic and Prosodic Features: By expanding the dataset to include a wider array of phonetic and prosodic features from underrepresented languages, OpenVoice could potentially improve its cloning accuracy for these languages.
Developing a Framework for Community-Driven Dataset Expansion: Establishing a platform where native speakers of underrepresented languages can contribute voice samples could enrich the training dataset and enhance the model's performance across a broader linguistic spectrum.
Integrating Adaptive Algorithms for Phonetic Variation: Implementing machine learning algorithms that can adapt to the phonetic variations of new languages could make OpenVoice more robust in handling the nuances of different linguistic contexts.

I believe these enhancements could not only refine the performance of OpenVoice but also contribute to the preservation and representation of linguistic diversity in the digital realm.

Thank you for considering my proposal. I eagerly await your thoughts on this matter and am keen to contribute further to this discussion.

Best regards,
yihong1120

demo not working

the demo site is not accepting any new signups pleas fix this is disappointing

Request for access training dataset

Hello,
I've been reading your paper and am very interested in your project. I noticed that the paper mentions the use of a MSML for training the model, but it doesn't specify the exact dataset used. I'm particularly interested in the emotion-style speech data that you've collected. This is a unique and valuable resource, and I'd love to learn more about it. Could you share with us access permission to your training dataset?

Thank you for your work on this project.

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

安装时报错：
ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

Cant install and run project

after i install all the requirements and start up the openvoice_app.py it spits out an error saying "[Errno 2] No such file or directory: 'checkpoints/base_speakers/EN/config.json' "

Error when trying to start tts

Going into the web thing, tick agree, and press submit, and i get "ERROR] Get target tone color error {str(e)}"

Question about training details

Hi,
Thanks again for your nice work. Here is another questions:
(1) There are two stages for training convertor, how long should each stage train?
(2) How long (or iterations) could we get a model which performs like the one you provided?

I haved trained for nearly 30w iters, but the result is too bad.

thx again and looking forward to your reply ^_^

OpenVoice is not open source software or open source data

From your README, you state:

This is an open-source implementation that approximates the performance of the internal voice clone technology of myshell.ai.

The "non-commercial" clause makes this project not open source, in the common usage of the term "open source".

"Open Source" has a generally accepted meaning of being able to use the digital artifacts for commercial purposes. The OSI and Wikipedia's entry on open-source licensing both articulate that commercial re-use is a (generally accepted) requirement of an "open source" license.

If the intent of the project is to create an open source software voice cloning software, could you change the licensing terms, both in the README.md, the headers of the source files and data licensing terms to reflect this intent? For example dropping the non-commercial clause would make this project open source.

If the intent is to keep the non-commercial clause of the license, indications of the project being "open source" should be removed as it isn't open source and can cause confusion for people wishing to use your code and data. A commonly accepted term is "source available" rather than "open source" to indicate that you've made the source available to view but not use commercially.

[ERROR] Get target tone color error cuFFT error: CUFFT_INTERNAL_ERROR

installed locally on Manjaro linux with nvidia drivers
no problems during installation

I get this error in the info box every time
[ERROR] Get target tone color error cuFFT error: CUFFT_INTERNAL_ERROR

Models on the Hugging Face Hub!

Hi myshell team,

I'm VB, I lead the developer advocacy efforts for Audio at Hugging Face. Congratulations on releasing such a brilliant checkpoint.

It'd also be nice to upload the model weights to the hub. This would also increase the visibility of the model checkpoint, hence leading to adoption.

The process to do is quite simple:

Under the https://huggingface.co/myshell-ai org create a new model.
Upload checkpoints to the repo.
Add a model card (similar to GitHub README)

Of course, I am happy to help/ guide you if you have any questions.

Cheers,
VB

About naturalness of the voice

I've tried the demo on hugging face. The voice is similar but the naturalness is still not very good. Is it because of the TTS model? If replace it by a better TTS model, can we expect a better result?

Multi-Reference Blends for Original Voice Hybrid Clones

Please research and, if possible, add the following:

The ability to clone not just one reference voice, but to accept two (or possibly more) reference samples from different speakers and blend them to create unique, original, hybrid cloned voices. This would make OpenVoice extremely useful in creative projects for creating new voices for original works and characters, rather than mere clones of the voices of actual people.

A good test while working on this would be to combine a reference from an American speaker, and a reference from a British speaker, and see if OpenVoice can blend them to create a convincing "mid-Atlantic" hybrid.

This would be an absolutely killer feature and I've seen no other software able to do it. Many thanks for considering this suggestion!

Tone Color Converter and Base Speaker

Hello,

Your model and paper look great.
I am deeply impressed for the ability to mimic the tone voice of your model. I tested on English and the results were so good.
However, when I tested ToneColorConverter on Vietnamese, the results were not as good as that of English. It even mistakes femle's tone voice for male's tone voice. An explaination for this result could be that ToneColorConverter was not trained on Vietnamese datasets so it may not capture special features on Vietnamese.

Could you please suggest some measures to solve this problem?

And I also look forward to your new plan for training the base speaker model and tone color converter model on the custom datasets.

Thank you.

ZH_style_base_model

Thank you very much for your contributions. This is a very useful and clear open-source project.

I noticed that it performed surprisingly well on English tasks. However, there seem to be some potential areas for improvement in the Chinese task, such as the lack of a style model that can adjust emotions.

Do you plan to release the relevant training process in the future, such as data and code? I want to try to fine-tune the Chinese base model and train the style-model. Thanks a lot.

How to use this project on Apple's M1 chip.

Linux Compatibility

Will this support Linux at some point? If yes, when?

Mims

Multilingual cloning？

How to make adjustments to other languages such as Japanese, such as emotions, accents, rhythms, pauses, and introductions?

Windows installation instruction

Can you make instruction for windows users? Some used dependencies uses multiple different python version.

WARNING: A conda environment already exists at 'c:\Users\vovap\miniconda3\envs\openvoice'
Remove existing environment (y/[n])? y

Channels:
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: c:\Users\vovap\miniconda3\envs\openvoice

  added / updated specs:
    - python=3.9


The following NEW packages will be INSTALLED:

  ca-certificates    pkgs/main/win-64::ca-certificates-2023.12.12-haa95532_0
  openssl            pkgs/main/win-64::openssl-3.0.12-h2bbff1b_0
  pip                pkgs/main/win-64::pip-23.3.1-py39haa95532_0
  python             pkgs/main/win-64::python-3.9.18-h1aa4202_0
  setuptools         pkgs/main/win-64::setuptools-68.2.2-py39haa95532_0
  sqlite             pkgs/main/win-64::sqlite-3.41.2-h2bbff1b_0
  tzdata             pkgs/main/noarch::tzdata-2023d-h04d1e81_0
  vc                 pkgs/main/win-64::vc-14.2-h21ff451_1
  vs2015_runtime     pkgs/main/win-64::vs2015_runtime-14.27.29016-h5e58377_2
  wheel              pkgs/main/win-64::wheel-0.41.2-py39haa95532_0


Proceed ([y]/n)? y


Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate openvoice
#
# To deactivate an active environment, use
#
#     $ conda deactivate


E:\AI\OpenVoice\OpenVoice>conda activate openvoice

CondaError: Run 'conda init' before 'conda activate'


E:\AI\OpenVoice\OpenVoice>conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
Channels:
 - pytorch
 - nvidia
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): done
Solving environment: \ warning  libmamba Added empty dependency for problem type SOLVER_RULE_UPDATE
failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package torchvision-0.14.1-py310_cpu requires python >=3.10,<3.11.0a0, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ pin-1 is installable and it requires
│  └─ python 3.11.* , which can be installed;
└─ torchvision 0.14.1  is not installable because there are no viable options
   ├─ torchvision 0.14.1 would require
   │  └─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   ├─ torchvision 0.14.1 would require
   │  └─ python >=3.7,<3.8.0a0 , which conflicts with any installable versions previously reported;
   ├─ torchvision 0.14.1 would require
   │  └─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   └─ torchvision 0.14.1 would require
      └─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported.

Pins seem to be involved in the conflict. Currently pinned specs:
 - python 3.11.* (labeled as 'pin-1')



E:\AI\OpenVoice\OpenVoice>pip install -r requirements.txt
Collecting librosa==0.9.1 (from -r requirements.txt (line 1))
  Downloading librosa-0.9.1-py3-none-any.whl (213 kB)
     ---------------------------------------- 213.1/213.1 kB 541.0 kB/s eta 0:00:00
Collecting faster-whisper==0.9.0 (from -r requirements.txt (line 2))
  Downloading faster_whisper-0.9.0-py3-none-any.whl.metadata (11 kB)
Collecting pydub==0.25.1 (from -r requirements.txt (line 3))
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting wavmark==0.0.2 (from -r requirements.txt (line 4))
  Downloading wavmark-0.0.2-py3-none-any.whl.metadata (5.0 kB)
Collecting numpy==1.22.0 (from -r requirements.txt (line 5))
  Downloading numpy-1.22.0.zip (11.3 MB)
     ---------------------------------------- 11.3/11.3 MB 16.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting eng_to_ipa==0.0.2 (from -r requirements.txt (line 6))
  Downloading eng_to_ipa-0.0.2.tar.gz (2.8 MB)
     ---------------------------------------- 2.8/2.8 MB 174.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting inflect==7.0.0 (from -r requirements.txt (line 7))
  Downloading inflect-7.0.0-py3-none-any.whl.metadata (21 kB)
Collecting unidecode==1.3.7 (from -r requirements.txt (line 8))
  Downloading Unidecode-1.3.7-py3-none-any.whl.metadata (13 kB)
Collecting whisper-timestamped==1.14.2 (from -r requirements.txt (line 9))
  Downloading whisper_timestamped-1.14.2-py3-none-any.whl.metadata (1.2 kB)
Collecting openai (from -r requirements.txt (line 10))
  Downloading openai-1.6.1-py3-none-any.whl.metadata (17 kB)
Collecting python-dotenv (from -r requirements.txt (line 11))
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting pypinyin==0.50.0 (from -r requirements.txt (line 12))
  Downloading pypinyin-0.50.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting cn2an==0.5.22 (from -r requirements.txt (line 13))
  Downloading cn2an-0.5.22-py3-none-any.whl.metadata (10 kB)
Collecting jieba==0.42.1 (from -r requirements.txt (line 14))
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
     ---------------------------------------- 19.2/19.2 MB 5.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting gradio==3.48.0 (from -r requirements.txt (line 15))
  Downloading gradio-3.48.0-py3-none-any.whl.metadata (17 kB)
Collecting langid==1.1.6 (from -r requirements.txt (line 16))
  Downloading langid-1.1.6.tar.gz (1.9 MB)
     ---------------------------------------- 1.9/1.9 MB 127.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting audioread>=2.1.5 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading audioread-3.0.1-py3-none-any.whl.metadata (8.4 kB)
Collecting scipy>=1.2.0 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading scipy-1.11.4-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 60.4/60.4 kB 3.3 MB/s eta 0:00:00
Collecting scikit-learn>=0.19.1 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading scikit_learn-1.3.2-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting joblib>=0.14 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting decorator>=4.0.10 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collecting resampy>=0.2.2 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading resampy-0.4.2-py3-none-any.whl (3.1 MB)
     ---------------------------------------- 3.1/3.1 MB 98.9 MB/s eta 0:00:00
Collecting numba>=0.45.1 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading numba-0.58.1-cp311-cp311-win_amd64.whl.metadata (2.8 kB)
Collecting soundfile>=0.10.2 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading soundfile-0.12.1-py2.py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 62.4 MB/s eta 0:00:00
Collecting pooch>=1.0 (from librosa==0.9.1->-r requirements.txt (line 1))
  Downloading pooch-1.8.0-py3-none-any.whl.metadata (9.9 kB)
Requirement already satisfied: packaging>=20.0 in c:\users\vovap\miniconda3\lib\site-packages (from librosa==0.9.1->-r requirements.txt (line 1)) (23.1)
Collecting av==10.* (from faster-whisper==0.9.0->-r requirements.txt (line 2))
  Downloading av-10.0.0-cp311-cp311-win_amd64.whl (25.3 MB)
     ---------------------------------------- 25.3/25.3 MB 12.1 MB/s eta 0:00:00
Collecting ctranslate2<4,>=3.17 (from faster-whisper==0.9.0->-r requirements.txt (line 2))
  Downloading ctranslate2-3.23.0-cp311-cp311-win_amd64.whl.metadata (10 kB)
Collecting huggingface-hub>=0.13 (from faster-whisper==0.9.0->-r requirements.txt (line 2))
  Downloading huggingface_hub-0.20.2-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.15,>=0.13 (from faster-whisper==0.9.0->-r requirements.txt (line 2))
  Downloading tokenizers-0.14.1-cp311-none-win_amd64.whl.metadata (6.8 kB)
Collecting onnxruntime<2,>=1.14 (from faster-whisper==0.9.0->-r requirements.txt (line 2))
  Downloading onnxruntime-1.16.3-cp311-cp311-win_amd64.whl.metadata (4.5 kB)
INFO: pip is looking at multiple versions of wavmark to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.52.0 
Requires-Python >=3.6,<3.9; 0.52.0rc3 
Requires-Python >=3.6,<3.9; 0.53.0 
Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 
Requires-Python >=3.6,<3.10; 0.53.0rc2 
Requires-Python >=3.6,<3.10; 0.53.0rc3 
Requires-Python >=3.6,<3.10; 0.53.1 
Requires-Python >=3.6,<3.10; 0.54.0 
Requires-Python >=3.7,<3.10; 0.54.0rc2 
Requires-Python >=3.7,<3.10; 0.54.0rc3 
Requires-Python >=3.7,<3.10; 0.54.1 
Requires-Python >=3.7,<3.10; 0.55.0 
Requires-Python >=3.7,<3.11; 0.55.0rc1 
Requires-Python >=3.7,<3.11; 0.55.1 
Requires-Python >=3.7,<3.11; 0.55.2 
Requires-Python >=3.7,<3.11; 1.21.2 
Requires-Python >=3.7,<3.11; 1.21.3 
Requires-Python >=3.7,<3.11; 1.21.4 
Requires-Python >=3.7,<3.11; 1.21.5 
Requires-Python >=3.7,<3.11; 1.21.6 
Requires-Python >=3.7,<3.11; 1.6.2 
Requires-Python >=3.7,<3.10; 1.6.3 
Requires-Python >=3.7,<3.10; 1.7.0 
Requires-Python >=3.7,<3.10; 1.7.1 
Requires-Python >=3.7,<3.10; 1.7.2 
Requires-Python >=3.7,<3.11; 1.7.3 
Requires-Python >=3.7,<3.11; 1.8.0 
Requires-Python >=3.8,<3.11; 1.8.0rc1 
Requires-Python >=3.8,<3.11; 1.8.0rc2 
Requires-Python >=3.8,<3.11; 1.8.0rc3 
Requires-Python >=3.8,<3.11; 1.8.0rc4 
Requires-Python >=3.8,<3.11; 1.8.1 
Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement torch<2.0 (from 
wavmark) (from versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2)
ERROR: No matching distribution found for torch<2.0```

ERROR

I am getting this error every time in info box
but no problem during the installation
[ERROR] Get target tone color error cuFFT error: CUFFT_INTERNAL_ERROR

Chinese pronunciation needs improvement

你好，demo里面中文的音调感觉不太正常，早是三声，但是读不能读三声，是否有计划吧中文语调支持更完善一下？

Is it possible to generate a real-time processed voice?

Use case:
Meeting with foreigners

HIgh Memory Usage

Hi, Thanks for this great repository. It is amazing work.

My problem is when I initialize OpenVoice's BaseSpeakerTTS, It uses ~3 GiB memory and ~1 GiB video ram. I think that, it consumes too much resources. Do you have any idea to optimize it ?

Notebook? Files?

Really awesome looking paper and samples was anxious to try but seems the repo's empty or is there a different repo we should be tracking?

Silero Error on Generation

Installed and getting this error on execution of generation. It looks like a silero version needs to be added to the code.

[ERROR] Get target tone color error Problem when installing silero with version None. Check versions here: https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models

how to collect reddit

Ukrainian supported?

Are there any plans to support Ukrainian and Russian languages? great product would like to try it

real-time voice: 后续会支持实时语音转换吗？

我完整运行了你们的项目，为你们取得的成果表示由衷的祝贺。
使用默认的TTS转换对于中文的支持不是太好，会有一些声调的问题，部分文字发音听起来像广西老表的口音一样😂。我使用真人的录音替换TTS，这会有很好的表现，即使是男声转女声也会有很不错的效果。
如果后续能够支持实时语音转换，那么本项目的想象空间就会大很多。用于国内的泛娱乐主播市场，或者是游戏内语音交流场景，我相信会有很大的关注度，哪怕是会有秒级的延迟也可以接受。
以上，祝本项目能够真正商业落地。

Translated using chatGPT：
I have successfully executed your project in its entirety and extend my heartfelt congratulations on the achievements you've made.
The default TTS conversion does not support Chinese very well; there are some tonal issues, and some words sound like they have a Guangxi accent 😂. I have replaced the TTS with real human recordings, which perform much better. Even when converting from male to female voices, the results are quite impressive.
If there could be support for real-time voice conversion in the future, the potential for this project would significantly expand. I believe there would be considerable attention in the broader entertainment broadcasting market in China or in-game voice communication scenarios. Even if there were delays of a few seconds, it would still be acceptable.
In conclusion, I hope this project can genuinely be implemented commercially.

Only seeing README in repo

Hi,

Saw the paper go out and was wondering if your going to release the code as well.

Thanks!

Open voice

pypinyin is missing in the requirements

The latest version of requirements doesn't contain pypinyin, and it is not a dependency for other packages, so
pip install -r requirements.txt
does not insall it. As a result demo_part1.ipynb gives and error:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      2 import torch
      3 import se_extractor
----> 4 from api import BaseSpeakerTTS, ToneColorConverter

File OpenVoice/api.py:9
      7 import os
      8 import librosa
----> 9 from text import text_to_sequence
     10 from mel_processing import spectrogram_torch
     11 from models import SynthesizerTrn

File OpenVoice/text/__init__.py:2
      1 """ from https://github.com/keithito/tacotron """
----> 2 from text import cleaners
      3 from text.symbols import symbols
      6 # Mappings from symbol to numeric ID and vice versa:

File OpenVoice/text/cleaners.py:3
      1 import re
      2 from text.english import english_to_lazy_ipa, english_to_ipa2, english_to_lazy_ipa2
----> 3 from text.mandarin import number_to_chinese, chinese_to_bopomofo, latin_to_bopomofo, chinese_to_romaji, chinese_to_lazy_ipa, chinese_to_ipa, chinese_to_ipa2
      5 def cjke_cleaners2(text):
      6     text = re.sub(r'\[ZH\](.*?)\[ZH\]',
      7                   lambda x: chinese_to_ipa(x.group(1))+' ', text)

OpenVoice/text/mandarin.py:4
      2 import sys
      3 import re
----> 4 from pypinyin import lazy_pinyin, BOPOMOFO
      5 import jieba
      6 import cn2an

ModuleNotFoundError: No module named 'pypinyin'

I have not checked any other missing requirements.

中文上的一些问题

非常感谢您开源这么棒的语音音色克隆项目。
我在本地配置并运行起来了该项目，我注意到：
1、中文效果不是很好
2、貌似不能指定每个字的音调或读音
3、中文与英文混合的句子下，英文单次会被拆成字符去朗读（这个应该是tts的问题）
4、我看到语音克隆的流程是：提取模板音频音色特征融合到另一个语音中（由文字转语音得到）
因此我使用了其他的tts去得到音频，然后使用音色进行融合，这样效果会好很多，也比较自然（音色有点相似的情况下）
我看到您在计划另一个中文场景的项目，您有计划时间吗，如果开放了我将会再次尝试！

Nothing alike

Not sure what's happening here - I managed to spin this up in the local gradio app, recorded my own voice, but inference gave me an american-sounding output - I'm British - is that expected?

Thanks!

It seems that OpenVoice uses VITS's VC to do voice conversion

POLICIA JA ESTÁ CIENTE

gr.Audio() got an unexpected keyword argument 'info'

after run python -m openvoice_app --share, i got this:

Loaded checkpoint 'checkpoints/base_speakers/EN/checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint 'checkpoints/base_speakers/ZH/checkpoint.pth'
missing/unexpected keys: [] []
Loaded checkpoint 'checkpoints/converter/checkpoint.pth'
missing/unexpected keys: [] []
F:\OpenVoice\installer_files\env\lib\site-packages\gradio\components\dropdown.py:90: UserWarning: The max_choices parameter is ignored when multiselect is False.
warnings.warn(
Traceback (most recent call last):
File "F:\OpenVoice\installer_files\env\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "F:\OpenVoice\installer_files\env\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "F:\OpenVoice\openvoice_app.py", line 267, in
ref_gr = gr.Audio(
File "F:\OpenVoice\installer_files\env\lib\site-packages\gradio\component_meta.py", line 157, in wrapper
return fn(self, **kwargs)
TypeError: init() got an unexpected keyword argument 'info'

compare to valle or xtts-v2?

Does any compare results to other state-of-art methods?

is not support chinese?

it seems text/cleaner's funciton chinese_to_ipa not declare?

Error on Debian (libcudnn_cnn_infer.so.8)

Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory

FFmpeg was missing (color/tone can't find file error)

For anyone who gets the error that says something about "color" and "tone" and not being able to find a file (I didn't save the exact error message), I looked it up and seen a barely related post about FFmeg and decided it was worth a shot. I choco'd it into the openvoice folder and it shockingly fixed the issue.

Example for custom model training

Hi @Zengyi-Qin
The paper looks great. Unfortunately the pre-training model can only work with English, although the examples contain other languages as well, which is misleading.
I tried adding a new language by modifying the code (adding tags and a converter to phonemes) and even managed to synthesize audio, but unfortunately it only looks a bit like promt.
Are you planning to open access (add an example) to train a custom model so that the community can add their own languages and train the model on their own dataset?

Hindi Version Or Anyone Want to work on this ?

cuda 11.7.* does not exist (perhaps a missing channel)

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Channels:

pytorch
nvidia
defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:

nothing provides cuda 11.7.* needed by pytorch-cuda-11.7-h67b0de4_0

Could not solve for environment specs
The following package could not be installed
└─ pytorch-cuda 11.7** is not installable because it requires
└─ cuda 11.7.* , which does not exist (perhaps a missing channel).

TROJAN ???

What is this file after launching the gradio demo ???

\venv\Lib\site-packages\gradio\frpc_windows_amd64_v0.2

is PUA:Win32/FRProxy
[Detected by Microsoft Defender Antivirus]

(https://www.microsoft.com/en-us/windows/windows-defender?ocid=cx-wdsi-ency)

??? Please explain !!!

Question about input data while training convertor

Hi,
Thanks for the great idea of your works! Here is a question.
When training the convertor, the paper shows that lots of audio all collected. So for a sample like ( text_1, audio_1 ), what's the input audio of the convertor(encoder)? audio_1 or the audio_x generated by the base tts using the text_1?

* if audio_1, it seems that input equals output?
* if audio_x, would the convertor too strongly related to the base TTS (the generated voice) ?

looking for your reply. thanks again.

Gradio demo available and linked locally

The demo on Spaces is awesome! It would also be great to have the Gradio demo available locally. This could help the community easily clone and test the model on their local hardware.

failed to run demo

clone project
install conda & dependencies
run demo_part1.ipynb
then i got an error:

reference_speaker = 'resources/example_reference.mp3'
target_se, audio_name = se_extractor.get_se(reference_speaker, tone_color_converter, target_dir='processed', vad=True)

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
File ~/miniconda3/envs/openvoice/lib/python3.9/site-packages/whisper_timestamped/transcribe.py:1885, in get_vad_segments(audio, output_sample, min_speech_duration, min_silence_duration, dilatation, method)
   1884 try:
-> 1885     _silero_vad_model, utils = torch.hub.load(repo_or_dir=repo_or_dir, model="silero_vad", onnx=onnx, source=source)
   1886 except ImportError as err:

File ~/miniconda3/envs/openvoice/lib/python3.9/site-packages/torch/hub.py:539, in load(repo_or_dir, model, source, trust_repo, force_reload, verbose, skip_validation, *args, **kwargs)
    538 if source == 'github':
--> 539     repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
    540                                        verbose=verbose, skip_validation=skip_validation)
    542 model = _load_local(repo_or_dir, model, *args, **kwargs)

File ~/miniconda3/envs/openvoice/lib/python3.9/site-packages/torch/hub.py:203, in _get_cache_or_reload(github, force_reload, trust_repo, calling_fn, verbose, skip_validation)
    202 if not skip_validation:
--> 203     _validate_not_a_forked_repo(repo_owner, repo_name, ref)
    205 cached_file = os.path.join(hub_dir, normalized_br + '.zip')

File ~/miniconda3/envs/openvoice/lib/python3.9/site-packages/torch/hub.py:162, in _validate_not_a_forked_repo(repo_owner, repo_name, ref)
    161 url = f'{url_prefix}?per_page=100&page={page}'
--> 162 response = json.loads(_read_url(Request(url, headers=headers)))
    163 # Empty response means no more data to process

File ~/miniconda3/envs/openvoice/lib/python3.9/site-packages/torch/hub.py:145, in _read_url(url)
    144 def _read_url(url):
...
-> 1889     raise RuntimeError(f"Problem when installing silero with version {version}. Check versions here: https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models") from err
   1890 finally:
   1891     if need_folder_hack:

RuntimeError: Problem when installing silero with version None. Check versions here: https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models

some qustions?

Hi, I have some questions as belows:

Are the speaker encoder models of the base tts models and tone color converter model be the same model structure? Is there any connection between base tts models and tone color converter model？
During training, for text-audio pair <x, y>, are the reference speaker audio, the output of tone color converter model (speech with
reference tone color and controlled styles) and g from both flow and reverse flow all from y?
Would you plan to release the codes of the training parts, we still could not train a good model following your paper.
Thanks a lot

`FileNotFoundError` on Windows

Not sure if you support Windows but I tried to run the demo_part1.ipybn in VS Code Jupyter then got an error:

I'm not very familiar with Python but I guess there's something wrong about the file path. Could anyone please show me how to run the demo on Windows? Thanks.