152334h / tortoise-tts-fast Goto Github PK

View Code? Open in Web Editor NEW

This project forked from neonbjb/tortoise-tts

772.0 27.0 179.0 105.1 MB

Fast TorToiSe inference (5x or your money back!)

License: GNU Affero General Public License v3.0

Python 9.83% HTML 2.14% Jupyter Notebook 87.98% Shell 0.06%

tortoise-tts-fast's Introduction

this repo is now maintenance only; please develop a fork || use the mrq repo if you have large features to submit

recent updates

BigVGAN-base is now used in place of Univnet by default. (thank you to @deviandice for the example implementation)
--sampler dpm++2m is now fixed, and actually uses dpm++2m. see here for more discussion
--kv_cache is now fixed, and produces outputs identical to the original tortoise repo. It is also enabled by default now because of this.
new: ✨ streamlit webui by @Ryu
Want better voice cloning? We now have tortoise fine-tuning; load fine-tuned GPT models with --ar-checkpoint!
added voicefixer

click me to skip to installation && usage!

Speeding up TorToiSe inference 5x

This is a working project to drastically boost the performance of TorToiSe, without modifying the base models. Expect speedups of 5~10x, and hopefully 20x or larger when this project is complete.

This repo adds the following config options for TorToiSe for faster inference:

(--kv_cache) enabling of KV cache for MUCH faster GPT sampling
(--half) half precision inference where possible
(--sampler dpm++2m) DPM-Solver samplers for better diffusion
(disable with --low_vram) option to toggle cpu offloading, for high vram users

All changes in this fork are licensed under the AGPL. For avoidance beyond all doubt, the following statement is added as a comment to all changed code files:

AGPL: a notification must be added stating that changes have been made to that file.

Current results

All results listed were generated with a slightly undervolted RTX 3090 on Ubuntu 22.04, with the following base command:

./script/tortoise-tts.py --voice emma --seed 42 --text "$TEXT"

NOTE: samples here are somewhat old; they don't have `voicefixer` applied.

Original TorToiSe repo:

speed (B)	speed (A)	preset	sample
112.81s	14.94s	ultra_fast	here

New repo, with --preset ultra_fast:

speed (B)	speed (A)	GPT kv-cache	sampler	steps	cond-free diffusion	autocast to fp16	samples (vs orig repo)
118.61	11.20	❌	DDIM	30	❌	❌	identical
9.98	4.17	✅	DDIM	30	❌	❌	identical
14.32	5.58	✅	DPM++2M	30	✅	❌	best
7.51	3.26	✅	DDIM	10	✅	❌	~identical
7.12	3.30	✅	DDIM	10	✅	✅	okayish
7.21	3.27	✅	DDIM	10	❌	✅	bad

Results measure the time taken to run tts.tts_with_preset(...) using the CLI.

The example texts used were:

A (70 characters)

I'm looking for contributors who can do optimizations better than me.

B (188 characters)

Then took the other, as just as fair,

And having perhaps the better claim,

Because it was grassy and wanted wear;

Though as for that the passing there

Had worn them really about the same,

Half precision currently significantly worsens outputs, so I do not recommend enabling it unless you are happy with the samples linked. Using cond_free with half precision seems to produce decent outputs.

Installation

AMD INSTALLATION IS NOT SUPPORTED, please don't try it

There are two methods for installation.

pure python install

The installation process is identical to the original tortoise-tts repo.

git clone https://github.com/152334H/tortoise-tts-fast
cd tortoise-tts-fast
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
python3 -m pip install -e .
pip3 install git+https://github.com/152334H/BigVGAN.git

Note that if you have the original tortoise installed,

You will need to uninstall it (pip uninstall tortoise)
You will need to install the new requirements (pip install -r requirements.txt)
You may want to install this repository as a symbolic link (pip install -e .), as this repository will be updated frequently

poetry install

First, install Poetry. Then, run:

poetry install
poetry shell

pytorch issues

If you are experiencing errors related to GPU usage (or lackthereof), please see the instructions on the pytorch website to install pytorch with proper GPU support.

CLI Usage

For maximum speed (and worst quality), you can try:

./script/tortoise-tts.py --half --no_cond_free --preset ultra_fast #...
# or, to only generate 1 sample:
./script/tortoise-tts.py --half --no_cond_free --preset single_sample --candidates 1 #...

But in most cases, these settings should perform decently && fast:

./script/tortoise-tts.py --preset ultra_fast # ...

For better quality, you might want the very_fast preset:

./script/tortoise-tts.py --preset very_fast # ...

You can obtain outputs 100% identical to the original tortoise repo with the following command:

./script/tortoise-tts.py --preset ultra_fast_old --original_tortoise #...

If you want to load a fine-tuned autoregressive model, use the --ar-checkpoint argument:

./script/tortoise-tts.py --preset very_fast --ar-checkpoint /path/to/checkpoint.pth #...

Webui

An experimental Streamlit web UI is now available. To access, run:

$ streamlit run script/app.py

Future plans

Optimization related:

add more k-diffusion samplers; optimize diffusion step count
add TensorRT model. 90% of inference time is spent in the GPT model; compiling it should produce great speedups, but it requires:
- a less hacky transformers model definition (see GPT2InferenceModel)
- an ORTModelForCausalLM implementation for tortoise
- tensorRT runtime
try half precision in the vocoder + diffuser

QoL related:

display samples on github pages, where you can do audio embeddings
refactor api & CLI args with saner defaults and names
improved webui integration

Motivation

As stated by an 11Labs developer:

Original README description:

TorToiSe

Tortoise is a text-to-speech program built with the following priorities:

Strong multi-voice capabilities.
Highly realistic prosody and intonation.

This repo contains all the code needed to run Tortoise TTS in inference mode.

A (very) rough draft of the Tortoise paper is now available in doc format. I would definitely appreciate any comments, suggestions or reviews: https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA

Version history

v2.4; 2022/5/17

Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.

v2.3; 2022/5/12

New CLVP-large model for further improved decoding guidance.
Improvements to read.py and do_tts.py (new options)

v2.2; 2022/5/5

Added several new voices from the training set.
Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
Bug fixes

v2.1; 2022/5/2

Added ability to produce totally random voices.
Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
Added ability to use your own pretrained models.
Refactored directory structures.
Performance improvements & bug fixes.

What's in a name?

I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.

Demos

See this page for a large list of example outputs.

Cool application of Tortoise+GPT-3 (not by me): https://twitter.com/lexman_ai

Usage guide

Colab

Colab is the easiest way to try this out. I've put together a notebook you can use here: https://colab.research.google.com/github/152334H/tortoise-tts-fast/blob/main/tortoise_tts.ipynb

Local Installation

If you want to use this on your own computer, you must have an NVIDIA GPU.

First, install pytorch using these instructions: https://pytorch.org/get-started/locally/. On Windows, I highly recommend using the Conda installation path. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.

Next, install TorToiSe and it's dependencies:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python -m pip install -r ./requirements.txt
python setup.py install

If you are on windows, you will also need to install pysoundfile: conda install -c conda-forge pysoundfile

tortoise-tts.py

This script allows you to speak a single phrase with one or more voices.

./script/tortoise-tts.py --text "I'm going to speak this" --voice random --preset fast

For reading large amounts of text:

./script/tortoise-tts.py --voice random --preset fast < textfile.txt

This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.

Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py with the --regenerate argument.

API

Tortoise can be used programmatically, like so:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

Voice customization guide

Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips.

These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb.

Random voice

I've included a feature which randomly generates a voice. These voices don't actually exist and will be random every time you run it. The results are quite fascinating and I recommend you play around with it!

You can use the random voice by passing in 'random' as the voice name. Tortoise will take care of the rest.

For the those in the ML space: this is created by projecting a random vector onto the voice conditioning latent space.

Provided voices

This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see what Tortoise can do for zero-shot mimicking, take a look at the others.

Adding a new voice

To add new voices to Tortoise, you will need to do the following:

Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section.
Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing.
Save the clips as a WAV file with floating point format and a 22,050 sample rate.
Create a subdirectory in voices/
Put your clips in that subdirectory.
Run tortoise utilities with --voice=<your_subdirectory_name>.

Picking good reference clips

As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:

Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them.
Avoid speeches. These generally have distortion caused by the amplification system.
Avoid clips from phone calls.
Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.
Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book.
The text being spoken in the clips does not matter, but diverse text does seem to perform better.

Advanced Usage

Generation settings

Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with these settings (and it's very likely that I missed something!)

These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See api.tts for a full list.

Prompt engineering

Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the prompt "[I am really sad,] Please feed me." will only speak the words "Please feed me" (with a sad tonality).

Playing with the voice latent

Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.

This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like.

Generating conditioning latents from voices

Use the script get_conditioning_latents.py to extract conditioning latents for a voice you have installed. This script will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).

Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.

Using raw conditioning latents to generate speech

After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).

Send me feedback!

Probabilistic models like Tortoise are best thought of as an "augmented search" - in this case, through the space of possible utterances of a specific string of text. The impact of community involvement in perusing these spaces (such as is being done with GPT-3 or CLIP) has really surprised me. If you find something neat that you can do with Tortoise that isn't documented here, please report it to me! I would be glad to publish it to this page.

Tortoise-detect

Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip came from Tortoise.

This classifier can be run on any computer, usage is as follows:

python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>

This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives.

Model architecture

Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

Training

These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own DLAS trainer.

I currently do not have plans to release the training configurations or methodology. See the next section..

Ethical Considerations

Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how.

After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:

It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See tortoise-detect above.
If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.

Diversity

The diversity expressed by ML models is strongly tied to the datasets they were trained on.

Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents.

Looking forward

Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.

I want to mention here that I think Tortoise could do be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS.

The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer. Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.

If you are an ethical organization with computational resources to spare interested in seeing what this model could do if properly scaled out, please reach out to me! I would love to collaborate on this.

Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:

Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
Kim and Jung who implemented univnet pytorch model.
lucidrains who writes awesome open source pytorch models, many of which are used here.
Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.

Notice

Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.

tortoise-tts-fast's People

Contributors

Stargazers

Watchers

Forkers

ryu1845 hesz94 jdv90 station384 mrc2023 rikabi89 slexiz therocketsloth evolf35 ernest-gray andrewkuo invisiblestrangler greydel stephanabs sraisolutions broken-shotgun markomanninen qlee01 bookbot-hive calv-io-n jonathanfly cian0 talipturkmen davvo keyboardcartel sageawesomegoanipicscope9865 ziyaad30 crizconzeta rogeriochaves absane clcarwin noahlessard saucekid opcecco laburn blockcat edgeoutreach m4cs ryokoasakura opencoca rpfilomeno louispaulet webi-ai tiptopfunk a-one-fan tortoise17 thisserand xylvier escottgoodwin smithart centaurioun androiddrew joannot msgpo ybombale sbusso kevinorozcomc bizrockman lonestriker acephalia pop2pop3 arnik-cube josephhampton blueredblueyellow darkwebarebears69 chugarah luzoji optigear g8nz laperiut buphnezz ericjmcdermott bux10 sfarpak furkangozukara varunlohade raminia joey00072 oceans0423 ken2190 distinctvision zodyking dmx974 evandcoleman 0xymoro loversama contesini timber8205 aixingxy agenthesh samuelwinterborn jasonfina27 afiyetolsun praveendhaked2 h4rk8s ochafik wolfiejp manmay-nakhashi drcrallen karusb

tortoise-tts-fast's Issues

ImportError and llvmlite

Why does this happen when I try to generate?

Traceback (most recent call last):
File "E:\tortoise-tts-fast\tortoise\do_tts.py", line 9, in
from api import TextToSpeech
File "E:\tortoise-tts-fast\tortoise\api.py", line 14, in
from tortoise.models.arch_util import TorchMelSpectrogram
File "E:\Anaconda3\lib\site-packages\tortoise\models.py", line 25, in
from tortoise import connections
ImportError: cannot import name 'connections' from 'tortoise' (unknown location)

And why does this happen when I install requirements.txt and pip install -e .

ERROR: Cannot uninstall 'llvmlite'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

sauceVR

Could u fix sauceVR?

tortoise_tts.py: error: one of the arguments -l, --list-voices -P, --play -o, --output -O, --output-dir is required

I'm unable to run via the tortoise_tts.py script. However, I could run do_tts.py as well as Streamlit. Not sure what I am doing wrong. I would appreciate any guidance you can give me.

Apostrophes

It seems to convert apostrophes into a combination of characters: "youâ€™ll" instead of "you'll".

Then it reads those characters (as best it can). Is there a way to avoid this, other than editing every contraction in the file? Or do you just have to remove every contraction?

Variables in do_tts.py

I was using the below arguments in the normal TTS can you add these to the do_tts? I am getting syntax error when I try do it my self, am a bit of rookie so am not sure why.

parser = argparse.ArgumentParser() parser.add_argument('--text', type=str, help='Text to speak.', default="The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.") parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast') parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/') parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3) parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True) parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.' 'Increasing this can in some cases reduce the likelyhood of multiple speakers. Defaults to 0 (disabled)', default=.0) parser.add_argument('--top-p', type=float, default=None, help='P value used in nucleus sampling. 0 to 1. Lower values mean the decoder produces more "likely" (aka boring) outputs.') parser.add_argument('--temperature', type=float, default=None, help='The softmax temperature of the autoregressive model.') parser.add_argument('--cond-free', type=bool, default=None, help='Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for ' 'each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output ' 'of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and ' 'dramatically improves realism.') parser.add_argument('--diffusion-iterations', type=int, default=None,help='Number of diffusion steps to perform. More steps means the network has more chances to iteratively' 'refine the output, which should theoretically mean a higher quality output. ' 'Generally a value above 250 is not noticeably better, however.') parser.add_argument('--diffusion-temperature', type=float, default=None, help='Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 ' 'are the "mean" prediction of the diffusion network and will sound bland and smeared. ') parser.add_argument('--num-autoregressive-samples', type=int, default=None, help='Number of samples taken from the autoregressive model, all of which are filtered using CLVP.' 'As TorToiSe is a probabilistic model, more samples means a higher probability of creating something "great".')

streamlit memleak

There's a memory leak somewhere, my computer with 32GB of RAM has ended up dying from swap thrashing twice now.

diffusion model usage is broken

Either that or the DDPM model export is wrong, haven't figured out which

Training another language

Hello,
I have a large collection of voices for the Polish language - will it be possible to train / finetune my own model in the future?
The original Tortoise repository does not contain the code for training

autocast

fp16 in the autoregressive model seems to degrade performance significantly; are there specific layers that can be forced to fp32 for better results?
Less important, but quantisation could potentially be applied anywhere else in the model as well

A better way to deal with package management?

Currently, it looks like the chosen mechanism is to lock the version to a requirements-version.txt file.
I think adding support for a proper dependency management tool like poetry or pyflow could be nice. Or maybe at least use something like https://github.com/jazzband/pip-tools/ to correctly lock them using hashes.
P.S. I'd be happy to add one myself 😄

Crackling sound when using dpm++2m

Compare :
ddim : https://vocaroo.com/1ccLP3IZFW5G
dpm++2m : https://vocaroo.com/19E6tT0itbIQ
Both on : ultra_fast_old.

This did not happen prior to update in which we have "latent averaging mode" to select on the gui. Or at least I noticed this started to happen since then.

I have tested different voices and it always the same crackling sound. I haven't changed any of my voices either. Again not a big issue but I wonder if anyone else has noticed this?

disable CLVP on ultra_fast

Seems pointless if only one AR sample is being generated?

GPT optimisation with TensorRT/FasterTransformer/Triton/???

This issue is likely to take a substantial amount of effort to solve.

Primary problem: GPT2InferenceModel is uniquely subclassed from 🤗's transformers.GPT2Model. The architecture is significantly different from a basic GPT2, and a substantial amount of code needs to be written to make an optimized version of the model work with the usual huggingface generation function.

Unify inference scripts

There's currently:

the API (tortoise/api.py)
the tortoise/do_tts.py script
the tortoise/read.py script
the scripts/tortoise_tts.py script
the app.py webui
the untouched colab .ipynb

This is too many scripts. Worse still, many of these scripts contain duplicated code to, e.g. split voices or write files.

I want:

a meta-inference script that handles all the shared functionality for running tortoise (currently: scripts/inference.py)
a single CLI script (most likely cut from tortoise_tts.py)
a single webui, that the colab notebook also uses

Everything else should either be scrapped or converted to thin wrappers around the main scripts.

Nothing happens

./scripts/tortoise_tts.py --preset ultra_fast --text "testing this out"       
reading text from stdin!

and then just hangs, sits there doing nothing

Docoupling voice model generation from text generation.

The issue.

If I understand it right tortoise does this:

takes generic model
finetunes it on .wav files
generate voice from text based on that finetuned model

Which means each time to produce one sentence it does each time finetuning.

The solution

Decouple voice finetuning with .wav files from generation of voice based on text.
Make script to finetune model with .wavs and save it for future use without generation part.
Provide a console script to generate voice from text based on finetuned model previously without finetuning it again.

RuntimeError

I stopped the voicefixer download accidentally and now this happens. I tried what you said last time but it didn't help (created a new environment and deleted the voicefixer model files.) I'm using Anaconda3

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ E:\tortoise-tts-fast\tortoise\do_tts.py:11 in │
│ │
│ 8 import torch │
│ 9 from api import TextToSpeech │
│ 10 from base_argparser import ap, nullable_kwargs │
│ ❱ 11 from inference import save_gen_with_voicefix │
│ 12 from utils.audio import load_voices │
│ 13 │
│ 14 │
│ │
│ E:\tortoise-tts-fast\tortoise\inference.py:167 in │
│ │
│ 164 │
│ 165 from voicefixer import VoiceFixer │
│ 166 │
│ ❱ 167 vfixer = VoiceFixer() │
│ 168 │
│ 169 │
│ 170 def save_gen_with_voicefix(g, fpath, squeeze=True, voicefixer=True): │
│ │
│ E:\Anaconda3\envs\Fuck\lib\site-packages\voicefixer\base.py:24 in init │
│ │
│ 21 │ │ │ │ │ │ │ │ By default the checkpoint should be download automatical │
│ 22 │ │ │ │ │ │ │ │ But don't worry! Alternatively you can download it direc │
│ 23 │ │ self._model.load_state_dict( │
│ ❱ 24 │ │ │ torch.load( │
│ 25 │ │ │ │ self.analysis_module_ckpt │
│ 26 │ │ │ ) │
│ 27 │ │ ) │
│ │
│ E:\Anaconda3\envs\Fuck\lib\site-packages\torch\serialization.py:777 in load │
│ │
│ 774 │ │ │ # If we want to actually tail call to torch.jit.load, we need to │
│ 775 │ │ │ # reset back to the original position. │
│ 776 │ │ │ orig_position = opened_file.tell() │
│ ❱ 777 │ │ │ with _open_zipfile_reader(opened_file) as opened_zipfile: │
│ 778 │ │ │ │ if _is_torchscript_zip(opened_zipfile): │
│ 779 │ │ │ │ │ warnings.warn("'torch.load' received a zip file that looks like a To │
│ 780 │ │ │ │ │ │ │ │ " dispatching to 'torch.jit.load' (call 'torch.jit.loa │
│ │
│ E:\Anaconda3\envs\Fuck\lib\site-packages\torch\serialization.py:282 in init │
│ │
│ 279 │
│ 280 class _open_zipfile_reader(_opener): │
│ 281 │ def init(self, name_or_buffer) -> None: │
│ ❱ 282 │ │ super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_bu │
│ 283 │
│ 284 │
│ 285 class _open_zipfile_writer_file(_opener): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

prunedgpt2 - a question

I've looked through the PrunedGPT2 class in the autoregressive.py side by side, and I did not see too much difference than in embedding creation. I wonder, do you have a pruned model somewhere available? If not what's the strategy for pruning?

amd user, my last pip install -r requirements broke my system

just a warning

I pin it on the nvidia named packages, although im not sure what and how exactly broke
I git-pull-ed after a week of not having done so

pytorch no longer detected my gpu card, not even for other apps like stable diffusion
trying to fix it I made it worse somehow and I broke everything

can't get conditioning latents?

This function works for me on the official tortoise-tts repo, but on this fast repo it fails with
"IndexError: too many indices for tensor of dimension 1" on line 138 of tortoise/api.py

This is the function I'm referring to.

auto_conditioning, diffusion_conditioning, _, _ = tts.get_conditioning_latents(
                voice_samples=reference_clips,
                return_mels=True,
            )

It also fails if I set original_tortoise=True.
Also, if I use the tts function with reference clips, it fails with the same error.

However, if I replace the entire get_conditioning_latents function with the official tortoise-tts repo's function it works.

Convert models to safetensors

As shown here, switch to safetensors would speed up model loading, which is a small but non-negligible gain.

changing the vocoder

Have you tried changing the vocoder from Waveglow to HiFi-GAN? HiFi-GAN is faster and requires less VRAM. Alternatively, you could try adding a different vocoder.

RuntimeError: CUDA driver error: invalid argument - when setting voice arg

./scripts/tortoise_tts.py --voice emma --preset ultra_fast "testing"

Suggestion: Colab Notebook

Could you set something up for the webui to work in colab?

where are the gpt and diffusion checkpoints?

Didn't see anything in the read me about them. Thank you.

Local install instructions are wrong

ModuleNotFoundError: No module named 'k_diffusion' error when running command for output

Have installed k_diffusion https://github.com/crowsonkb/k-diffusion/ and tested several times with its contents in a different location/in path for reference from command, yet result is the same

Is there really speed up?

Hello,
thank you for trying to optimize the tortoise library, I am trying to compare the speed between the two implementations, but so far I am getting very similar results in both quality and speed. I use NVIDIA 3060 with 12GB VRAM.

Running the script bellow takes about 2m and 14s.

python scripts/tortoise_tts.py -p ultra_fast -O results/best_short_15/ultra_fast -v best_short_15 <text_short.txt --sampler dpm++2m --diffusion_iterations 30 --vocoder Univnet

Using the same setting, but using original repo tag in CLI takes about 2m and 2s.

python scripts/tortoise_tts.py -p ultra_fast -O results/best_short_15/ultra_fast_original -v best_short_15 <text_short.txt --original_tortoise

Am I missing something? Probably some tags I should add to speed up the generation?

requirements error - 'gdown'

Collecting gdown
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
gdown from https://files.pythonhosted.org/packages/bc/c2/bf15a3e9a5551bc13d9fce0377376e6801fa4bd9fca964dc1ee093da2559/gdown-4.6.4-py3-none-any.whl (from -r ./requirements.txt (line 1430))

Docs are out of date and examples are mostly broken

./script is now ./scripts

./script/tortoise-tts.py is now ./scripts/tortoise_tts.py (underscore)

CUDA out of memory

I know I've been a little bit obnoxious with the issues but here's another one (which will hopefully be my last)

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ E:\tortoise-tts-fast\tortoise\do_tts.py:42 in │
│ │
│ 39 │ kwargs = nullable_kwargs(args) │
│ 40 │ os.makedirs(args.output_path, exist_ok=True) │
│ 41 │ │
│ ❱ 42 │ tts = TextToSpeech( │
│ 43 │ │ models_dir=args.model_dir, │
│ 44 │ │ high_vram=args.high_vram, │
│ 45 │ │ kv_cache=args.kv_cache, │
│ │
│ E:\tortoise-tts-fast\tortoise\api.py:388 in init │
│ │
│ 385 │ │ if high_vram: │
│ 386 │ │ │ self.autoregressive = self.autoregressive.to(self.device) │
│ 387 │ │ │ self.diffusion = self.diffusion.to(self.device) │
│ ❱ 388 │ │ │ self.clvp = self.clvp.to(self.device) │
│ 389 │ │ │ self.vocoder = self.vocoder.to(self.device) │
│ 390 │ │ self.high_vram = high_vram │
│ 391 │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:989 in to │
│ │
│ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │
│ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │
│ 988 │ │ │
│ ❱ 989 │ │ return self._apply(convert) │
│ 990 │ │
│ 991 │ def register_backward_hook( │
│ 992 │ │ self, hook: Callable[['Module', _grad_t, _grad_t], Union[None, Tensor]] │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:641 in _apply │
│ │
│ 638 │ │
│ 639 │ def _apply(self, fn): │
│ 640 │ │ for module in self.children(): │
│ ❱ 641 │ │ │ module._apply(fn) │
│ 642 │ │ │
│ 643 │ │ def compute_should_use_set_data(tensor, tensor_applied): │
│ 644 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor_applied): │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:664 in _apply │
│ │
│ 661 │ │ │ # track autograd history of param_applied, so we have to use │
│ 662 │ │ │ # with torch.no_grad(): │
│ 663 │ │ │ with torch.no_grad(): │
│ ❱ 664 │ │ │ │ param_applied = fn(param) │
│ 665 │ │ │ should_use_set_data = compute_should_use_set_data(param, param_applied) │
│ 666 │ │ │ if should_use_set_data: │
│ 667 │ │ │ │ param.data = param_applied │
│ │
│ E:\Anaconda3\lib\site-packages\torch\nn\modules\module.py:987 in convert │
│ │
│ 984 │ │ │ if convert_to_format is not None and t.dim() in (4, 5): │
│ 985 │ │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() els │
│ 986 │ │ │ │ │ │ │ non_blocking, memory_format=convert_to_format) │
│ ❱ 987 │ │ │ return t.to(device, dtype if t.is_floating_point() or t.is_complex() else No │
│ 988 │ │ │
│ 989 │ │ return self._apply(convert) │
│ 990 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.45 GiB already
allocated; 0 bytes free; 3.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

read.py in Streamlit

I know this is very new but would be great to have support for large form text input from a file in the UI

'streamlit' is not recognized as an internal or external command

after fresh install, that's the error

Download models once

First, thank you for working on this. I believe its awesome you are trying to keep this open and available locally for users who don´t want to depend on cloud, third party services.

My request/questions: can you follow a similar path to Automatic1111 SD´s where models are downloaded and added to a folder once, manually, vs the models being downloaded every time TTS runs? I also like the webui idea, I just would prefer if running tts-fast could be done without internet connection.

Thanks!

What happened to read.py in this fork?

It seems read.py was working for a while, there were several issues referencing it and fixes to it. Then it seems to have been summarily deleted with a "remove read" comment on Feb 26. Is that long-text-file-with-custom-separator functionality here already somewhere? Was there something really broken about it? I would like to use it along with the added speed and other functionality here -- I'm rendering whole Wikipedia articles, see here point 2 for an RSS feed of them -- and want to avoid re-inventing any wheels.

Wait 1 week to see if anyone complains about voicefixer

I'm pretty sure no one is going to do that, but just in case.

Suggestion: AMD GPU?

I just want to know if it will ever come.

half precision on diffuser - results

Diffuser can supposedly be switched to half precision when it's being built (line 239 in api.py) tried this on a 3090ti and got marginal (5% ish iirc) speed improvement - presumably this switch isn't properly implemented and it switches back to full precision. Will have to do a proper trace of the diffuser run to see exactly what's going on.

read.py fails at torchaudio tensor dimension check [Expected 2D Tensor, got 3D.]

a simple "python tortoise/read.py --voice random --preset ultra_fast" will fail as following.

Running on an RTX 3080

google colab

it was working fine but after this update not anymore

/content/tortoise-tts-fast/tortoise/models/vocoder.py:10 in │
│ │
│ 7 from typing import Optional, Callable │
│ 8 from dataclasses import dataclass │
│ 9 try: │
│ ❱ 10 │ from BigVGAN.models import BigVGAN as BVGModel │
│ 11 │ from BigVGAN.env import AttrDict │
│ 12 except ImportError: │
│ 13 │ raise ImportError( │
│ │
│ /content/BigVGAN/models.py:14 in │
│ │
│ 11 from torch.nn import Conv1d, ConvTranspose1d, Conv2d │
│ 12 from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm │
│ 13 │
│ ❱ 14 import activations │
│ 15 from utils import init_weights, get_padding │
│ 16 from alias_free_torch import * │
│ 17 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'activations'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py:3326 in run_code │
│ │
│ 3323 │ │ │ │ elif async_ : │
│ 3324 │ │ │ │ │ await eval(code_obj, self.user_global_ns, self.user_ns) │
│ 3325 │ │ │ │ else: │
│ ❱ 3326 │ │ │ │ │ exec(code_obj, self.user_global_ns, self.user_ns) │
│ 3327 │ │ │ finally: │
│ 3328 │ │ │ │ # Reset our crash handler in place │
│ 3329 │ │ │ │ sys.excepthook = old_excepthook │
│ in │
│ │
│ /content/tortoise-tts-fast/tortoise/api.py:19 in │
│ │
│ 16 from tortoise.models.cvvp import CVVP │
│ 17 from tortoise.models.diffusion_decoder import DiffusionTts │
│ 18 from tortoise.models.random_latent_generator import RandomLatentConverter │
│ ❱ 19 from tortoise.models.vocoder import VocConf │
│ 20 from tortoise.utils.audio import denormalize_tacotron_mel, wav_to_univnet_mel │
│ 21 from tortoise.utils.diffusion import ( │
│ 22 │ SpacedDiffusion, │
│ │
│ /content/tortoise-tts-fast/tortoise/models/vocoder.py:13 in │
│ │
│ 10 │ from BigVGAN.models import BigVGAN as BVGModel │
│ 11 │ from BigVGAN.env import AttrDict │
│ 12 except ImportError: │
│ ❱ 13 │ raise ImportError( │
│ 14 │ │ "BigVGAN not installed, can't use BigVGAN vocoder\n" │
│ 15 │ │ "Please see the installation instructions on README." │
│ 16 │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: BigVGAN not installed, can't use BigVGAN vocoder
Please see the installation instructions on README.

BigVGAN error - clean install

I get the error below. Totally clean install 3.10.2 of python on a VM. I followed the install instructions. It seems it cannot reference the BigGAN lib even though its in the correct folder,¨
`─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/tortoise-tts-fast/tortoise/models/vocoder.py:10 in │
│ │
│ 7 from typing import Optional, Callable │
│ 8 from dataclasses import dataclass │
│ 9 try: │
│ ❱ 10 │ from BigVGAN.models import BigVGAN as BVGModel │
│ 11 │ from BigVGAN.env import AttrDict │
│ 12 except ImportError: │
│ 13 │ raise ImportError( │
│ │
│ /root/tortoise-tts-fast/BigVGAN/models.py:14 in │
│ │
│ 11 from torch.nn import Conv1d, ConvTranspose1d, Conv2d │
│ 12 from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm │
│ 13 │
│ ❱ 14 import activations │
│ 15 from utils import init_weights, get_padding │
│ 16 from alias_free_torch import * │
│ 17 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'activations'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/tortoise-tts-fast/scripts/tortoise_tts.py:15 in │
│ │
│ 12 import torchaudio │
│ 13 from simple_parsing import ArgumentParser, field │
│ 14 │
│ ❱ 15 from tortoise.api import MODELS_DIR, TextToSpeech │
│ 16 from tortoise.utils.audio import load_audio │
│ 17 from tortoise.utils.diffusion import SAMPLERS │
│ 18 from tortoise.models.vocoder import VocConf │
│ │
│ /root/tortoise-tts-fast/tortoise/api.py:19 in │
│ │
│ 16 from tortoise.models.cvvp import CVVP │
│ 17 from tortoise.models.diffusion_decoder import DiffusionTts │
│ 18 from tortoise.models.random_latent_generator import RandomLatentConverter │
│ ❱ 19 from tortoise.models.vocoder import VocConf │
│ 20 from tortoise.utils.audio import denormalize_tacotron_mel, wav_to_univnet_mel │
│ 21 from tortoise.utils.diffusion import ( │
│ 22 │ SpacedDiffusion, │
│ │
│ /root/tortoise-tts-fast/tortoise/models/vocoder.py:13 in │
│ │
│ 10 │ from BigVGAN.models import BigVGAN as BVGModel │
│ 11 │ from BigVGAN.env import AttrDict │
│ 12 except ImportError: │
│ ❱ 13 │ raise ImportError( │
│ 14 │ │ "BigVGAN not installed, can't use BigVGAN vocoder\n" │
│ 15 │ │ "Please see the installation instructions on README." │
│ 16 │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: BigVGAN not installed, can't use BigVGAN vocoder
Please see the installation instructions on README.`

Issues trying to load voicefixer

Hey guys I'm trying to run the code on my machine and I'm having trouble loading the checkpoint using torch:
warn(f"Failed to load image Python extension: {e}")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ X:\tortoise-tts-faster2\scripts\tortoise_tts.py:223 in │
│ │
│ 220 │ # app = import_module("app") │
│ 221 │ # sys.exit(app.main()) │
│ 222 │ │
│ ❱ 223 │ from tortoise.inference import ( │
│ 224 │ │ check_pydub, │
│ 225 │ │ get_all_voices, │
│ 226 │ │ get_seed, │
│ │
│ X:\tortoise-tts-faster2\tortoise\inference.py:167 in │
│ │
│ 164 │
│ 165 from voicefixer import VoiceFixer │
│ 166 │
│ ❱ 167 vfixer = VoiceFixer() │
│ 168 │
│ 169 │
│ 170 def save_gen_with_voicefix(g, fpath, squeeze=True, voicefixer=True): │
│ │
│ C:\Users*\AppData\Local\Programs\Python\Python310\lib\site-packages\voicefixer\base. │
│ py:13 in init │
│ │
│ 10 class VoiceFixer(nn.Module): │
│ 11 │ def init(self): │
│ 12 │ │ super(VoiceFixer, self).init() │
│ ❱ 13 │ │ self._model = voicefixer_fe(channels=2, sample_rate=44100) │
│ 14 │ │ # print(os.path.join(os.path.expanduser(''), ".cache/voicefixer/analysis_module │
│ 15 │ │ self.analysis_module_ckpt = os.path.join( │
│ 16 │ │ │ │ │ os.path.expanduser(""), │
│ │
│ C:\Users*r\AppData\Local\Programs\Python\Python310\lib\site-packages\voicefixer\resto │
│ rer\model.py:180 in init │
│ │
│ 177 │ │ # self.am = AudioMetrics() │
│ 178 │ │ # self.im = ImgMetrics() │
│ 179 │ │ │
│ ❱ 180 │ │ self.vocoder = Vocoder(sample_rate=44100) │
│ 181 │ │ │
│ 182 │ │ self.valid = None │
│ 183 │ │ self.fake = None │
│ │
│ C:\Users*\AppData\Local\Programs\Python\Python310\lib\site-packages\voicefixer\vocod │
│ er\base.py:19 in init │
│ │
│ 16 │ │ │ raise RuntimeError("Error 1: The checkpoint for synthesis module / vocoder ( │
│ 17 │ │ │ │ │ │ │ │ By default the checkpoint should be download automatical │
│ 18 │ │ │ │ │ │ │ │ But don't worry! Alternatively you can download it direc │
│ ❱ 19 │ │ self._load_pretrain(Config.ckpt) │
│ 20 │ │ self.weight_torch = Config.get_mel_weight_torch(percent=1.0)[ │
│ 21 │ │ │ None, None, None, ... │
│ 22 │ │ ] │
│ │
│ C:\Users*\AppData\Local\Programs\Python\Python310\lib\site-packages\voicefixer\vocod │
│ er\base.py:26 in _load_pretrain │
│ │
│ 23 │ │
│ 24 │ def _load_pretrain(self, pth): │
│ 25 │ │ self.model = Generator(Config.cin_channels) │
│ ❱ 26 │ │ checkpoint = load_checkpoint(pth, torch.device("cpu")) │
│ 27 │ │ load_try(checkpoint["generator"], self.model) │
│ 28 │ │ self.model.eval() │
│ 29 │ │ self.model.remove_weight_norm() │
│ │
│ C:\Users*\AppData\Local\Programs\Python\Python310\lib\site-packages\voicefixer\vocod │
│ er\model\util.py:111 in load_checkpoint │
│ │
│ 108 │
│ 109 │
│ 110 def load_checkpoint(checkpoint_path, device): │
│ ❱ 111 │ checkpoint = torch.load(checkpoint_path, map_location=device) │
│ 112 │ return checkpoint │
│ 113 │
│ 114 │
│ │
│ C:\Users*\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serializat │
│ ion.py:777 in load │
│ │
│ 774 │ │ │ # If we want to actually tail call to torch.jit.load, we need to │
│ 775 │ │ │ # reset back to the original position. │
│ 776 │ │ │ orig_position = opened_file.tell() │
│ ❱ 777 │ │ │ with _open_zipfile_reader(opened_file) as opened_zipfile: │
│ 778 │ │ │ │ if _is_torchscript_zip(opened_zipfile): │
│ 779 │ │ │ │ │ warnings.warn("'torch.load' received a zip file that looks like a To │
│ 780 │ │ │ │ │ │ │ │ " dispatching to 'torch.jit.load' (call 'torch.jit.loa │
│ │
│ C:\Users***\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\serializat │
│ ion.py:282 in init │
│ │
│ 279 │
│ 280 class _open_zipfile_reader(_opener): │
│ 281 │ def init(self, name_or_buffer) -> None: │
│ ❱ 282 │ │ super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_bu │
│ 283 │
│ 284 │
│ 285 class _open_zipfile_writer_file(_opener): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

I thought maybe the file was imcomplete and tried to delete it and dowload it manually through the browser but I'm still having the same issue. Thanks for any help!

multi-gpu support

Running the streamlit UI and the program is only using 1 GPU, any way to fix this?

SUGGESTION: M1 Support?

Is it possible to have m1 chip support ofr this repository? I'm not very familiar with tensorflow so that why I'm asking since in the docs it says only nvidiga gpu. Is it even possible?

Trying to use the CLI again but have I got the syntax correct?

So I've been using streamlit webui a while with no issues. I need to go back to CLI as I need to work on a large amount of text. Anyway I've been trying to synthesis text with or without the model checkpoint but I get "reading text from stdin!" and then nothing happens. I know this is under maintence but I appreciate if you could guide me here.

Below is the log.

(tts-fast2) H:\tortoise-tts-fast>python scripts/tortoise_tts.py --original_tortoise  --voice trump --preset ultra_fast --text "Well, you know, for starters, sponges are real organisms, right?"
reading text from stdin!

American accent for an English accent speaker?

Is there a way to amend this?

weird input voice sample treatment

According to the readme the voice samples for voice cloning HAVE TO be at 22.05kHz, but then the first thing done to them after loading is resampling them to 24kHz - should prolly switch up to inputting non-default sampling rate if we're resampling anyway
After reading the voice samples and resampling them, only first 102400 samples (~4.27s at 24kHz fs) are used to generate the latents - as far as I'm aware nothing stops us from either taking more samples, or looping the process over existing samples and averaging latents for the specific voice. Tried both approaches locally and they seemed to have increased voice stability (which makes sense since such averaged latents should be more representative of a voice profile).

UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

I have 3080 and installed it as per git readme.

While running:
python tortoise/do_tts.py --kv_cache --half --no_cond_free --preset single_sample --candidates 1 --text "I have a problem Huston, i repeat !" --voice train-grace

This is the full ouput from console:

C:\Users\Perkel\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\amp\autocast_mode.py:202: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:31<00:00,  3.89s/it]
Computing best candidates using CLVP
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00,  1.32s/it]
Transforming autoregressive outputs into audio..
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.18it/s]
Generating 1 candidates for voice train_grace (seed=None) took 51.85 seconds

From what I read about this kind of issue it seems to be problem with dependency.

Broken paths in test.sh

python: can't open file 'tortoise/do_tts.py': [Errno 2] No such file or directory

RIP

Implement samplers correctly

rudimentary dpm++2m implementation
explore other DPM-Solver samplers
figure out if k-diffusion is still possible
UniPC

Is longform conversion from a text file working properly via read.py? (--kv_cache unrecognized)

I'm getting an "unrecognized arguments" error when I try to run read.py with --kv_cache.