Comments (4)
1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet) 2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes
I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.
I have been looking at the AI-Voice-Cloning setup. There is a setting in there called "pause time" which gets rid of the clipped last word. It seems to be much earlier in the build and is very slow, but coming along nicely. I'm still getting much better results with DLAS and Ozen.
The test I did was trying to clone Vincent Price's voice. I used three different audiobook readings he did. They are fairly clean and his speech is consistent. That yeilded about 500 clips using Ozen to create the dataset. Then I did 200 steps in DLAS and clicked the Auto Settings button. I have a separate set of clips I made for the voices folder that I can interchange to get different types of readings (specific emotions, rasp, voice pitch). Doesn't always work, but most of the time I get great results. Going to try 300 steps and see if the quality is any cleaner.
I've done 5 other tests with similarly recognizable voices (Walken, Jeff Goldblum, Louise from Bob's Burgers...) with equivalent results. The cadence isn't always right, but the tone, pronunciation etc are great. Instantly recognizable. Now I'm trying to combine voices to create specific sounds. Using this to do some preproduction for a film proof of concept, and it's working nicely. Sor of like a digital table read.
from dl-art-school.
The best way I have found to fix the clipping is just to add a space and then a single character to the end of the phrase. Then edit that final character out if it is pronounced. As for the doubling of the final line...breaking the text into shorter phrases fixes this. Shorter phrases also yield better "performances" overall.
from dl-art-school.
1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet)
2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes
I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.
from dl-art-school.
I found hyperparameters that worked better for me, see #1 . Mostly reducing lr for smaller datasets / single speaker.
For repeats, I experimented running each of length_penalty and repetition_penalty up to 1024, zero difference (super-helpful to have those exposed as script parameters in this repo).
It is oddly regular in that it always seems to affect an elements in a list, text of the form "blah blah, X, Y, and Z" being rendered as "blah blah, X, Y, and Z, and Z". If anyone has thoughts on what to experiment with to try to eliminate that, open to ideas.
from dl-art-school.
Related Issues (20)
- English with an accent. HOT 1
- Can you add to a trained model?
- Separating
- The process cannot access the file because it is being used by another process HOT 1
- Please help RecursionError: maximum recursion depth exceeded while calling a Python object HOT 1
- Got error trying to test the fine-tuned model HOT 1
- Error no kernel image is available for execution on the device HOT 2
- Very detailed tutorial up-to-date HOT 2
- finetune on single speaker HOT 1
- Anyone tried Multi-lingual Tortoise?
- Unexpected keys in state dict when loading Unified Voice HOT 6
- Multi-gpu option is possible? HOT 2
- getting this error:RuntimeError: CUDA error: invalid device ordinal
- I need help with google colab HOT 3
- how to resolve this when i train the japanese voices HOT 4
- Error on ljspeech train starting in google colab: Unexpected key(s) in state_dict: "gpt.h.0.attn.bias", ... HOT 2
- Only using half the capacity of my vidoe card RTX 4090? (12 out of 24G)
- When do I stop the training? How do I know I should stop (Image inside)
- Is this the only voice cloning tech available right now? HOT 2
- Control the length of AUDIOs generated????
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dl-art-school.