I'm having awesome results with fine-tuning datasets, but I am running into a couple i

I found hyperparameters that worked better for me, see <a class="issue-link js-issue-l

Clipped ending or doubled ending about dl-art-school HOT 4 CLOSED

152334h commented on September 25, 2024

Clipped ending or doubled ending

from dl-art-school.

Comments (4)

demonauthor commented on September 25, 2024 1

1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet) 2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes

I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.

I have been looking at the AI-Voice-Cloning setup. There is a setting in there called "pause time" which gets rid of the clipped last word. It seems to be much earlier in the build and is very slow, but coming along nicely. I'm still getting much better results with DLAS and Ozen.

The test I did was trying to clone Vincent Price's voice. I used three different audiobook readings he did. They are fairly clean and his speech is consistent. That yeilded about 500 clips using Ozen to create the dataset. Then I did 200 steps in DLAS and clicked the Auto Settings button. I have a separate set of clips I made for the voices folder that I can interchange to get different types of readings (specific emotions, rasp, voice pitch). Doesn't always work, but most of the time I get great results. Going to try 300 steps and see if the quality is any cleaner.

I've done 5 other tests with similarly recognizable voices (Walken, Jeff Goldblum, Louise from Bob's Burgers...) with equivalent results. The cadence isn't always right, but the tone, pronunciation etc are great. Instantly recognizable. Now I'm trying to combine voices to create specific sounds. Using this to do some preproduction for a film proof of concept, and it's working nicely. Sor of like a digital table read.

from dl-art-school.

demonauthor commented on September 25, 2024 1

The best way I have found to fix the clipping is just to add a space and then a single character to the end of the phrase. Then edit that final character out if it is pronounced. As for the doubling of the final line...breaking the text into shorter phrases fixes this. Shorter phrases also yield better "performances" overall.

from dl-art-school.

xenotropic commented on September 25, 2024

1 - See issue 237 in the original tortoise repo, there's params you can try (I haven't had time to experiment yet)
2 - I've also noticed that, seems to be a failing of tortoise generally, not aware of any possible fixes

I'd be interested to hear what you've done to get awesome results -- what dataset size did you have, how many epochs, other hyperparamters. I have not yet managed to get awesome results.

from dl-art-school.

xenotropic commented on September 25, 2024

I found hyperparameters that worked better for me, see #1 . Mostly reducing lr for smaller datasets / single speaker.

For repeats, I experimented running each of length_penalty and repetition_penalty up to 1024, zero difference (super-helpful to have those exposed as script parameters in this repo).

It is oddly regular in that it always seems to affect an elements in a list, text of the form "blah blah, X, Y, and Z" being rendered as "blah blah, X, Y, and Z, and Z". If anyone has thoughts on what to experiment with to try to eliminate that, open to ideas.

from dl-art-school.

Clipped ending or doubled ending about dl-art-school HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent