Comments (4)
This setting needs to be consistent with the feature extractor. I guess you specifically mean the vocos.yaml config file that takes mel-spectrograms as inputs. Note that input mel-spectrograms are "center" padded, so to compensate for this we can use torch.istft
with center=True
which trims the corresponding samples.
However, for features like EnCodec tokens that are not padded in that specific way, using torch.istft
with center=True
would trim too many samples. In the vocos-encodec.yaml config file, you'll find padding: same
in the ISTFTHead
.
It would certainly be simpler if we could use torch.istft
with center=False
and slice the output audio. However, PyTorch does not allow this (for specific windows) due to how the NOLA (Nonzero Overlap Add) is checked. You might want to check out this issue: pytorch/pytorch#91309
Hope it helps!
from vocos.
That's helpful~ Thanks!
from vocos.
Hi, i noticed that the parameter of n_fft in head is 1280, why did not you use 1024?
from vocos.
The key parameter here is hop_length
. It should align with the resolution of your input features. Since EnCodec tokens are downsampled by a factor of 320, we've set the hop_length
to 320.
Now, when you use a Hann window in the iSTFT, it's common to have a 75% overlap. This means our window_len
should be four times the hop_length
. That's why we set window_len
(and n_fft
) to 1280.
If you want to dive a bit deeper, I'd recommend looking into the constant overlap-add (COLA) constraint, there's a helpful discussion on this topic here: https://dsp.stackexchange.com/a/33615.
from vocos.
Related Issues (20)
- Stripes in melspectrogram. HOT 1
- Is Vocos suitable for singing?
- about the install problems HOT 1
- combine with superresolution HOT 2
- Training error, help needed!
- how to convert custom ckpt to bin? HOT 3
- Bark+Vocos.ipynb fails on saving mp3 files with error about FFmpeg backend
- error
- Export to ONNX HOT 14
- Compatibility with Matcha TTS HOT 7
- "error: No module named 'encodec'" while training a vocos
- MPS support HOT 2
- Why spectogram power is picket as 1?
- Bark + Vocos for longer text to speech ?
- Debug in vscode
- Training vocos on a single speaker dataset
- 32kHz Vocos Multi Speaker Model Training Log HOT 10
- Feature maps from 1st layer of each discriminator not included
- About the VISQOL
- COLA == Training Instability?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vocos.