gemelo-ai / vocos Goto Github PK

View Code? Open in Web Editor NEW

560.0 30.0 68.0 17.74 MB

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Home Page: https://gemelo-ai.github.io/vocos/

License: MIT License

Python 91.55% Jupyter Notebook 8.45%

vocoder vocos

vocos's Introduction

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Audio samples | Paper [abs] [pdf]

Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.

Installation

To use Vocos only in inference mode, install it using:

pip install vocos

If you wish to train the model, install it with additional dependencies:

pip install vocos[train]

Usage

Reconstruct audio from mel-spectrogram

import torch

from vocos import Vocos

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

mel = torch.randn(1, 100, 256)  # B, C, T
audio = vocos.decode(mel)

Copy-synthesis from a file:

import torchaudio

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)

Reconstruct audio from EnCodec tokens

Additionally, you need to provide a bandwidth_id which corresponds to the embedding for bandwidth from the list: [1.5, 3.0, 6.0, 12.0].

vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")

audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2])  # 6 kbps

audio = vocos.decode(features, bandwidth_id=bandwidth_id)

Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass.

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)

y_hat = vocos(y, bandwidth_id=bandwidth_id)

Integrate with 🐶 Bark text-to-audio model

See example notebook.

Pre-trained models

Model Name	Dataset	Training Iterations	Parameters
charactr/vocos-mel-24khz	LibriTTS	1M	13.5M
charactr/vocos-encodec-24khz	DNS Challenge	2M	7.9M

Training

Prepare a filelist of audio files for the training and validation set:

find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val

Fill a config file, e.g. vocos.yaml, with your filelist paths and start training with:

python train.py -c configs/vocos.yaml

Refer to Pytorch Lightning documentation for details about customizing the training pipeline.

Citation

If this code contributes to your research, please cite our work:

@article{siuzdak2023vocos,
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author={Siuzdak, Hubert},
  journal={arXiv preprint arXiv:2306.00814},
  year={2023}
}

License

The code in this repository is released under the MIT license as found in the LICENSE file.

vocos's People

Contributors

Stargazers

Watchers

vocos's Issues

One click installer + usage question

Hi, thank you for the great project you have made available!

I added it to my one click installed package of AI based audio generators. Link

Here's the notebook I quickly created:
https://github.com/rsxdalv/tts-generation-webui/blob/main/notebooks/vocos.ipynb

I wonder if using this in a pipeline with SunoAI/Bark has a different impact than with something else. I couldn't manage to link up the raw encodec codes so I used the final wav files.
I saw the best result when using 12kbps bandwidth although if I remember correctly Bark model runs on 6kbps.
In my small sample size I didn't see a unsupervised improvement although I found an example where it gives more "quality" to a sound sample (I included it next to the notebook).

I would love to see how would it go if I could link it up with the encodec tokens from Bark and how to best go about using it.

Windows incompatibility?

RuntimeError: torchaudio.sox_effects.sox_effects.apply_effects_tensor requires sox extension, but TorchAudio is not compiled with it. Please build TorchAudio with libsox support.

How can we fix this error on windows?

Feature maps from 1st layer of each discriminator not included

I noticed that when saving feature maps for the GAN loss there is the condition if i > 0, that means that the feature maps of the first convolutional layer are not considered. Is this an optimization trick? Does the training work better this way?

TypeError: 'type' object is not subscriptable

Thanks for your opensource, when I try to run this model, I face the following problem:
File "/mnt/nfs/dev-aigc-0/data1/xiuyuanqin/work/vocos/vocos/modules.py", line 109, in ResBlock1
dilation: tuple[int] = (1, 3, 5),
TypeError: 'type' object is not subscriptable
could you help me with this problem?

combine with superresolution

Thank you for open sourcing this great work!

One of the great advantages I see in vocoders operating in the time domain is how easy it is to combine the vocoding task with superresolution. You just upsample some more and simply use an audio with a higher samplingrate as the target signal. Is the same somehow possible with Vocos? Could I train a model that uses 16kHz spectrograms as the input but produces a 24kHz wave?

error

AttributeError: 'WeightNorm' object has no attribute 'name'. Did you mean: 'ne'?

Whether to support streaming decode?

Hi , thanks for your work. I would like to ask if Vocos can decode with streaming when reconstructing audio from EnCodec tokens？

ISTFT head use center padding but the same padding

Hi
this is a great work. I noticed that the ISTFT head does not use the same padding that you implemented, but uses the center padding which is provided by torch api according your config file.
for neural vocoding, the spectrum frames and samples should be time aligned(num_frames* hop_len = samples), why did not you use the same padding you implemented for that property?

Config parameters for training vocos with 22.05 kHz

Hi!

If it is possible to train vocos with 22.05 kHz I would like to see config parameters, because my attempt has failed.

I have the following config:

# pytorch_lightning==1.8.6
seed_everything: 4444

data:
  class_path: vocos.dataset.VocosDataModule
  init_args:
    train_params:
      filelist_path: /home/yehor/Work/github/vocos/tetiana-dataset/filelist.train
      sampling_rate: 22050
      num_samples: 15053
      batch_size: 16
      num_workers: 8

    val_params:
      filelist_path: /home/yehor/Work/github/vocos/tetiana-dataset/filelist.val
      sampling_rate: 22050
      num_samples: 44453
      batch_size: 16
      num_workers: 8

model:
  class_path: vocos.experiment.VocosExp
  init_args:
    sample_rate: 22050
    initial_learning_rate: 2e-4
    mel_loss_coeff: 45
    mrd_loss_coeff: 0.1
    num_warmup_steps: 0 # Optimizers warmup steps
    pretrain_mel_steps: 0  # 0 means GAN objective from the first iteration

    # automatic evaluation
    evaluate_utmos: false
    evaluate_pesq: false
    evaluate_periodicty: false

    feature_extractor:
      class_path: vocos.feature_extractors.MelSpectrogramFeatures
      init_args:
        sample_rate: 22050
        n_fft: 1024
        hop_length: 256
        n_mels: 80
        padding: center

    backbone:
      class_path: vocos.models.VocosBackbone
      init_args:
        input_channels: 80
        dim: 512
        intermediate_dim: 1536
        num_layers: 8

    head:
      class_path: vocos.heads.ISTFTHead
      init_args:
        dim: 512
        n_fft: 1024
        hop_length: 256
        padding: center

trainer:
  logger:
    class_path: pytorch_lightning.loggers.TensorBoardLogger
    init_args:
      save_dir: logs/
  callbacks:
    - class_path: pytorch_lightning.callbacks.LearningRateMonitor
    - class_path: pytorch_lightning.callbacks.ModelSummary
      init_args:
        max_depth: 2
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        monitor: val_loss
        filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f}
        save_top_k: 3
        save_last: true
    - class_path: vocos.helpers.GradNormCallback

  # Lightning calculates max_steps across all optimizer steps (rather than number of batches)
  # This equals to 1M steps per generator and 1M per discriminator
  max_steps: 2000000
  # You might want to limit val batches when evaluating all the metrics, as they are time-consuming
  limit_val_batches: 100
  accelerator: gpu
  strategy: ddp
  devices: [0]
  log_every_n_steps: 100

Fails with the following error:

  File "/home/yehor/Work/github/vocos/vocos/experiment.py", line 142, in training_step
    loss_fm_mp = self.feat_matching_loss(fmap_r=fmap_rs_mp, fmap_g=fmap_gs_mp) / len(fmap_rs_mp)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yehor/Tools/anaconda3/envs/vocos/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yehor/Work/github/vocos/vocos/loss.py", line 112, in forward
    loss += torch.mean(torch.abs(rl - gl))
                                 ~~~^~~~
RuntimeError: The size of tensor a (837) must match the size of tensor b (825) at non-singleton dimension 2

If I want to train a 16k vocos vocoder, what config should I modify? hop_length=200

Is Vocos suitable for singing?

I tried the pretrained model with mel spectrograms directly generated from singing samples and it does not sound as good as: https://github.com/yl4579/HiFTNet

The linked vocoder uses neural source-filter (https://nii-yamagishilab.github.io/samples-nsf/) like some other singing voice conversion models.

So could the architecture be the reason for the difference in quality or does Vocos just need to be additionaly trained on singing to reach the same quality?

32kHz Vocos Multi Speaker Model Training Log

Training Loss, Generated Outputs.

I hope this will be a reference for model training.

https://api.wandb.ai/links/xi-speech-team/k0kdfwch

Debug in vscode

Can't debug in vscode. I put breakpoints in feature extractor. Not stopping in the breakpoints inside code during training. . Tried with pdb still doesn't work. Any ideas how to do this? Or is this a problem of lightning? Can't find any documentation there.

Any help is appreciated.

Export to ONNX

Hi,
Any plans to enable ONNX export for Vocos?
I developed a script to do it, but it has some issues with some pytorch operators that Vocos uses.

Click to expand code

# coding: utf-8

import argparse
import logging
import os
import random
from pathlib import Path

import numpy as np
import torch
import yaml
from torch import nn


from vocos.pretrained import Vocos


DEFAULT_OPSET_VERSION = 18
_LOGGER = logging.getLogger("export_onnx")


class VocosGen(nn.Module):
    def __init__(self, vocos):
        super().__init__()
        self.vocos = vocos

    def forward(self, mels):
        x = self.vocos.backbone(mels)
        audio_output = self.vocos.head(x)
        return audio_output


def export_generator(config_path, checkpoint_path, output_dir, opset_version):

    with open(config_path, "r") as f:
        config = yaml.safe_load(f)

    class_module, class_name = config["model"]["class_path"].rsplit(".", 1)
    module = __import__(class_module, fromlist=[class_name])
    vocos_cls = getattr(module, class_name)

    components = Vocos.from_hparams(config_path)
    params = config["model"]["init_args"]
    
    vocos = vocos_cls(
        feature_extractor=components.feature_extractor,
        backbone=components.backbone,
        head=components.head,
        sample_rate=params["sample_rate"],
        initial_learning_rate=params["initial_learning_rate"],
        num_warmup_steps=params["num_warmup_steps"],
        mel_loss_coeff=params["mel_loss_coeff"],
        mrd_loss_coeff=params["mrd_loss_coeff"],
    )

    model = VocosGen(vocos)
    model.eval()

    Path(output_dir).mkdir(parents=True, exist_ok=True)
    epoch = 200
    global_step = 1000000
    onnx_filename = f"vocos-epoch={epoch}.step={global_step}.onnx"
    onnx_path = os.path.join(output_dir, onnx_filename)

    dummy_input = torch.rand(1, vocos.backbone.input_channels, 64)
    dynamic_axes = {
        "mels": {0: "batch_size", 2: "time"},
        "audio": {0: "batch_size", 1: "time"},
    }

    # Conventional ONNX export
    #torch.onnx.export(
    #    model=model,
    #    args=dummy_input,
    #    f=onnx_path,
    #    input_names=["mels"],
    #    output_names=["audio"],
    #    dynamic_axes=dynamic_axes,
    #     opset_version=opset_version,
    #      export_params=True,
    #    do_constant_folding=True,
    # )

    # Using the new dynamo export
    export_output = torch.onnx.dynamo_export(model, dummy_input)
    export_output.save(onnx_path)
    return onnx_path


def main():
    logging.basicConfig(level=logging.DEBUG)

    parser = argparse.ArgumentParser(
        prog="export_onnx",
        description="Export a vocos checkpoint to onnx",
    )

    parser.add_argument("--config", type=str, required=True)
    parser.add_argument("--checkpoint", type=str, required=True)
    parser.add_argument("--output-dir", type=str, required=True)
    parser.add_argument("--seed", type=int, default=1234, help="random seed")
    parser.add_argument("--opset", type=int, default=DEFAULT_OPSET_VERSION)

    args = parser.parse_args()

    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.cuda.manual_seed(args.seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    _LOGGER.info("Exporting model to ONNX")
    _LOGGER.info(f"Config path: `{args.config}`")
    _LOGGER.info(f"Using checkpoint: `{args.checkpoint}`")
    onnx_path = export_generator(
        config_path=args.config,
        checkpoint_path=args.checkpoint,
        output_dir=args.output_dir,
        opset_version=args.opset
    )
    _LOGGER.info(f"Exported ONNX model to: `{onnx_path}`")


if __name__ == '__main__':
    main()

Training vocos on a single speaker dataset

Hi,

I'm looking to train on a single-speaker dataset similar to LJSpeech, and I'm looking for guidance. I have a few questions.

Has any experimentation been done on single-speaker datasets such as LJSpeech with vocos and if so, what were the metrics at convergence? How many steps do I train for for a single-speaker dataset? Also, what metrics do I focus on to tell if the model has converged?

Any help regarding this would be very valuable to me.

Thanks!

Multiple GPU

Hello, does the repository support multiple GPU inference with DataParallel? Or some other method?

Primarily, I'm looking for encoding with encodec and running that on multiple GPU (using codes and feature extractor, etc). Multi GPU with decoding via vocos would also be great.

training error.

hi, I met this error:
I fill train and valid path in configs/vocos.yaml and use cli: "python train.py -c configs/vocos.yaml", error happens: "train.py: error: Parser key "data": 'type' object is not subscriptable".

how should i slove it? many thanks.
@hubertsiuzdak

Noob questions on use and optimizing

I want to fix some old recordings, so they sound crisp, can someone help me with a dumbed down version on how to install this and create a vocos pth with my own 48 dataset? Maybe training recommendation settings would help as well.

Stripes in melspectrogram.

Hello, I'm trying to train Vocos for Generating 32kHz Waveform.
I simply change Mel Loss and Head`s Parameter to hop_size=1600, n_fft=400, sample_rate=32000, mel_channels=120, segment_length=81 witch is fit for my mel spectrogram format.
But when Vocos model converged well after 420k training steps, I can see some stripes at 28k~32kHZ frequency.
Is there something I have to change? like VocosBackbone Module?

Any advice or help would be appreciated.

Thank you.

MPS support

i tried to run it on MPS, and hit some issue in decode that in 1j * y one matrix is complex and other isn't which triggered assert in mps backend: binarygemm. Any ideas how can i solve this?

Why spectogram power is picket as 1?

Hey, i am curious what was the reason to pick "1" instead of commonly used "2" for vocos?

how to use custom trained checkpoint

Thanks for sharing this project!
I've followed the instructions to train a custom model. The tensorboard is showing decent progress and audio predictions are starting to sound good. But I am unable to load custom model checkpoint for inference. Can you share how to use the custom trained checkpoints for inference?
Thanks,
Emmanuel

If I want to train a 48k vocos vocoder, what config should I modify?

And another question, I notice that in the paper you said "the replacement of ResBlocks with ConvNeXt further improves performance.", I want to know how much performance has ConvNeXt version improved compared to ResBlock1 (HIFIGANv1) version?

Thanks!

about the install problems

i follow the remead install the vocos, but when testing samples, there is a problem:
LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
i also download the the pytorch file, but vocos.from_pretrain() how to load the local files?

Import error when loading vocos

I'm trying to import "vocos" module, but I'm getting the following traceback error.
Is there anyone who can help me solve this issue?
fyi, all dependencies are installed and I just want to try inference with pretrained models.
Thanks in advance.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 from vocos import Vocos

File ~/conda/codec/lib/python3.8/site-packages/vocos/__init__.py:1
----> 1 from vocos.pretrained import Vocos
      4 __version__ = "0.0.3"

File ~/conda/codec/lib/python3.8/site-packages/vocos/pretrained.py:7
      5 from huggingface_hub import hf_hub_download
      6 from torch import nn
----> 7 from vocos.feature_extractors import FeatureExtractor, EncodecFeatures
      8 from vocos.heads import FourierHead
      9 from vocos.models import Backbone

File ~/conda/codec/lib/python3.8/site-packages/vocos/feature_extractors.py:8
      5 from encodec import EncodecModel
      6 from torch import nn
----> 8 from vocos.modules import safe_log
     11 class FeatureExtractor(nn.Module):
     12     """Base class for feature extractors."""

File ~/conda/codec/lib/python3.8/site-packages/vocos/modules.py:89
     85         x = x * scale + shift
...
    112     ):
    113         super().__init__()
    114         self.lrelu_slope = lrelu_slope

TypeError: 'type' object is not subscriptable

Bark+Vocos.ipynb fails on saving mp3 files with error about FFmpeg backend

Thanks a lot for this repository! This is very useful and thanks a lot for great notebook about Bark+Vocos integration!

Error

I tried to follow notebook Bark+Vocos.ipynb but encountered the following error:

Traceback (most recent call last):
  File ".../bark_vocos_usage.py", line 66, in <module>
    torchaudio.save("encodec.mp3", encodec_output[None, :], 44100, compression=128)
  File ".../venv/lib/python3.10/site-packages/torchaudio/_backend/utils.py", line 312, in save
    return backend.save(
  File ".../venv/lib/python3.10/site-packages/torchaudio/_backend/ffmpeg.py", line 351, in save
    raise ValueError(
ValueError: ('FFmpeg backend expects non-`None` value for argument `compression` to be of ', "type `torchaudio.io.CodecConfig`, but received value of type <class 'int'>")

Workaround

For me it works if I replace these two last lines:

torchaudio.save("encodec.mp3", encodec_output[None, :], 44100, compression=128)
torchaudio.save("vocos.mp3", vocos_output, 44100, compression=128)

with these:

torchaudio.save("encodec.mp3", encodec_output[None, :], 44100, compression=torchaudio.io.CodecConfig(bit_rate=320))
torchaudio.save("vocos.mp3", vocos_output, 44100, compression=torchaudio.io.CodecConfig(bit_rate=320))

pip freeze

Just in case this will help somebody someday - for me it works after installing torch with pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 and IPython with pip install ipython. And these packages were finally installed:

annotated-types==0.6.0
asttokens==2.4.1
audioread==3.0.1
boto3==1.29.3
botocore==1.32.3
certifi==2022.12.7
cffi==1.16.0
charset-normalizer==2.1.1
cmake==3.25.0
decorator==5.1.1
einops==0.7.0
encodec==0.1.1
exceptiongroup==1.1.3
executing==2.0.1
filelock==3.9.0
fsspec==2023.10.0
funcy==2.0
huggingface-hub==0.19.4
idna==3.4
inflect==7.0.0
ipython==8.17.2
jedi==0.19.1
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.3.2
lazy_loader==0.3
librosa==0.10.1
lit==15.0.7
llvmlite==0.41.1
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mpmath==1.3.0
msgpack==1.0.7
networkx==3.0
numba==0.58.1
numpy==1.24.1
packaging==23.2
parso==0.8.3
pexpect==4.8.0
Pillow==9.3.0
platformdirs==4.0.0
pooch==1.8.0
progressbar==2.5
prompt-toolkit==3.0.41
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
pydantic==2.5.1
pydantic_core==2.14.3
Pygments==2.17.1
python-dateutil==2.8.2
PyYAML==6.0.1
regex==2023.10.3
requests==2.28.1
rotary-embedding-torch==0.3.5
s3transfer==0.7.0
safetensors==0.4.0
scikit-learn==1.3.2
scipy==1.11.4
six==1.16.0
soundfile==0.12.1
soxr==0.3.7
stack-data==0.6.3
suno-bark @ git+https://github.com/suno-ai/bark.git@773624d26db84278a55aacae9a16d7b25fbccab8
sympy==1.12
threadpoolctl==3.2.0
tokenizers==0.13.3
torch==2.1.1+cu118
torchaudio==2.1.1+cu118
torchvision==0.16.1+cu118
tortoise-tts @ git+https://github.com/neonbjb/tortoise-tts@80f89987a5abda5e2b082618cd74f9c7411141dc
tqdm==4.66.1
traitlets==5.13.0
transformers==4.31.0
triton==2.1.0
typing_extensions==4.8.0
Unidecode==1.3.7
urllib3==1.26.13
vocos==0.1.0
wcwidth==0.2.10

How to continue training using my voice from already pre-trained checkpoint?

Bark + Vocos for longer text to speech ?

Recently bark supported long form with multiple speakers enabled. link. Can the example notebook be modified to also take in consideration long form text

Mismatch in number of iterations in the paper and the readme

The paper says there's 2M of iterations
The readme says there's 2.5M of iterations

What is the truth?

About the VISQOL

@hubertsiuzdak

Thanks for nice work!

I have a question about the VISQOL.

For the evaluation, you utilized an audio mode of VISQOL.

However, input waveform should be a 48kHz sampling rate to do this. I hope to know how to upsample the GT samples and generated samples, respectively

Thanks!

how to convert custom ckpt to bin?

Hi @hubertsiuzdak, I am trying to figure out how to convert my ckpt to pytorch_model.bin so that I can load model by vocos.pretrained or any idea to load ckpt for inferencing directly?

COLA == Training Instability?

I'm training a Vocos decoder for my DAC autoencoder. When I set hop length = 256 and n_fft = 1024 in the iSTFT head the discriminators quickly win within 1000 steps. However, this doesn't happen when I set n_fft = 512, 768, or 1026. Do you know why this is happening and whether using 1026 would affect quality? I don't completely understand the COLA property.

Compatibility with Matcha TTS

The issue

I trained a model based on Matcha TTS, and I tried to use Vocos with it. Unfortunately, vocoding using a checkpoint trained with the default config of Vocos gives a robotic output with very low volume.

The only config values I changed are sample_rate (=22050) and n_mels (=80).

I asumed that there is a mismatch between Matcha TTS-generated melspectrogram and Vocos expected melspectrogram in terms of parameters.

A new feature extractor

I wrote a feature extractor class to generate melspectogram using same parameters of Matcha TTS. Most of the code is copied directly from Matcha's source code.

Click to expand: MatchaMelSpectrogramFeatures

import numpy as np
import torch
from librosa.filters import mel as librosa_mel_fn

from vocos.feature_extractors import FeatureExtractor


class MatchaMelSpectrogramFeatures(FeatureExtractor):
    """
    Generate MelSpectrogram from audio using same params
    as Matcha TTS (https://github.com/shivammehta25/Matcha-TTS)
    This is also useful with tacatron, waveglow..etc.
    """

    def __init__(
        self,
        *,
        mel_mean,
        mel_std,
        sample_rate=22050,
        n_fft=1024,
        win_length=1024,
        n_mels=80,
        hop_length=256,
        center=False,
        f_min=0,
        f_max=8000,
    ):
        super().__init__()
        self.sample_rate = sample_rate
        self.n_mels = n_mels
        self.n_fft = n_fft
        self.win_length = win_length
        self.hop_length = hop_length
        self.center = center
        self.f_min = f_min
        self.f_max = f_max
        # Data-dependent
        self.mel_mean = mel_mean
        self.mel_std = mel_std
        # Cache
        self._mel_basis = {}
        self._hann_window = {}

    def forward(self, audio: torch.Tensor, **kwargs) -> torch.Tensor:
        mel = self.mel_spectrogram(audio).squeeze()
        mel = normalize(mel, self.mel_mean, self.mel_std)
        return mel.unsqueeze(0)

    def mel_spectrogram(self, y):
        mel_basis_key = str(self.f_max) + "_" + str(y.device)
        han_window_key = str(y.device)
        if mel_basis_key not in self._mel_basis:
            mel = librosa_mel_fn(
                sr=self.sample_rate,
                n_fft=self.n_fft,
                n_mels=self.n_mels,
                fmin=self.f_min,
                fmax=self.f_max
            )
            self._mel_basis[mel_basis_key] = torch.from_numpy(mel).float().to(y.device)
            self._hann_window[han_window_key] = torch.hann_window(self.win_length).to(y.device)
        pad_vals = (
            (self.n_fft - self.hop_length) // 2,
            (self.n_fft - self.hop_length) // 2,
        )
        y = torch.nn.functional.pad(
            y.unsqueeze(1),
            pad_vals,
            mode="reflect"
        )
        y = y.squeeze(1)
        spec = torch.stft(
            y,
            self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self._hann_window[han_window_key],
            center=self.center,
            pad_mode="reflect",
            normalized=False,
            onesided=True,
            return_complex=True,
        )
        spec = torch.view_as_real(spec)
        spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
        spec = torch.matmul(self._mel_basis[mel_basis_key], spec)
        spec = spectral_normalize_torch(spec)
        return spec


def spectral_normalize_torch(magnitudes):
    output = dynamic_range_compression_torch(magnitudes)
    return output


def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
    return torch.log(torch.clamp(x, min=clip_val) * C)

def normalize(data, mu, std):
    if not isinstance(mu, (float, int)):
        if isinstance(mu, list):
            mu = torch.tensor(mu, dtype=data.dtype, device=data.device)
        elif isinstance(mu, torch.Tensor):
            mu = mu.to(data.device)
        elif isinstance(mu, np.ndarray):
            mu = torch.from_numpy(mu).to(data.device)
        mu = mu.unsqueeze(-1)

    if not isinstance(std, (float, int)):
        if isinstance(std, list):
            std = torch.tensor(std, dtype=data.dtype, device=data.device)
        elif isinstance(std, torch.Tensor):
            std = std.to(data.device)
        elif isinstance(std, np.ndarray):
            std = torch.from_numpy(std).to(data.device)
        std = std.unsqueeze(-1)

    return (data - mu) / std

And I used it with the following config:

Click to expand config: vocos-matcha.yaml

# pytorch_lightning==1.8.6
seed_everything: 4444

data:
  class_path: vocos.dataset.VocosDataModule
  init_args:
    train_params:
      filelist_path: ./datasets/train.txt
      sampling_rate: 22050
      num_samples: 16384
      batch_size: 16
      num_workers: 4

    val_params:
      filelist_path: ./datasets/val.txt
      sampling_rate: 22050
      num_samples: 48384
      batch_size: 16
      num_workers: 4

model:
  class_path: vocos.experiment.VocosExp
  init_args:
    sample_rate: 22050
    initial_learning_rate: 5e-4
    mel_loss_coeff: 45
    mrd_loss_coeff: 0.1
    num_warmup_steps: 0 # Optimizers warmup steps
    pretrain_mel_steps: 0  # 0 means GAN objective from the first iteration
    # automatic evaluation
    evaluate_utmos: true
    evaluate_pesq: true
    evaluate_periodicty: true

    feature_extractor:
      class_path: matcha_feature_extractor.MatchaMelSpectrogramFeatures
      init_args:
        sample_rate: 22050
        n_fft: 1024
        n_mels: 80
        hop_length: 256
        win_length: 1024
        f_min: 0
        f_max: 8000
        center: False
        mel_mean: -6.38385
        mel_std: 2.541796

    backbone:
      class_path: vocos.models.VocosBackbone
      init_args:
        input_channels: 80
        dim: 512
        intermediate_dim: 1536
        num_layers: 8

    head:
      class_path: vocos.heads.ISTFTHead
      init_args:
        dim: 512
        n_fft: 1024
        hop_length: 256
        padding: same

trainer:
  logger:
    class_path: pytorch_lightning.loggers.TensorBoardLogger
    init_args:
      save_dir: /content/drive/MyDrive/vocos/logs
  callbacks:
    - class_path: pytorch_lightning.callbacks.LearningRateMonitor
    - class_path: pytorch_lightning.callbacks.ModelSummary
      init_args:
        max_depth: 2
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        monitor: val_loss
        filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f}
        save_top_k: 2
        save_last: true
    - class_path: vocos.helpers.GradNormCallback

  # Lightning calculates max_steps across all optimizer steps (rather than number of batches)
  # This equals to 1M steps per generator and 1M per discriminator
  max_steps: 2000000
  # You might want to limit val batches when evaluating all the metrics, as they are time-consuming
  limit_val_batches: 128
  accelerator: gpu
  strategy: ddp
  devices: [0]
  log_every_n_steps: 100

Results

I trained Vocos using the above feature extractor and config, but this also fails with even worse vocoding quality and even lower volume.

Questions

Did I miss something in above feature extractor?
Does the default Vocos head expects melspectograms generated using certain parameters?
Any suggestions to resolve this?

Additional notes

I believe many open-source TTS models use the same code to extract melspectogram. So resolving this will help with training Vocos for use with these TTS models.

Best

"error: No module named 'encodec'" while training a vocos

I was trying to train a vocos and I ran the command train.py -c configs/vocos.yaml, while it showed this error:
(autodl_test) root@autodl-container-da7148a975-fb23289d:~/autodl-tmp/PycharmProjects/vocos-main# python train.py -c configs/vocos.yaml usage: train.py [-h] [-c CONFIG] [--print_config[=flags]] [--seed_everything SEED_EVERYTHING] [--trainer CONFIG] [--trainer.logger.help CLASS_PATH_OR_NAME] [--trainer.logger LOGGER] [--trainer.enable_checkpointing {true,false}] [--trainer.callbacks.help CLASS_PATH_OR_NAME] [--trainer.callbacks CALLBACKS] [--trainer.default_root_dir DEFAULT_ROOT_DIR] [--trainer.gradient_clip_val GRADIENT_CLIP_VAL] [--trainer.gradient_clip_algorithm GRADIENT_CLIP_ALGORITHM] [--trainer.num_nodes NUM_NODES] [--trainer.num_processes NUM_PROCESSES] [--trainer.devices DEVICES] [--trainer.gpus GPUS] [--trainer.auto_select_gpus {true,false}] [--trainer.tpu_cores TPU_CORES] [--trainer.ipus IPUS] [--trainer.enable_progress_bar {true,false}] [--trainer.overfit_batches OVERFIT_BATCHES] [--trainer.track_grad_norm TRACK_GRAD_NORM] [--trainer.check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH] [--trainer.fast_dev_run FAST_DEV_RUN] [--trainer.accumulate_grad_batches ACCUMULATE_GRAD_BATCHES] [--trainer.max_epochs MAX_EPOCHS] [--trainer.min_epochs MIN_EPOCHS] [--trainer.max_steps MAX_STEPS] [--trainer.min_steps MIN_STEPS] [--trainer.max_time MAX_TIME] [--trainer.limit_train_batches LIMIT_TRAIN_BATCHES] [--trainer.limit_val_batches LIMIT_VAL_BATCHES] [--trainer.limit_test_batches LIMIT_TEST_BATCHES] [--trainer.limit_predict_batches LIMIT_PREDICT_BATCHES] [--trainer.val_check_interval VAL_CHECK_INTERVAL] [--trainer.log_every_n_steps LOG_EVERY_N_STEPS] [--trainer.accelerator.help CLASS_PATH_OR_NAME] [--trainer.accelerator ACCELERATOR] [--trainer.strategy.help CLASS_PATH_OR_NAME] [--trainer.strategy STRATEGY] [--trainer.sync_batchnorm {true,false}] [--trainer.precision PRECISION] [--trainer.enable_model_summary {true,false}] [--trainer.num_sanity_val_steps NUM_SANITY_VAL_STEPS] [--trainer.resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--trainer.profiler.help CLASS_PATH_OR_NAME] [--trainer.profiler PROFILER] [--trainer.benchmark {true,false,null}] [--trainer.deterministic DETERMINISTIC] [--trainer.reload_dataloaders_every_n_epochs RELOAD_DATALOADERS_EVERY_N_EPOCHS] [--trainer.auto_lr_find AUTO_LR_FIND] [--trainer.replace_sampler_ddp {true,false}] [--trainer.detect_anomaly {true,false}] [--trainer.auto_scale_batch_size AUTO_SCALE_BATCH_SIZE] [--trainer.plugins.help CLASS_PATH_OR_NAME] [--trainer.plugins PLUGINS] [--trainer.amp_backend AMP_BACKEND] [--trainer.amp_level AMP_LEVEL] [--trainer.move_metrics_to_cpu {true,false}] [--trainer.multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE] [--trainer.inference_mode {true,false}] [--model.help CLASS_PATH_OR_NAME] --model CONFIG | CLASS_PATH_OR_NAME | .INIT_ARG_NAME VALUE [--data.help CLASS_PATH_OR_NAME] [--data CONFIG | CLASS_PATH_OR_NAME | .INIT_ARG_NAME VALUE] [--optimizer.help CLASS_PATH_OR_NAME] [--optimizer CONFIG | CLASS_PATH_OR_NAME | .INIT_ARG_NAME VALUE] [--lr_scheduler.help CLASS_PATH_OR_NAME] [--lr_scheduler CONFIG | CLASS_PATH_OR_NAME | .INIT_ARG_NAME VALUE] error: Parser key "data": Problem with given class_path 'vocos.dataset.VocosDataModule': No module named 'encodec'
I am a newer in model training. Anyone has had the same problem, please?

Error Training VOCOS (UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte)

I've been trying to train my own model with VOCOS using Google Colab but came across the following error when I ran the train.py file

``'utf-8' codec can't decode byte 0xff in position 0: invalid start byte```

The Full log is below but as far as I know I have everything configured right according to the README.md. I'm a bit new to Audio Synthesis so any help is great!

Full Log:

Global seed set to 4444
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 4444
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
2023-07-09 17:26:37.331298: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

   | Name                           | Type                         | Params
---------------------------------------------------------------------------------
0  | feature_extractor              | MelSpectrogramFeatures       | 0     
1  | feature_extractor.mel_spec     | MelSpectrogram               | 0     
2  | backbone                       | VocosBackbone                | 13.0 M
3  | backbone.embed                 | Conv1d                       | 358 K 
4  | backbone.norm                  | LayerNorm                    | 1.0 K 
5  | backbone.convnext              | ModuleList                   | 12.6 M
6  | backbone.final_layer_norm      | LayerNorm                    | 1.0 K 
7  | head                           | ISTFTHead                    | 526 K 
8  | head.out                       | Linear                       | 526 K 
9  | head.istft                     | ISTFT                        | 0     
10 | multiperioddisc                | MultiPeriodDiscriminator     | 41.1 M
11 | multiperioddisc.discriminators | ModuleList                   | 41.1 M
12 | multiresddisc                  | MultiResolutionDiscriminator | 600 K 
13 | multiresddisc.discriminators   | ModuleList                   | 600 K 
14 | disc_loss                      | DiscriminatorLoss            | 0     
15 | gen_loss                       | GeneratorLoss                | 0     
16 | feat_matching_loss             | FeatureMatchingLoss          | 0     
17 | melspec_loss                   | MelSpecReconstructionLoss    | 0     
18 | melspec_loss.mel_spec          | MelSpectrogram               | 0     
---------------------------------------------------------------------------------
55.2 M    Trainable params
0         Non-trainable params
55.2 M    Total params
220.950   Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/content/drive/MyDrive/Colab Notebooks/train.py", line 9, in <module>
    cli.trainer.fit(model=cli.model, datamodule=cli.datamodule)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
    self._run_sanity_check()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1255, in _run_sanity_check
    val_loop._reload_evaluation_dataloaders()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 234, in _reload_evaluation_dataloaders
    self.trainer.reset_val_dataloader()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1635, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 357, in _reset_eval_dataloader
    dataloaders = self._request_dataloader(mode)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 446, in _request_dataloader
    dataloader = source.dataloader()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 524, in dataloader
    return method()
  File "/usr/local/lib/python3.10/dist-packages/vocos/dataset.py", line 38, in val_dataloader
    return self._get_dataloder(self.val_config, train=False)
  File "/usr/local/lib/python3.10/dist-packages/vocos/dataset.py", line 28, in _get_dataloder
    dataset = VocosDataset(cfg, train=train)
  File "/usr/local/lib/python3.10/dist-packages/vocos/dataset.py", line 44, in __init__
    self.filelist = f.read().splitlines()
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Origin of the model name 'Vocos'

Summary

Are there any ancestor or inspiration source of the model naming 'Vocos'?

Detail

I succesfully train the Vocos with other datasets (e.g. japanese), the model is so good.
At first, thanks so much for your great model and its open-sourced code/weight.

Just out of curiosity, why the model is named as 'Vocos'?
The naming Vocos reminds me of "vocoder" or "cosine", but. at least in the paper, there seems to be no description about the naming.

Training error, help needed!

@hubertsiuzdak Hi, thank for ur great work! But I met a error when I am trying to train a mel model.
Here's error log:

Epoch 0:   0%|                                                                                                                                                                           | 0/853 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/zj/workspace/TTS/vocos/train.py", line 6, in <module>
    cli.trainer.fit(model=cli.model, datamodule=cli.datamodule)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
    return function(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 281, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 121, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/optim/adamw.py", line 161, in step
    loss = closure()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 107, in _wrap_closure
    closure_result = closure()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step
    return self.model(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "/home/zj/workspace/TTS/vocos/vocos/experiment.py", line 142, in training_step
    loss_fm_mp = self.feat_matching_loss(fmap_r=fmap_rs_mp, fmap_g=fmap_gs_mp) / len(fmap_rs_mp)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zj/anaconda3/envs/vocos/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zj/workspace/TTS/vocos/vocos/loss.py", line 112, in forward
    loss += torch.mean(torch.abs(rl - gl))
RuntimeError: The size of tensor a (669) must match the size of tensor b (667) at non-singleton dimension 2

and this is my config( I am using vocos-imdct.yaml) ：

# pytorch_lightning==1.8.6
seed_everything: 4444

data:
  class_path: vocos.dataset.VocosDataModule
  init_args:
    train_params:
      filelist_path: /home/zj/workspace/TTS/vocos/filelist.train
      sampling_rate: 16000
      num_samples: 12041
      batch_size: 16
      num_workers: 8

    val_params:
      filelist_path: /home/zj/workspace/TTS/vocos/filelist.val
      sampling_rate: 16000
      num_samples: 4201
      batch_size: 16
      num_workers: 8

model:
  class_path: vocos.experiment.VocosExp
  init_args:
    sample_rate: 16000
    initial_learning_rate: 5e-4
    mel_loss_coeff: 45
    mrd_loss_coeff: 0.1
    num_warmup_steps: 0 # Optimizers warmup steps
    pretrain_mel_steps: 0  # 0 means GAN objective from the first iteration

    # automatic evaluation
    evaluate_utmos: true
    evaluate_pesq: true
    evaluate_periodicty: true

    feature_extractor:
      class_path: vocos.feature_extractors.MelSpectrogramFeatures
      init_args:
        sample_rate: 16000
        n_fft: 2048
        hop_length: 200
        n_mels: 80
        padding: center

    backbone:
      class_path: vocos.models.VocosBackbone
      init_args:
        input_channels: 80
        dim: 400
        intermediate_dim: 2448
        num_layers: 8

    head:
      class_path: vocos.heads.IMDCTCosHead
      init_args:
        dim: 400
        mdct_frame_len: 400  # mel-spec hop_length * 2
        padding: center

trainer:
  logger:
    class_path: pytorch_lightning.loggers.TensorBoardLogger
    init_args:
      save_dir: logs/
  callbacks:
    - class_path: pytorch_lightning.callbacks.LearningRateMonitor
    - class_path: pytorch_lightning.callbacks.ModelSummary
      init_args:
        max_depth: 2
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        monitor: val_loss
        filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f}
        save_top_k: 3
        save_last: true
    - class_path: vocos.helpers.GradNormCallback

  # Lightning calculates max_steps across all optimizer steps (rather than number of batches)
  # This equals to 1M steps per generator and 1M per discriminator
  max_steps: 2000000
  # You might want to limit val batches when evaluating all the metrics, as they are time-consuming
  limit_val_batches: 100
  accelerator: gpu
  strategy: ddp
  devices: [3]
  log_every_n_steps: 100

Speech vibration artifacts

Has anyone observed vibration artifacts when reconstructing speech? I can reproduce this with the provided pretrained model by converting the original audio into mel spectrogram and using vocos to reconstruct (no TTS in between). Example (pay attention to the word "carnival"): Original vs Generated

I trained my own model with own dataset that has the exact speech in question but still observed the artifacts. Does anyone have suggestions on how we may fix the artifact?

gemelo-ai / vocos Goto Github PK

vocos's Introduction

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Installation

Usage

Reconstruct audio from mel-spectrogram

Reconstruct audio from EnCodec tokens

Integrate with 🐶 Bark text-to-audio model

Pre-trained models

Training

Citation

License

vocos's People

Contributors

Stargazers

Watchers

Forkers

vocos's Issues

Error

Workaround

pip freeze

The issue

A new feature extractor

Results

Questions

Additional notes

Summary

Detail

Recommend Projects

Recommend Topics

Recommend Org