Giter VIP home page Giter VIP logo

facodec's Introduction

FAcodec

This project is supported by Amphion.

Pytorch implementation for the training of FAcodec, which was proposed in paper NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.

Requirements

  • Python 3.10

Installation

git clone https://github.com/Plachtaa/FAcodec.git
pip install -r requirements.txt

If you want to train the model by yourself, install the following packages:

pip install nemo_toolkit['all']
pip install descript-audio-codec

Model storage

We provide pretrained checkpoints on 50k hours speech data.

Model type Link
FAcodec Hugging Face
FAcodec redecoder Hugging Face

Demo

Try our model on Hugging Face!

Training

accelerate launch train.py --config ./configs/config.yaml

Before you run the command above, replace the PseudoDataset class in meldataset.py with your own dataset. Simply load your own wave files in the same format.
To train redecoder, the voice conversion model, run:

accelerate launch train_redecoder.py --config ./configs/config_redecoder.yaml

Remember to fill in the checkpoint path of a pretrained FAcodec model in the config file.

Usage

Encode & reconstruct

python reconstruct.py --source <source_wav> --ckpt-path <ckpt_path> --config-path <config_path>

If no --ckpt-path or --config-path is specified, model weights will be automatically downloaded from Hugging Face.
For China mainland users, add additional environment variable to specify huggingface endpoint:

HF_ENDPOINT=https://hf-mirror.com python reconstruct_redecoder.py --source <source_wav> --target <target_wav>

Extracting representations

import yaml
from modules.commons import build_model, recursive_munch
from hf_utils import load_custom_model_from_hf
import torch
import torchaudio
import librosa

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt_path, config_path = load_custom_model_from_hf("Plachta/FAcodec")
model = build_model(yaml.safe_load(open(config_path))['model_params'])
ckpt_params = torch.load(ckpt_path, map_location="cpu")

for key in ckpt_params:
    model[key].load_state_dict(ckpt_params[key])

_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

with torch.no_grad():
    source = "path/to/source.wav"
    source_audio = librosa.load(source, sr=24000)[0]
    source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
    z = model.encoder(source_audio[None, ...].to(device).float())
    z, quantized, _, _, timbre, codes = model.quantizer(z, source_audio[None, ...].to(device).float(), return_codes=True)

where:
timbre is the timbre representation, one single vector for each utterance.
codes[0] is the prosody representation
codes[1] is the content representation

Zero-shot voice conversion

python reconstruct_redecoder.py \
    --source <source_wav> 
    --target <target_wav> 
    --codec-ckpt-path <codec_ckpt_path> 
    --redecoder-ckpt-path <redecoder_ckpt_path> 
    --codec-config-path <codec_config_path> 
    --redecoder-config-path <redecoder_config_path>

same as above, if no checkpoint path or config path is specified, model weights will be automatically downloaded from Hugging Face.

Real-time voice conversion

This codec is fully causal, so it can be used for real-time voice conversion.
Script are still under development and will be released soon.

facodec's People

Contributors

plachtaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

facodec's Issues

似乎有bug

meldataset.py中68行的clamp是否想打clip?
以及84行的
max_wave_length = max([b[0].size(0) for b in batch])
TypeError: 'int' object is not callable
是否应该改成与上一行一样的shape[0]?

Dataset of the pre-trained checkpoint

Hi, I have a question about the dataset.

As far as I know, the official FACodec checkpoint was trained using Librilight.
Was this version of the checkpoint also trained using Librilight?
README only says 50k hours of training data and the possibility of multi-language.
I'm confused because Librilight is known as containing 60k hours, Libriheavy is known as containing 50k hours.
I wonder about the details of the training data.

Thanks.

Does the prosody codes[0] work?

I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?

运行train.py报错:AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

您好,我在运行train.py的时候碰到以下报错:

Traceback (most recent call last):
  File "/home/tts/ref/ns3/train.py", line 496, in <module>
    main(args)
  File "/home/tts/ref/ns3/train.py", line 342, in main
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/home/tts/ref/ns3/train.py", line 342, in <listcomp>
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/opt/conda/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

nemo-toolkit的版本为1.21.0。
这个报错我在nemo的issues中找过,但是没有找到相关的问题。

你好,我想问下关于检查点的问题

我发现您们所提供的预训练检查点似乎都是只有权重的bin格式,而使用仓库中train训练出来的检查点都是pth格式,先是大小就差了2.5个G
由于我既无法连上HF也无法连上HFmirror,于是我就想着先用自己训练出来的检查点试试,就把检查点的名字改成了pytorch_model.bin,连着config一起放到了checkpoints里
然后我发现训练出来的模型并不能够用于声音重构,因为在reconstruct的时候,模型的键是:
dict_keys(['encoder', 'quantizer', 'decoder', 'discriminator', 'fa_predictors'])
而检查点的键是:
Keys in ckpt_params: dict_keys(['net', 'optimizer', 'scheduler', 'iters', 'epoch'])
请问是就是这样设计的呢,还是我的使用方法是错误的呢?
最后我想问一下,请问您们是如何不加上任何标签和注释就将一个音频的音色内容音高给解耦开的呢?是用的哪个文件中的哪一段函数呢?
多谢解答

Are the requirements incomplete?

For example, if I don't see Tensorbord in it and report an error during runtime: Audiotools is required but I haven't seen it in the requirements section, should I consider completing the requirements?
The error log mentions that some modules were compiled using an older version of NumPy (1. x series), while the current environment is using NumPy 2.0.0, which may cause program crashes. To support both NumPy 1. x and 2. x versions simultaneously, it is necessary to recompile these modules using NumPy 2.0. Does this require an update?
AttributeError: _ARRAY_API not found,may i ask what is that API?

模型问题咨询

想请教下,您是否已经跑通了代码,并且验证了效果呢?
因为看到好多权重设置跟论文中不一致

What do the loss curves look like during your successful training?

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

代码细节问题

您好,请问 FAcodec/modules /quantize.py中FApredictors中forward_v2函数注释掉了
spk_pred = self.timbre_predictor(timbre)[0]
这行代码,因此timbre为None,这里会导致后面

     spk_pred_logits = preds['timbre']
     spk_loss = F.cross_entropy(spk_pred_logits, spk_labels)

spk_pred_logits 的内容为None,因此报错,这里是bug吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.