plachtaa / facodec Goto Github PK

Training code for FAcodec presented in NaturalSpeech3

Python 100.00%

audio-codec voice-conversion zero-shot-voice-conversion

facodec's Introduction

FAcodec

This project is supported by Amphion.

Pytorch implementation for the training of FAcodec, which was proposed in paper NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

This implementation made some key improvements to the training pipeline, so that the requirements of any form of annotations, including transcripts, phoneme alignments, and speaker labels, are eliminated. All you need are simply raw speech files.
With the new training pipeline, it is possible to train the model on more languages with more diverse timbre distributions.
We release the code for training and inference, including a pretrained checkpoint on 50k hours speech data with over 1 million speakers.

Requirements

Python 3.10

Installation

git clone https://github.com/Plachtaa/FAcodec.git
pip install -r requirements.txt

If you want to train the model by yourself, install the following packages:

pip install nemo_toolkit['all']
pip install descript-audio-codec

Model storage

We provide pretrained checkpoints on 50k hours speech data.

Model type	Link
FAcodec
FAcodec redecoder

Demo

Try our model on !

Training

accelerate launch train.py --config ./configs/config.yaml

Before you run the command above, replace the PseudoDataset class in meldataset.py with your own dataset. Simply load your own wave files in the same format.
To train redecoder, the voice conversion model, run:

accelerate launch train_redecoder.py --config ./configs/config_redecoder.yaml

Remember to fill in the checkpoint path of a pretrained FAcodec model in the config file.

Usage

Encode & reconstruct

python reconstruct.py --source <source_wav> --ckpt-path <ckpt_path> --config-path <config_path>

If no --ckpt-path or --config-path is specified, model weights will be automatically downloaded from Hugging Face.
For China mainland users, add additional environment variable to specify huggingface endpoint:

HF_ENDPOINT=https://hf-mirror.com python reconstruct_redecoder.py --source <source_wav> --target <target_wav>

Extracting representations

import yaml
from modules.commons import build_model, recursive_munch
from hf_utils import load_custom_model_from_hf
import torch
import torchaudio
import librosa

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt_path, config_path = load_custom_model_from_hf("Plachta/FAcodec")
model = build_model(yaml.safe_load(open(config_path))['model_params'])
ckpt_params = torch.load(ckpt_path, map_location="cpu")

for key in ckpt_params:
    model[key].load_state_dict(ckpt_params[key])

_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]

with torch.no_grad():
    source = "path/to/source.wav"
    source_audio = librosa.load(source, sr=24000)[0]
    source_audio = torch.tensor(source_audio).unsqueeze(0).float().to(device)
    z = model.encoder(source_audio[None, ...].to(device).float())
    z, quantized, _, _, timbre, codes = model.quantizer(z, source_audio[None, ...].to(device).float(), return_codes=True)

where:
timbre is the timbre representation, one single vector for each utterance.
codes[0] is the prosody representation
codes[1] is the content representation

Zero-shot voice conversion

python reconstruct_redecoder.py \
    --source <source_wav> 
    --target <target_wav> 
    --codec-ckpt-path <codec_ckpt_path> 
    --redecoder-ckpt-path <redecoder_ckpt_path> 
    --codec-config-path <codec_config_path> 
    --redecoder-config-path <redecoder_config_path>

same as above, if no checkpoint path or config path is specified, model weights will be automatically downloaded from Hugging Face.

Real-time voice conversion

This codec is fully causal, so it can be used for real-time voice conversion.
Script are still under development and will be released soon.

facodec's People

Contributors

Stargazers

Watchers

Forkers

wendongj shaun95 ishine zjc6666 ctwgl liuhuang31 techthiyanes wangtianrui hbwu-ntu sarracen1a naseem56 mubtasimahasan yuxianglin1234 lixuyuan102 forwiat

facodec's Issues

似乎有bug

meldataset.py中68行的clamp是否想打clip？
以及84行的
max_wave_length = max([b[0].size(0) for b in batch])
TypeError: 'int' object is not callable
是否应该改成与上一行一样的shape[0]？

Dataset of the pre-trained checkpoint

Hi, I have a question about the dataset.

As far as I know, the official FACodec checkpoint was trained using Librilight.
Was this version of the checkpoint also trained using Librilight?
README only says 50k hours of training data and the possibility of multi-language.
I'm confused because Librilight is known as containing 60k hours, Libriheavy is known as containing 50k hours.
I wonder about the details of the training data.

Thanks.

Does the prosody codes[0] work?

I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?

How many steps would be enough if i train this model from start?

Hi! Nice work!
Could you share how many steps would be sufficient to train a new model? I'm trying to train a 16k FAcodec. The results reconstructed by ckpt 130,000 still sound different from the real speech, especially for the speaker timbre.

原始论文的公开权重是不是缺少部分参数？

gr_content_f0和gr_prosody_phone这两个grl层似乎没有使用，这与原论文是不符的，请问你有探究过这两部分的影响吗？

恢复后的音频高频部分都没了

权重模型到时候会公开么？

运行train.py报错：AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

您好，我在运行train.py的时候碰到以下报错：

Traceback (most recent call last):
  File "/home/tts/ref/ns3/train.py", line 496, in <module>
    main(args)
  File "/home/tts/ref/ns3/train.py", line 342, in main
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/home/tts/ref/ns3/train.py", line 342, in <listcomp>
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/opt/conda/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

nemo-toolkit的版本为1.21.0。
这个报错我在nemo的issues中找过，但是没有找到相关的问题。

你好，我想问下关于检查点的问题

我发现您们所提供的预训练检查点似乎都是只有权重的bin格式，而使用仓库中train训练出来的检查点都是pth格式，先是大小就差了2.5个G
由于我既无法连上HF也无法连上HFmirror，于是我就想着先用自己训练出来的检查点试试，就把检查点的名字改成了pytorch_model.bin，连着config一起放到了checkpoints里
然后我发现训练出来的模型并不能够用于声音重构，因为在reconstruct的时候，模型的键是：
dict_keys(['encoder', 'quantizer', 'decoder', 'discriminator', 'fa_predictors'])
而检查点的键是：
Keys in ckpt_params: dict_keys(['net', 'optimizer', 'scheduler', 'iters', 'epoch'])
请问是就是这样设计的呢，还是我的使用方法是错误的呢？
最后我想问一下，请问您们是如何不加上任何标签和注释就将一个音频的音色内容音高给解耦开的呢？是用的哪个文件中的哪一段函数呢？
多谢解答

Are the requirements incomplete?

For example, if I don't see Tensorbord in it and report an error during runtime: Audiotools is required but I haven't seen it in the requirements section, should I consider completing the requirements?
The error log mentions that some modules were compiled using an older version of NumPy (1. x series), while the current environment is using NumPy 2.0.0, which may cause program crashes. To support both NumPy 1. x and 2. x versions simultaneously, it is necessary to recompile these modules using NumPy 2.0. Does this require an update?
AttributeError: _ARRAY_API not found,may i ask what is that API?

模型问题咨询

想请教下，您是否已经跑通了代码，并且验证了效果呢？
因为看到好多权重设置跟论文中不一致

请问解码器是否可以支持流式输出？

你好，请问解码器是否可以支持流式输出？

Audio format in dataset files

Thanks for you great work on implementing FACodec!
I found the data file in https://github.com/Plachtaa/FAcodec/blob/master/data/val.txt has some labels, like speaker id, phonemes. How can I get these labels? Will these labels be auto-generated in the training process?

请问'uv'指的是 'unvoiced' 吗?

即某一帧是否有声音,计算方式为f0是否大于某一阈值?

你好，我想请问下如何用train中训练出的PTH文件进行推理以及想请教下不用任何标签也可以解耦的**

项目中的reconstruct和redecoder reconstruct似乎只能针对预训练文件，也就是bin，我想请教下train训练的pth文件能否用于推理
还有就是想请问不用任何标签也可以训练出解耦音频要素的方法是在哪个文件中体现的
感谢解答

may i ask How did you eliminate the difficulty of requiring phoneme audio alignment through predicting semantic latent?

Can you indicate in which file you implemented this feature?
and , As you wrote in Read Me: \ t<speakeer_id>\ t\ t<script>\ t<phonemixed_transscript>If these parameters cannot be replaced with placeholders, will the presence or absence of these parameters have a performance impact on the final trained model?

What do the loss curves look like during your successful training?

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

代码细节问题

您好，请问 FAcodec/modules /quantize.py中FApredictors中forward_v2函数注释掉了
spk_pred = self.timbre_predictor(timbre)[0]
这行代码，因此timbre为None，这里会导致后面

     spk_pred_logits = preds['timbre']
     spk_loss = F.cross_entropy(spk_pred_logits, spk_labels)

spk_pred_logits 的内容为None，因此报错，这里是bug吗?