Giter VIP home page Giter VIP logo

audiodec's Introduction

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

Highlights

  • Streamable high-fidelity audio codec for 48 kHz mono speech with 12.8 kbps bitrate.
  • Very low decoding latency on GPU (~6 ms) and CPU (~10 ms) with 4 threads.
  • Efficient two-stage training (with the pre-trained models, training an encoder for new applications takes only a few hours)

Abstract

A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)/10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [paper] [demo]

Two modes of AudioDec

  1. AutoEncoder (symmetric AudioDec, symAD)
    1-1. Train an AutoEncoder-based codec model from scratch with only metric loss(es) for the first 200k iterations.
    1-2. Fix the encoder, projector, quantizer, and codebook, and train the decoder with the discriminators for the following 500k iterations.
  2. AutoEncoder + Vocoder (AD v0,1,2) (recommended!)
    2-1. Extract the stats (global mean and variance) of the codes extracted by the trained Encoder.
    2-2. Train the vocoder with the trained Encoder and stats for 500k iterations.

NEWS

  • 2024/01/03: Update pre-trained models (issue9 and issue11)
  • 2023/05/17: Upload the demo sounds on the demo page
  • 2023/05/13: 1st version is released

Requirements

This repository is tested on Ubuntu 20.04 using a V100 and the following settings.

  • Python 3.8+
  • Cuda 11.0+
  • PyTorch 1.10+

Folder architecture

  • bin: The folder of training, stats extraction, testing, and streaming templates.
  • config: The folder of config files (.yaml).
  • dataloader: The source codes of data loading.
  • exp: The folder for saving models.
  • layers: The source codes of basic neural layers.
  • losses: The source codes of losses.
  • models: The source codes of models.
  • slurmlogs: The folder for saving slurm logs.
  • stats: The folder for saving stats.
  • trainer: The source codes of trainers.
  • utils: The source codes of utils for the demo.

Run real-time streaming encoding/decoding demo

  1. Please download the whole exp folder and put it in the AudioDec project directory.
  2. Get the list of all I/O devices
$ python -m sounddevice
  1. Run the demo
# The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.
# Set up the I/O devices according to the list of I/O devices

# w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1

# w/ CPU
$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym

# The input and out audios will be dumped into input.wav and output.wav

Run codec demo with files

  1. Please download the whole exp folder and put it in the AudioDec project directory.
  2. Run the demo
## VCTK 48000Hz models
$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav

## LibriTTS 24000Hz model
$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav

Training and testing the whole AudioDec pipeline

  1. Prepare the training/validation/test utterances and put them in three different folders
    ex: corpus/train, corpus/dev, and corpus/test
  2. Modify the paths (ex: /mnt/home/xxx/datasets) in
    submit_codec_vctk.sh
    config/autoencoder/symAD_vctk_48000_hop300.yaml
    config/statistic/symAD_vctk_48000_hop300_clean.yaml
    config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml
  3. Assign corresponding analyzer and stats in config/statistic/symAD_vctk_48000_hop300_clean.yaml
    config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml
  4. Follow the usage instructions in submit_codec_vctk.sh to run the training and testing
# stage 0: training autoencoder from scratch
# stage 1: extracting statistics
# stage 2: training vocoder from scratch
# stage 3: testing (symAE)
# stage 4: testing (AE + Vocoder)

# Run stages 0-4
$ bash submit_codec.sh --start 0 --stop 4 \
--autoencoder "autoencoder/symAD_vctk_48000_hop300" \
--statistic "stati/symAD_vctk_48000_hop300_clean" \
--vocoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"  

Training and testing only the AutoEncoder

  1. Prepare the training/validation/test utterances and modify the paths
  2. Follow the usage instructions in submit_autoencoder.sh to run the training and testing
# Train AutoEncoder from scratch
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"

# Resume AutoEncoder from previous iterations
$ bash submit_autoencoder.sh --stage 1 \
--tag_name "autoencoder/symAD_vctk_48000_hop300" \
--resumepoint 200000

# Test AutoEncoder
$ bash submit_autoencoder.sh --stage 2 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
--subset "clean_test"

Pre-trained Models

All pre-trained models can be accessed via exp (only the generators are provided).

AutoEncoder Corpus Fs Bitrate Path
symAD VCTK 48 kHz 24 kbps exp/autoencoder/symAD_c16_vctk_48000_hop320
symAAD VCTK 48 kHz 12.8 kbps exp/autoencoder/symAAD_vctk_48000_hop300
symAD VCTK 48 kHz 12.8 kbps exp/autoencoder/symAD_vctk_48000_hop300
symAD_univ VCTK 48 kHz 12.8 kbps exp/autoencoder/symADuniv_vctk_48000_hop300
symAD LibriTTS 24 kHz 6.4 kbps exp/autoencoder/symAD_libritts_24000_hop300
Vocoder Corpus Fs Path
AD v0 VCTK 48 kHz exp/vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean
AD v1 VCTK 48 kHz exp/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean
AD v2 VCTK 48 kHz exp/vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean
AD_univ VCTK 48 kHz exp/vocoder/AudioDec_v3_symADuniv_vctk_48000_hop300_clean
AD v1 LibriTTS 24 kHz exp/vocoder/AudioDec_v1_symAD_libritts_24000_hop300_clean

Bonus Track: Denoising

  1. It is easy to perform denoising by just updating the encoder using noisy-clean pairs (keeping the decoder/vocoder the same).
  2. Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing
# Update the Encoder for denoising
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "denoise/symAD_vctk_48000_hop300"

# Denoise
$ bash submit_autoencoder.sh --stage 2 \
--encoder "denoise/symAD_vctk_48000_hop300"
--decoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
--encoder_checkpoint 200000
--decoder_checkpoint 500000
--subset "noisy_test"

# Stream demo w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise

# Codec demo w/ files
$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise

Citation

If you find the code helpful, please cite the following article.

@INPROCEEDINGS{10096509,
  author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander},
  booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec}, 
  year={2023},
  doi={10.1109/ICASSP49357.2023.10096509}}

References

The AudioDec repository is developed based on the following repositories.

License

The majority of "AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https://github.com/kan-bayashi/ParallelWaveGAN, https://github.com/lucidrains/vector-quantize-pytorch, https://github.com/jik876/hifi-gan, https://github.com/r9y9/wavenet_vocoder, and https://github.com/chomeyama/SiFiGAN are licensed under the MIT license.

FQ&A

  1. Have you compared AudioDec with Encodec?
    Please refer to the discussion.
  2. Have you compared AudioDec with other non-neural-network codecs such as Opus?
    Since this paper focuses on providing a well-developed streamable neural codec implementation with an efficient training paradigm and modularized architecture, we only compared AudioDec with SoundStream.
  3. Can you also release the pre-trained discriminators?
    For many applications such as denoising, updating only the encoder achieves almost the same performance as updating the whole model. For applications involving decoder updating such as binaural rending, it might be better to design specific discriminators for that application. Therefore, we release only the generators.
  4. Can AudioDec encode/decode multi-channel signals?
    Yes, you can train a MIMO model by changing the input_channels and output_channels in the config. One lesson I learned in training a MIMO model is that although the generator is MIMO, reshaping the generator output signal to mono for the following discriminator will markedly improve the MIMO audio quality.

audiodec's People

Contributors

bigpon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audiodec's Issues

Samplerates' affect on hyperparameters

Hi Authors,

Great work, I'm sure to use this going forward. I have a question however, given the encoder and vocoder, what hyperparameters are affected when the samplerate is changed? I'm looking to instead use 16khz audio instead of 48khz. For the sake of simplicity, let us use the configuration files described in the readme.

Thank you, congratulations on the release!

The trained models seem overfitted to their training sets?

I compressed a few examples with the 24kHz libritts_v1 model and they sounded great at very low bitrates but when I tried downsampling something from VCTK to 24kHz and pushing it through the same model the quality suffered a lot. I've seen the same problem when testing it out on some clean speech extracted from a YouTube video.

Since LibriTTS and VCTK are pretty small datasets is it possible that the pretrained models are overfitted to them a bit too much?

Increasing batch length

Thank you for sharing the code.

If we increase the batch_length (9600) and adv_batch_length (9600) to longer duration e.g., 48000.

s1

Do we need to change anything else to transmit data at 12kbps. Do we need to modify code_dim?
s2

What data was used by the pretrained model of LibriTTS?

Hello, thank you for your excellent open source work。
Could you please provide information about the subsets that were utilized during the training of the 24kHz model?
For instance, were subsets like train-clean-100, train-clean-360, or other subsets used?

Multi-GPU training

Hello, thank you for your open source work. I have a question about this code, whether it can be trained using multiple GPUs. It seems that there is no option to set the GPU in the code.

VCTK dataset

Excuse me, may I ask how the training set and validation set of VCTK are divided?

vq_loss increase, not converge

I am working on my dataset, whose channels = 2 and sampling rate = 16000. I paste my config file below, the major changes I made are: 1) sample_rate 2) data path 3) input output channel

sampling_rate: &sampling_rate 16000
data:
    path: "../ABCS/Audio"
    subset:
        train: "train"
        valid: "dev"
        test:  "test"

###########################################################
#                   MODEL SETTING                         #
###########################################################
model_type: symAudioDec
train_mode: autoencoder
paradigm: efficient

generator_params:
    input_channels: 2
    output_channels: 2 
    encode_channels: 32
    decode_channels: 32
    code_dim: 64
    codebook_num: 8
    codebook_size: 1024
    bias: true
    enc_ratios: [2, 4, 8, 16]
    dec_ratios: [16, 8, 4, 2]
    enc_strides: [3, 4, 5, 5]
    dec_strides: [5, 5, 4, 3]
    mode: 'causal'
    codec: 'audiodec'
    projector: 'conv1d'
    quantier: 'residual_vq'

discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1d"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.

###########################################################
#                 METRIC LOSS SETTING                     #
###########################################################
use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
mel_loss_params:
    fs: *sampling_rate
    fft_sizes: [2048]
    hop_sizes: [300]
    win_lengths: [2048]
    window: "hann_window"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null

use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
stft_loss_params:
    fft_sizes: [1024, 2048, 512]     # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]        # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240]    # List of window length for STFT-based loss.
    window: "hann_window"            # Window function for STFT-based loss

use_shape_loss: false                # Whether to use waveform shape loss.
shape_loss_params:
    winlen: [300]

###########################################################
#                  ADV LOSS SETTING                       #
###########################################################
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

use_feat_match_loss: true
feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.

###########################################################
#                  LOSS WEIGHT SETTING                    #
###########################################################
lambda_adv: 0.1          # Loss weight of adversarial loss.
lambda_feat_match: 2.0   # Loss weight of feat match loss.
lambda_vq_loss: 1.0      # Loss weight of vector quantize loss.
lambda_mel_loss: 45.0    # Loss weight of mel-spectrogram spectloss.
lambda_stft_loss: 45.0   # Loss weight of multi-resolution stft loss.
lambda_shape_loss: 45.0  # Loss weight of multi-window shape loss.
      
###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 64              # Batch size.
batch_length: 9600          # Length of each audio in batch (training w/o adv). Make sure dividable by hop_size.
adv_batch_length: 9600      # Length of each audio in batch (training w/ adv). Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 8              # Number of workers in Pytorch DataLoader.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 1.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: StepLR
generator_scheduler_params:
    step_size: 200000      # Generator's scheduler step size.
    gamma: 1.0
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
start_steps:                       # Number of steps to start training
    generator: 0
    discriminator: 500000 
train_max_steps: 500000            # Number of training steps. (w/o adv)
adv_train_max_steps: 1000000       # Number of training steps. (w/ adv)
save_interval_steps: 100000        # Interval steps to save checkpoint.
eval_interval_steps: 1000          # Interval steps to evaluate the network.
log_interval_steps: 100            # Interval steps to record the training log.
  1. In the stage 1 training (<500k), the mel_loss seems reasonable, but the vq_loss gets larger and larger, which seems weird.
1710138169017 1710138136376
  1. In the stage 2 training, my mel loss will go much higher. Is the reason 1) I set the wrong lambda_adv or 2) the problem caused by bad vq_loss? What is the recommended way to work on it?

Thank you in advance!

Permissive license + patents

We all really appreciates you publishing this for research.

In terms of using the work, patent rights are not licensed under the current license.
Did you apply for one or more patents before releasing this?

Could you license this under https://opensource.org/license/bsdpluspatent/ ?
It would be great to have a license which does not just cover part of the IP rights aka the copyright.

Thanks in advance!

The test results are different from those in the paper

I tested clean_testset_wav audio in the Valentini dataset according to the description in your paper. The model used is vctk_v1, and the test results are as follows
test_result
The test results are quite different from those in your paper. Can you release the calculation code?

Fix the encoder

first: model_type: symAudioDec

  1. start_steps:
    generator: 0
    discriminator: 0
    train_max_steps: 500000
    adv_train_max_steps: 500000

  2. start_steps:
    generator: 0
    discriminator: 500000
    train_max_steps: 500000
    adv_train_max_steps: 1000000

  3. Which configuration would be faster or more efficient?

  4. My opinion is that both configurations train the 500k step (with discriminator), but if you change the discriminator type, there is a difference between the two:

  • The first configuration, 500k step (with discriminator) needs to be trained from the beginning.
  • The second configuration can load 500k step(without discriminator) and then continue 500k step(with discriminator)
  • And finally, both of them actually train the 500k step (with discriminator), but the effect or time, I do not know which one is more dominant

Baseline?

AudioDec/codecTrain.py

Lines 22 to 25 in 5ec3ab9

from models.autoencoder.AudioDec import Generator as generator_audiodec
from models.vocoder.HiFiGAN import Generator as generator_hifigan
from models.vocoder.HiFiGAN import Discriminator as discriminator_hifigan
from models.vocoder.UnivNet import Discriminator as discriminator_univnet

image
Hi @bigpon
It is mentioned in the paper that baseline is soundstream, but there is no wav_discriminator or stft_discriminator of soundstream in your code, just only mel_loss.
Is it better to keep only mel_loss than adv_loss feat_loss com_loss of soundstream? This is a great discovery.

download link. needs. more. juice.

Hey,

Looks great. Unfortunately, the download link is already down.. and you're only at 185 stars. I don't see this getting any better :)

Any chance you can put the exp.zip on an s3 bucket or somewhere that can handle the downloads?

Thanks!

wget https://ghe.oculus-rep.com/yichiaowu/AudioDec/releases/download/pretrain_models/exp.zip
--2023-05-19 02:49:34--  https://ghe.oculus-rep.com/yichiaowu/AudioDec/releases/download/pretrain_models/exp.zip
Resolving ghe.oculus-rep.com (ghe.oculus-rep.com)... 54.193.191.239, 2600:1f1c:aad:f800:8c1d:af6a:f7b6:ead3
Connecting to ghe.oculus-rep.com (ghe.oculus-rep.com)|54.193.191.239|:443... failed: Connection timed out.
Connecting to ghe.oculus-rep.com (ghe.oculus-rep.com)|2600:1f1c:aad:f800:8c1d:af6a:f7b6:ead3|:443... failed: Network is unreachable.

Is it missing some activation functions between some layers?

Thanks for your work. I have trained model in my own dataset. I met same question as ISSUE7. When I checked the model, I found some difference in AutoEncoder:

  • Before encoder_output is feeded into Projector, is an activation function needed?
  • Before ConvTranspose1d, is an activation function needed?
  • add tanh activation function in Decoder final out?

In other popular implementations, they all added those. So I add those:

When I added those and trained again, I got some improvement in unseen datasets than your base when I only trained AutoEncoder with discriminators and don't finetune it with AudioDec.
BTW, I trained model only with Librispeech and AIShell with 16K sampling_rate and tested model by another clean TTS dataset with training 160K steps. When my model is finished(total 800k), I will compare final results, upload some demos and share my training config.

Perplexity blows up at start of training

Hi authors,

I am training the AudioDec model from scratch, with a 16 kHz dataset; each file in the dataset is around 20 seconds. I modified the hyper params as mentioned in this thread. As I start training, I observe that the perplexity starts increasing almost immediately. The vqloss is also steadily increasing, as can be seen from the following logs.

AutoEncoder Training
Configuration file=config/autoencoder/symAD_vctk_48000_hop300.yaml
2023-06-28 10:54:59,796 (train:47) INFO: device: gpu
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] sampling_rate = 16000
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] data = {'path': '/mnt/resource_nvme/segmented_20s', 'subset': {'train': 'clean_trainset_84spk_wav', 'valid': 'clean_validset_84spk_wav', 'test': 'clean_testset_wav'}}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] model_type = symAudioDec
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] train_mode = autoencoder
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] paradigm = efficient
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] generator_params = {'input_channels': 1, 'output_channels': 1, 'encode_channels': 32, 'decode_channels': 32, 'code_dim': 64, 'codebook_num': 8, 'codebook_size': 1024, 'bias': True, 'enc_ratios': [2, 4, 8, 16], 'dec_ratios': [16, 8, 4, 2], 'enc_strides': [2, 4, 5, 5], 'dec_strides': [5, 5, 4, 2], 'mode': 'causal', 'codec': 'audiodec', 'projector': 'conv1d', 'quantier': 'residual_vq'}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] discriminator_params = {'scales': 3, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'stride': 2, 'padding': 2}, 'scale_discriminator_params': {'in_channels': 1, 'out_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'channels': 128, 'max_downsample_channels': 1024, 'max_groups': 16, 'bias': True, 'downsample_scales': [4, 4, 4, 4, 1], 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}}, 'follow_official_norm': True, 'periods': [2, 3, 5, 7, 11], 'period_discriminator_params': {'in_channels': 1, 'out_channels': 1, 'kernel_sizes': [5, 3], 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'max_downsample_channels': 1024, 'bias': True, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'use_weight_norm': True, 'use_spectral_norm': False}}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_mel_loss = True
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] mel_loss_params = {'fs': 16000, 'fft_sizes': [2048], 'hop_sizes': [200], 'win_lengths': [2048], 'window': 'hann_window', 'num_mels': 80, 'fmin': 0, 'fmax': 7600, 'log_base': None}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_stft_loss = False
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] stft_loss_params = {'fft_sizes': [1024, 2048, 512], 'hop_sizes': [120, 240, 50], 'win_lengths': [600, 1200, 240], 'window': 'hann_window'}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] use_shape_loss = False
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] shape_loss_params = {'winlen': [300]}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] generator_adv_loss_params = {'average_by_discriminators': False}
2023-06-28 10:54:59,836 (train:66) INFO: [TrainGAN] discriminator_adv_loss_params = {'average_by_discriminators': False}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] use_feat_match_loss = True
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] feat_match_loss_params = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': False}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_adv = 1.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_feat_match = 2.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_vq_loss = 1.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_mel_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_stft_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] lambda_shape_loss = 45.0
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] batch_size = 2
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] batch_length = 64000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] adv_batch_length = 9600
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] pin_memory = True
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] num_workers = 96
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_optimizer_type = Adam
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_optimizer_params = {'lr': 0.0001, 'betas': [0.5, 0.9], 'weight_decay': 0.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_scheduler_type = StepLR
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_scheduler_params = {'step_size': 200000, 'gamma': 1.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] generator_grad_norm = -1
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_optimizer_type = Adam
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_optimizer_params = {'lr': 0.0002, 'betas': [0.5, 0.9], 'weight_decay': 0.0}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_scheduler_type = MultiStepLR
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_scheduler_params = {'gamma': 0.5, 'milestones': [200000, 400000, 600000, 800000]}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] discriminator_grad_norm = -1
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] start_steps = {'generator': 0, 'discriminator': 200000}
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] train_max_steps = 200000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] adv_train_max_steps = 700000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] save_interval_steps = 100000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] eval_interval_steps = 1000
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] log_interval_steps = 100
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] config = config/autoencoder/symAD_vctk_48000_hop300.yaml
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] tag = autoencoder/symAD_vctk_48000_hop300
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] exp_root = exp
2023-06-28 10:54:59,837 (train:66) INFO: [TrainGAN] resume = 
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] seed = 1337
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] disable_cudnn = False
2023-06-28 10:54:59,838 (train:66) INFO: [TrainGAN] outdir = exp/autoencoder/symAD_vctk_48000_hop300
2023-06-28 10:54:59,838 (codecTrain:49) INFO: Loading datasets... (batch_lenght: 64000)
2023-06-28 10:54:59,860 (codecTrain:62) INFO: The number of training files = 5638.
2023-06-28 10:54:59,860 (codecTrain:63) INFO: The number of validation files = 5638.
2023-06-28 10:55:02,164 (codecTrain:249) INFO: Train from scratch
2023-06-28 10:55:02,164 (train:108) INFO: The current training step: 0
[train]:   0%|                                                                                                  | 97/200000 [00:08<2:08:56, 25.84it/s]2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_0 = 1.0378.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_1 = 1.3916.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_2 = 2.0831.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_3 = 3.9071.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_4 = 5.6672.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_5 = 5.7495.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_6 = 6.8262.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/ppl_7 = 6.5784.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/train/vqloss = 1.1645.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/mel_loss = 69.3261.
2023-06-28 10:55:11,056 (trainerGAN:333) INFO: (Steps: 100) train/generator_loss = 70.4907.
[train]:   0%|                                                                                                 | 197/200000 [00:11<1:39:00, 33.63it/s]2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_0 = 12.0950.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_1 = 71.2452.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_2 = 117.6926.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_3 = 93.9345.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_4 = 72.0587.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_5 = 60.1578.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_6 = 57.6502.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/ppl_7 = 45.2416.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/train/vqloss = 0.0397.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/mel_loss = 60.2521.
2023-06-28 10:55:14,023 (trainerGAN:333) INFO: (Steps: 200) train/generator_loss = 60.2918.
[train]:   0%|▏                                                                                                | 297/200000 [00:14<1:37:34, 34.11it/s]2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_0 = 42.3088.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_1 = 101.7872.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_2 = 115.1603.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_3 = 39.3077.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_4 = 29.4465.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_5 = 25.6839.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_6 = 25.1760.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/ppl_7 = 23.8064.
2023-06-28 10:55:16,952 (trainerGAN:333) INFO: (Steps: 300) train/train/vqloss = 0.0471.
2023-06-28 10:55:16,953 (trainerGAN:333) INFO: (Steps: 300) train/mel_loss = 50.7779.
2023-06-28 10:55:16,953 (trainerGAN:333) INFO: (Steps: 300) train/generator_loss = 50.8251.
[train]:   0%|▏                                                                                                | 397/200000 [00:17<1:37:39, 34.07it/s]2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_0 = 65.0666.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_1 = 96.9628.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_2 = 102.8317.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_3 = 31.8549.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_4 = 29.1392.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_5 = 26.5810.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_6 = 24.8492.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/ppl_7 = 22.6607.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/train/vqloss = 0.0356.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/mel_loss = 48.2285.
2023-06-28 10:55:19,917 (trainerGAN:333) INFO: (Steps: 400) train/generator_loss = 48.2641.
[train]:   0%|▏                                                                                                | 497/200000 [00:20<1:38:37, 33.71it/s]2023-06-28 10:55:22,886 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_0 = 90.1403.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_1 = 111.8928.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_2 = 109.8676.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_3 = 31.8954.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_4 = 30.4388.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_5 = 26.5104.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_6 = 24.2304.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/ppl_7 = 22.0437.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/train/vqloss = 0.0557.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/mel_loss = 42.6582.
2023-06-28 10:55:22,887 (trainerGAN:333) INFO: (Steps: 500) train/generator_loss = 42.7138.

I tried reducing the batch size and learning rate, but that did not help. Do you have any idea why this may be happening?

LOSS?

image

data: LibriTTS tpye: Autoencoder

Hi, @bigpon
Excuse me, is it normal for generator loss to increase all the time?
What the correct loss for the generator and discriminator should be ?
THS

LibriTTS dataset

image
Hi @bigpon
For your libritts_sym model, does the dataset include train-clean-100 and train-clean-360 and train-other-500? Or just part of it?
Because it's related to the training time.

Some questions about CausalConvTranspose1d in conv_layer.py

hello, thanks for your useful code. I don't figure out the class of CausalConvTranspose1d. why we select nn.ReplicationPad1d for stream pad not similar to CausalConv1d which pads constant 0?In CausalConvTranspose1d, I found self.pad_length is equal to 1 no matter kernel _size change values.But in CausalConv1d self.pad_length is relevant to kernel_size.
Does self.pad_length have no links to kernel_size in CausalConvTranspose1d?So don't we change self.pad_length in CausalConvTranspose1d when we change its any parameters?

projector(e)?

self.projector = Projector(
input_channels=self.encoder.out_channels,
code_dim=code_dim,
kernel_size=3,
stride=1,
bias=False,
mode=self.mode,
model=projector,
)

with torch.no_grad():
e = self.model["analyzer"].encoder(x)
z = self.model["analyzer"].projector(e)
zq, _, _ = self.model["analyzer"].quantizer(z)
y_ = self.model["generator"](zq)
p = self.model["discriminator"](x)
p_ = self.model["discriminator"](y_.detach())

Hi, @bigpon
Can I ask you a few questions?

  1. z = self.model["analyzer"].projector(e),Does this step belong to VQ?
  2. z[16,64,32] VQ to zq[16,64,32], code_dim=64, codebook_size=1024, Is that equivalent to finding 32 red balls out of 1024 balls? Is the codebook_size too large? 1024/32=32, it takes 32 batches to form a complete initial codebook?
  3. What's the difference between enc_ratios: [2, 4, 8, 16] and encode_channels: 32 and input_channels: 1? I can see the outchannel=512, How did it come about?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.