Giter VIP home page Giter VIP logo

encodec's Issues

Invalid file: WindowsPath of encodec using Windows OS.

πŸ› Bug Report

When I tried to encode from WAV to ECDC file, python gave me an error for invalid file using Windows OS. On Linux, works well, but only Windows OS did not work.

To Reproduce

  1. Install encodec using pip install .
  2. Install pysoundfile using pip
  3. Encode from uncompressed WAV to encodec (.ecdc)

Expected behavior

Encode success from WAV to ECDC.

Actual Behavior

Saying invalid file for trying find WAV file

Your Environment

The error code is for Windows OS only:

encodec -b 24 -r MSESTONIA.wav MSESTONIA.ecdc
Traceback (most recent call last):
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\Scripts\encodec.exe\__main__.py", line 7, in <module>
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\__main__.py", line 109, in main
    wav, sr = torchaudio.load(args.input)
  File "C:\Users\marti\AppData\Roaming\Python\Python310\site-packages\torchaudio\backend\soundfile_backend.py", line 205, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 740, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 1263, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: WindowsPath('MSESTONIA.wav')
  • Python and PyTorch version: Python 3.10.8 and Pytorch 1.13.0+cu117

  • Operating system and version (desktop or mobile): Windows 11 (desktop)

  • Hardware (gpu or cpu, amount of RAM etc.): NVIDIA RTX 3060 Laptop GUI, 16 GB RAM.

  • Martin Eesmaa

SoundStream improved reimplementation

Thanks for publishing this! In the encodec paper you write

For fair evaluation, we also compare EnCodec to our reimplementation of
SoundStream (Zeghidour et al., 2021). [...] Finally, we compare EnCodec to the
SoundStream model from the official implementation available in Lyra 2 1 at 3.2 kbps and 6 kbps on audio
upsampled to 32 kHz. We also reproduced a version of SoundStream (Zeghidour et al., 2021) with minor
improvements.
Namely, we use the relative feature loss introduce in Section 3.4, and layer normalization
(applied separately for each time step) in the discriminators, except for the first and last layer, which improved
the audio quality during our preliminary studies.

And on https://ai.honu.io/papers/encodec/samples.html you show samples of this reimplementation.
Could you share the source code of your SoundStream reimplementation so this work can be reproduced?

Non-free LICENSE

Non-commercial (NC) clauses are non-free as they are not free software according to the FSF, open source according to the OSI, or free culture according to Freedom Defined. I would recommend using CC-BY-SA-4.0, CC-BY-4.0, or CC-0-1.0 instead which are free culture Creative Commons licenses.

Inference using a GPU

❓ Questions

Hello, my question is the following, i would like to know if it's possible in any way to use cuda acceleration to process compressing and decompressing using Encodec? from what i've been reading i couldn't find anything that mentioned inference by gpu.. so i wanted to know if it's possible to use the gpu, that would be even more productive and useful, awaiting an answer, thank you in advance, att. Lucas.

VQ code vectors

❓ Questions

Thanks for the nice paper/work!

I've a question:
How do I print the VQ code vectors of EnCodec?

Real-world Balancer usage question

❓ Questions

Hi, first of all, thanks for sharing the code! It's quite well-written.

I've been trying to understand the practical use of the Balancer class.

In pseudo-code, what I understood from the documentation is that my use of it in a real training loop should look something like this:

z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2}, reconstruction)

So far so good?

Assuming I understood correctly, I still run into an issue once I add a loss term on the latent representation (z) in the above. e.g.

z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
loss_latent = some_regularization_loss(z)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2, 'z' : some_regularization_loss}, reconstruction)

Now, the loss on z does not depend on the reconstruction so this won't work. An alternative is to pass z to balancer.backward() however that comes with the cost of backpropagating through the decoder multiple times. What did you do regarding the quantization loss?

Finetuning possible?

❓ Questions

Is it possible to fine-tune the model? If so, how?

Or is only the compression/decompression model provided?

Memory leak in decoding process

πŸ› Bug Report

I'm trying to evaluate Encodec for streaming audio use cases, however I noticed that the decoding step seems to accumulate memory very quickly over time. If I turn off decoding, memory usage stays constant. I looked at the code though, I'm not sure why?

To Reproduce

from encodec import EncodecModel
from encodec.utils import convert_audio

import torchaudio
import torch

# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)

# Load and pre-process the audio waveform
wav, sr = torchaudio.load("CantinaBand60.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)

# Extract discrete codes from EnCodec
frames = []
for offset in range(0, 1440000, 480):
    print(offset)

    encoded_frames = model._encode_frame(wav[:, :, offset: offset + 480])

    frames.append(model._decode_frame(encoded_frames))

# merge the decoded frames
decoded = torch.cat(frames, dim=2)

# save the decoded audio
torchaudio.save("decoded.wav", decoded, model.sample_rate)

Expected behavior

Memory shouldn't be persisted across frames.

Actual Behavior

Memory usage grows over time eventually resulting in OOM.

Do we have training examples?

❓ Questions

Hi encodes team,

great job!

We want to reimplement the results and train a new model in a wider range of the sound dataset and add some denoise/dereverberation functionality. Could you please add some training examples?

Wrong device in average_metrics function

πŸ› Bug Report

Since average_metrics function is being called from backward that should be called from every gpu, the devise that is being created here should be equal to the current rank, otherwise torch.distributed.all_reduce will be stucked forever.

Wrong bar chart on announcement

This report is regarding the announcement of this project on the blog-post. I know this is the wrong place to report it, but I don't know where the right place would be.

The chart on the page is misleading and wrong. It shows an x-axis that indicates that the 10Γ— gain is for mp3. Either way the bar-length has to be swapped or the caption has to change:

image

Channel-mismatch `RuntimeError` when extracting embedding with 24kHz model

πŸ› Bug Report

Following the "Extracting discrete representations" section in README, I tried to extract the encoded embedding myself. However, running the exact code snippet gave me an error: RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead.

To Reproduce

from encodec import EncodecModel
from encodec.utils import convert_audio

import torchaudio
import torch

# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)

# Load and pre-process the audio waveform
wav, sr = torchaudio.load("test.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1)  # [B, n_q, T]

where test.wav is any WAV file. I tried with one on the sample page.

Expected behavior

I should be able to get the representation in [B, n_q, T] as described in the code itself.

Actual Behavior

Full traceback:

Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 18, in <module>
    encoded_frames = model.encode(wav)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 210, in forward
    return self.conv(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 120, in forward
    x = self.conv(x)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead

Your Environment

  • Python and PyTorch version: Python 3.10.8 (conda 22.11.1), PyTorch 1.13.0 with CUDA 11.7
  • Operating system and version (desktop or mobile): Ubuntu 22.04.1 on WSL2 (Windows 10 Build 19045)
  • Hardware (gpu or cpu, amount of RAM etc.): RTX 2070 SUPER with 32 GB RAM

incorrect time to expire codes

πŸ› Bug Report

In the current implementation the following incorrect behavior occurs:
some embedding was not chosen in the previous steps so it will be expired here, but it was chosen right on that step a couple lines above, but since self.cluster_size will be changed only in following lines, this embedding will be expired, but it should not.

This can be easily checked by adding these 2 lines
expared_but_taken = (self.cluster_size < self.threshold_ema_dead_code) & (embed_onehot.sum(0) > 0)
assert torch.any(expared_but_taken).
somewhere here

I think the correct code should be like in vector-quantize-pytorch repo, namely this line.

Question about bitrate choices

❓ Questions

Hello! Do you intend to train models with a target bitrate of above 24kbps? I didn't see anything in the paper, but maybe I missed it.

I'd be curious to see how 48 and 96kbps models compare to mp3s at higher bitrates.

Thanks for the great work!

Can the decoder models run on Android and IOS ?

❓ Questions

Google's Lyra provides support for Android via TFLite but not for IOS (yet). Does Facebook's encoded provide models that can run on edge devices for both Android and IOS? If not, are there any estimated timelines for when this could be available?

About audio quality evaluation

❓ Questions

Thank you for nice work.

I have some question about objective evaluation metrics.

  1. Are these metrics (SI-SNR and ViSQOL) consistent with audio quality perceptually?

I know that it is very difficult to evaluate the audio quality. So I'm so curious how to evaluate the model during ablation studies or during training.

  1. MS-STFT discriminator (Complex) VS MS-STFT discriminator (Real)

How about the quality of the model with MS-STFT discriminator using only real value? It would be appreciated if you could share such information.

Thank you!

A question about adversarial loss

❓ Questions

In the paper 3.4 "Discriminative Loss" section, adversarial loss is constructed as $l_g(\hat{x})=\mathbb{E}[max(0,1-D_k(\hat{x}))]$, but in the original hinge loss paper, adversarial loss is constructed as $-\mathbb{E}[D(\hat{x})]$.

So I want to know, why the adversarial loss in this paper is different from the original hinge loss?

Number of codebooks and calculation of bitrates

❓ Questions

I don't understand how you come to the smallest bitrate of 1.5 kbps for the 24kHz model:

If I understand correctly, we take a multiple of 4 number of codebooks (4, 8, 12, ... so 4 would be the minimum), and we have 10 bits per codebook (2^10 = 1024 entries), and for the 24kHz model 75 latent codes per second, giving us the smallest possible bit rate:
4 * 10 bits * 75 1/s = 3kbps

However, both the paper and the README state that the lowest bitrate is 1.5kbps.
Looking at the bitrate progression (1.5, 3, 6, 12, 24), which doubles at each step, wouldn't that rather correspond to 2, 4, 8, 16, 32 codebooks being used? Maybe I am just misinterpreting or missing something, could you please clarify this point?

Using the casual model with 24 kHz results in the process being "Killed"

πŸ› Bug Report

Using the casual model at 24 kHz results in the process being killed after ~10 seconds. It doesn't matter if the language model --lm is used or not. The 48 kHz mode works as expected.

To Reproduce

I've used the following command and got these results:
python3 -m encodec -r -b 1.5 --lm '/home/user/Downloads/Audiofile.wav' '/home/user/Downloads/Audiofile_EnCodec.wav';

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /home/user/checkpoints/encodec_24khz-d7cc33bc.th 100.0%

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_lm_24khz-1608e3c0.th" to /home/user/checkpoints/encodec_lm_24khz-1608e3c0.th 100.0%

Killed

user@user:~$

I even deleted the checkpoints folder to make sure that this isn't an issue. I've tested with a 16 and 24 bit PCM WAV with mono channel. Using --hq and --lm together works as expected so I believe the 24 kHz is the culprit.

Expected behavior

It should not kill the process and result in a properly encoded and decoded file.

Actual Behavior

It kills the process.

Your Environment

  • Python and PyTorch version: Python 3.11.0, PyTorch: 1.13.0
  • Operating system and version (desktop or mobile): Ubuntu 24.04.01 LTS
  • Hardware (gpu or cpu, amount of RAM etc.): Virtual machine (AMD 5900X, RTX 3080, 16 GB RAM assigned to the VM)

Low quality audio in the demo clip

πŸ™‹β€β™‚οΈ Suggestion

I noticed that the audio on the linked demo (final.mp4) sounds very compressed, even the original part - even on a phone speaker, I can hear a difference to test_48k.wav, not to mention with headpones

I'm aware this is NOT a finished product, but maybe something above 128kbps AAC would be better for comparing audio compression

what is the ECDC File?

❓ Questions

Hey thank you for sharing your work, sorry for asking what seems like a simple question, can you correct the below two statements if they were wrong?
ecdc file can't be used directly to be played and must be decompressed first in order to output a wav file that can be played?
ecdc would be like a zip file where it holds ur file but you can't actually use it until you decompress it?
are the above two statements correct?

the definition of loss_l

❓ Questions

it seems the definition of 'loss_l' in figure 1 that connected vq and transofomer in the quantizer block is not described in the paper.

it there some descriptions of that? thank.

How backward balancer when using huggingface accelerate

❓ Questions

when training encodec using huggingface's accelerate package, can't using balancer?

this is part of my training script

            self.balancer._set_losses_and_input(
                losses={'t': recon_loss, 'f': m_recon_loss, 'g': ads_loss, 'feat': rfm_loss},
                input=output
            )
            # self.balancer.backward()
            self.accelerator.backward(self.balancer)

and i change Balancer little

    def __mul__(self, other):
        for name, loss in self.losses.items():
            self.losses[name] = loss * other
        
    
    def __truediv__(self, other):
        for name, loss in self.losses.items():
            self.losses[name] = loss / other


    def _set_losses_and_input(self, losses: tp.Dict[str, torch.Tensor], input: torch.Tensor):
        self.losses = losses
        self.input = input
        
    @property
    def metrics(self):
        return self._metrics

    def backward(self):
        losses = self.losses
        input = self.input
        
        norms = {}
        grads = {}
        for name, loss in losses.items():
            grad, = autograd.grad(loss, [input], retain_graph=True)
            if self.per_batch_item:
                dims = tuple(range(1, grad.dim()))
                norm = grad.norm(dim=dims).mean()
            else:
                norm = grad.norm()
            norms[name] = norm
            grads[name] = grad

        count = 1
        if self.per_batch_item:
            count = len(grad)
        avg_norms = average_metrics(self.averager(norms), count)
        total = sum(avg_norms.values())

        self._metrics = {}
        if self.monitor:
            for k, v in avg_norms.items():
                self._metrics[f'ratio_{k}'] = v / total

        total_weights = sum([self.weights[k] for k in avg_norms])
        ratios = {k: w / total_weights for k, w in self.weights.items()}

        out_grad: tp.Any = 0
        for name, avg_norm in avg_norms.items():
            if self.recale_grads:
                scale = ratios[name] * self.total_norm / (self.epsilon + avg_norm)
                grad = grads[name] * scale
            else:
                grad = self.weights[name] * grads[name]
            out_grad += grad

        input.backward(out_grad)

Real-time usage example, and permissive licensing question

❓ Questions

Thanks for releasing this - very exciting work! I have two questions:

  • Do you have examples for real-time usage, or is it currently only set up for conversion of pre-recorded audio files at this time?
  • Lyra V2 is permissively licensed (Apache-2.0), and I'm in the process of getting an open-source demo of it working on the web so that people can use it in their web applications. Would you consider using a permissive license (e.g. CC-BY/MIT/Apache) so that your work can more broadly benefit the open source ecosystem? I'd love to create a JS package for this that anyone can use in their web app.

Reconstruction Loss

❓ Questions

Hello, my question is about reconstruction loss in frequency domain, in a paragraph 3.4 it is stated that you use "mel-spectrogram using a normalized STFT", what type of normalization is mentioned here? Is it sufficient to use normalized flag of
torchaudio.transforms.MelSpectrogram which normalizes "by magnitude after stft"?
Also in practice stft loss is sometimes computed via log mel-spectrogram for better convergence, so I want to clarify, in your implementation, S_i from formula 1 is a mel-spectrogram or log mel-spectrogram?

EOFError encounted for some special audio lengths

πŸ› Bug Report

EOFError encounted for some special audio lengths.
The reason is that when calculating the number of decompressed frames, the wrong order of operations resulted in floating point errors. E.g.:

math.ceil(53760 / 24000 * 75) evals to 169
math.ceil(53760 * 75 / 24000) evals to 168

frame_length = int(math.ceil(this_segment_length / model.sample_rate * model.frame_rate))

This line of code should be changed to
frame_length = int(math.ceil(this_segment_length * model.frame_rate / model.sample_rate ))

To Reproduce

Create a .wav file with duration 2.24s and content all zeros.

import numpy as np
from scipy.io import wavfile
wavfile.write('bug.wav', 24000, data=np.zeros(53760, dtype=np.int16))

Run encodec compress and decompress

import subprocess
subprocess.run('encodec bug.wav bug.encodec.wav'.split())

gives error message:

Traceback (most recent call last):
  File "/home/chenjiasheng/.local/bin/encodec", line 33, in <module>
    sys.exit(load_entry_point('encodec', 'console_scripts', 'encodec')())
  File "/mnt/d/code/encodec/encodec/__main__.py", line 117, in main
    out, out_sample_rate = decompress(compressed)
  File "/mnt/d/code/encodec/encodec/compress.py", line 185, in decompress
    return decompress_from_file(fo, device=device)
  File "/mnt/d/code/encodec/encodec/compress.py", line 147, in decompress_from_file
    raise EOFError("The stream ended sooner than expected.")
EOFError: The stream ended sooner than expected.

Your Environment

repo version:

commit c79ba28c9199494d106d2c7f56006260528d7b16 (HEAD -> main, origin/main, origin/HEAD)
Author: Alexandre DΓ©fossez <[email protected]>
Date:   Tue Jan 24 14:07:56 2023 +0100

get_bandwidth_per_quantizer is incorrect.

πŸ› Bug Report

    def get_bandwidth_per_quantizer(self, sample_rate: int):
        """Return bandwidth per quantizer for a given input sample rate.
        """
        return math.log2(self.bins) * sample_rate / 1000

it should be

        return math.log2(self.bins) * sample_rate / 1000 / self.hop_length

During Training model's loss doesn't converge

❓ Questions

I have written my custom training loop with a reconstruction loss function. But the loss doesn't converge and gets bounced around 600 and 700 (in the case of my loss func) even after 3 to 4 epochs. Can you please explain why did that happen?

I first wanted to check the loss convergence of the model only for reconstruction loss without the discriminators.

loss function code (From SoundStream paper) :
image

def L_G_rec(x, G_x, eps=1e-4):
    L = 0
    sr = 16000
    for i in range(6, 12):
        s = 2 ** i
        alpha_s = (s / 2) ** 0.5
        melspec = MelSpectrogram(sample_rate=sr, n_fft=s, hop_length=s//4, center=True, pad_mode="reflect", power=2.0, norm="slaney", onesided=True, n_mels=8, mel_scale="htk")
        S_x = melspec(x)
        S_G_x = melspec(G_x)
        loss1 = (S_x - S_G_x).abs().mean()
        loss2 = alpha_s * (((torch.log(S_x.abs() + eps) - torch.log(S_G_x.abs() + eps)) ** 2).sum(dim=-2).add(eps).abs() ** 0.5).mean()
        loss = loss1 + loss2
        L = L + loss
    return L

Training loop code :

model = EncodecModel._get_model(24)
. . . 
for epoch in range(0, EPOCH):
  print("EPOCH number = %d" % epoch)
  for i, input in enumerate(iter(dataloader)):
    with torch.autograd.set_detect_anomaly(True):

      recon_ouput = model(input)

      gen_loss = L_G_rec(input, recon_output)

      generator_optim.zero_grad()
      gen_loss.backward()
      generator_optim.step()
      if i % 50 == 0:
        print("Training generator loss = %d" % gen_loss)
      save_model(model, generator_optim)

What is the problem with the training loop? Do I need to include a discriminator for convergence?

Changing existing models and training them

❓ Questions

Hello! Could you tell my, how can I change your models (24 kHz and 48 kHz) to 16 kHz, 8 kHz, because sometime we don't use audio sample rate (24 kHz and 48 kHz).
Also, how can I train models?

Entropy coding

❓ Questions

Hi, thank you for a great work.

(I)
I could not figure out necessity of predicting codebook logits via transformer.
Why could not we use empirical distribution of codebook usage ( frequencies ) over a validation set?
I feel like I am missing something here.

(II)
Also, [3] and your work show that Entropy coding does not improve much. While [2] demonstrates significantly better results (excluding latency most likely). Also, most of the literature on Neural Image codecs shows that most of the gains are achieved from Entropy coding. Any comments on that? Do you think different implementation is required to obtain higher compression or it is not that important for Audio domain?

(III)
Also, I am writing on behalf of a small non commercial, independent research group. We are interested to work in this direction further, if there are some possibilities to collaborate that would be wonderful. We have some ideas and resources to test them. Your expertise could help us to save some time.

@adefossez.
Actually, I am sort of a fan of yours since speech de-noising paper [1] and later DiffQ. Not surprised you used it in VQ-VAE. RNNs within auto-encoders look like a signature move, not sure if it is your idea though. ))

[1] Real Time Speech Enhancement in the Waveform Domain
[2] LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
[3] SoundStream: An End-to-End Neural Audio Codec

Question about better quality models for 48 kHz

Do you have better quality models for 48 kHz?

I've been helpful if you add 32 kbps, 48 kbps, 64 kbps and 96 kbps models to compare to MP3 / AAC / Windows Media Audio.

Thanks for the support!

Pre-trained Discriminator ??

❓ Questions

Do I need to use a pre-trained discriminator or I can use an un-trained discriminator to calculate adversarial losses??

About the stability of the VQ based approach for codec

❓ Questions

Thanks for sharing the amazing speech codec. Since the Encodec and SoundStream utilized the RVQ to quantize the latent representations, I'm worried about the stability of (R)VQ.

I evaluated the Lyra2 at 9.2 kbps and the Encodec at 12 kbps with high-quality data and found there exists irregular harmonics (an example). I guess it is caused by the VQ process, do you have any view about this?

Could this be used to compare audio similarity?

❓ Questions

I'm curious how to extract embeddings, and if that's the output of the compress function / command line tool, and whether that could be used to compare, via cosine similarity, how similar 2 audio files are?

Questions about dataset mixture strategy

❓ Questions

Thanks for your amazing work!
I have a question about your dataset mixing strategy. Does mixture here refer to:

  1. a combination of audios from different datasets so no change per training audio, or
  2. a superposition of different audio tracks, so each training audio contains multiple tracks from multiple datasets?

image

Some details about RVQ code

❓ Questions

Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L150 and https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L168 will cause the DDP training stop, I find the problem is this code will cause mutilple-GPU to wait each other. Thus, I delete this line code. Now, it can be trained with torch DDP. But I donot know whether this line code will influence the performance? Can you give me some advice whether this line code can be deleted?

Motivation behind `layer_state` in `StreamingTransformerEncoder`

❓ Questions

I would like to hear some more about your motivation behind your usage of layer_state in the StreamingTransformerEncoder.

My understanding so far is that, for each layer, the previous input x_past is concatenated with x for the keys and values, not for the queries. This effectively means that the matmul between queries and keys is not just attending to itself but also to part of the inputs of the previous input x_past.

I'm not entirely sure how to interpret this and this maybe due to me not being able to introspect your training strategy. To my understanding, x and x_past should be independent token sequences, in this case it seems strange to allow the transformer to attend to a concatenation of these sequences. Alternatively, x and x_past originate from the same audio clip, in this case I don't understand why you wouldn't just increase the context length explicitly.

I tried to find other transformer implementations that do something similar and the only thing that came close to this is Transformer XL. There is a major difference however since they propagate the output of the transformer layer stack to the next step, your implementation propagates the input.

I may be missing something entirely so please excuse my ignorance in that case, nonetheless I would really appreciate it if you can shed some light on this πŸ˜‡

Trying to compress what's the formel

❓ Questions

Sitting on Linux mint with python 3.8
In the terminal location EnCodec map, got a audio file sound.mp3

What is the terminal command to get it funktion?
I can't figur out that...
Can you help?

Questions about the pre-trained language model

❓ Questions

Thanks for the great work and the shared code! I have some questions about the pre-trained transformer language model:

Could you explain more details about the supervision for training the transformer(shown as L_l in Fig 1 in your paper)? My understanding is that you use a pre-trained language model and train some linear layers to model the distribution of codewords for each frame, but is there any other supervision for modeling the distribution or is the transformer also joint optimized with the whole encoder and decoder?

Looking forward for your reply!
Snipaste_2022-11-06_16-42-53

Could this be used for a better quality audiolm

❓ Questions

within the soundstream paper, the google team used the residual vector quantization model to generate music.

I was wondering since the architecture is very similiar if you guys have thought about using it to generate music.

I've tried openai's jukebox and the output is fairly noisy.

Kudos on this great work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.