facebookresearch / encodec Goto Github PK
View Code? Open in Web Editor NEWState-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
License: MIT License
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
License: MIT License
How to use to iOS and Android
When I tried to encode from WAV to ECDC file, python gave me an error for invalid file using Windows OS. On Linux, works well, but only Windows OS did not work.
pip install .
Encode success from WAV to ECDC.
Saying invalid file for trying find WAV file
The error code is for Windows OS only:
encodec -b 24 -r MSESTONIA.wav MSESTONIA.ecdc
Traceback (most recent call last):
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\Scripts\encodec.exe\__main__.py", line 7, in <module>
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\__main__.py", line 109, in main
wav, sr = torchaudio.load(args.input)
File "C:\Users\marti\AppData\Roaming\Python\Python310\site-packages\torchaudio\backend\soundfile_backend.py", line 205, in load
with soundfile.SoundFile(filepath, "r") as file_:
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 740, in __init__
self._file = self._open(file, mode_int, closefd)
File "C:\Users\marti\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 1263, in _open
raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: WindowsPath('MSESTONIA.wav')
Python and PyTorch version: Python 3.10.8 and Pytorch 1.13.0+cu117
Operating system and version (desktop or mobile): Windows 11 (desktop)
Hardware (gpu or cpu, amount of RAM etc.): NVIDIA RTX 3060 Laptop GUI, 16 GB RAM.
Martin Eesmaa
Please see attached screenshot for information. I'm not sure if this requires any more context. I installed the latest version of PyTorch (using below screenshot on https://pytorch.org) and have python version 3.10.9
Thanks for publishing this! In the encodec paper you write
For fair evaluation, we also compare EnCodec to our reimplementation of
SoundStream (Zeghidour et al., 2021). [...] Finally, we compare EnCodec to the
SoundStream model from the official implementation available in Lyra 2 1 at 3.2 kbps and 6 kbps on audio
upsampled to 32 kHz. We also reproduced a version of SoundStream (Zeghidour et al., 2021) with minor
improvements. Namely, we use the relative feature loss introduce in Section 3.4, and layer normalization
(applied separately for each time step) in the discriminators, except for the first and last layer, which improved
the audio quality during our preliminary studies.
And on https://ai.honu.io/papers/encodec/samples.html you show samples of this reimplementation.
Could you share the source code of your SoundStream
reimplementation so this work can be reproduced?
Non-commercial (NC) clauses are non-free as they are not free software according to the FSF, open source according to the OSI, or free culture according to Freedom Defined. I would recommend using CC-BY-SA-4.0, CC-BY-4.0, or CC-0-1.0 instead which are free culture Creative Commons licenses.
Hello, my question is the following, i would like to know if it's possible in any way to use cuda acceleration to process compressing and decompressing using Encodec? from what i've been reading i couldn't find anything that mentioned inference by gpu.. so i wanted to know if it's possible to use the gpu, that would be even more productive and useful, awaiting an answer, thank you in advance, att. Lucas.
Thanks for the nice paper/work!
I've a question:
How do I print the VQ code vectors of EnCodec?
Hi, first of all, thanks for sharing the code! It's quite well-written.
I've been trying to understand the practical use of the Balancer
class.
In pseudo-code, what I understood from the documentation is that my use of it in a real training loop should look something like this:
z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2}, reconstruction)
So far so good?
Assuming I understood correctly, I still run into an issue once I add a loss term on the latent representation (z
) in the above. e.g.
z = encoder(input)
reconstruction = decoder(z)
loss_l1 = F.l1_loss(reconstruction, input)
loss_l2 = F.l2_loss(reconstruction, input)
loss_latent = some_regularization_loss(z)
balancer.backward({'l1' : loss_l1, 'l2' : loss_l2, 'z' : some_regularization_loss}, reconstruction)
Now, the loss on z
does not depend on the reconstruction so this won't work. An alternative is to pass z
to balancer.backward()
however that comes with the cost of backpropagating through the decoder multiple times. What did you do regarding the quantization loss?
Is it possible to fine-tune the model? If so, how?
Or is only the compression/decompression model provided?
I'm trying to evaluate Encodec for streaming audio use cases, however I noticed that the decoding step seems to accumulate memory very quickly over time. If I turn off decoding, memory usage stays constant. I looked at the code though, I'm not sure why?
from encodec import EncodecModel
from encodec.utils import convert_audio
import torchaudio
import torch
# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)
# Load and pre-process the audio waveform
wav, sr = torchaudio.load("CantinaBand60.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
# Extract discrete codes from EnCodec
frames = []
for offset in range(0, 1440000, 480):
print(offset)
encoded_frames = model._encode_frame(wav[:, :, offset: offset + 480])
frames.append(model._decode_frame(encoded_frames))
# merge the decoded frames
decoded = torch.cat(frames, dim=2)
# save the decoded audio
torchaudio.save("decoded.wav", decoded, model.sample_rate)
Memory shouldn't be persisted across frames.
Memory usage grows over time eventually resulting in OOM.
Are you offering a commercial-friendly license elsewhere? Startups could really use it.
Hi encodes team,
great job!
We want to reimplement the results and train a new model in a wider range of the sound dataset and add some denoise/dereverberation functionality. Could you please add some training examples?
This report is regarding the announcement of this project on the blog-post. I know this is the wrong place to report it, but I don't know where the right place would be.
The chart on the page is misleading and wrong. It shows an x-axis that indicates that the 10Γ gain is for mp3. Either way the bar-length has to be swapped or the caption has to change:
Following the "Extracting discrete representations" section in README, I tried to extract the encoded embedding myself. However, running the exact code snippet gave me an error: RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead
.
from encodec import EncodecModel
from encodec.utils import convert_audio
import torchaudio
import torch
# Instantiate a pretrained EnCodec model
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(6.0)
# Load and pre-process the audio waveform
wav, sr = torchaudio.load("test.wav")
wav = wav.unsqueeze(0)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
# Extract discrete codes from EnCodec
with torch.no_grad():
encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # [B, n_q, T]
where test.wav
is any WAV file. I tried with one on the sample page.
I should be able to get the representation in [B, n_q, T]
as described in the code itself.
Full traceback:
Traceback (most recent call last):
File "/home/ubuntu/test.py", line 18, in <module>
encoded_frames = model.encode(wav)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
encoded_frames.append(self._encode_frame(frame))
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
emb = self.encoder(x)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
return self.model(x)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 210, in forward
return self.conv(x)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/encodec/modules/conv.py", line 120, in forward
x = self.conv(x)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/miniconda3/envs/dev/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [32, 1, 7], expected input[1, 2, 144006] to have 1 channels, but got 2 channels instead
In the current implementation the following incorrect behavior occurs:
some embedding was not chosen in the previous steps so it will be expired here, but it was chosen right on that step a couple lines above, but since self.cluster_size
will be changed only in following lines, this embedding will be expired, but it should not.
This can be easily checked by adding these 2 lines
expared_but_taken = (self.cluster_size < self.threshold_ema_dead_code) & (embed_onehot.sum(0) > 0)
assert torch.any(expared_but_taken)
.
somewhere here
I think the correct code should be like in vector-quantize-pytorch repo, namely this line.
Hello! Do you intend to train models with a target bitrate of above 24kbps? I didn't see anything in the paper, but maybe I missed it.
I'd be curious to see how 48 and 96kbps models compare to mp3s at higher bitrates.
Thanks for the great work!
Google's Lyra provides support for Android via TFLite but not for IOS (yet). Does Facebook's encoded provide models that can run on edge devices for both Android and IOS? If not, are there any estimated timelines for when this could be available?
Thank you for nice work.
I have some question about objective evaluation metrics.
I know that it is very difficult to evaluate the audio quality. So I'm so curious how to evaluate the model during ablation studies or during training.
How about the quality of the model with MS-STFT discriminator using only real value? It would be appreciated if you could share such information.
Thank you!
In the paper 3.4 "Discriminative Loss" section, adversarial loss is constructed as
So I want to know, why the adversarial loss in this paper is different from the original hinge loss?
I don't understand how you come to the smallest bitrate of 1.5 kbps for the 24kHz model:
If I understand correctly, we take a multiple of 4 number of codebooks (4, 8, 12, ... so 4 would be the minimum), and we have 10 bits per codebook (2^10 = 1024 entries), and for the 24kHz model 75 latent codes per second, giving us the smallest possible bit rate:
4 * 10 bits * 75 1/s = 3kbps
However, both the paper and the README state that the lowest bitrate is 1.5kbps.
Looking at the bitrate progression (1.5, 3, 6, 12, 24), which doubles at each step, wouldn't that rather correspond to 2, 4, 8, 16, 32 codebooks being used? Maybe I am just misinterpreting or missing something, could you please clarify this point?
I saw this video on the readme page, and I'm wondering how it was made. I'm curious about it and would like to learn how to make it.
https://ai.honu.io/papers/encodec/final.mp4
Zero grad in second residual vq as mentioned here (lucidrains/vector-quantize-pytorch#33)
encodec/encodec/quantization/core_vq.py
Line 336 in 1943298
The fix link is lucidrains/vector-quantize-pytorch@ecf2f7c
Using the casual model at 24 kHz results in the process being killed after ~10 seconds. It doesn't matter if the language model --lm
is used or not. The 48 kHz mode works as expected.
I've used the following command and got these results:
python3 -m encodec -r -b 1.5 --lm '/home/user/Downloads/Audiofile.wav' '/home/user/Downloads/Audiofile_EnCodec.wav';
Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /home/user/checkpoints/encodec_24khz-d7cc33bc.th 100.0%
Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_lm_24khz-1608e3c0.th" to /home/user/checkpoints/encodec_lm_24khz-1608e3c0.th 100.0%
Killed
user@user:~$
I even deleted the checkpoints folder to make sure that this isn't an issue. I've tested with a 16 and 24 bit PCM WAV with mono channel. Using --hq
and --lm
together works as expected so I believe the 24 kHz is the culprit.
It should not kill the process and result in a properly encoded and decoded file.
It kills the process.
I noticed that the audio on the linked demo (final.mp4) sounds very compressed, even the original part - even on a phone speaker, I can hear a difference to test_48k.wav, not to mention with headpones
I'm aware this is NOT a finished product, but maybe something above 128kbps AAC would be better for comparing audio compression
Hey thank you for sharing your work, sorry for asking what seems like a simple question, can you correct the below two statements if they were wrong?
ecdc file can't be used directly to be played and must be decompressed first in order to output a wav file that can be played?
ecdc would be like a zip file where it holds ur file but you can't actually use it until you decompress it?
are the above two statements correct?
it seems the definition of 'loss_l' in figure 1 that connected vq and transofomer in the quantizer block is not described in the paper.
it there some descriptions of that? thank.
when training encodec using huggingface's accelerate package, can't using balancer?
this is part of my training script
self.balancer._set_losses_and_input(
losses={'t': recon_loss, 'f': m_recon_loss, 'g': ads_loss, 'feat': rfm_loss},
input=output
)
# self.balancer.backward()
self.accelerator.backward(self.balancer)
and i change Balancer little
def __mul__(self, other):
for name, loss in self.losses.items():
self.losses[name] = loss * other
def __truediv__(self, other):
for name, loss in self.losses.items():
self.losses[name] = loss / other
def _set_losses_and_input(self, losses: tp.Dict[str, torch.Tensor], input: torch.Tensor):
self.losses = losses
self.input = input
@property
def metrics(self):
return self._metrics
def backward(self):
losses = self.losses
input = self.input
norms = {}
grads = {}
for name, loss in losses.items():
grad, = autograd.grad(loss, [input], retain_graph=True)
if self.per_batch_item:
dims = tuple(range(1, grad.dim()))
norm = grad.norm(dim=dims).mean()
else:
norm = grad.norm()
norms[name] = norm
grads[name] = grad
count = 1
if self.per_batch_item:
count = len(grad)
avg_norms = average_metrics(self.averager(norms), count)
total = sum(avg_norms.values())
self._metrics = {}
if self.monitor:
for k, v in avg_norms.items():
self._metrics[f'ratio_{k}'] = v / total
total_weights = sum([self.weights[k] for k in avg_norms])
ratios = {k: w / total_weights for k, w in self.weights.items()}
out_grad: tp.Any = 0
for name, avg_norm in avg_norms.items():
if self.recale_grads:
scale = ratios[name] * self.total_norm / (self.epsilon + avg_norm)
grad = grads[name] * scale
else:
grad = self.weights[name] * grads[name]
out_grad += grad
input.backward(out_grad)
Thanks for releasing this - very exciting work! I have two questions:
how does it work for resampling 16kHz audio file to 24kHz file, or directly extracting the codec result from 16kHz wav?
Hello, my question is about reconstruction loss in frequency domain, in a paragraph 3.4 it is stated that you use "mel-spectrogram using a normalized STFT", what type of normalization is mentioned here? Is it sufficient to use normalized
flag of
torchaudio.transforms.MelSpectrogram
which normalizes "by magnitude after stft"?
Also in practice stft loss is sometimes computed via log mel-spectrogram for better convergence, so I want to clarify, in your implementation, S_i from formula 1 is a mel-spectrogram or log mel-spectrogram?
EOFError
encounted for some special audio lengths.
The reason is that when calculating the number of decompressed frames, the wrong order of operations resulted in floating point errors. E.g.:
math.ceil(53760 / 24000 * 75) evals to 169
math.ceil(53760 * 75 / 24000) evals to 168
Line 120 in c79ba28
frame_length = int(math.ceil(this_segment_length * model.frame_rate / model.sample_rate ))
Create a .wav file with duration 2.24s and content all zeros.
import numpy as np
from scipy.io import wavfile
wavfile.write('bug.wav', 24000, data=np.zeros(53760, dtype=np.int16))
Run encodec compress and decompress
import subprocess
subprocess.run('encodec bug.wav bug.encodec.wav'.split())
gives error message:
Traceback (most recent call last):
File "/home/chenjiasheng/.local/bin/encodec", line 33, in <module>
sys.exit(load_entry_point('encodec', 'console_scripts', 'encodec')())
File "/mnt/d/code/encodec/encodec/__main__.py", line 117, in main
out, out_sample_rate = decompress(compressed)
File "/mnt/d/code/encodec/encodec/compress.py", line 185, in decompress
return decompress_from_file(fo, device=device)
File "/mnt/d/code/encodec/encodec/compress.py", line 147, in decompress_from_file
raise EOFError("The stream ended sooner than expected.")
EOFError: The stream ended sooner than expected.
repo version:
commit c79ba28c9199494d106d2c7f56006260528d7b16 (HEAD -> main, origin/main, origin/HEAD)
Author: Alexandre DΓ©fossez <[email protected]>
Date: Tue Jan 24 14:07:56 2023 +0100
def get_bandwidth_per_quantizer(self, sample_rate: int):
"""Return bandwidth per quantizer for a given input sample rate.
"""
return math.log2(self.bins) * sample_rate / 1000
it should be
return math.log2(self.bins) * sample_rate / 1000 / self.hop_length
Thank you for the excellent experiment detail in the paper. Some model architecture analysis were done in Table A.3, I'm curious about what the actual audio sounds like, it should be more intuitive than the metrics in the table.
I have written my custom training loop with a reconstruction loss function. But the loss doesn't converge and gets bounced around 600 and 700 (in the case of my loss func) even after 3 to 4 epochs. Can you please explain why did that happen?
I first wanted to check the loss convergence of the model only for reconstruction loss without the discriminators.
loss function code (From SoundStream paper) :
def L_G_rec(x, G_x, eps=1e-4):
L = 0
sr = 16000
for i in range(6, 12):
s = 2 ** i
alpha_s = (s / 2) ** 0.5
melspec = MelSpectrogram(sample_rate=sr, n_fft=s, hop_length=s//4, center=True, pad_mode="reflect", power=2.0, norm="slaney", onesided=True, n_mels=8, mel_scale="htk")
S_x = melspec(x)
S_G_x = melspec(G_x)
loss1 = (S_x - S_G_x).abs().mean()
loss2 = alpha_s * (((torch.log(S_x.abs() + eps) - torch.log(S_G_x.abs() + eps)) ** 2).sum(dim=-2).add(eps).abs() ** 0.5).mean()
loss = loss1 + loss2
L = L + loss
return L
Training loop code :
model = EncodecModel._get_model(24)
. . .
for epoch in range(0, EPOCH):
print("EPOCH number = %d" % epoch)
for i, input in enumerate(iter(dataloader)):
with torch.autograd.set_detect_anomaly(True):
recon_ouput = model(input)
gen_loss = L_G_rec(input, recon_output)
generator_optim.zero_grad()
gen_loss.backward()
generator_optim.step()
if i % 50 == 0:
print("Training generator loss = %d" % gen_loss)
save_model(model, generator_optim)
What is the problem with the training loop? Do I need to include a discriminator for convergence?
Hello! Could you tell my, how can I change your models (24 kHz and 48 kHz) to 16 kHz, 8 kHz, because sometime we don't use audio sample rate (24 kHz and 48 kHz).
Also, how can I train models?
most speech dataset are 16 kHz, can you provide well-trained 16kHz model?
What do you think?
Hi, thank you for a great work.
(I)
I could not figure out necessity of predicting codebook logits via transformer.
Why could not we use empirical distribution of codebook usage ( frequencies ) over a validation set?
I feel like I am missing something here.
(II)
Also, [3] and your work show that Entropy coding does not improve much. While [2] demonstrates significantly better results (excluding latency most likely). Also, most of the literature on Neural Image codecs shows that most of the gains are achieved from Entropy coding. Any comments on that? Do you think different implementation is required to obtain higher compression or it is not that important for Audio domain?
(III)
Also, I am writing on behalf of a small non commercial, independent research group. We are interested to work in this direction further, if there are some possibilities to collaborate that would be wonderful. We have some ideas and resources to test them. Your expertise could help us to save some time.
@adefossez.
Actually, I am sort of a fan of yours since speech de-noising paper [1] and later DiffQ. Not surprised you used it in VQ-VAE. RNNs within auto-encoders look like a signature move, not sure if it is your idea though. ))
[1] Real Time Speech Enhancement in the Waveform Domain
[2] LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
[3] SoundStream: An End-to-End Neural Audio Codec
Do you have better quality models for 48 kHz?
I've been helpful if you add 32 kbps, 48 kbps, 64 kbps and 96 kbps models to compare to MP3 / AAC / Windows Media Audio.
Thanks for the support!
Do I need to use a pre-trained discriminator or I can use an un-trained discriminator to calculate adversarial losses??
Thanks for sharing the amazing speech codec. Since the Encodec and SoundStream utilized the RVQ to quantize the latent representations, I'm worried about the stability of (R)VQ.
I evaluated the Lyra2 at 9.2 kbps and the Encodec at 12 kbps with high-quality data and found there exists irregular harmonics (an example). I guess it is caused by the VQ process, do you have any view about this?
I'm curious how to extract embeddings, and if that's the output of the compress function / command line tool, and whether that could be used to compare, via cosine similarity, how similar 2 audio files are?
Thanks for your amazing work!
I have a question about your dataset mixing strategy. Does mixture
here refer to:
Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L150 and https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L168 will cause the DDP training stop, I find the problem is this code will cause mutilple-GPU to wait each other. Thus, I delete this line code. Now, it can be trained with torch DDP. But I donot know whether this line code will influence the performance? Can you give me some advice whether this line code can be deleted?
I would like to hear some more about your motivation behind your usage of layer_state
in the StreamingTransformerEncoder
.
My understanding so far is that, for each layer, the previous input x_past
is concatenated with x
for the keys and values, not for the queries. This effectively means that the matmul
between queries and keys is not just attending to itself but also to part of the inputs of the previous input x_past
.
I'm not entirely sure how to interpret this and this maybe due to me not being able to introspect your training strategy. To my understanding, x
and x_past
should be independent token sequences, in this case it seems strange to allow the transformer to attend to a concatenation of these sequences. Alternatively, x
and x_past
originate from the same audio clip, in this case I don't understand why you wouldn't just increase the context length explicitly.
I tried to find other transformer implementations that do something similar and the only thing that came close to this is Transformer XL. There is a major difference however since they propagate the output of the transformer layer stack to the next step, your implementation propagates the input.
I may be missing something entirely so please excuse my ignorance in that case, nonetheless I would really appreciate it if you can shed some light on this π
Sitting on Linux mint with python 3.8
In the terminal location EnCodec map, got a audio file sound.mp3
What is the terminal command to get it funktion?
I can't figur out that...
Can you help?
Thanks for the great work and the shared code! I have some questions about the pre-trained transformer language modelοΌ
Could you explain more details about the supervision for training the transformer(shown as L_l in Fig 1 in your paper)? My understanding is that you use a pre-trained language model and train some linear layers to model the distribution of codewords for each frame, but is there any other supervision for modeling the distribution or is the transformer also joint optimized with the whole encoder and decoder?
within the soundstream paper, the google team used the residual vector quantization model to generate music.
I was wondering since the architecture is very similiar if you guys have thought about using it to generate music.
I've tried openai's jukebox and the output is fairly noisy.
Kudos on this great work!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.