haoheliu / audioldm-training-finetuning Goto Github PK

View Code? Open in Web Editor NEW

162.0 162.0 31.0 1.68 MB

AudioLDM training, finetuning, evaluation and inference.

Home Page: https://audioldm.github.io/audioldm2/

License: MIT License

Python 99.95% Shell 0.05%

audiogeneration diffusion-models

audioldm-training-finetuning's Introduction

Warmest greeting from Haohe. PR is most welcomed for my repos.

"What good is a newborn baby?" -Franklin

audioldm-training-finetuning's People

Contributors

Stargazers

Watchers

Forkers

as1th hcynomo suzhiba huutuongtu holehole5566 yyua8222

audioldm-training-finetuning's Issues

Unable to infer model in latest release. Please, help me!!

I use the latest inference code but cannot infer all pretrain models. I think ckpt files are not suitable for my code execution environment. I'm not sure about this
CUDA_VISIBLE_DEVICES=1 python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --list_inference tests/captionlist/inference_test.lst
logs:
SEED EVERYTHING TO 0
Global seed set to 0
Add-ons: []
Dataset initialize finished
Reload ckpt specified in the config file audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml
LatentDiffusion: Running in eps-prediction mode
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchlibrosa/stft.py:193: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = librosa.util.pad_center(fft_window, n_fft)
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Use extra condition on UNet channel using Film. Extra condition dimension is 512.
DiffusionWrapper has 185.04 M params.
Keeping EMAs of 692.
making attention of type 'vanilla' with 512 in_channels
making attention of type 'vanilla' with 512 in_channels
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
loaded pretrained LPIPS loss from taming/modules/autoencoder/lpips/vgg.pth
Removing weight norm...
Initial learning rate 1e-05
--> Reload weight of autoencoder from data/checkpoints/vae_mel_16k_64bins.ckpt
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/datnt114/Videos/AudioLDM-training-finetuning/audioldm_train/infer.py", line 128, in
infer(dataset_json, config_yaml, config_yaml_path, exp_group_name, exp_name)
File "/home/datnt114/Videos/AudioLDM-training-finetuning/audioldm_train/infer.py", line 80, in infer
checkpoint = torch.load(resume_from_checkpoint)
File "/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/serialization.py", line 1002, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'm'.
Time: 0h:00m:21s

Use pretrained model on 32kHz dataset to fintune on 16kHz dataset

Dear @haoheliu,

First, I want to Thank for your great repo!
I have a naive question that Can I use a pretrained model on 32kHz dataset to fintune on 16kHz dataset.
Hope for your reply!

Best regards,
Khanh

Training Code for AudioLDM2

This is really a great implementation. Is there audioldm2 code and config already available?

How to inference model ? Please

About configuration of released code and model

Thank you very much for this great work.

I found that the latent shape printed when I loaded AE with the given config was inconsistent with the sample shape printed when infer with generate_sample() func. What could be causing this?
When init AE:
Working with z of shape (1, 8, 64, 64) = 32768 dimensions.
When calling generate_sample() func:
Data shape for DDIM sampling is (3, 8, 256, 16), eta 1.0

And I found that the generated sample I got through the generate_sample() func is much different from the domo effect on hugging face. Is calling the generate_sample() func directly not the final infer process? What other configurations are needed?

Looking forward to it and thank you for your kind reply!

A question about training with transcription

I appreciate your wonderful work and the open-resourced codes. It helps me a lot! And I have a question. Do the codes provide the method to fine-tune the model with transcription? If not, I'd like to implement that part myself. But I am not sure whether I missed that part when viewing the codes.

Looking forward to your reply!

Embed mode for AudioLDM model

It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?

Inference is very slow

It takes about 2 minutes to create a 10-second audio

question about the metadata

Hi there! Firstly, thank you for providing the code for training AudioLDM.

I have a question regarding the AudioCaps dataset. In the *_label.json files, each data entry contains a "seg_label" key. The README mentions that pre-segmentation of audio files isn't necessary, but I'm curious about the purpose of this "seg_label" key.

Could you clarify whether the "seg_label" field is simply a path for saving preprocessed .npy files during training, or does it contain preprocessed data that requires specific steps before use? If the latter is true, could you guide me on how to process the WAV files into .npy format?

Thank you very much for your help!

Inference code

Thanks for your great work. How to inference with new sample text caption after training?

Training on our own dataset.

Thank you @haoheliu for the amazing work! I was looking forward to this for awhile!.
I am a beginner to Machine Learning and was wondering how trivial it is to modify the network architecture to suit our own dataset.
Any input from you, would be greatly appreciated

GPT2 training

It is such an excellent work!
However, I am wondering is GPT2（audiomae feature generating model）pretrained before joint training. For I cannot found training script for GPT2 module.
By the way, the example "AudioCaps Dataset" is unable to be fetched for its large size and limitations of my poor network. Is it posible if an explaination of the dataset is provided so that I can directly preprocess my local dataset.
Thanks very much!

How to define time of an audio

Thank you for your excellent work!

I have a question about how to define the exact time of audio when inferring. For example, in the sample code of audioldm2 huggingface, I see the parameter audio_length_in_s.

clap_htsat_tiny.pt is not avaiable in checkpoint.tar

Support LoRA finetuning in the future?

Thanks for your great work!
I have made a personal dataset for finetuning, but I think my small dataset will make the model worse. And I find that the LoRA finetuning method is generally applied in diffusion models, which is better than directly finetuning the whole UNet model. Will this repo support LoRA finetuning method in the future?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.