Giter VIP home page Giter VIP logo

audioldm-training-finetuning's Introduction

Anurag's GitHub stats

Warmest greeting from Haohe. PR is most welcomed for my repos.

"What good is a newborn baby?" -Franklin

audioldm-training-finetuning's People

Contributors

haoheliu avatar yyua8222 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audioldm-training-finetuning's Issues

Unable to infer model in latest release. Please, help me!!

I use the latest inference code but cannot infer all pretrain models. I think ckpt files are not suitable for my code execution environment. I'm not sure about this
CUDA_VISIBLE_DEVICES=1 python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --list_inference tests/captionlist/inference_test.lst
logs:
SEED EVERYTHING TO 0
Global seed set to 0
Add-ons: []
Dataset initialize finished
Reload ckpt specified in the config file audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml
LatentDiffusion: Running in eps-prediction mode
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchlibrosa/stft.py:193: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = librosa.util.pad_center(fft_window, n_fft)
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']

  • This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  • Use extra condition on UNet channel using Film. Extra condition dimension is 512.
    DiffusionWrapper has 185.04 M params.
    Keeping EMAs of 692.
    making attention of type 'vanilla' with 512 in_channels
    making attention of type 'vanilla' with 512 in_channels
    /home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
    warnings.warn(
    /home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights.
    warnings.warn(msg)
    loaded pretrained LPIPS loss from taming/modules/autoencoder/lpips/vgg.pth
    Removing weight norm...
    Initial learning rate 1e-05
    --> Reload weight of autoencoder from data/checkpoints/vae_mel_16k_64bins.ckpt
    Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
  • This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Traceback (most recent call last):
    File "/home/datnt114/Videos/AudioLDM-training-finetuning/audioldm_train/infer.py", line 128, in
    infer(dataset_json, config_yaml, config_yaml_path, exp_group_name, exp_name)
    File "/home/datnt114/Videos/AudioLDM-training-finetuning/audioldm_train/infer.py", line 80, in infer
    checkpoint = torch.load(resume_from_checkpoint)
    File "/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/serialization.py", line 795, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    File "/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/serialization.py", line 1002, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
    _pickle.UnpicklingError: invalid load key, 'm'.
    Time: 0h:00m:21s

About configuration of released code and model

Thank you very much for this great work.

I found that the latent shape printed when I loaded AE with the given config was inconsistent with the sample shape printed when infer with generate_sample() func. What could be causing this?
When init AE:
Working with z of shape (1, 8, 64, 64) = 32768 dimensions.
When calling generate_sample() func:
Data shape for DDIM sampling is (3, 8, 256, 16), eta 1.0

And I found that the generated sample I got through the generate_sample() func is much different from the domo effect on hugging face. Is calling the generate_sample() func directly not the final infer process? What other configurations are needed?

Looking forward to it and thank you for your kind reply!

A question about training with transcription

I appreciate your wonderful work and the open-resourced codes. It helps me a lot! And I have a question. Do the codes provide the method to fine-tune the model with transcription? If not, I'd like to implement that part myself. But I am not sure whether I missed that part when viewing the codes.

Looking forward to your reply!

Embed mode for AudioLDM model

It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?

question about the metadata

Hi there! Firstly, thank you for providing the code for training AudioLDM.

I have a question regarding the AudioCaps dataset. In the *_label.json files, each data entry contains a "seg_label" key. The README mentions that pre-segmentation of audio files isn't necessary, but I'm curious about the purpose of this "seg_label" key.

Could you clarify whether the "seg_label" field is simply a path for saving preprocessed .npy files during training, or does it contain preprocessed data that requires specific steps before use? If the latter is true, could you guide me on how to process the WAV files into .npy format?

Thank you very much for your help!

Inference code

Thanks for your great work. How to inference with new sample text caption after training?

Training on our own dataset.

Thank you @haoheliu for the amazing work! I was looking forward to this for awhile!.
I am a beginner to Machine Learning and was wondering how trivial it is to modify the network architecture to suit our own dataset.
Any input from you, would be greatly appreciated

GPT2 training

It is such an excellent work!
However, I am wondering is GPT2(audiomae feature generating model)pretrained before joint training. For I cannot found training script for GPT2 module.
By the way, the example "AudioCaps Dataset" is unable to be fetched for its large size and limitations of my poor network. Is it posible if an explaination of the dataset is provided so that I can directly preprocess my local dataset.
Thanks very much!

How to define time of an audio

Thank you for your excellent work!

I have a question about how to define the exact time of audio when inferring. For example, in the sample code of audioldm2 huggingface, I see the parameter audio_length_in_s.

Support LoRA finetuning in the future?

Thanks for your great work!
I have made a personal dataset for finetuning, but I think my small dataset will make the model worse. And I find that the LoRA finetuning method is generally applied in diffusion models, which is better than directly finetuning the whole UNet model. Will this repo support LoRA finetuning method in the future?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.