Warmest greeting from Haohe. PR is most welcomed for my repos.
"What good is a newborn baby?" -Franklin
AudioLDM training, finetuning, evaluation and inference.
Home Page: https://audioldm.github.io/audioldm2/
License: MIT License
I use the latest inference code but cannot infer all pretrain models. I think ckpt files are not suitable for my code execution environment. I'm not sure about this
CUDA_VISIBLE_DEVICES=1 python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --list_inference tests/captionlist/inference_test.lst
logs:
SEED EVERYTHING TO 0
Global seed set to 0
Add-ons: []
Dataset initialize finished
Reload ckpt specified in the config file audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml
LatentDiffusion: Running in eps-prediction mode
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torchlibrosa/stft.py:193: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = librosa.util.pad_center(fft_window, n_fft)
/home/datnt114/anaconda3/envs/audioldm_train/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias']
None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1
. You can also use weights=VGG16_Weights.DEFAULT
to get the most up-to-date weights.Dear @haoheliu,
First, I want to Thank for your great repo!
I have a naive question that Can I use a pretrained model on 32kHz dataset to fintune on 16kHz dataset.
Hope for your reply!
Best regards,
Khanh
This is really a great implementation. Is there audioldm2 code and config already available?
Thank you very much for this great work.
I found that the latent shape printed when I loaded AE with the given config was inconsistent with the sample shape printed when infer with generate_sample() func. What could be causing this?
When init AE:
Working with z of shape (1, 8, 64, 64) = 32768 dimensions.
When calling generate_sample() func:
Data shape for DDIM sampling is (3, 8, 256, 16), eta 1.0
And I found that the generated sample I got through the generate_sample() func is much different from the domo effect on hugging face. Is calling the generate_sample() func directly not the final infer process? What other configurations are needed?
Looking forward to it and thank you for your kind reply!
I appreciate your wonderful work and the open-resourced codes. It helps me a lot! And I have a question. Do the codes provide the method to fine-tune the model with transcription? If not, I'd like to implement that part myself. But I am not sure whether I missed that part when viewing the codes.
Looking forward to your reply!
It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?
Hi there! Firstly, thank you for providing the code for training AudioLDM.
I have a question regarding the AudioCaps dataset. In the *_label.json files, each data entry contains a "seg_label" key. The README mentions that pre-segmentation of audio files isn't necessary, but I'm curious about the purpose of this "seg_label" key.
Could you clarify whether the "seg_label" field is simply a path for saving preprocessed .npy files during training, or does it contain preprocessed data that requires specific steps before use? If the latter is true, could you guide me on how to process the WAV files into .npy format?
Thank you very much for your help!
Thanks for your great work. How to inference with new sample text caption after training?
Thank you @haoheliu for the amazing work! I was looking forward to this for awhile!.
I am a beginner to Machine Learning and was wondering how trivial it is to modify the network architecture to suit our own dataset.
Any input from you, would be greatly appreciated
It is such an excellent work!
However, I am wondering is GPT2(audiomae feature generating model)pretrained before joint training. For I cannot found training script for GPT2 module.
By the way, the example "AudioCaps Dataset" is unable to be fetched for its large size and limitations of my poor network. Is it posible if an explaination of the dataset is provided so that I can directly preprocess my local dataset.
Thanks very much!
Thank you for your excellent work!
I have a question about how to define the exact time of audio when inferring. For example, in the sample code of audioldm2 huggingface, I see the parameter audio_length_in_s.
Thanks for your great work!
I have made a personal dataset for finetuning, but I think my small dataset will make the model worse. And I find that the LoRA finetuning method is generally applied in diffusion models, which is better than directly finetuning the whole UNet model. Will this repo support LoRA finetuning method in the future?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.