Giter VIP home page Giter VIP logo

cdcd's Introduction

Conditional Discrete Contrastive Diffusion (CDCD)

This is the implementation of the Conditional Discrete Contrastive Diffusion approach for cross-modal and conditional generation (ICLR2023).

Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

Paper | Project Page

Updates:

  • (04/2023) Also super exicited to introduce my new series of work in #AI4Science and #ML4Astrophysics, where we apply the diffusion generative models to solve real scientific problems in astronomy for the star formation in universe. Specifically, we adapt the DDPM formulation to predict the density of molecular clouds, and achieve an order of magnitude improvement on the prediction accuracy over the existing physical methods. Full paper accecpted to The Astrophysical Journal (Top-1 Journal in Astronomy), short version can also be found in ICLR23 Physics4ML Workshop.

  • If you are interested in applying pre-trained and frozen diffusion models for various downstream applications such as image editing, please also check my recent work Boundary Guided Mixing Trajectory for Semantic Control with Diffusion Models, in which we introduce a non-learning based (thus super light-weight) method for image semantic control and manipulation by denoising trajectory guidance, thanks! (Code and project page will be released soon, stay tuned!)

1. Project Overview

In this work, we introduce our Conditional Discrete Contrastive Diffusion (CDCD) approach to enhance the input-output connections in cross-modal and conditional generation. Specifically, we tackle the problem by maximizing the mutual information between the given input and the generated output via contrastive learning. We demonstrate the efficacy of the proposed approach in evaluations with three diverse, multimodal conditional synthesis tasks on five datasets: dance-to-music generation on AIST++ and TikTok Dance-Music, text-to-image synthesis on CUB200 and MSCOCO, and class-conditioned image synthesis on ImageNet.

2. Environment Setup

The environment can be set up following the instructions below. The dance-to-music task requires the pretrained JukeBox model, and the text-to-image task loads the pre-trained DALL-E model.

conda create --name cdcd python=3.8
source activate cdcd
conda install mpi4py==3.0.3
# install the pytorch and cudatoolkit based on your machine.
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
git clone https://github.com/L-YeZhu/CDCD.git
cd CDCD
pip install -r requirements.txt
cd ./synthesis/modeling/models/
pip install -e .

3. Dataset

We conduct experiments on three diverse cross-modal and conditional generation tasks on five datasets: dance-to-music generation, text-to-image synthesis, and class-conditioned image synthesis.

3.1 AIST++ Dataset for Dance-to-Music

The AIST++ dataset is a subset of AIST dataset, which can be downloaded from here. We use the cross-modality data split for training and testing.

3.2 TikTok Dance-Music Dataset for Dance-to-Music

The TikTok Dance-Music dataset includes the dance videos with paired music collected from "in the wild" environment, which can be downloaded from here.

3.3 Text-to-Image Datasets

We follow the dataset preprations similar to the VQ-Diffusion for the CUB200, MSCOCO, and ImageNet.

4. Training

Please check and modify all the paths in the config files to match your machine before running experiments.

Here are some pre-trained models for use: AIST++, TikTok, CUB200.

4.1 Default training

To perform dance-to-music on AIST++. This default setting trains contrastive diffusion model with 80 steps with step-wise parallel contrastive diffusion and intra-negative music samples.

CUDA_VISIBLE_DEVICES=#IDS python running_command/run_train_aist.py 

To perform dance-to-music on TikTok Dance-Music. This default setting trains contrastive diffusion model with 80 steps with step-wise parallel contrastive diffusion and intra-negative music samples.

CUDA_VISIBLE_DEVICES=#IDS python running_command/run_train_tiktok.py 

To perform text-to-image on CUB200. This default setting trains contrastive diffusion model with 80 steps with sample-wise auxiliary diffusion and inter-negative image samples.

CUDA_VISIBLE_DEVICES=#IDS python running_command/run_train_cub.py 

To perform text-to-image on MSCOCO. This default setting trains contrastive diffusion model with 80 steps with step-wise parallel diffusion and intra-negative image samples.

CUDA_VISIBLE_DEVICES=#IDS python running_command/run_train_coco.py 

To perform class conditioned image synthesis on ImageNet. This default setting trains contrastive diffusion model with 80 steps with step-wise parallel diffusion and intra-negative image samples.

CUDA_VISIBLE_DEVICES=#IDS python running_command/run_train_imgnet.py 

4.2 Options for different contrastive diffusion settings

As we described in our paper, there are several possible combinations with the contrastive diffusion modes and negative sampling methods. In addition to the default training settings, you can optionally select and play with these settings by making modifications as follows.

To switch between the step-wise parallel and sample-wise auxiliary contrastive diffusion, modify the weights for parameter --contrastive_intra_loss_weight and --contrastive_extra_loss_weight,respectively. You can also run the vanilla version by setting both to be 0, or include both. Note that to run the sample-wise auxiliary contrastive diffusion with inter negative sampling methods, we provide extra files for negative samples. These files need to be specified in the config file using the --negative_sample_path. I provide some of those extra files in the ./data folder, but you can also prepare by your own following the requirements we mention in our paper. Other parameters such as diffusion steps can also be changed in the config file.

5. Inference

Use the following commands to do inference.

python inference/inference_aist.py
python inference/inference_tiktok.py
python inference/inference_cub.py
python inference/inference_coco.py
python inference/inference_imgnet.py

6. Citation

If you find our work interesting and useful, please consider citing it.

@inproceedings{zhu2022discrete,
  title={Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation},
  author={Zhu, Ye and Wu, Yu and Olszewski, Kyle and Ren, Jian and Tulyakov, Sergey and Yan, Yan},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2023}
}

7. Acknowledgement

We would like to thank the authors of previous related projects for generously sharing their code and insights: VQ-Diffusion, Taming Transformer, CLIP, JukeBox, DALL-E model, and D2M-GAN.

cdcd's People

Contributors

dependabot[bot] avatar l-yezhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cdcd's Issues

requirements.txt file missed for synthesis/modeling/models

Hi Ye,

Thank you for this great work.

When I followed the installation instruction and tried to setup the env with

cd ./synthesis/modeling/models/
pip install -e .

the error as follow:
Screenshot from 2022-10-14 12-18-55
But I can find the requirements.txt file at that path. Did I miss anything?
Thanks.

Hello, Ms. Zhu I was wondering if you have released the implementation for the genre accuracy metric

Dear Ms.Zhu:
I hope this email finds you well. My name is zhaoyang Zhang, and I am a fellow researcher in the field of generating music through dance, much like yourself. I recently came across your paper titled "Quantized GAN for Complex Music Generation from Dance Videos" and was particularly intrigued by the evaluation metric you proposed, specifically the genre accuracy metric.
I am currently in the process of conducting experiments with my own models, and I believe that testing the code implementation of your genre accuracy metric would greatly benefit my research. I am writing to kindly request access to the code used to execute this metric in your paper, as it would provide invaluable assistance in ensuring the robustness and accuracy of my own experiments.
I understand that sharing code can be sensitive, but I assure you that I will utilize it solely for academic purposes and will not distribute it without your explicit permission. Any assistance you could provide would be immensely appreciated and duly acknowledged in my work.

Thank you very much for considering my request. I look forward to hearing from you at your earliest convenience.

Warm regards,
Zhaoyang Zhang CUC
[email protected]

Beat Coverage Rate

Hi, I have some doubts about the usage of the beat coverage rate. As you mentioned in the D2M-GAN paper, the beat coverage rate is computed as the division of the generated beat number by the original beat number. From my point of view, this restricts the generated music from including too many beat keypoints (which has a high beat hit rate but results in poor performance). Under these preliminaries, I have two main questions.

  1. The beat coverage rate can easily exceed 1 if the generated music has more beat keypoints than the original music, but I cannot find any beat coverage rate that is bigger than 1 both in the D2M-GAN and CDCD papers and I wonder the reason. By the way, in this case, is it true that the performance is better if the beat coverage rate is closer to 1 rather than bigger?
  2. It seems that the final beat coverage rate is computed as the average of all samples. However, if a test set has two samples, of which the beat coverage rate is 0.5 and 1.5, respectively. In this case, the final beat coverage rate is 1, which seems to be perfect under this evaluation metric yet each sample performs unsatisfactorily. How to prevent such a phenomenon when using this evaluation metric?

I would appreciate it if you could figure out my misunderstanding regarding the beat coverage rate, and I'd like to have a further discussion about the evaluation of music beats/rhythms.

aist_s6 data

Hi,
Thanks for your excellent work. I wonder how you processed the aist_s6 data? Do you plan to release it?
Thanks!

Question about Beat Detection

Hi, I am curious about the computation procedure of beat detection. It seems that the beats are computed by extracting the local maximums of the onset envelopes via librosa, which is more accurate to be regarded as the auditory rhythms from my own perspective. Since the librosa library also includes the official implementation of beat detection (librosa.beat.beat_track), which picks peaks in onset strength approximately consistent with estimated tempo, I wonder if the beat detection methods in the paper have a certain rationale for the dance-to-music scenarios, or if computing the hit rate of the onset maximums can reflect the performance of music generation more precisely? Thanks.

Genre accuracy metric

Hi, I was wondering if you have released the implementation for the genre accuracy metric.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.