Giter VIP home page Giter VIP logo

video_editing_diffusion's Introduction

Video editing with foundational stabel diffusion model

Stable diffusion model can generate images based on given input prompts, making a pretrianed diffusion model viable tool to edit images with conditions. This pipeline is adapted from the work of Tune-A-Video (github, paper) that is based on open-source Huggigface Diffusers and its pretrained checkpoints.

Results

Input Video Output Video
"A woman is talking" "A woman, wearing Superman clothes, is talking” "A woman, wearing Batman's mask, is talking" "A Wonder Woman is talking, cartoon style"

The pipeline

Set-ups

Conda

In a conda env, I installed conda install Python==3.11 and the following:

pip install -r requirements.txt

Docker Container

I can also build a docker container to use GPUs for training and inferencing (postprocessing not included). Build Docker image (inside the current folder):

docker build -t image_name -f ./docker/Dockerfile .

Launch a Docker image using the following command (if needed, --network your_docker_network_name to specify a network). You will have a running shell and access to NVIDIA GPUs. Then follow the instructions in the next sections.

docker run --gpus all -it image_name

Training:

The input video will be decomposed into frame images. The prompt and the images (in batches) will be embedded into latent vectors. The model is trained to semantically match these latent vectors going through cross-attention Unet architecture.

  1. Download stable diffusion mdoel and the pretrined weights.

    ./download_models.sh
    
  2. Stongly suggest lanching accelerate jobs in terminal. First, configurate Accelerate for non/distributed training.

    accelerate config
    
  3. Launch a training job.

    accelerate launch train_tuneavideo.py --config='./configs/woman-talking.yaml'
    

Some notes: I have tried different image aspect ratios and resolutions, I think the best is (512, 512), which is the default image sizes of the pretrained model. GPU memory is a bottelneck for the training (with A100 40GBs) since the model itself is quite huge. Due to resources limitation, I was only able to train videos with a total frames up to 16.

Inferencing:

Once the training is done (modify inference.py if needed), do:

python inference.py

In this process,

  1. New prompts will be embeded. The new latent vectors are initialized through DDIM inversion, providing structure guidance for sampling.
  2. The new latent vectors will be used to reconstruct frames (the same dimension as input videos) through a VAE decoder.

Postporcessing:

It contains a few functionalites using module moviepy. See postprocess.ipynb.

  1. An audio is extracted from the original video.
  2. A new video is made by combining the audio and new video of the same duration.

video_editing_diffusion's People

Contributors

zhuowenzhao avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.