Giter VIP home page Giter VIP logo

pcavs-windows's Introduction

This is a Windows 10 guide to install & run PC-AVS

Codes and guides are slightly modified

Installation Tutorial

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu.

We propose Pose-Controllable Audio-Visual System (PC-AVS), which achieves free pose control when driving arbitrary talking faces with audios. Instead of learning pose motions from audios, we leverage another pose source video to compensate only for head motions. The key is to devise an implicit low-dimension pose code that is free of mouth shape or identity information. In this way, audio-visual representations are modularized into spaces of three key factors: speech content, head pose, and identity information.

Requirements

Anaconda3 Prompt is used

You can get it here

Quick Start: Generate Demo Results

Clone the repository Run anaconda3 prompt and run each of these commands to setup the environment

cd *Path_To_PCAVS*
conda create -n PCAVS python=3.6
conda activate PCAVS
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
## or for NVIDIA 30 series 
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch

pip install -r requirements.txt
pip install lws
conda install -c conda-forge ffmpeg
pip install face-alignment
  • Download the pre-trained checkpoints.

  • Create the default folder ./checkpoints and unzip the demo.zip at ./checkpoints/demo. There should be 5 pths in it.

  • Unzip all *.zip files within the misc folder.

  • Run the demo:

python inference.py --name demo --meta_path_vox misc/demo.csv --dataset_mode voxtest --netG modulate --netA resseaudio --netA_sync ressesync --netD multiscale --netV resnext --netE fan --model av --gpu_ids 0 --clip_len 1 --batchSize 16 --style_dim 2560 --nThreads 4 --input_id_feature --generate_interval 1 --style_feature_loss --use_audio 1 --noise_pose --driving_pose --gen_video --generate_from_audio_only

From left to right are the reference input, the generated results, the pose source video and the synced original video with the driving audio.

Prepare Testing Your Own Images/Videos

Check out this tutorial video for a clear demonstration

All of these steps can be done with the prepare_testing_files.py. However, the codes can't consistently run on Windows, so I suggest manually setting up. Hence, all below steps are manual setups. Please refer to the original branch for running the script prepare_testing_files.py if you want to do that instead.

Example combos to run this AI:

description combo
make 1 video talk, with head movements and extra mouth reference Input: x.mp4
Audio Source: y.mp4
Pose Source: z.mp4
make 1 image talk, with head movements and extra mouth reference Input: x.jpg
Audio Source: y.mp4
Pose Source: z.mp4
make 1 image talk, no head movements and extra mouth reference Input: x.jpg
Audio Source: y.mp4
Pose Source: x.jpg
make 1 image talk, no head movements Input: x.jpg
Audio Source: y.mp3
Pose Source: x.jpg

If the Audio Source is mp4, Extract the mp3 out of the mp4. So you have 2 files.

All the following steps assume all mp4s are in 30 fps.

Step up Audio_Source

  • Drag and drop your mp3 at misc/Audio_Source

Step up Mouth_Source Skip this step if your Audio Source is a mp3

  • Drag and drop your mp4 at misc/Mouth_Source IF Audio Source was a mp4, and create a folder at misc/Mouth_Source with the name of your mp4 file
  • Change the x (2 occurences) to your mp4's name in the command below, and enter it
ffmpeg -i misc/Mouth_Source/x.mp4 -vf fps=30 misc/Mouth_Source/x/%06d.jpg

Step up Input

  • Drag and drop your input image/video in misc/Input, and create a folder at misc/Input with the name of your mp4/jpg file
  • If your input is a video, change the y (2 occurences) to your mp4's name in the command below, and enter it
ffmpeg -i misc/Input/y.mp4 -vf fps=30 misc/Input/y/%06d.jpg
  • If your input is an image, drag and drop the image inside the folder. Rename the image to 000000.jpg

Step up Pose_Source

  • Drag and drop your pose image/video misc/Pose_Source, and create a folder at misc/Pose_Source with the name of your mp4/jpg file
  • If your input is a video, change the z (2 occurences) to your mp4's name in the command below, and enter it
ffmpeg -i misc/Pose_Source/z.mp4 -vf fps=30 misc/Pose_Source/z/%06d.jpg
  • If your input is an image, drag and drop the image inside the folder. Rename the image to 000000.jpg

Align Frames

After the drag and drop steps, run each of these commands (replace the x, y, z to the right names)

python scripts/align_68.py --folder_path misc/Mouth_Source/x
python scripts/align_68.py --folder_path misc/Input/y
python scripts/align_68.py --folder_path misc/Pose_Source/z

If you get the error preprocessing failed, it means some of the frames can't detect faces. One way to fix it is to shorten the video where the faces are visible.

New folders called x_cropped, y_cropped, z_cropped will be created.


Conditional: Stablize aligned videos

If your aligned faces are really shaky when put back into a video, you can stablize the aligned faces by putting the frames together with

Mouth_Source (replace the x with the corresponding name)

ffmpeg -i misc/Mouth_Source/x_cropped/%06d.jpg -vf fps=30 misc/Mouth_Source/x_aligned_stabled.mp4

Input (replace the y with the corresponding name)

ffmpeg -i misc/Input/y_cropped/%06d.jpg -vf fps=30 misc/Input/y_aligned_stabled.mp4

Pose_Source (replace the z with the corresponding name)

ffmpeg -i misc/Pose_Source/z_cropped/%06d.jpg -vf fps=30 misc/Pose_Source/z_aligned_stabled.mp4

Stablizes the videos with this

After stablizing them, replace the corresponding mp4s. (eg. replace z_aligned_stabled.mp4 with the actual stablized mp4)

Mouth_Source

ffmpeg -i misc/Mouth_Source/x_aligned_stabled.mp4 -vf fps=30 misc/Mouth_Source/x_aligned_stabled/%06d.jpg

Input

ffmpeg -i misc/Input/y_aligned_stabled.mp4 -vf fps=30 misc/Input/y_aligned_stabled/%06d.jpg

Pose_Source

ffmpeg -i misc/Pose_Source/z_aligned_stabled.mp4 -vf fps=30 misc/Pose_Source/z_aligned_stabled/%06d.jpg

Setup demo.csv

  • open the file demo.csv with notepad
  • Count the amount of frames in each x_cropped, y_cropped, z_cropped folder
  • Change the content accordingly:
description                                                                 combo                                                       demo.csv
make 1 video talk, with head movements and extra mouth reference Input: x.mp4
Audio Source: y.mp4
Pose Source: z.mp4
misc/Input/x_cropped _x_NUMBER_OF_FRAMES_HERE_ misc/Pose_Source/z_cropped _z_NUMBER_OF_FRAMES_HERE_ misc/Audio_Source/y.mp3 misc/Mouth_Source/y_cropped _y_NUMBER_OF_FRAMES_HERE_ dummy
make 1 image talk, with head movements and extra mouth reference Input: x.jpg
Audio Source: y.mp4
Pose Source: z.mp4
misc/Input/x_cropped 1 misc/Pose_Source/z_cropped _z_NUMBER_OF_FRAMES_HERE_ misc/Audio_Source/y.mp3 misc/Mouth_Source/y_cropped _y_NUMBER_OF_FRAMES_HERE_ dummy
make 1 image talk, no head movements and extra mouth reference Input: x.jpg
Audio Source: y.mp4
Pose Source: x.jpg
misc/Input/x_cropped 1 misc/Pose_Source/z_cropped _z_NUMBER_OF_FRAMES_HERE_ misc/Audio_Source/y.mp3 misc/Mouth_Source/y_cropped _y_NUMBER_OF_FRAMES_HERE_ dummy
make 1 image talk, no head movements Input: x.jpg
Audio Source: y.mp3
Pose Source: x.jpg
misc/Input/x_cropped 1 misc/Input/x_cropped 1 misc/Audio_Source/y.mp3 None 0 dummy

Run

After demo.csv is done setting up, run the codes with this line of command:

python inference.py --name demo --meta_path_vox misc/demo.csv --netG modulate --netA resseaudio --netA_sync ressesync --netD multiscale --netV resnext --netE fan --model av --gpu_ids 0 --clip_len 1 --batchSize 16 --style_dim 2560 --nThreads 4 --input_id_feature --generate_interval 1 --style_feature_loss --use_audio 1 --noise_pose --gen_video --driving_pose --generate_from_audio_only

The results will be in the results folder.

License and Citation

The usage of this software is under CC-BY-4.0.

@InProceedings{zhou2021pose,
author = {Zhou, Hang and Sun, Yasheng and Wu, Wayne and Loy, Chen Change and Wang, Xiaogang and Liu, Ziwei},
title = {Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Acknowledgement

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.