Giter VIP home page Giter VIP logo

spectre's Introduction

SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

Paper   Project WebPage   Youtube Video

Our method performs visual-speech aware 3D reconstruction so that speech perception from the original footage is preserved in the reconstructed talking head. On the left we include the word/phrase being said for each example.

This is the official Pytorch implementation of the paper:

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos
arXiv 2022

Installation

Clone the repo and its submodules:

git clone --recurse-submodules -j4 https://github.com/filby89/spectre
cd spectre

You need to have installed a working version of Pytorch with Python 3.6 or higher and Pytorch 3D. You can use the following commands to create a working installation:

conda create -n "spectre" python=3.8
conda install -c pytorch pytorch=1.11.0 torchvision torchaudio # you might need to select cudatoolkit version here by adding e.g. cudatoolkit=11.3
conda install -c conda-forge -c fvcore fvcore iopath 
conda install pytorch3d -c pytorch3d
pip install -r requirements.txt # install the rest of the requirements

Installing a working setup of Pytorch3d with Pytorch can be a bit tricky. For development we used Pytorch3d 0.6.1 with Pytorch 1.10.0.

PyTorch3d 0.6.2 with pytorch 1.11.0 are also compatible.

Install the face_alignment and face_detection packages:

cd external/face_alignment
pip install -e .
cd ../face_detection
git lfs pull
pip install -e .
cd ../..

You may need to install git-lfs to run the above commands. More details

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

Download the FLAME model and the pretrained SPECTRE model:

pip install gdown
bash quick_install.sh

Demo

Samples are included in samples folder. You can run the demo by running

python demo.py --input samples/LRS3/0Fi83BHQsMA_00002.mp4 --audio

The audio flag extracts audio from the input video and puts it in the output shape video for visualization purposes (ffmpeg is required for video creation).

Training and Testing

In order to train the model you need to download the trainval and test sets of the LRS3 dataset. After downloading the dataset, run the following command to extract frames and audio from the videos (audio is not needed for training but it is nice for visualizing the result):

python utils/extract_frames_and_audio.py --dataset_path ./data/LRS3

After downloading and preprocessing the dataset, download the rest needed assets:

bash get_training_data.sh

This command downloads the original DECA pretrained model, the ResNet50 emotion recognition model provided by EMOCA, the pretrained lipreading model and detected landmarks for the videos of the LRS3 dataset provided by Visual_Speech_Recognition_for_Multiple_Languages.

Finally, you need to create a texture model using the repository BFM_to_FLAME. Due to licencing reasons we are not allowed to share it to you.

Now, you can run the following command to train the model:

python main.py --output_dir logs --landmark 50 --relative_landmark 25 --lipread 2 --expression 0.5 --epochs 6 --LRS3_path data/LRS3 --LRS3_landmarks_path data/LRS3_landmarks

and then test it on the LRS3 dataset test set:

python main.py --test --output_dir logs --model_path logs/model.tar --LRS3_path data/LRS3 --LRS3_landmarks_path data/LRS3_landmarks

and run lipreading with AV-hubert:

# and run lipreading with our script
python utils/run_av_hubert.py --videos "logs/test_videos_000000/*_mouth.avi --LRS3_path data/LRS3"

Acknowledgements

This repo is has been heavily based on the original implementation of DECA. We also acknowledge the following repositories which we have benefited greatly from as well:

Citation

If your research benefits from this repository, consider citing the following:

@misc{filntisis2022visual,
  title = {Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos},
  author = {Filntisis, Panagiotis P. and Retsinas, George and Paraperas-Papantoniou, Foivos and Katsamanis, Athanasios and Roussos, Anastasios and Maragos, Petros},
  publisher = {arXiv},
  year = {2022},
}

spectre's People

Contributors

filby89 avatar iamgmujtaba avatar radekd91 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

spectre's Issues

Type of image input

Hello, from your repo, the image inputs are within the value of [0, 255], but in DECA, they divide 255 in dataset preprocessing, right?

Reconstructed mesh is not stable

test.mp4

Hey I use your demo.py scripts run my video. Although the expressiveness and lip movement looks perfect the reconstructed face seems unstable, is there any way to solve this problem? Much appreciated!

About landmarks

Hi, thanks for the nice work!
I notice the landmarks are from Mesh verts, but I wonder how the landmarks index in the FLAME were got? And what are the differences between 'dynamic/static/full_lmk_faces_idx', 'dynamic/static/full_lmk_bary_coords' ?

Thanks

Question about saving the mesh

I tried to save the mesh using 'vert' with f of the 'head_template.obj', but my output is not the shape of a human head, where am I doing wrong, the ‘vert’ I use comes from the return of the 'decode()' function of the file 'spectre.py'

About the Pre-trained model

Hi, your work is awesome!

And I would like to know which pre-trained model you use when utilizing the ExpressionLossNet.

self.emotion_checkpoint = torch.load("data/ResNet50/checkpoints/deca-epoch=01-val_loss_total/dataloader_idx_0=1.27607644.ckpt")['state_dict']

I tried to find it in EMOCA but didn't get it. So could you provide a specific download link?

Can not find the run_av_hubert.sh file

Thanks for your great works. At the end of Readme, you pointed out the test command in the LRS3 dataset. But I can not find the script named run_av_hubert.sh. So how do I verify the performance of the model in lipreading?

dataset link invalid

Thank you for your excellent work! The link to the LRS3 official website dataset is no longer valid. Could you provide a link to your own dataset?

Is it possible to use this work for single image reconstruction?

Hi, nice work!

I was desired to use this work for single image reconstruction but found that the provided model is designed for video sequences.
I wonder if it is possible for single image case? For example, if I concatenate a single frame multiple times and send the sequence into the model, does it produce reasonable results?

I also wonder the 3D reconstruction accuracy of this work compare to EMOCA. The original paper did not report this kind of metrics.

Thanks in advance!

Unable to perceive improvement.

Thanks for sharing this work, good insight and inspiring.
But I'm unable to perceive improvement of the pretrained model.
My inference of E_expression:
For images, inputs: just concat same image (1, 5, 224, 224), output: Flame parameters (5, 53), chose the center one's param ouput[2, :]
For videos, inputs: just concat same frame (1, 5, 224, 224), output: Flame parameters (5, 53), chose the center one's param ouput[2, :]. Or maybe I should try to concat continous 5 frames for E_expression input ?
My mean_shape for alignment is consist with author's.
Comparison of the results (talkinghead videos and single image reconstruction) between E_flame_without_E_expression and E_flame_with_expression:

E_flame_without_E_expression:

talkinghead_E_flame_without_E_expression.mp4

msk_E_flame_without_E_expression

obm_E_flame_without_E_expression

E_flame_with_expression:

talkinghead_E_flame_with_E_expression.mp4

msk_E_flame_with_E_expression

obm_E_flame_with_E_expression

Sorry, my test maybe not sufficient,and my preprocess maybe not accurate.

Expression Retargeting

Is it possible to retarget/transfer the expressions from one face to another while keeping the identity and pose of the head?

Cannot clone the submodule external/Visual_Speech_Recognition_for_Multiple_Languages

Hi,

This repo contains a submodule with url https://github.com/filby89/Visual_Speech_Recognition_for_Multiple_Languages.git, but there is no public repository with such url.

So the command git clone --recurse-submodules -j4 https://github.com/filby89/spectre fails and it is not possible to run the demo, because it fails when trying to import FaceTracker from external.Visual_Speech_Recognition_for_Multiple_Languages.tracker.face_tracker.

Could you please look into that?

Problem with face_detection

Hello, I'm working on reproducing your great paper. However, I'm stuck in the git lfs pull of face_detection.
I ran the git lfs pull and got the error:

Error downloading object: ibug/face_detection/retina_face/weights/Resnet50_Final.pth (6d1de9c): 
Smudge error: Error downloading ibug/face_detection/retina_face/weights/Resnet50_Final.pth 
(6d1de9c2944f2ccddca5f5e010ea5ae64a39845a86311af6fdf30841b0a5a16d): batch response: 
This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Could you offer me a google drive link to the pre-trained weight of face_detection or any other alternatives? Thank you very much~

How to change training datasets?

Hi, author, I am wondering how to replace a dataset without landmarks: 1) At present, your code is trained based on the LR3 dataset, and the LR3 has the groudtruth of landmarks. 2) If I replace the dataset, can I delete some landmarks constraints? 3) Are there any special requirements, such as video length size? If I can, I want to create my own datasets through cameras.

Best,
YangXin

Unable to download the pretrained SPECTRE model

Hello, I just ran the code you provided to download the pretrained SPRECTRE model, but got permission denied error, as follows. Could you please provide a new link?

gdown --id 1vmWX6QmXGPnXTXWFgj67oHzOoOmxBh6B
Permission denied: https://drive.google.com/uc?id=1vmWX6QmXGPnXTXWFgj67oHzOoOmxBh6B
Maybe you need to change permission over 'Anyone with the link'

script demo.py fail to start

python demo.py --input samples/LRS3/0Fi83BHQsMA_00002.mp4 --audio

Downloading: "https://github.com/pytorch/vision/archive/v0.8.1.zip" to /home/user/.cache/torch/hub/v0.8.1.zip
/home/user/miniconda3/envs/spectre/lib/python3.8/site-packages/chumpy/init.py:11: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar.
from numpy import bool, int, float, complex, object, unicode, str, nan, inf
/home/user/miniconda3/envs/spectre/lib/python3.8/site-packages/chumpy/init.py:11: FutureWarning: In the future np.object will be defined as the corresponding NumPy scalar.
from numpy import bool, int, float, complex, object, unicode, str, nan, inf
/home/user/miniconda3/envs/spectre/lib/python3.8/site-packages/chumpy/init.py:11: FutureWarning: In the future np.str will be defined as the corresponding NumPy scalar.
from numpy import bool, int, float, complex, object, unicode, str, nan, inf
Traceback (most recent call last):
File "demo.py", line 231, in
main(parser.parse_args())
File "demo.py", line 91, in main
spectre = SPECTRE(spectre_cfg, args.device)
File "/home/user/dev/face_stuff/spectre/src/spectre.py", line 38, in init
self._create_model(self.cfg.model)
File "/home/user/dev/face_stuff/spectre/src/spectre.py", line 73, in _create_model
self.flame = FLAME(model_cfg).to(self.device)
File "/home/user/dev/face_stuff/spectre/src/models/FLAME.py", line 47, in init
ss = pickle.load(f, encoding='latin1')
File "/home/user/miniconda3/envs/spectre/lib/python3.8/site-packages/chumpy/init.py", line 11, in
from numpy import bool, int, float, complex, object, unicode, str, nan, inf
ImportError: cannot import name 'bool' from 'numpy' (/home/user/miniconda3/envs/spectre/lib/python3.8/site-packages/numpy/init.py)

couldn't download the spectre_model.tar

Hey @filby89 ,
I was trying to download the model but couldn't download it.
Getting the below error,

requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
mv: cannot stat 'spectre_model.tar': No such file or directory

Can you please share the link here?
Thank you.

Magnitude of expression loss while training

Hi, thanks for your wonderful work!
Recently, we're trying to train Spectre on our own dataset, but we find expression loss is hard to converge. In order to position problem, we also tried to use official Spectre training setting, also find expression loss hard to converge.

So I wonder the magnitude of expression loss before training and after training. And, could you give use some advice to position this problem.

Thanks!

截屏2023-06-06 10 22 33

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.