Giter VIP home page Giter VIP logo

aiden200 / 2d3mf Goto Github PK

View Code? Open in Web Editor NEW
25.0 2.0 1.0 124.28 MB

Code and models for the paper "2D3MF: Deepfake Detection using Multi Modal Middle Fusion"

License: Other

Python 84.52% Shell 2.44% Makefile 0.01% Batchfile 0.01% Jupyter Notebook 11.89% C++ 0.34% C 0.01% Cuda 0.56% Perl 0.03% Cython 0.15% Lua 0.05%
audio deep-learning deepfake-detection machine-learning multimodal video pytorch

2d3mf's Introduction

2D3MF: Deepfake Detection using Multi Modal Middle Fusion

Caution

This repo is under development. No hyper parameter tuning is presented yet here; hence, the current architecture is not optimal for deepfake detection.

This repo is the implementation for the paper 2D3MF: Deepfake Detection using Multi Modal Middle Fusion.

Repository Structure

.
├── assets                # Images for README.md
├── LICENSE
├── README.md
├── MODEL_ZOO.md
├── CITATION.cff
├── .gitignore
├── .github

# below is for the PyPI package marlin-pytorch
├── src                   # Source code for marlin-pytorch and audio feature extractors
├── tests                 # Unittest
├── requirements.lib.txt
├── setup.py
├── init.py
├── version.txt

# below is for the paper implementation
├── configs              # Configs for experiments settings
├── TD3MF                # 2D3MF model code
├── preprocess           # Preprocessing scripts
├── dataset              # Dataloaders
├── utils                # Utility functions
├── train.py             # Training script
├── evaluate.py          # Evaluation script
├── requirements.txt

Installing and running our model

Feature Extraction - 2D3MF

Install 2D3MF from pypi

pip install 2D3MF

Sample code snippet for feature extraction

from TD3MF.classifier import TD3MF
ckpt = "ckpt/celebvhq_marlin_deepfake_ft/last-v72.ckpt"
model = TD3MF.load_from_checkpoint(ckpt)
features = model.feature_extraction("2D3MF_Datasets/test/SampleVideo_1280x720_1mb.mp4")

We have some pretrained marlin checkpoints and configurations here

Paper Implementation

Requirements:

  • Python >= 3.7, < 3.12
  • PyTorch ~= 1.11
  • Torchvision ~= 0.12
  • ffmpeg

Installation

Install PyTorch from the official website

Clone the repo and install the requirements:

git clone https://github.com/aiden200/2D3MF
cd 2D3MF
pip install -e .

Training

1. Download Datasets

Forensics++ We cannot offer the direct script in our repository due to their terms on using the dataset. Please follow the instructions on the [Forensics++](https://github.com/ondyari/FaceForensics?tab=readme-ov-file) page to obtain the download script.

Storage

- FaceForensics++
    - The original downladed source videos from youtube: 38.5GB
    - All h264 compressed videos with compression rate factor
        - raw/0: ~500GB
        - 23: ~10GB (Which we use)

Downloading the data

Please download the Forensics++ dataset. We used the all light compressed original & altered videos of three manipulation methods. It's the script in the Forensics++ repository that ends with: <output path> -d all -c c23 -t videos

The script offers two servers which can be selected by add --server <EU or CA>. If the EU server is not working for you, you can also try EU2 which has been reported to work in some of those instances.

Audio download

Once the first two steps are executed, you should have a structure of

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences

Since the Forensics++ dataset doesn't provide audio data, we need to extract the data ourselves. Please run the script in the Forensics++ repository that ends with: <Parent_dir from last step> -d original_youtube_videos_info

Now you should have a directory with the following structure:

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info

Please run the script from our repository: python3 preprocess/faceforensics_scripts/extract_audio.py --dir [Parent_dir]

After this, you should have a directory with the following structure:

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info
|-- audio_clips

References

  • Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, Matthias Nießner. "FaceForensics++: Learning to Detect Manipulated Facial Images." In International Conference on Computer Vision (ICCV), 2019.
DFDC Kaggle provides a nice and easy way to download the [DFDC dataset](https://www.kaggle.com/c/deepfake-detection-challenge/data)
DeepFakeTIMIT We recommend downloading the data from the [DeepfakeTIMIT Zenodo Record](https://zenodo.org/records/4068245)
FakeAVCeleb We recommend requesting access to FakeAVCeleb via their [repo README](https://github.com/DASH-Lab/FakeAVCeleb)
RAVDESS We recommend downloading the data from the [RAVDESS Zenodo Record](https://zenodo.org/records/1188976)

2. Preprocess the dataset

We recommend using the following unified dataset structure

2D3MF_Dataset/
├── DeepfakeTIMIT
│   ├── audio/*.wav
│   └── video/*.mp4
├── DFDC
│   ├── audio/*.wav
│   └── video/*.mp4
├── FakeAVCeleb
│   ├── audio/*.wav
│   └── video/*.mp4
├── Forensics++
│   ├── audio/*.wav
│   └── video/*.mp4
├── RAVDESS
    ├── audio/*.wav
    └── video/*.mp4

Crop the face region from the raw video. Run:

python3 preprocess/preprocess_clips.py --data_dir [Dataset_Dir]

3. Extract features from pretrained models

EfficientFace

Download the pre-trained EfficientFace from here under 'Pre-trained models'. In our experiments, we use the model pre-trained on AffectNet7, i.e., EfficientFace_Trained_on_AffectNet7.pth.tar. Please place it under the pretrained directory

Run:

python preprocess/extract_features.py --data_dir /path/to/data --video_backbone [VIDEO_BACKBONE] --audio_backbone [AUDIO_BACKBONE]

[VIDEO_BACKBONE] can be replaced with one of the following:

  • marlin_vit_small_ytf
  • marlin_vit_base_ytf
  • marlin_vit_large_ytf
  • efficientface

[AUDIO_BACKBONE] can be replaced with one of the following:

  • MFCC
  • xvectors
  • resnet
  • emotion2vec
  • eat

Optionally add the --Forensics flag in the end if Forensics++ is the dataset being processed.

From our paper, we found that eat works the best as the audio backbone.

Split the train val and test sets. Run:

python preprocess/gen_split.py --data_dir /path/to/data --test 0.1 --val 0.1 --feat_type [AUDIO_BACKBONE]

Note that the pre-trained video_backbone and audio_backbone can be downloaded from MODEL_ZOO.md

4. Train and evaluate

Train and evaluate the 2D3MF model..

Please use the configs in config/*.yaml as the config file.

python evaluate.py \
    --config /path/to/config \
    --data_path /path/to/CelebV-HQ
    --num_workers 4
    --batch_size 16


python evaluate.py \
    --config /path/to/config  \
    --data_path /path/to/dataset \
    --num_workers 4 \
    --batch_size 8 \
    --marlin_ckpt pretrained/marlin_vit_base_ytf.encoder.pt \
    --epochs 300


python evaluate.py --config config/celebvhq_marlin_deepfake_ft.yaml --data_path 2D3MF_Datasets --num_workers 4     --batch_size 1 --marlin_ckpt pretrained/marlin_vit_small_ytf.encoder.pt --epochs 300

Optionally, add

--skip_train --resume /path/to/checkpoint

To skip the training.

5. Configuration File

Set a configuration file based on your hyperparameters and backbones. You can find a example config file under config/

Explanation:

  • training_datasets - list, can contain one or more datasets within "DeepfakeTIMIT", "RAVDESS", "Forensics++", "DFDC", "FakeAVCeleb"
  • eval_datasets- list, can contain one or more datasets within "DeepfakeTIMIT", "RAVDESS", "Forensics++", "DFDC", "FakeAVCeleb"
  • learning_rate - int, ex: 1.00e-3
  • num_heads - int, Number of attention heads
  • fusion - str, Choice of fusion type: "mf" for middle fusion and "lf" for late fusion.
  • audio_positional_encoding - bool, add audio positional encoding
  • hidden_layers - int, hidden layers
  • lp_only - bool, setting this to be true will perform inference from the video features only
  • audio_backbone- str, select one of the following options: "MFCC", "eat", "xvectors", "resnet", "emotion2vec"
  • middle_fusion_type- str, select one of the following options: "default", "audio_refuse", "video_refuse", "self_attention", "self_cross_attention"
  • modality_dropout - float, modality dropout rate
  • video_backbone - str, select one of the following options: "efficientface", "marlin"

6. Performing Grid Search

  • config/grid_search_config.py
  • --grid_search

7. Monitoring Performance:

Run

tensorboard --logdir=lightning_logs/

Should be hosted on http://localhost:6006/

License

This project is under the CC BY-NC 4.0 license. See LICENSE for details.

References

Please cite our work!

Acknowledgements

Some code about model is based on ControlNet/MARLIN. The code related to middle fusion is from Self-attention fusion for audiovisual emotion recognition with incomplete data.

Our Audio Feature Extraction Models:

Our Video Feature Extraction Models:

2d3mf's People

Contributors

adriansroman avatar aiden200 avatar aromanusc avatar controlnet avatar cy3021561 avatar hermes7308 avatar hyunkeup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

bearownage

2d3mf's Issues

General Preprocessing

We need a general preprocessing pipeline that crops, tags the videos as real(-0) or fake(-1). Then generates the .npy files based off of a backbone. Args: data dir, backbone type (small, base, large).

Please specify a required structure for input. Example:

**Data dir**
   Video
       Real
       Fake
   Audio
       Real
       Fake

Middle Fusion Ablations

We want to try out different middle fusion structures. Please implement the following ablations for middle fusions using the files
model/multi_modal_middle_fusion.py, model/transformer_blocks.py, and model/classifier.py.

Unlike the diagram, the values are currently multiplied at the end. Here are the ablations:

Middle_fusion_substitutions (1)

Unify requirements.txt

Currently we have two requirements.txt files. This was done as part of the development process. Now that the repo is stable and the major chunk development is finished we should unify the requirements into a single file

Grid Search

To test with autoML, implement Grid Search

Pre-train Resnet for emotion detection

The task is to perform pre-training on a ResNet model on a simple emotion detection classification task. The input to the ResNet should be MFCCs computed using 1second of audio with a sampling rate of sr=44100Hz and n_mfcc=10 i.e. you can uselibrosa.feature.mfcc(y=y_1sec_audio, sr=44100, n_mfcc=10).

The network can be trained with audioclips from the RAVDESS dataset. The labels to predict should be 8:
01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised

This pre-trained network will then be used as a feature extractor within our 2D3MF pipeline.

Hidden Dimension Change

Under model/classifier.py, self.hidden_layers needs to be able to change dynamically without throwing an error. It is currently set to 128.

Classes that are effected from self.hidden_layers:
AudioCNNPool, VideoCnnPool, AttentionBlock

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.