Giter VIP home page Giter VIP logo

transvae's Introduction


Unsupervised Video Domain Adaptation for Action Recognition:
A Disentanglement Perspective

Pengfei Wei1   Lingdong Kong1,2   Xinghua Qu1   Yi Ren1   Zhiqiang Xu3   Jing Jiang4   Xiang Yin1
1ByteDance AI Lab   2National University of Singapore   3MBZUAI   4University of Technology Sydney

NeurIPS 2023

About

TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.



Col1: Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); Col2: Sequence reconstructions; Col3: Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; Col4: Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$.


Visit our project page to explore more details. 🐾

Updates

  • [2023.10] - We provide our extracted I3D features, kindly refer to this page for more details.
  • [2023.09] - TranSVAE was accepted to NeurIPS 2023! 🎉
  • [2022.08] - TranSVAE achieves 1st place among the UDA leaderboards of UCF-HMDB, Jester, and Epic-Kitchens, based on Paper-with-Code.
  • [2022.08] - Try a Gradio demo for domain disentanglement in TranSVAE at Hugging Face Spaces! 🤗
  • [2022.08] - Our paper is available on arXiv, click here to check it out!

Outline

Highlights

Conceptual Comparison
Graphical Model
Framework Overview

Installation

Please refer to INSTALL.md for the installation details.

Data Preparation

Please refer to DATA_PREPARE.md for the details to prepare the 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, and 5Sprites datasets.

Getting Started

Please refer to GET_STARTED.md to learn more usage about this codebase.

Main Results

UCF101 - HMDB51

PWC

Method Backbone U101 → H51 H51 → U101 Average
DANN (JMLR'16) ResNet-101 75.28 76.36 75.82
JAN (ICML'17) ResNet-101 74.72 76.69 75.71
AdaBN (PR'18) ResNet-101 72.22 77.41 74.82
MCD (CVPR'18) ResNet-101 73.89 79.34 76.62
TA3N (ICCV'19) ResNet-101 78.33 81.79 80.06
ABG (MM'20) ResNet-101 79.17 85.11 82.14
TCoN (AAAI'20) ResNet-101 87.22 89.14 88.18
MA2L-TD (WACV'22) ResNet-101 85.00 86.59 85.80
Source-only I3D 80.27 88.79 84.53
DANN (JMLR'16) I3D 80.83 88.09 84.46
ADDA (CVPR'17) I3D 79.17 88.44 83.81
TA3N (ICCV'19) I3D 81.38 90.54 85.96
SAVA (ECCV'20) I3D 82.22 91.24 86.73
CoMix (NeurIPS'21) I3D 86.66 93.87 90.22
CO2A (WACV'22) I3D 87.78 95.79 91.79
TranSVAE (Ours) I3D 87.78 98.95 93.37
Oracle I3D 95.00 96.85 95.93

Jester

PWC

Task Source-only DANN ADDA TA3N CoMix TranSVAE (Ours) Oracle
JSJT 51.5 55.4 52.3 55.5 64.7 66.1 95.6

Epic-Kitchens

PWC

Task Source-only DANN ADDA TA3N CoMix TranSVAE (Ours) Oracle
D1D2 32.8 37.7 35.4 34.2 42.9 50.5 64.0
D1D3 34.1 36.6 34.9 37.4 40.9 50.3 63.7
D2D1 35.4 38.3 36.3 40.9 38.6 50.3 57.0
D2D3 39.1 41.9 40.8 42.8 45.2 58.6 63.7
D3D1 34.6 38.8 36.1 39.9 42.3 48.0 57.0
D3D2 35.8 42.1 41.4 44.2 49.2 58.0 64.0
Average 35.3 39.2 37.4 39.9 43.2 52.6 61.5

Ablation Study

UCF101HMDB51

HMDB51UCF101

Domain Transfer Example

Source (Original) Target (Original) Source (Original) Target (Original)
src_original tar_original src_original tar_original
Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)
src_recon tar_recon src_recon tar_recon
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)
recon_srcZf recon_tarZf recon_srcZf recon_tarZf
Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)
recon_srcZt recon_tarZt recon_srcZt recon_tarZt
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)
recon_srcZf_tarZt recon_tarZf_srcZt recon_srcZf_tarZt recon_tarZf_srcZt

TODO List

  • Initial release. 🚀
  • Add license. See here for more details.
  • Add demo at Hugging Face Spaces.
  • Add installation details.
  • Add data preparation details.
  • Add evaluation details.
  • Add training details.

License

Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

We acknowledge the use of the following public resources during the course of this work: 1UCF101, 2HMDB51, 3Jester, 4Epic-Kitchens, 5Sprites, 6I3D, and 7TRN.

Citation

If you find this work helpful, please kindly consider citing our paper:

@inproceedings{wei2023transvae,
  title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
  author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
  booktitle = {Advances in Neural Information Processing Systems}, 
  year = {2023},
}

transvae's People

Contributors

ldkong1205 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transvae's Issues

Inquire about some questions about the paper "Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective"

Dear Sir, greetings. Firstly, congratulations on the successful publication of your paper in NeurIPS. However, I have some minor questions regarding its content that I would like to consult with you. Concerning the Domain Specificity & Static Consistency module, in the subsequent operations, it is mentioned that "The static latent factors disentangled from the original sequence and the shuffled sequence should be ideally equal or be very close at least. This motivates us to minimize the distance between these two static factors. Meanwhile, to further enhance the domain specificity, we enforce the dynamic latent factors from different domains to have a large distance." This is the part I find perplexing. Shouldn't we aim to maximize static latent factors? This is in consideration of your earlier statement, "the dynamic latent factors are enforced to be domain-invariant."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.