Giter VIP home page Giter VIP logo

humantomato's Introduction

HumanTOMATO: Text-aligned Whole-body Motion Generation

Shunlin Lu🍅 2, 3, Ling-Hao Chen🍅 1, 2, Ailing Zeng2, Jing Lin1, 2, Ruimao Zhang3, Lei Zhang2, and Heung-Yeung Shum1, 2

🍅Co-first author. Listing order is random.

1Tsinghua University, 2International Digital Economy Academy (IDEA), 3School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-SZ)

🤩 Abstract

This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H²VQ) and a Hierarchical-GPT for fine-grained body and hand motion reconstruction and generation with two structured codebooks; and (2) a pre-trained text-motion-alignment model to help generated motion align with the input textual description explicitly. Comprehensive experiments verify that our model has significant advantages in both the quality of generated motions and their alignment with text.

📢 News

  • [2024/05/13] Release OpenTMA project. It is exactly the text-motion alignment used in HumanTOMATO.
  • [2024/05/02] HumanTOMATO is accepted by ICML-2024. See you in Vienna!
  • [2023/11/15] Publish HumanTOMATO Motion Representation (tomato representation) processing code.
  • [2023/10/22] Publish project!

🎬 Highlight Whole-body Motions

The proposed HumanTOMATO model can generate text-aligned whole-body motions with vivid and harmonious face, hand, and body motion. We show two generated qualitative results.

🔍 System Overview

The framework overview of the proposed text-driven whole-body motion generation. (a) Holistic Hierarchical Vector Quantization (H²VQ) to compress fine-grained body-hand motion into two discrete codebooks with hierarchical structure relations. (b) Hierarchical-GPT using motion-aware textual embedding as the input to hierarchically generate body-hand motions. (c) Facial text-conditional VAE (cVAE) to generate the corresponding facial motions. The outputs of body, hand, and face motions comprise a vivid and text-aligned whole-body motion.

🚀 Quick Start

🚅 Model Training

📸 Visualization

🤝🏼 Citation

If you find the code is useful in your research, please cite us:

@article{humantomato,
  title={HumanTOMATO: Text-aligned Whole-body Motion Generation},
  author={Lu, Shunlin and Chen, Ling-Hao and Zeng, Ailing and Lin, Jing and Zhang, Ruimao and Zhang, Lei and Shum, Heung-Yeung},
  journal={arxiv:2310.12978},
  year={2023}
}

📚 License

This code is distributed under an IDEA LICENSE. Note that our code depends on other libraries and datasets which each have their own respective licenses that must also be followed.

💋 Acknowledgement

The code is on the basis of TMR, MLD, T2M-GPT, and HumanML3D. Thanks to all contributors!

🌟 Star History

Star History Chart

If you have any question, please contact at: shunlinlu0803 [AT] gmail [DOT] com AND thu [DOT] lhchen [AT] gmail [DOT] com.

humantomato's People

Contributors

linghaochan avatar savicktso avatar shunlinlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

humantomato's Issues

body-only feature representation

Hi! What's the difference between the tomato representation and humanml3d of the body-only part? As I understand, from the supplementary, that the difference is there's rotation regularization, but I can't find it in your code.
And when will the pretrained TMR model be released?

t2m_hand_raw_offsets error?

Hi! I have done something similar to you for my motion processing. I have also used 000021 as the example motion similar to humanml3d. The raw offset is the general direction of the joint from the parent. However, if you visualize the motion (as shown below) you can see that the relative offset of the first finger joint is at -1 y from the wrist similar to lines 94 - 99 for t2m_raw_body_offsets. While finger joints 2 and 3 are on the x-axis. However you have the raw offsets of all finger joints on the x-axis, denoting the fingers are pointing in the x-axis for 000021. Can you recheck again?
example_hml_render

This is what I have:

p - pinky, r - ring, m - middle, i - index, t - thumb
left hand
[0, -1, 0], # lp1
[-1, 0, 0], # lp1
[-1, 0, 0], # lp1
[0, -1, 0], # lr1
[-1, 0, 0], # lr1
[-1, 0, 0], # lr1
[0, -1, 0], # lm1
[-1, 0, 0], # lm1
[-1, 0, 0], # lm1
[0, -1, 0], # li1
[-1, 0, 0], # li1
[-1, 0, 0], # li1
[0, -1, 0], # lt1
[0, -1, 0], # lt1
[0, -1, 0], # lt1
right hand
[0, -1, 0], # rp1
[1, 0, 0], # rp1
[1, 0, 0], # rp1
[0, -1, 0], # rr1
[1, 0, 0], # rr1
[1, 0, 0], # rr1
[0, -1, 0], # rm1
[1, 0, 0], # rm1
[1, 0, 0], # rm1
[0, -1, 0], # ri1
[1, 0, 0], # ri1
[1, 0, 0], # ri1
[0, -1, 0], # rt1
[0, -1, 0], # rt1
[0, -1, 0], # rt1

2D

If my dataset stores frame-by-frame images in each folder, with images containing only upper body movements, and each folder represents an action sequence with a corresponding text description.
How should I use your code for such a dataset?
Also, which part of your code is related to text-motion alignment? I couldn't find it, please point it out. Thank you.
If I have only extracted the 2D skeletal keypoints of human images, how do I use your code?

Question about recovering XZ rotation of root from the motion representation

Hi,
in appendix, I see the motion representation is defined below:

Specifically, the i-th frame pose is defined by a tuple of root angular velocity ( ̇ra ∈ R) along Y-axis, root linear velocities ( ̇rx, ̇rz ∈ R) on XZ-plane, root height ry ∈ R, local joints positions (jp ∈ R3N −1), and velocities (jv ∈ R3N ), where N denotes the number of joints.

Obviously, we can recover 3D global position of root from XZ-plane velocities and root height. We can also recover the Y rotation of root by using Y-rotation velocity . My question is how the X-rotation and Z-rotation of root can be recovered from the representation defined above.

Questions about the Pretrained Checkpoints Used in the Evaluation of OpenTMA

Hi, I have some questions about the Pretrained Evaluation Checkpoints of OpenTMA. I would be very grateful if you can provide some help~

  • OpenTMA itself should be the motion encoder and text encoder optimized by contrastive learning. Why are these pre-trained weights read in for OpenTMA evaluation? What role do they play in OpenTMA evaluation?
  • Additionally, I would like to confirm that the "smpl212" in OpenTMA is based on the 322-dimensional SMPLX of MotionX with the 100-dimensional face_shape and 10-dimensional betas removed?
  • Last, the currently provided Pretrained Evaluation Checkpoints of OpenTMA do not seem to include pre-trained weights for the smpl212 format. Will it be provided later, or will the corresponding training code be provided?
    Looking forward to your reply~

Inverting the new_joint_vecs?

This is great work! I have a quick question. After I have processed the motions as described, I have three folders. joint new_joints new_joint_vecs. When training a generative model here, as described in your paper, you would use the full 623 dimensional vector in new_joint_vecs

If you wanted to extract the rotations in new_joint_vecs to apply to an fbx - how would you do this?

I mean to say, it seems the joint order has changed, there is no Jaw joint, and directly removing the continuous 6d rotations and transforming them into quaternions doesn't yield the desired effect.

I'm curious if you have insights here

Estimated code release date

Thanks for the awesome work! Do you have a estimated release date for when the code and models(including mld, t2m-gpt) will be released?

Question of Fig2(b)

Hi, I am reading the original paper, but I notice that in Fig2(b) the text-motion alignment is finished during inference since it is marked with dotted line, why is that ?

代码是否开源?

你好,感谢公开这么优秀的工作,想请教一下论文代码是否计划开源?期待你的回复!

OpenTMA: Motion-X checkpoints dimension

Awesome work!

But I met a problem:

When I use the motionx checkpoints, Line249 and Line251 in 'OpenTMA/tma/models/modeltype/temos.py', there is a "size mismatch for main.0.weight: copying a param with shape torch.Size([512, 309, 4]) from checkpoint, the shape in current model is torch.Size([512, 619, 4]). Copying a param with shape torch.Size([512, 309, 4]) from checkpoint, the shape in current model is torch.Size([512, 619, 4])."

I don't know if it's a problem with my Motion-X data processing or with the checkpoint.

VRAM Need

How much VRAM do I need to run the pre-trained model after release?Thank you!

plot_feature.py missing

Thanks for the great work. In the data preparation instruction, it is mentioned to check the 623-dim visualization run plot_feature.py. But it seems this file is missing from the repository. Could please share this code? thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.