Giter VIP home page Giter VIP logo

vit-pose's Introduction

ViT Pose

ViTPose is a 2D Human Pose Estimation model based on the Vision transformer architecture. The official repo is [1]. Goal here is to create a version of VIT Pose without the framework code(mmpose/mmcv) for easy understanding/hacking. Only inference is supported.

1. Execution

Download the model weights from [1] - VitPose-B - single task training - classic decoder.

pip install -r requirements.txt

python main.py

2. Details

0. Pretraining

Pretraining of the ViT backbone is done using Masked AutoEncoder(MAE) approach. This was validated using ImageNet / COCO / COCO + AIC. Using COCO + AIC showed similar performance(AP/AR) as ImageNet although the size of COCO + AIC is an order less than ImageNet. So less data is required in pre- training if it is similar to the ones that will be used for training downstream tasks.

The sequence of steps is as follows:

Image => preprocess => model => postprocess => keypoints

a. Preprocess

  • calculate center/scale, do affine_transform
    • (x, y, w, h) - bounding box of detected person in the image that is output by an object detector (e.g. YOLO or EfficientDet)
    • center - x + w/2, y + h/2
    • adjust (w,h) based on the image aspect ratio. scale - ((w,h)/200) * padding (200 is used to normalize the scale)
    • Affine transform
  • convert to tensor & /255
  • normalize the tensor
  • tensor shape is [(1, 3, 256, 192)]

b. Model

  • Backbone - Patch Embedding + Pos. Embedding + Encoder blocks
    • patch embedding implemented using a Conv2D layer with the kernel size and stride equal to the patch size(16) and the out channels equal to the embedding dimension (768). Output shape is [(1, 768, 16, 12)]. Flattened & transposed to [(1, 192, 768)]
    • Position embedding is added to the output of patch emdedding.
    • this embedding output is fed to multiple layers of encoder blocks. Output shape [(1, 192, 768)] is same as input shape.
    • output is reshaped back to [(1, 768, 16, 12)]
  • Decoder or Head - outputs heatmaps of size (64 x 48) corresponding to the number of key points
    • Encoder output is fed to a decoder which consists of 2 layers of ConvTranspose2D + BN + ReLU ([(1, 256, 64, 48)]) and a final conv1d layer with (1x1) kernel and 17 out channels([(1, 17, 64, 48)]).

Screenshot 2023-04-10 at 12 23 44 PM

c. Postprocess

  • Heatmaps to keypoints
    • For each heatmap, calculate the location of max value
    • add +/-0.25 shift to the locations for higher accuracy
    • scale = scale * 200. Transform back to the image dimensions -> location * scale + center - 0.5 * scale

3. Adapted from:

  1. ViTPose
  2. ViTPose-Pytorch

vit-pose's People

Contributors

mkmohangb avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.