Giter VIP home page Giter VIP logo

Comments (3)

themattinthehatt avatar themattinthehatt commented on June 15, 2024

Thanks for the question @Wulin-Tan - let me know if this clears things up:

temporal context frames
Temporal context frames refer to the frames that come directly before/after a given labeled frame. Specifically, we consider the 2 frames before and 2 frames after each labeled frame, and the 5-frame chunk is processed all at once to produce pose estimates for the central, labeled frame. Importantly, the temporal context frames do not require labels; we only need labels for the central frame.

To utilize temporal context frames you must set the model.model_type parameter in the config file to heatmap_mhcrnn.

unsupervised losses
Using the unsupervised losses is a second, complementary way to introduce unlabeled frames into the training process. Unsupervised losses are computed on chunks of unlabeled videos. These videos can be the same ones that the labeled frames come from; they can also be completely different videos (as long as the experimental setup is the same) or a mix between the two.

To utilize the unsupervised losses you must set the model.losses_to_use parameter in the config file to be a non-empty list (more info here). The size of the unlabeled batches is governed by dali.base.train.sequence_length.

combining the two
Because the temporal context frames and the unsupervised losses act independently (one requires a modification of the network architecture, the other requires a modification of the loss), the two can be combined seamlessly. This means that, for labeled data, chunks of 5 frames are used to predict the pose of the central frame, which can then be compared to the ground truth. At the same time, the unlabeled video data also uses 5-frame chunks to create a prediction for the central frame. In this case we do not have ground truth data, and therefore the unsupervised losses are computed on those predictions.

As a concrete example, imagine you set the unlabeled batch size to be 10 (using dali.context.train.batch_size). If we index the frames starting from zero, we have frames {0, ..., 9}. The model will not be able to form a prediction for the first and last two frames. We cannot make a prediction on frame 0 because that requires frames -1 and -2, which we do not have access to in the batch (as well as frames 1 and 2, which we do in fact have access to). However, we can start to form a prediction on frame 2, which requires frames {0, 1, 2, 3, 4}. Likewise, we can form predictions for frames 3-7. We cannot form predictions for frames 8 and 9, which requires later frames which we do not have access to in the batch. [Therefore the minimum accepted value for dali.context.train.batch_size is 5 - this will create a single prediction on the central frame.]

from lightning-pose.

Wulin-Tan avatar Wulin-Tan commented on June 15, 2024

Hi, @themattinthehatt
Thank you for your detailed explanation! That's pretty good for me to understand LP.
My understanding is as follows: (please correct me if I am wrong)

  1. In DLC, usually smoothing (interpolation or imputation) for the unconfident prediction would be done after new videos prediction. But in LP, this smoothing is incorporated as part of the training.
  2. Regarding the prediction, most algorithm including DLC, would do frame by frame. But LP constructs kinds of a 'sliding window' network, named TCN in LP, to include the neighboring information for prediction.
  3. In LP, both base and temporal models made use of unsupervised loss / unlabeled frames during training for smoothing, but when it goes to prediction, only temporal models make use of 2+1+2 frame information.
  4. by the way, in your preprint, Figure 2: Exploiting unlabeled videos in pose estimation model training, plus your explanation, helps me a lot in understanding.

from lightning-pose.

themattinthehatt avatar themattinthehatt commented on June 15, 2024

Yes that all sounds correct! Just to be a bit more precise on your point 3, I would say that both the base and TCN models can use unsupervised losses, but they don't necessarily have to (remember that the model architecture - base or TCN - is controlled in the config file by model.model_type, whereas the unsupervised losses are controlled by model.losses_to_use). Each of these will contribute to smoother predictions in their own way. The TCN will make for smoother predictions becasuse it can look forward and backward in time to localize keypoints better. The unsupervised loss (specifically the "temporal" loss) will penalize large jumps in the predictions. I should also mention here that we played around with the temporal loss quite a bit and it doesn't actually work well if you try to set small numbers for the allowed temporal jumps (in losses.temporal.epsilon). The temporal loss is much more effective if epsilon is large (we use 20 pixels for frames that are 400x400, for example, but this will depend on your experimental setup). In this case the loss will penalize really large jumps, but it won't remove small jitters in the predictions (see EKS comment below for a better way to deal with small jitter).

And then yes, at prediction time the base model only processes one frame at a time, whereas the TCN model will make use of 2+1+2 frame information.

I would also recommend checking out the ensemble kalman smoother (EKS) package when you get a chance - this allows you to actually enforce temporal smoothness on the predictions, and we have found it to work very well across a lot of datasets. This post-processor will definitely do a good job of removing jitter in your predictions.

from lightning-pose.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.