hi, lightning pose team As the tutorial mentions, both base model and context mode

Thanks for the question <a class="user-mention notranslate" data-hovercard-type="user"

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

how to understand unlabeled frames vs temporal context frames? about lightning-pose HOT 3 CLOSED

Wulin-Tan commented on June 15, 2024

how to understand unlabeled frames vs temporal context frames?

from lightning-pose.

Comments (3)

themattinthehatt commented on June 15, 2024

Thanks for the question @Wulin-Tan - let me know if this clears things up:

temporal context frames
Temporal context frames refer to the frames that come directly before/after a given labeled frame. Specifically, we consider the 2 frames before and 2 frames after each labeled frame, and the 5-frame chunk is processed all at once to produce pose estimates for the central, labeled frame. Importantly, the temporal context frames do not require labels; we only need labels for the central frame.

To utilize temporal context frames you must set the model.model_type parameter in the config file to heatmap_mhcrnn.

unsupervised losses
Using the unsupervised losses is a second, complementary way to introduce unlabeled frames into the training process. Unsupervised losses are computed on chunks of unlabeled videos. These videos can be the same ones that the labeled frames come from; they can also be completely different videos (as long as the experimental setup is the same) or a mix between the two.

To utilize the unsupervised losses you must set the model.losses_to_use parameter in the config file to be a non-empty list (more info here). The size of the unlabeled batches is governed by dali.base.train.sequence_length.

combining the two
Because the temporal context frames and the unsupervised losses act independently (one requires a modification of the network architecture, the other requires a modification of the loss), the two can be combined seamlessly. This means that, for labeled data, chunks of 5 frames are used to predict the pose of the central frame, which can then be compared to the ground truth. At the same time, the unlabeled video data also uses 5-frame chunks to create a prediction for the central frame. In this case we do not have ground truth data, and therefore the unsupervised losses are computed on those predictions.

As a concrete example, imagine you set the unlabeled batch size to be 10 (using dali.context.train.batch_size). If we index the frames starting from zero, we have frames {0, ..., 9}. The model will not be able to form a prediction for the first and last two frames. We cannot make a prediction on frame 0 because that requires frames -1 and -2, which we do not have access to in the batch (as well as frames 1 and 2, which we do in fact have access to). However, we can start to form a prediction on frame 2, which requires frames {0, 1, 2, 3, 4}. Likewise, we can form predictions for frames 3-7. We cannot form predictions for frames 8 and 9, which requires later frames which we do not have access to in the batch. [Therefore the minimum accepted value for dali.context.train.batch_size is 5 - this will create a single prediction on the central frame.]

from lightning-pose.

Wulin-Tan commented on June 15, 2024

Hi, @themattinthehatt
Thank you for your detailed explanation! That's pretty good for me to understand LP.
My understanding is as follows: (please correct me if I am wrong)

In DLC, usually smoothing (interpolation or imputation) for the unconfident prediction would be done after new videos prediction. But in LP, this smoothing is incorporated as part of the training.
Regarding the prediction, most algorithm including DLC, would do frame by frame. But LP constructs kinds of a 'sliding window' network, named TCN in LP, to include the neighboring information for prediction.
In LP, both base and temporal models made use of unsupervised loss / unlabeled frames during training for smoothing, but when it goes to prediction, only temporal models make use of 2+1+2 frame information.
by the way, in your preprint, Figure 2: Exploiting unlabeled videos in pose estimation model training, plus your explanation, helps me a lot in understanding.

from lightning-pose.

themattinthehatt commented on June 15, 2024

Yes that all sounds correct! Just to be a bit more precise on your point 3, I would say that both the base and TCN models can use unsupervised losses, but they don't necessarily have to (remember that the model architecture - base or TCN - is controlled in the config file by model.model_type, whereas the unsupervised losses are controlled by model.losses_to_use). Each of these will contribute to smoother predictions in their own way. The TCN will make for smoother predictions becasuse it can look forward and backward in time to localize keypoints better. The unsupervised loss (specifically the "temporal" loss) will penalize large jumps in the predictions. I should also mention here that we played around with the temporal loss quite a bit and it doesn't actually work well if you try to set small numbers for the allowed temporal jumps (in losses.temporal.epsilon). The temporal loss is much more effective if epsilon is large (we use 20 pixels for frames that are 400x400, for example, but this will depend on your experimental setup). In this case the loss will penalize really large jumps, but it won't remove small jitters in the predictions (see EKS comment below for a better way to deal with small jitter).

And then yes, at prediction time the base model only processes one frame at a time, whereas the TCN model will make use of 2+1+2 frame information.

I would also recommend checking out the ensemble kalman smoother (EKS) package when you get a chance - this allows you to actually enforce temporal smoothness on the predictions, and we have found it to work very well across a lot of datasets. This post-processor will definitely do a good job of removing jitter in your predictions.

from lightning-pose.

how to understand unlabeled frames vs temporal context frames? about lightning-pose HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent