This repository provides the implementation of Bounded Future MS-TCN++ for surgical gesture recognition.
In recent times there is a growing development of video based applications for surgical purposes. Part of these applications can work offline after the end of the procedure, other applications must react immediately. However, there are cases where the response should be done during the procedure but some delay is acceptable. In the literature, the online-offline performance gap is known. Our goal in this study was to learn the performance-delay trade-off and design an MS-TCN++-based algorithm that can utilize this trade-off. To this aim, we used our open surgery simulation data-set containing 96 videos of 24 participants that perform a suturing task on a variable tissue simulator. In this study, we used video data captured from the side view. The Networks were trained to identify the performed surgical gestures. The naive approach is to reduce the MS-TCN++ depth, as a result, the receptive field is reduced, and also the number of required future frames is also reduced. We showed that this method is sub-optimal, mainly in the small delay cases. The second method was to limit the accessible future in each temporal convolution. This way, we have flexibility in the network design and as a result, we achieve significantly better performance than in the naive approach.
It converts an MSTCN++ model to a model that works online.
Python3, pytorch
In order to do inference with the online model, use
from project import run
runner = run(frame_gen, model, extractor, normalize, val_augmentation, use_extractions=True, shape=shape)
# runner is a generator
for t, output in enumerate(runner):
# output is the output of the original model for time t.
# output is a list of tensors, each tensor is the prediction of the corresponding task.
pass
- frame_gen - an object that returns frames using the frame_gen.next() command. When the video ends, returns
None
. - model - An MSTCN++ model to recreate.
- extractor - A model that takes a frame (as a tensor) and converts it to an embedding.
- normalize - does the normaliztion of the tensor frame after converting to tensor. If None, use Identity.
- val_augmentation - the augmentations to the frame before converting to tensor. If None, use Identity.
- use_extractions - If true, converts the extractor to a onednn model using examples. Makes the code faster but requires the shape of the frames. Defaults to True.
- shape - shape of the frames after the augmentations. Defaults to None. If use_extractions is True, must provide the shape (shape is not None)
python run_example.py
In run_example.py
we load the extractor but it is too big to upload to git.
It is available in
https://drive.google.com/file/d/1V7vmPiteR2rZ_mGS5CtzUwluMFvhSqwq/view?usp=sharing
bf-ms-tcnpp.mp4
image of a refinement layer in the Bounded Future MSTCN++
In order to make the model online, we created a data structure that keeps the vectors in each layer that were already calculated. This way, we can avoid calculating the same features multiple times which will save time.
To do that, for each layer we create a queue of vectors.
In order to use convolutions and other matrix calculations easily, the queue of each layer will be a matrix where the columns are the vectors and the newest vector will be inserted to the right.
Define the Needed Indices in layer
Using the definition of Direct Future Window (DFW) for Bounded Future (BF) Refinement (R) :
By definition of the model,
For
In the initialization of the queue of each layer l, we will calculate the first
For example, in the figure above, notice that the non padding part is the result of the residual convolution (including the padding) of the layer below.
Since the left padding size of each layer is known, given the queue of the layer below, we can calculate the queue of this layer.
After initializing the first layer, the queue of each layer will contain the vectors marked in the figure above.
We denote
Recall that in the Prediction Generation stages, there are 2 convolutions: one whose dialtation is increasing and one whose decreasing.
Like in the refinement, we need to calculate how much vectors we will need in the layer (look ahead). The definition are the same except
After
Similar to the refinement stage, however here
Since we only generate one frame's features at a time, using nn.Conv1D(tensor)
might be costly (we know that the output will only have 1 column).
So, we implemented the convolution ourselves:
def conv_func(x, conv):
W = conv.weight # the weights of the convolution
b = conv.bias # the bias of the convolution
dilation = conv.dilation[0] # the dilation of the convolution
x = x[:, :, ::dilation] # what we should do the convolution on
return (torch.sum(W*x, (1, 2))+b).reshape(1, -1, 1) # reshape to 1 batch, 1 column
If you use this code, please cite:
@inproceedings{goldbraikh2023bounded,
title={Bounded Future MS-TCN++ for surgical gesture recognition},
author={Goldbraikh, Adam and Avisdris, Netanell and Pugh, Carla M and Laufer, Shlomi},
booktitle={Computer Vision--ECCV 2022 Workshops: Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part III},
pages={406--421},
year={2023},
organization={Springer}
}