Giter VIP home page Giter VIP logo

tdn's Introduction

TDN: Temporal Difference Networks for Efficient Action Recognition (CVPR 2021)

1

PWC
PWC

News

[Mar 24, 2022] We present VideoMAE, a new SOTA on Kinetics, Something-Something, and AVA.
[Dec 1, 2021] We update the TDN-ResNet101 on SSV2 in model zoo.
[Mar 5, 2021] TDN has been accepted by CVPR 2021.
[Dec 26, 2020] We have released the PyTorch code of TDN.

Overview

We release the PyTorch code of the TDN(Temporal Difference Networks). This code is based on the TSN and TSM codebase. The core code to implement the Temporal Difference Module are ops/base_module.py and ops/tdn_net.py.

TL; DR. We generalize the idea of RGB difference to devise an efficient temporal difference module (TDM) for motion modeling in videos, and provide an alternative to 3D convolutions by systematically presenting principled and detailed module design.

Prerequisites

The code is built with following libraries:

Data Preparation

We have successfully trained TDN on Kinetics400, UCF101, HMDB51, Something-Something-V1 and V2 with this codebase.

  • The processing of Something-Something-V1 & V2 can be summarized into 3 steps:

    1. Extract frames from videos(you can use ffmpeg to get frames from video)
    2. Generate annotations needed for dataloader ("<path_to_frames> <frames_num> <video_class>" in annotations) The annotation usually includes train.txt and val.txt. The format of *.txt file is like:
      dataset_root/frames/video_1 num_frames label_1
      dataset_root/frames/video_2 num_frames label_2
      dataset_root/frames/video_3 num_frames label_3
      ...
      dataset_root/frames/video_N num_frames label_N
      
    3. Add the information to ops/dataset_configs.py.
  • The processing of Kinetics400 can be summarized into 3 steps:

    1. We preprocess our data by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark for TSN and SlowOnly.
    2. Generate annotations needed for dataloader ("<path_to_video> <video_class>" in annotations) The annotation usually includes train.txt and val.txt. The format of *.txt file is like:
      dataset_root/video_1.mp4  label_1
      dataset_root/video_2.mp4  label_2
      dataset_root/video_3.mp4  label_3
      ...
      dataset_root/video_N.mp4  label_N
      
    3. Add the information to ops/dataset_configs.py.

    Note: We use decord to decode the Kinetics videos on the fly.

Model Zoo

Here we provide some off-the-shelf pretrained models. The accuracy might vary a little bit compared to the paper, since the raw video of Kinetics downloaded by users may have some differences.

Something-Something-V1

Model Frames x Crops x Clips Top-1 Top-5 checkpoint
TDN-ResNet50 8x1x1 52.3% 80.6% link
TDN-ResNet50 16x1x1 53.9% 82.1% link

Something-Something-V2

Model Frames x Crops x Clips Top-1 Top-5 checkpoint
TDN-ResNet50 8x1x1 64.0% 88.8% link
TDN-ResNet50 16x1x1 65.3% 89.7% link
TDN-ResNet101 8x1x1 65.8% 90.2% link
8x3x1 67.1% 90.5% -
TDN-ResNet101 16x1x1 66.9% 90.9% link
16x3x1 68.2% 91.6% -
TDN-ResNet101 (8+16)x1x1 68.2% 91.6% -
(8+16)x3x1 69.6% 92.2% -

Kinetics400

Model Frames x Crops x Clips Top-1 (30 view) Top-5 (30 view) checkpoint
TDN-ResNet50 8x3x10 76.6% 92.8% link
TDN-ResNet50 16x3x10 77.5% 93.2% link
TDN-ResNet101 8x3x10 77.5% 93.6% link
TDN-ResNet101 16x3x10 78.5% 93.9% link

Testing

  • For center crop single clip, the processing of testing can be summarized into 2 steps:
    1. Run the following testing scripts:
      CUDA_VISIBLE_DEVICES=0 python3 test_models_center_crop.py something \
      --archs='resnet50' --weights <your_checkpoint_path>  --test_segments=8  \
      --test_crops=1 --batch_size=16  --gpus 0 --output_dir <your_pkl_path> -j 4 --clip_index=0
      
    2. Run the following scripts to get result from the raw score:
      python3 pkl_to_results.py --num_clips 1 --test_crops 1 --output_dir <your_pkl_path>  
      
  • For 3 crops, 10 clips, the processing of testing can be summarized into 2 steps:
    1. Run the following testing scripts for 10 times(clip_index from 0 to 9):
      CUDA_VISIBLE_DEVICES=0 python3 test_models_three_crops.py  kinetics \
      --archs='resnet50' --weights <your_checkpoint_path>  --test_segments=8 \
      --test_crops=3 --batch_size=16 --full_res --gpus 0 --output_dir <your_pkl_path>  \
      -j 4 --clip_index <your_clip_index>
      
    2. Run the following scripts to ensemble the raw score of the 30 views:
      python pkl_to_results.py --num_clips 10 --test_crops 3 --output_dir <your_pkl_path> 
      

Training

This implementation supports multi-gpu, DistributedDataParallel training, which is faster and simpler.

  • For example, to train TDN-ResNet50 on Something-Something-V1 with 8 gpus, you can run:
    python -m torch.distributed.launch --master_port 12347 --nproc_per_node=8 \
                main.py  something  RGB --arch resnet50 --num_segments 8 --gd 20 --lr 0.01 \
                --lr_scheduler step --lr_steps  30 45 55 --epochs 60 --batch-size 8 \
                --wd 5e-4 --dropout 0.5 --consensus_type=avg --eval-freq=1 -j 4 --npb 
    
  • For example, to train TDN-ResNet50 on Kinetics400 with 8 gpus, you can run:
    python -m torch.distributed.launch --master_port 12347 --nproc_per_node=8 \
            main.py  kinetics RGB --arch resnet50 --num_segments 8 --gd 20 --lr 0.02 \
            --lr_scheduler step  --lr_steps 50 75 90 --epochs 100 --batch-size 16 \
            --wd 1e-4 --dropout 0.5 --consensus_type=avg --eval-freq=1 -j 4 --npb 
    

Contact

[email protected]

Acknowledgements

We especially thank the contributors of the TSN and TSM codebase for providing helpful code.

License

This repository is released under the Apache-2.0. license as found in the LICENSE file.

Citation

If you think our work is useful, please feel free to cite our paper 😆 :

@InProceedings{Wang_2021_CVPR,
    author    = {Wang, Limin and Tong, Zhan and Ji, Bin and Wu, Gangshan},
    title     = {TDN: Temporal Difference Networks for Efficient Action Recognition},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {1895-1904}
}

tdn's People

Contributors

hardik01shah avatar yztongzhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tdn's Issues

GradCAM code

Hi, thanks for the code.
Could you share the code implementation of GradCAM? I know there are gradcam available in the Internet but it would be nice if you could also publish it in this repo.

One question

Hi!
I tested sthv1 on two GTX Titan X as follows:
python -m torch.distributed.launch --master_port 12347 --nproc_per_node=2
main.py something RGB --arch resnet50 --num_segments 8 --gd 20 --lr 0.005
--lr_scheduler step --lr_steps 30 45 55 --epochs 60 --batch-size 16
--wd 5e-4 --dropout 0.5 --consensus_type=avg --eval-freq=1 -j 4 --npb
But the accuracy in modelzoom is not obtained,
In addition, I tried to implement the core code of TDN on the basis of TSM in mmaction2, and only achieved 48.9%.
What details did I ignore?
Look forward to your reply!

Question about processing of HMDB51 dataset

Hi, I'm trying to fine-tune on HMDB51, but I have no idea about how u cope with this .avi dataset and its corresponding annotation format. Can you help on this? Thanks a lot.

BatchSize How to set in a single card?

My test result is 6% lower than yours. How is your batch size set in 8 cards. How do you think batch size should be set in a single card?Thanks for your reply!

Question about pretrained kinetic400 model accuracy?

Hi, thanks for your awesome work in video recognition and also the release.

Recently, I tested the pretrained kinetic400 checkpoint on 19799 videos by executing the provided test command, and only got 53% top-1 precision, which is a little bit far from the reported 76.6%. It bothered me a lot and the dataset in my drive seems fine.
Here are some my test dataset file

sword_fighting/_kcVbo4E2JQ_000101_000111 300 345 
bowling/eU7Ht6zJcyk_000003_000013 142 31   
celebrating/bUidM8i-buc_000012_000022 300 51    
pumping_fist/OjowI-z7-oo_000006_000016 270 256

Some test logs:

video 928 done, total 928/19799, average 0.534 sec/video, moving Prec@1 52.396 Prec@5 73.958
video 960 done, total 960/19799, average 0.577 sec/video, moving Prec@1 52.823 Prec@5 74.294
video 992 done, total 992/19799, average 0.562 sec/video, moving Prec@1 52.539 Prec@5 74.414
video 1024 done, total 1024/19799, average 0.547 sec/video, moving Prec@1 52.557 Prec@5 74.527
video 1056 done, total 1056/19799, average 0.532 sec/video, moving Prec@1 52.941 Prec@5 74.816
video 1088 done, total 1088/19799, average 0.567 sec/video, moving Prec@1 52.857 Prec@5 74.643
video 1120 done, total 1120/19799, average 0.554 sec/video, moving Prec@1 53.212 Prec@5 74.740
video 1152 done, total 1152/19799, average 0.541 sec/video, moving Prec@1 53.209 Prec@5 74.662
video 1184 done, total 1184/19799, average 0.530 sec/video, moving Prec@1 52.878 Prec@5 74.753
video 1216 done, total 1216/19799, average 0.550 sec/video, moving Prec@1 53.125 Prec@5 75.080

So could you please help me to figure it out? thx
(ps, I used the original data, and it was not scaled to 320px)

Backward and forward in L-TDM module

Hello!
nice work you guys have done!
I have a stupid question after reading paper and code carefully:
What's the meaning of Backward and forward in L-TDM module ?

decord._ffi.base.DECORDError: [16:56:24] /github/workspace/src/video/video_reader.cc:151: Check failed: st_nb >= 0 (-1381258232 vs. 0) ERROR cannot find video stream with wanted index: -1

when I trained in kinetics 400, it happened:

`=> base model: resnet50
kinetics: 400 classes
[06/22 16:52:42 TDN]: storing name: TDN__kinetics_RGB_resnet50_avg_segment8_e100

Initializing TSN with base model: resnet50.
TSN Configurations:
input_modality: RGB
num_segments: 8
new_length: 1
consensus_module: avg
dropout_ratio: 0.5
img_feature_dim: 256
=> base model: resnet50
[06/22 16:52:43 TDN]: [TDN-resnet50]group: first_conv_weight has 1 params, lr_mult: 1, decay_mult: 1
[06/22 16:52:43 TDN]: [TDN-resnet50]group: first_conv_bias has 1 params, lr_mult: 2, decay_mult: 0
[06/22 16:52:43 TDN]: [TDN-resnet50]group: normal_weight has 143 params, lr_mult: 1, decay_mult: 1
[06/22 16:52:43 TDN]: [TDN-resnet50]group: normal_bias has 64 params, lr_mult: 2, decay_mult: 0
[06/22 16:52:43 TDN]: [TDN-resnet50]group: BN scale/shift has 232 params, lr_mult: 1, decay_mult: 0
[06/22 16:52:43 TDN]: [TDN-resnet50]group: custom_ops has 0 params, lr_mult: 1, decay_mult: 1
video number:234619
video number:19760
video number:234619
video number:19760
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[06/22 16:52:54 TDN]: Epoch: [0][0/14663], lr: 0.02000 Time 6.965 (6.965) Data 3.851 (3.851) Loss 5.9819 (5.9819) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
[06/22 16:53:05 TDN]: Epoch: [0][20/14663], lr: 0.02000 Time 0.951 (0.853) Data 0.000 (0.183) Loss 6.1689 (6.5195) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
[06/22 16:53:19 TDN]: Epoch: [0][40/14663], lr: 0.02000 Time 0.555 (0.782) Data 0.000 (0.094) Loss 6.0868 (6.3538) Prec@1 0.000 (0.305) Prec@5 0.000 (0.305)
[06/22 16:53:34 TDN]: Epoch: [0][60/14663], lr: 0.02000 Time 0.476 (0.761) Data 0.000 (0.063) Loss 6.1435 (6.2322) Prec@1 0.000 (0.205) Prec@5 0.000 (0.820)
[06/22 16:53:47 TDN]: Epoch: [0][80/14663], lr: 0.02000 Time 0.908 (0.742) Data 0.000 (0.048) Loss 5.7867 (6.1691) Prec@1 0.000 (0.309) Prec@5 0.000 (0.772)
[06/22 16:54:01 TDN]: Epoch: [0][100/14663], lr: 0.02000 Time 0.554 (0.729) Data 0.000 (0.038) Loss 5.7885 (6.1329) Prec@1 0.000 (0.371) Prec@5 0.000 (1.114)
[06/22 16:54:14 TDN]: Epoch: [0][120/14663], lr: 0.02000 Time 0.680 (0.719) Data 0.000 (0.032) Loss 6.0279 (6.1143) Prec@1 0.000 (0.310) Prec@5 0.000 (0.930)
[06/22 16:54:27 TDN]: Epoch: [0][140/14663], lr: 0.02000 Time 0.491 (0.708) Data 0.000 (0.027) Loss 5.8971 (6.0965) Prec@1 0.000 (0.266) Prec@5 0.000 (1.064)
[06/22 16:54:41 TDN]: Epoch: [0][160/14663], lr: 0.02000 Time 0.519 (0.704) Data 0.000 (0.024) Loss 5.8185 (6.0764) Prec@1 0.000 (0.311) Prec@5 12.500 (1.242)
[06/22 16:54:54 TDN]: Epoch: [0][180/14663], lr: 0.02000 Time 0.613 (0.702) Data 0.000 (0.021) Loss 5.8592 (6.0648) Prec@1 0.000 (0.345) Prec@5 0.000 (1.312)
[06/22 16:55:08 TDN]: Epoch: [0][200/14663], lr: 0.02000 Time 0.519 (0.699) Data 0.000 (0.019) Loss 5.9776 (6.0537) Prec@1 0.000 (0.435) Prec@5 0.000 (1.368)
[06/22 16:55:21 TDN]: Epoch: [0][220/14663], lr: 0.02000 Time 0.500 (0.697) Data 0.000 (0.018) Loss 6.0370 (6.0481) Prec@1 0.000 (0.396) Prec@5 0.000 (1.527)
[06/22 16:55:35 TDN]: Epoch: [0][240/14663], lr: 0.02000 Time 0.714 (0.698) Data 0.000 (0.016) Loss 5.8629 (6.0379) Prec@1 0.000 (0.363) Prec@5 12.500 (1.556)
[06/22 16:55:49 TDN]: Epoch: [0][260/14663], lr: 0.02000 Time 0.565 (0.695) Data 0.000 (0.015) Loss 5.8003 (6.0317) Prec@1 0.000 (0.431) Prec@5 0.000 (1.628)
[06/22 16:56:13 TDN]: Epoch: [0][280/14663], lr: 0.02000 Time 0.572 (0.733) Data 0.000 (0.014) Loss 5.8998 (6.0280) Prec@1 0.000 (0.445) Prec@5 0.000 (1.601)
[06/22 16:56:27 TDN]: Epoch: [0][300/14663], lr: 0.02000 Time 0.524 (0.731) Data 0.000 (0.013) Loss 5.8075 (6.0214) Prec@1 0.000 (0.415) Prec@5 0.000 (1.620)
Traceback (most recent call last):
File "main.py", line 361, in
main()
File "main.py", line 211, in main
train_loss, train_top1, train_top5 = train(train_loader, model, criterion, optimizer, epoch=epoch, logger=logger, scheduler=scheduler)
File "main.py", line 260, in train
for i, (input, target) in enumerate(train_loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 971, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
decord._ffi.base.DECORDError: Caught DECORDError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/workspace/mnt/storage/kanghaidong/new_video_project/video_project/TDN/ops/dataset.py", line 166, in getitem
video_list = decord.VideoReader(video_path)
File "/usr/local/lib/python3.6/dist-packages/decord/video_reader.py", line 55, in init
uri, ctx.device_type, ctx.device_id, width, height, num_threads, 0, fault_tol)
File "/usr/local/lib/python3.6/dist-packages/decord/_ffi/_ctypes/function.py", line 175, in call
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib/python3.6/dist-packages/decord/_ffi/base.py", line 78, in check_call
raise DECORDError(err_str)
decord._ffi.base.DECORDError: [16:56:24] /github/workspace/src/video/video_reader.cc:151: Check failed: st_nb >= 0 (-1381258232 vs. 0) ERROR cannot find video stream with wanted index: -1`
how to solve it?

TypeError: 'Compose' object is not iterable

I want to test the return value of dataset, the code is as follows:

    test_dataset = TSNDataSet(
        dataset="something",
        root_path="F:/Dataset/temp/video_001",
        list_file="../annotations",
        num_segments=8,
        modality="RGB",
        image_tmpl='{:02d}.jpg',
        transform=torchvision.transforms.Compose([train_augmentation,
                                                  Stack(roll=(arch in ['BNInception', 'InceptionV3'])),
                                                  ToTorchFormatTensor(
                                                      div=(arch not in ['BNInception', 'InceptionV3'])),
                                                  normalize, ]),
    )
    for data, label in test_dataset:
        pass

But abnormal "TypeError: 'Compose' object is not iterable" occurred, why?
The video_001 directory is all the extracted images, and the content of the annotations file is: "F:/Dataset/temp/video_001 50 0".

some confusions

你好:

opts/base_module.py 103、104行代码如下:
103 y_forward_smallscale4 = self.bn3_smallscale4(self.conv3_smallscale4(y_forward_smallscale4))
104 y_backward_smallscale4 = self.bn3_smallscale4(self.conv3_smallscale4(y_forward_smallscale4))
这个104行的输入变量y_forward_smallscale4是否有误呢?应该是y_backward_smallscale4吧?

one problem

hi, if you load a model (--resume) which epoch is before warm epoch, your lr scheduler within the warm epoch is right, but when it enters the after scheduler, the lr update is wrong , the entry in the lr_scheduler.py of line 29 can not enter. Can you fix it?? thank you.

Error when running Testing Kinetics: no is_shift defined in TSN()

Thanks for sharing the codes.

I tried to evaluate TDN-ResNet50-8x3x10 in your model zoo on Kinetics following your examples:

CUDA_VISIBLE_DEVICES=0 python3 test_models_three_crops.py kinetics
--archs='resnet50' --weights <your_checkpoint_path> --test_segments=8
--test_crops=3 --batch_size=16 --full_res --gpus --output_dir <your_pkl_path>
-j 4 --clip_index <your_clip_index>

Error occurs when constructing TSN (Line 124 in test_models_three_crops.py):

net = TSN(num_class, this_test_segments, modality,
          base_model=this_arch,
          consensus_type=args.crop_fusion_type,
          img_feature_dim=args.img_feature_dim,
          pretrain=args.pretrain,
          is_shift=is_shift, shift_div=shift_div, shift_place=shift_place
          )

There is no is_shift, shift_div and shift_place defined for TSN(nn.module) in ops/models.py.
Maybe you need to update ops/models.py?

Question of setting of 'new_length' in dataset.py

Hello, I find that 'new_length' are always set as 5 in dataset.py. The following code in dataset.py confused me, as average_duration will get same result. Should new_length in else be set as 1? Hope for your kindly reply. Thank you.

        if not self.I3D_sample : # TSN uniformly sampling for TDN
            if((len(video_list) - self.new_length + 1) < self.num_segments):
                average_duration = (len(video_list) - 5 + 1) // (self.num_segments)
            else:
                average_duration = (len(video_list) - self.new_length + 1) // (self.num_segments)

Pretrained models

Thanks for sharing the code. After downloading pretrained models, I found that the files is damaged and can not be used. Please check and provide available models.Thanks!

where is the short term TDM part?

Hi, thank your for sharing codes.
I am reading your code and I only find the long term TDM as mentioned in the paper in base_module.py and the short term TDM is not found, so where is it?

Thank you very much.

What's the meaning of Frames, Crops, and Clips reqpectvely?

Hi, I am currently applying your method for very large-scale micro-video classification.
However, I am a little confused about the Frames, Crops, and Clips in your paper, since I am a green hand.
May I check their meanings with you?

  • Frame: the number of segments sampled from a video, and a segment may have lots of frames?
  • Crop: the number of cropped images in a single frame?
  • Clip: I don't know. It cannot be the number of frames in a segment because i's fixed to 5?

Visualization wtih Grad-CAM

Hi! Thank you for your wonderful project! I've tried to visualization with gradcam, and the results are as follows.

800_7_gradcam_cam800_10_gradcam_cam800_19_gradcam_cam800_32_gradcam_cam
800_36_gradcam_cam800_48_gradcam_cam800_54_gradcam_cam800_63_gradcam_cam

There seems to be something wrong with my results. I was going to input one frame [3,224,224] to the method https://github.com/jacobgil/pytorch-grad-cam. But it failed to the problem "shape [-1,15,224,224] is invalid for input of size 150528", because the input tensor should be [-1, 3*5, 224, 224] in your model.

So I sampled the frames according to the strategy in your code for visualization in each segment. 5 consecutive frames [X1,X2,X3,X4,X5] was concated in each segment , and got the tensor [3x5,224,224]. Then, I repeated it 8 times to obtain the input tensor [3x5x8, 224, 224]. Finally, I got the CAM and multipled it onto X3.

Why is there an obvious CAM in each result frame? It seems to be different from the picture in your paper. Could you tell me what's wrong with my visualization?

Can't reproduce kinetic400 result

Hello, I've trained kinetic400 with commend:

python -m torch.distributed.launch --master_port 12347 --nproc_per_node=5 \
        main.py  kinetics RGB --arch resnet50 --num_segments 8 --gd 20 --lr 0.02 \
        --lr_scheduler step  --lr_steps 20 45 60 --epochs 70 --batch-size 24 \
        --wd 1e-4 --dropout 0.5 --consensus_type=avg --eval-freq=1 -j 10 --npb 

and got prec@1:73.77 prec@5 91.1 . And I've used num_segments=16, and got Prec@1: 75.37878, which still can't reproduce kinetic400 result(76.6%、77.5%).
There seems issue #33 got the same accuracy.
Maybe the training data is inconsistent with yours, can you share your training data file or log?
Or is there a problem with my training?
Hope for your kindly reply. Thank you.

questions about ShiftModule used in BottleneckShift

Thanks for sharing the code !

I find the ShiftModule which uses a conv1d to shift like TSM. In your BottleneckShift, you add a mSEModule and ShiftModule, making it different from the normal Bottleneck.

Does the mSEMoudle refer to Long-term TDM in your paper? How about the ShiftModule, does the ShiftModule refer to the shift module of TSM.

'''
out = self.mse(out)
out = self.shift(out
'''
In your code, you simultaneously applied the mse and shift. Does this mean the TDM simultaneously use L-TDM and TSM shift ? Or you just set ShiftModule(mode = 'fixed') when running TDM.

Confusion about frames sampling

Hi, your paper shows you set frames sampling number T=8, 16 and 8+16 respectively, and what does the "8+16" mean? Is it an ensemble of two models or just a simple setting that T=24? I have some trouble in understanding "8+16" and I don't know how to rerive this result.

Training with my own datasets

I follow the steps made my own datasets with kinetics400 format,one GPU, I set batch_size=16,and but i met a question when trained,this is error code:
train_loader: <torch.utils.data.dataloader.DataLoader object at 0x7f90f6298460>
Traceback (most recent call last):
File "main.py", line 361, in
main()
File "main.py", line 211, in main
train_loss, train_top1, train_top5 = train(train_loader, model, criterion, optimizer, epoch=epoch, logger=logger, scheduler=scheduler)
File "main.py", line 261, in train
for i, (input, target) in enumerate(train_loader):
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/shtf/ysir/pycharmprojects/TDN-main/ops/dataset.py", line 162, in getitem
video_list = decord.VideoReader(video_path)
UnboundLocalError: local variable 'video_path' referenced before assignment

Traceback (most recent call last):
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/shtf/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/shtf/anaconda3/envs/py38/bin/python', '-u', 'main.py', '--local_rank=0', 'kinetics', 'RGB', '--arch', 'resnet50', '--num_segments', '8', '--gd', '20', '--lr', '0.02', '--lr_scheduler', 'step', '--lr_steps', '50', '75', '90', '--epochs', '100', '--batch-size', '16', '--wd', '1e-4', '--dropout', '0.5', '--consensus_type=avg', '--eval-freq=1', '-j', '4', '--npb']' returned non-zero exit status 1
I have no idea about this ,thank you

questions about batchnorm in DistributedDataParallel

hi, thanks for sharing the code!
You have batchnorm in your resnet model, and also use DistributedDataParallel. I want to know if it is necessary to use SyncBatchnorm to have a more accurate result. I'm always confused about this kind of details.

To replace Batchnorm with SyncBatchnorm in DDP, only two lines of code need to change:
" model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)" and "isinstance(m, torch.nn.BatchNorm2d) or SyncBatchnorm:" in get_optim_policies. However, applying SyncBatchnorm may lead to a bit of decrease in speed.

One question

Hi, I run the following testing scripts
CUDA_VISIBLE_DEVICES=1 python test_models_center_crop.py something --archs='resnet50' --weights checkpoints/best.pth.tar --test_segments=8 --test_crops=1 --batch_size=16 --gpus 0 --output_dir output/8f -j 4 --clip_index=0
in which the best.pth.tar is downloaded from your sthv1 model zoo, but I got the results that Overall Prec@1 0.22% Prec@5 2.40%.
I wonder what mistakes might I make...

some confusions

hi,
in tdn_net.py,

  1. line 84: model = TDN_Net(resnet_model,resnet_model1,apha=0.5,belta=0.5)
    why do you use two models of the same structures? I am very confused?

  2. line 3 self.conv1_temp1 = list(resnet_model1.children())[0], the variable self.conv1_temp1 seems not to appear in the following cods, so what is this variable for?
    Thank you.

Some confusion

Please, how to implement the model collection, I did not find the relevant information in the code

The size of input data is not the same

I trained model with RGB modality num_segment=16, batch_size=16, so I got input shape is [16, 16 *3, 224, 224]. In TSM or TSN they will convert into [-1, 3, 224, 224], but your method use base_output with shape is [-1, 3 * 5, 224, 224]. How can I change the input size?

Error:

 x_c5 = self.conv1_5(self.avg_diff(torch.cat([x2-x1,x3-x2,x4-x3,x5-x4],1).view(-1,12,x2.size()[2],x2.size()[3])))
RuntimeError: The size of tensor a (0) must match the size of tensor b (3) at non-singleton dimension 1

How to train the HMDB51 dataset to get the results of the paper?

python -m torch.distributed.launch --master_port 12347 --nproc_per_node=3
main.py hmdb51 RGB --arch resnet50 --num_segments 16 --gd 20 --lr 0.001
--lr_scheduler step --lr_steps 10 20 --epochs 25 --batch-size 4
--wd 5e-4 --dropout 0.8 --consensus_type=avg --eval-freq=1 -j 4
--tune_from='checkpoint/best.pth.tar'

Use this command, I did not get the approximate result of the paper. Can you share how you did it, Thank you very much

Problems related to model performance

Hi~
I have trained a TDN(ResNet50+8frame) on something-v1 as well as something-v2. However, there exists some gaps between the results we do and the reported one. On something-v2, best Prec@1 is only 61.567.
We follow the command and process that you mentioned in README.md. I wonder that maybe some hyper-parameters settings in our experiments is not the same as your experiments.
截屏2022-01-19 下午3 12 31

How does clip_index affect the sampling results?

When test kinetics, there is a argument clip_index. But in dataset.py, I find clip_index doesn't make any difference. So how do you control the distribution of input clips? Or did I neglect anything?

Confusion about the baseline

Hi, I'm very interested in your project. So, I'm coming back! In your ablation study, I found you mentioned baseline without S_TDM or L-TDM, only with shift operation in latter stages. So I reimplemented it. WOW~ It's amazing that the top1 accuracy reached 47.2% (45.2% in paper). I don't know what's wrong with the code. I would appreciate it if you could provide the code of baseline! ([email protected])

Only two changes I made in your code:
1.base_module.py
2021-09-15_221632
2.tdn_net.py
2021-09-15_221917

A question about testing

Hi! The best model I got after training is 73.13/91.16, but the test result is always close to 0. I don’t know why and try many ways, can you point out the reason for this? Thanks very much!

I use this instruction “CUDA_VISIBLE_DEVICES=0 python3 test_models_three_crops.py kinetics --archs='resnet50' --weights /data1/Masters/TDN/best.pth.tar --test_segments=8 --test_crops=3 --batch_size=32 --full_res --gpus 0 --output_dir ./result -j 4 --clip_index 0”

Learning rate

Hi, your paper says "...the initial learning rate is 0.02... The learning rate will be devided by a factor of 10 when the performance on validation set saturates." And what argument should I put in training so that to achieve this kind of decay of learning rate? Thank you very much!

finetuning error when load the something_resnet50_segment8/best.pth.tar

你好同学,在测评TSM和TDN时进行fine-tuning训练,本项目加载预训练模型报错Notice: keys that failed to load:{'module.base_model.layer2_bak.1.bn1.running_var', 'base_model.layer4_bak.0.bn1.weight', 'module.new_fc.weight', 'base_model.layer1_bak.1.conv2.bias', 'module.base_model.layer2_bak.1.bn3.running_var', ......如何解决?谢谢!
另,resume训练时修改加载函数参数可以正常训练。

Confused about arguments

I'm currently running TDN on a personal dataset. Everything runs well. Initially I set --num_segments=32 because I'm dealing with very fine grained actions. However, this means that I need to set --batch_size=4, even when working on 2 Tesla V100 GPUs.

I also ran a different experiment where I activate --dense_sample. What I cannot derive from the research paper is that when using "dense sampling" like I3D does, does --num_segments still matter, or can I set it to a very small number and receive the same results?

Lower performance on Kinetics using model zoo TDN-ResNet50 checkpoint

I followed your Testing instructions (3-crops, 10-clips) and evaluated model zoo TDN-ResNet50 checkpoint on Kinetics.
The only difference is that I use extracted frames as you process Something-V1&V2 instead of raw video inputs you process Kinetics.
I ran the Testing twice and got 75.05/91.95 and 75.10/91.97 respectively.
I know there is some dataset differences but it should not bring such a big performance gap.
Would you kindly check if the Testing setting is the one you obtain the reported performance?

train on kinetics

Hi , thanks for sharing this code .
I want to ask is there anyone who trained this code on kinetics datasets . I have generated annotations followed by introductions, but got some errors . And i found that the generated file(train.txt and val.txt) doesn't match to the code in datasets.py .

dataset.py 'new_length'

what does “new_length” mean in the dataset.py file, and what is the shape of a sample, sorry I'm a little confused to read the code

Question about TSM accuracy in ablation study

Hi, thanks for your awesome work in video recognition and also the release.

Recently, I tested the pretrained ssthv1 811 checkpoint by executing the provided test command, and only got 49.9% top-1 precision, which is a little bit far from the reported 52.3%. It bothered me a lot and the dataset in my drive seems fine.
So could you please help me to figure it out? By the way, I am using torch1.9.

Besides, the ablation studies in your paper show that TSM with 8 RGB frames (center crop * 1 clip) could achieve 47.1% top-1 precision, which is higher than the original paper (also higher than their released pretrained model). So do you train this model by yourself? Could you please give me more specific details? Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.