mcg-nju / memotr Goto Github PK

View Code? Open in Web Editor NEW

123.0 123.0 5.0 28.07 MB

[ICCV 2023] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

Home Page: https://arxiv.org/abs/2307.15700

License: MIT License

Python 74.69% Shell 0.06% C++ 0.64% Cuda 6.43% Jupyter Notebook 18.19%

computer-vision deep-learning multi-object-tracking tracking

memotr's People

Contributors

Stargazers

Watchers

Forkers

jehanjaye minho8849 rogcomfox bwhiteice glance75

memotr's Issues

在运行 sh make.sh时出现 ImportError: No module named torch问题

在根据下载模块的代码创建好了虚拟环境并安装相应的组件之后，我cd到相应目录下执行sh make.sh时，出现的这个问题，完整的报错是：
Traceback (most recent call last):
File "setup.py", line 11, in
import torch
ImportError: No module named torch
我在尝试pip install torch后再次运行还是得到一样的结果

Abount bdd100k

Congradulations about the achievement. But i wonder when will release the bdd100k model and train methods?

Input format-Training on one frame of the video clip?

Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."

Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.

and What is the role of no_grad_frames?

Thank you!!

the request of code for datasets organization

Hi, thanks for your excellent work.
Could you upload the pre-processed code to organize the dataset?
(as follows):

DATADIR/
  ├── DanceTrack/
  │ ├── train/
  │ ├── val/
  │ ├── test/
  │ ├── train_seqmap.txt
  │ ├── val_seqmap.txt
  │ └── test_seqmap.txt
  ├── MOT17/
  │ ├── images/
  │ │ ├── train/
  │ │ └── test/
  │ └── gts/
  │   └── train/
  └── CrowdHuman/
    ├── images/
    │ ├── train/
    │ └── val/
    └── gts/
      ├── train/
      └── val/

Distributed operation

Hello, thank you for your excellent work and salute you.
When I reproduced the code, Use the following command: python-mtrch.distributed.run-nproc _ per _ node = 8main.py-mode submit-config-path/home/sunzhaojie/memot/outputs/memot _ mot17/train/ config.yaml --submit-dir /home/sunzhaojie/MeMOTR/outputs/memotr_mot17/ --submit-model dab_deformable_detr.pth --use-distributed --data-root /home/sunzhaojie/MeMOTR/dataset/MOT17
The following error occurred while running the code:

Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
Traceback (most recent call last):
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/home/sunzhaojie/MeMOTR/main.py", line 120, in
main(config=merged_config)
File "/home/sunzhaojie/MeMOTR/main.py", line 97, in main
torch.cuda.set_device(distributed_rank())
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949687 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949688 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 3949689) of binary: /home/sunzhaojie/.conda/envs/mot13/bin/python
Traceback (most recent call last):
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunzhaojie/.conda/envs/mot13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 3949690)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 3949691)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 3949692)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 3949693)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 3949694)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3949689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

How can I solve this problem? Hope to reply!

Error while training in distributed mode

Hello, I ran into an error when I am training the code in distributed mode. Error is as follow "torch.distributed.elastic.multiprocessing.errors.childFailedError: main.py FAILED

any idea?

Thanks!

How to fine tune a custom VOC Dataset?

Thanks for your amazing work. I have a dataset labeled in voc. It has a directory structure as follows:

Dataset
|________video1
------------------|______img1
------------------|______img1.xml
|________video2
------------------|______img1
------------------|______img1.xml

I have written a custom script for converting into coco video dataset.

question about short_memory

Hello, When I read your paper and reproduced the code, I had a question. You mentioned in your paper: we fuse the outputs from two adjacent frames with an adaptive aggregation algorithm. As shown in the red box below:

The implementation of this part in the code is as follows:

My question is as follows: last_output_embed represents the output of the previous frame, why is it not tracks[b-1].last_output but tracks[b].last_output.
I'm sorry to bother you again. If I have any misunderstanding, please advise me.

Pytorch environment issues

May I ask what version of pytorch should be installed when cuda version is 11.3?

1、I have tried versions with torch=1.12.1+cu113, torch vision=0.13.1+cu113, and torch studio=0.12.1+cu113. Under these conditions, I will not be able to run the/Deformable DETR/models/ops/test.py file, which will result in an error nvrtc: error: invalid value for -- gpu architecture (- arch).

2、I have tried versions with torch=1.11.0+cu113 torch vision=0.12.0+cu113 torch studio=0.11.0, the/Deformable DETR/models/ops/test.py file can run normally. However, if I run the training main.py, an error will occur as shown in the following figure

i am looking forward to your early reply, thx!!!!!!!

question about --use-checkpoint

当我不使用checkpoint，可以运行但是速度比较慢
当我使用checkpoint，并且CHECKPOINT_LEVEL=2时，会有以下错误
ValueError: Unexpected keyword arguments: use_reentrant
features, pos = checkpoint(self.backbone, frame, use_reentrant=False)
点进去def checkpoint(function, *args, **kwargs):checkpoint函数没有use_reentrant这个参数
当我把use_reentrant=False删去或者使用CHECKPOINT_LEVEL=3时，又会有以下错误

  File "main.py", line 120, in <module>
    main(config=merged_config)
  File "main.py", line 103, in main
    train(config=config)
  File "/cver/tcying/ytc/MeMOTR/train_engine.py", line 126, in train
    train_one_epoch(
  File "/cver/tcying/ytc/MeMOTR/train_engine.py", line 238, in train_one_epoch
    loss.backward()
  File "/cver/tcying/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/cver/tcying/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/cver/tcying/lib/python3.8/site-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/cver/tcying/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 138, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/cver/tcying/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 271 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

请问该如何解决呢

Can you please point to the code that tracks during inference?

I am confused about how the tracking is performed during inference for videos longer than sample length frames? What part of the code connects those shorter tracks?

MOT 17 Training

Can you please share details about the validation set you used to validate your method for the MOT17 dataset before submitting to the test server?

An error occurred while replicating the code

Thank you for your excellent work. When I was training on the dancetrack dataset, I got an error: RuntimeError: mat1 and mat2 shapes cannot be multiplied (300x256 and 512x256). This error seems to say that the two matrices have different shapes, but I can't find the problematic block of code. Could you give me some guidance, please? Thank you very much.

undefined symbol: _ZNK2at6Tensor7optionsEv

你好，当我使用环境为torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1以及CUDA=11.3时，会出现以下问题：

ImportError: /cver/tcying/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor7optionsEv

这貌似是DETR编译的问题，因为我在执行test.py时也会有同样的错误。我换成最新的pytorch版本依旧会有这样的问题。

但是我将环境换成torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1以及CUDA=11.1时，DETR编译成功了，但是运行时又会出现以下错误：

Traceback (most recent call last):
  File "main.py", line 120, in <module>
    main(config=merged_config)
  File "main.py", line 99, in main
    from train_engine import train
  File "/cver/tcying/ytc/MeMOTR/train_engine.py", line 12, in <module>
    from models import build_model
  File "/cver/tcying/ytc/MeMOTR/models/__init__.py", line 6, in <module>
    from .memotr import build as build_memotr
  File "/cver/tcying/ytc/MeMOTR/models/memotr.py", line 13, in <module>
    from .backbone import BackboneWithPE
  File "/cver/tcying/ytc/MeMOTR/models/backbone.py", line 8, in <module>
    from torchvision.models import resnet50, ResNet50_Weights
ImportError: cannot import name 'ResNet50_Weights' from 'torchvision.models' (/cver/tcying/lib/python3.8/site-packages/torchvision/models/__init__.py)

这与问题 #6 很像，但是我不知道该如何解决。

Training instructions for MOT 17

Hello! Can you please upload instructions to use the the training script for MOT17? Thank you!!

How to get bbox for occluded frames using motion params?

Hi, my tracker is performing well it keeps track of the object even if it is occluded for 10 to 15 frames. Now how can I get BBox for these occluded frames? I have seen motion parameters in runtime tracker. How can I make use of those params?

Performance Reproduction

Hi, thank you very much for providing this well-structured codebase!

I tried training MeMOTR (with DAB-DETR) on DanceTrack and run into performance issues. In particular, using the provided config file and pretrained checkpoint I only obtain:

HOTA DetA AssA
62.481 74.141 52.901

In particular, the association accuracy lags > 2 points behind the performance reported in the paper. Was anyone able to reproduce the original performance? Is there anything I'm missing? @HELLORPG have you tried training this model with the current codebase and config file? Thanks in advance for your help!

mcg-nju / memotr Goto Github PK

memotr's People

Contributors

Stargazers

Watchers

Forkers

memotr's Issues

main.py FAILED

Root Cause (first observed failure): [0]: time : 2023-12-15_17:11:52 host : ubuntu-Precision-7920-Tower rank : 2 (local_rank: 2) exitcode : 1 (pid: 3949689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Root Cause (first observed failure):
[0]:
time : 2023-12-15_17:11:52
host : ubuntu-Precision-7920-Tower
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 3949689)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html