Giter VIP home page Giter VIP logo

yolos's Issues

AMP Support?

Thanks for your great work and releasing the code!

I find that in, AMP-related code is commented out. And I am wondering that if I can use AMP in this project. Would it speed up the training and would it hurt the performance?

Where are the pre-trained models?

In your paper, you say that your pre-trained models are under this repo, but I can't find them! I had searched it everywhere, but I can't find them at any other place. I have no time to train them by myself, so I need your help!
If you can give them to me privately, you can send them to [email protected].

How is the performance on Pascal VOC?

❔How is the performance on Pascal VOC?

Hi, I test YOLOS on pascal voc 2007 with default parameters, I can't get a satisfactory result, here is my result:

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.276
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.497
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.274
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.006
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.085
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.390
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.297
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.433
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.490
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.053
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.289
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.628

I wonder is there anything go wrong, could you give me some advice?

Additional context

Error of the size mismatch for pos_embed

We load our pretrained model of vit-base trained with mae method, and we meet the size mismatch for pos_embed. Is there any solution to this problem please?

RuntimeError: Error(s) in loading state_dict for VisionTransformer: size mismatch for pos_embed: copying a param with shape torch.Size([1, 785, 768]) from checkpoint, the shape in current model is torch.Size([1, 578, 768]).

CUDA Out of Memory Errors w Batch Size of 1 on 16GB V100

Using the default FeatureExtractor settings for the HuggingFace port of YOLOS, I am consistently running into CUDA OOM errors on a 16GB V100 (even with a training batch size of 1).

I would like to train YOLOS on publaynet and ideally use 4-8 V100s.

Is there a way to lower the CUDA memory usage while training YOLOS besides batch size (whilst preserving the accuracy and leveraging the pertained models)?

I see that other models (e.g. DiT) use image sizes of 244x244. However, is it fair to assume that such a small image size would not be appropriate for object detection as too much information is lost? In the DiT case document image classification was the objective.

confusion about pe


hi, I want to know what does eval_size、init_pe_size and mid_pe_size mean in the code? thanks for your answer.

Additional context

Implmenetation queries


Hi thanks for opensourcing the code base this gives steps to learn transformers, i am having few queries

  1. The dataset is loaded from using which function since "ConvertCocoPolysToMask" is not called inherently anywhere
  2. Your load the data training for each epoch using train_one_epoch() for the whole dataset which internally performs losses and then the out for that is performed with evaluation this is performed for each 300 epoch so what's the idea behind this training
  3. Does yolos provide panoptic segmentation also?can we get pretrained model on this

Thanks in advance

Additional context

Small learning rate value


Thank you for your great work to examine transformers in OD. My question is that why do we start with a very small learning rate 2.5 * 10e-5 as there is no clue in your paper? My first guess is that you inherited the settings from the DETR framework.

Have you tried with larger learning rates? To speed up the training procedure with more GPUs, any rule to scale up the learning rate for YOLOS as you experimented without losing the performance?

Many thanks.

Train with custom dataset


For training a custom dataset is it possible just change the path from coco to my own dataset ?

Additional context

Limit on output boudning boxes

Hi is there a limit somewhere on how many bounding boxes are being predicted. I have a very populated image and when I am running the models I noticed that they consistently are predicting only 100 bboxes. Is this a limit somewhere that I can change or is it something else that I am not noticing?

ImportError: cannot import name 'container_abcs' from 'torch._six'

models\layers\, line 6, import error in my env:
torch 1.9.0+cu111
torch-tb-profiler 0.2.0
torchaudio 0.9.0
torchvision 0.10.0+cu111
error msg like:
from torch._six import container_abcs
ImportError: cannot import name 'container_abcs' from 'torch._six' (C:\python39\lib\site-packages\

this link fix it
thanks for your code

Anyone else getting memory issues?


Hello! I wonder if anyone else is getting GPU memory errors even with the small model (yolos_small) ?

Additional context

I am on a 4 GPUs node with Geforce Gtx 1080 ti with 11gb memory each. I use batch size 1 as recommended. Both distributed and non-distributed versions throw the same error.

Tiny model trains smoothly without a trouble.

If there are any tips to reduce memory usage that would be awesome as well!

Object Detection LB


Congratulation for publishing a good work.
How is performance wrt to YOLO5 and other YOlo series and also its standing on Object detection LB.

Additional context

About Learning Rate Scheduler


Why the step of learning rate scheduler after each epoch instead of each batch in

Won't the change rate of lr be too slow? (and unstable for various dataset sizes)

[URGENT] Eval results are much lower than what's reported

Hi, thanks for the excellent work!

I follow the instructions in README to evaluate the models provided in your repo. However, the AP I got for yolos_ti .pth, yolos_s_200_pre.pth, yolos_s_300_pre.pth, yolos_s_dWr.pth, and yolos_base.pth are 28.7, 12.5, 12.7, 13.2, and 13.8, respectively. While yolos_ti.pth matches the performance in your paper and log, other four models are significantly lower than what's expected.
Any idea why this would happen? Thanks in advance!

For example, when evaluating the base model, I ran

python  -m torch.distributed.launch --nproc_per_node=8 --use_env --coco_path ../data/coco --batch_size 2 --backbone_name base --eval --eval_size 800 --init_pe_size 800 1344 --mid_pe_size 800 1344 --resume ../trained_weights/yolos/yolos_base.pth

and was expected to obtain a 42.0 AP performance, as shown in your paper and log. However, the result is only 13.8 AP.

The complete evaluation output is shown below.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 6): env://
| distributed init (rank 5): env://
| distributed init (rank 7): env://
| distributed init (rank 4): env://
Namespace(backbone_name='base', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='../data/coco', dataset_file='coco', decay_rate=0.1, det_token_num=100, device='cuda', dice_loss_coef=1, dist_backend='nccl', dist_url='env://', distributed=True, eos_coef=0.1, epochs=150, eval=True, eval_size=800, giou_loss_coef=2, gpu=0, init_pe_size=[800, 1344], lr=0.0001, lr_backbone=1e-05, lr_drop=100, mid_pe_size=[800, 1344], min_lr=1e-07, num_workers=2, output_dir='', pre_trained='', rank=0, remove_difficult=False, resume='../trained_weights/yolos/yolos_base.pth', sched='warmupcos', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, use_checkpoint=False, warmup_epochs=0, warmup_lr=1e-06, weight_decay=0.0001, world_size=8)
Has mid pe
number of params: 127798368
loading annotations into memory...
Done (t=23.52s)
creating index...
index created!
loading annotations into memory...
Done (t=3.00s)
creating index...
index created!
Test:  [  0/313]  eta: 0:39:39  class_error: 29.21  loss: 2.1542 (2.1542)  loss_bbox: 0.4245 (0.4245)  loss_ce: 0.7761 (0.7761)  loss_giou: 0.9535 (0.9535)  cardinality_error_unscaled: 5.3750 (5.3750)  class_error_unscaled: 29.2100 (29.2100)  loss_bbox_unscaled: 0.0849 (0.0849)  loss_ce_unscaled: 0.7761 (0.7761)  loss_giou_unscaled: 0.4768 (0.4768)  time: 7.6030  data: 0.5298  max mem: 3963
Test:  [256/313]  eta: 0:00:26  class_error: 17.22  loss: 2.5668 (2.6435)  loss_bbox: 0.5639 (0.5792)  loss_ce: 0.8598 (0.8386)  loss_giou: 1.1904 (1.2257)  cardinality_error_unscaled: 3.8750 (4.2398)  class_error_unscaled: 28.7817 (28.6160)  loss_bbox_unscaled: 0.1128 (0.1158)  loss_ce_unscaled: 0.8598 (0.8386)  loss_giou_unscaled: 0.5952 (0.6129)  time: 0.4406  data: 0.0137  max mem: 10417
Test:  [312/313]  eta: 0:00:00  class_error: 16.29  loss: 2.8745 (2.6626)  loss_bbox: 0.5974 (0.5833)  loss_ce: 0.8791 (0.8461)  loss_giou: 1.3012 (1.2332)  cardinality_error_unscaled: 3.8750 (4.2370)  class_error_unscaled: 26.2946 (28.7748)  loss_bbox_unscaled: 0.1195 (0.1167)  loss_ce_unscaled: 0.8791 (0.8461)  loss_giou_unscaled: 0.6506 (0.6166)  time: 0.4251  data: 0.0134  max mem: 10417
Test: Total time: 0:02:25 (0.4663 s / it)
Averaged stats: class_error: 16.29  loss: 2.8745 (2.6626)  loss_bbox: 0.5974 (0.5833)  loss_ce: 0.8791 (0.8461)  loss_giou: 1.3012 (1.2332)  cardinality_error_unscaled: 3.8750 (4.2370)  class_error_unscaled: 26.2946 (28.7748)  loss_bbox_unscaled: 0.1195 (0.1167)  loss_ce_unscaled: 0.8791 (0.8461)  loss_giou_unscaled: 0.6506 (0.6166)
Accumulating evaluation results...
DONE (t=15.78s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.13810
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.26766
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.11832
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.05146
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.13066
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.23324
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.18115
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.29001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.31740
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.12520
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.31154
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.49446

Where're the pre-trained models?

In your paper, you say that your pre-trained models are under this repo, but I can't find them! I had found it everywhere, but I can't find them at any other place since I have no time to train them by myself.
If you can give them to me privately, you can send them to [email protected].

Input size can not be dynamic?

I tried something like this:

 python --resume weights/yolos_s_dWr.pth --data_file ../yolov7/images/COCO_val2014_000000001856.jpg --mid_pe_size 800 864 --init_pe_size 800 864
Not using distributed mode
Namespace(backbone_name='small_dWr', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path=None, data_file='../yolo/images/COCO_val2014_000000001856.jpg', dataset_file='coco', decay_rate=0.1, det_token_num=100, device='cuda', dice_loss_coef=1, dist_url='env://', distributed=False, eos_coef=0.1, epochs=150, eval=False, eval_size=800, giou_loss_coef=2, init_pe_size=[800, 864], lr=0.0001, lr_backbone=1e-05, lr_drop=100, mid_pe_size=[800, 864], min_lr=1e-07, num_workers=2, output_dir='', pre_trained='', remove_difficult=False, resume='weights/yolos_s_dWr.pth', sched='warmupcos', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, use_checkpoint=False, warmup_epochs=0, warmup_lr=1e-06, weight_decay=0.0001, world_size=1)


torch1.8/lib/python3.8/site-packages/torch/nn/modules/", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Detector:
	size mismatch for backbone.pos_embed: copying a param with shape torch.Size([1, 1829, 330]) from checkpoint, the shape in current model is torch.Size([1, 2801, 330]).
	size mismatch for backbone.mid_pos_embed: copying a param with shape torch.Size([13, 1, 1829, 330]) from checkpoint, the shape in current model is torch.Size([13, 1, 2801, 330]).

Train problem with VOC


I convert PASCAL VOC dataset to COCO format, but when I trained yolos-tiny with 150 epochs and pre-trained weights , the results is so bad. I get no ideas.

Additional context

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.039 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.084
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.032 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.005 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.059 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.124
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.234 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.283 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.004 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.066 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.402

About learning rate scheduler


From your code, I only see the learning rate updated after every epoch:


Line 217 in 2e10dc4


Looking at your logs, it also seems to confirm that.
Did you use a warm-up learning rate scheduler for the first few iterations?

ONNX Export


Can we export YOLOS models to ONNX format?

Additional context

Because I want to deploy YOLOS model on onnxruntime to save deployment cost and run it via docker on NVIDIA Jetson series

Adding YOLOS to HuggingFace Transformers

Hi YOLOS team :)

I've implemented YOLOS as a fork of 🤗 HuggingFace Transformers, and I'm going to add it soon to the library (see huggingface/transformers#16848). Here's a notebook that illustrates inference with it:

The reason I'm adding YOLOS is because I really like the simplicity of it, compared to very complex frameworks such as Faster R-CNN and Mask R-CNN. I've added DETR previously also because it simplifies the task of object detection a lot.

As you may or may not know, any model on the HuggingFace hub has its own Github repository. E.g. the YOLOS-small checkpoint can be found here: If you check the "files and versions" tab, it includes the weights. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

Are you interested in creating an organization on the hub, such that we can store all model checkpoints there (rather than under my user name)?

Let me know!

Kind regards,

ML Engineer @ HuggingFace

Control the patches


Hello, thank you for this great contribution. I'm asking if with your architecture, we can control witch patches to feed to YOLOS, because i have already the RoI (Region of interest) of each image of my dataset, and i want to train the model just on theses regions of image, so can we do this by controlling the patches?


Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.