Giter VIP home page Giter VIP logo

a-pytorch-tutorial-to-object-detection's Issues

cxcy_to_gcxgcy function return nan values

I traced back where the nan is coming from the loss function. It leads to cxcy_to_gcxgcy function.
Division is causing nan value it seems. please provide any solution for this.

RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

When I ran
python3 train

I got the errors below:
`Loaded base model.

/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 234, in
main()
File "train.py", line 101, in main
epoch=epoch)
File "train.py", line 151, in train
predicted_locs, predicted_scores = model(images) # (N, 8732, 4), (N, 8732, n_classes)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/media/gaoya/disk/Downloads/a-PyTorch-Tutorial-to-Object-Detection-master/model.py", line 353, in forward
conv4_3_feats, conv7_feats = self.base(image) # (N, 512, 38, 38), (N, 1024, 19, 19)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/media/gaoya/disk/Downloads/a-PyTorch-Tutorial-to-Object-Detection-master/model.py", line 58, in forward
out = F.relu(self.conv1_1(image)) # (N, 64, 300, 300)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 320, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
`

CUDA:9.0
pytorch:1.0.1
OS: Ubuntu18.04

I can use pytorch with GPU in other simple pytorch examples.What's wrong with this?

MultiBox loss goes to infinity

Hi,
I hope you can help me to solve the following issue.
While training, in the first steps (first half of the batch) error was being computed well, but then suddenly the MultiBox error turned to infinity (I get inf)
Does anyone know what is the problem?
Thank you so much

Bounding Box explaination

But pixel values are next to useless if we don't know the actual dimensions of the image.

Pixel values and also their representation as fractions of the image's dimension are equivalent. That is, they provide the same amount of information.

Calculate jaccard overlap for two same sets in model.py?

In model.py, Line 478

overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs) # (n_qualified, n_min_score)

Isn’t this two same sets? I think the purpose of the function is to calculate overlap for two different
sets with shape (n_qualified, 4) and (n_min_score, 4)?
Am I missing something here?

Loss calculations

Dear Sir, I am a bit confused about the loss calculations. These two snippets from the train log would help me better explain the confusion.

Epoch: [7][0/250] Batch Time 1.087 (1.087) Data Time 0.641 (0.641) Loss 4.4972 (4.4972)
Epoch: [7][200/250] Batch Time 0.347 (0.349) Data Time 0.000 (0.003) Loss 4.4606 (4.2053)
[0/313] Batch Time 0.727 (0.727) Loss 4.6890 (4.6890)
[200/313] Batch Time 0.137 (0.140) Loss 5.0137 (5.0349)

  • LOSS - 5.032

Epoch: [138][0/250] Batch Time 0.931 (0.931) Data Time 0.560 (0.560) Loss 0.1863 (0.1863)
Epoch: [138][200/250] Batch Time 0.344 (0.349) Data Time 0.000 (0.003) Loss 0.1613 (0.1790)
[0/313] Batch Time 0.697 (0.697) Loss 10.3283 (10.3283)
[200/313] Batch Time 0.137 (0.141) Loss 11.1063 (10.4203)

  • LOSS - 10.459

On epoch 7, the loss shown besides the 'Data Time' is about 4, while the average loss is about 5.
On epoch 138, the loss shown besides the 'Data Time' is about 0.2, while the average loss is about 10.
Looking at the loss shown besides 'Data Time', I see that my model is learning well, but looking at the average loss, it seems that the model is diverging.
Can you please guide?
Thanks

about calculate_mAP

calculate_mAP is wrong with some special cases.
So I would like to suggest to add code a bit to calculate_mAP class.

For example, when running this code

det_boxes = [torch.tensor([[ 50.8456,  11.0575, 497.9483, 319.0857]])]
det_labels = [torch.tensor([1])]
det_scores = [torch.tensor([1])]
true_boxes = [torch.tensor([[ 67.9887, 155.5200, 276.2039, 240.4080],
                                      [ 11.3314,   7.7760, 498.5836, 322.7040]])]
true_labels = [torch.tensor([5, 1])]
true_difficulties = [torch.tensor([0,0])]

calculate_mAP(det_boxes, det_labels, det_scores, true_boxes, true_labels, true_difficulties)

, we get output like below

({'person': 1.0,
  'bird': 0.0,
  'cat': 0.0,
  'cow': 0.0,
  'dog': 0.0,
  'horse': 0.0,
  'sheep': 0.0,
  'aeroplane': 0.0,
  'bicycle': 0.0,
  'boat': 0.0,
  'bus': 0.0,
  'car': 0.0,
  'motorbike': 0.0,
  'train': 0.0,
  'bottle': 0.0,
  'chair': 0.0,
  'diningtable': 0.0,
  'pottedplant': 0.0,
  'sofa': 0.0,
  'tvmonitor': 0.0},
 0.05000000074505806)

mAP is 0.05 here, but I think we should get mAP 0.5.

I would like to suggest to add this code

true_labels = torch.tensor(true_labels, dtype=torch.long)
average_precisions = average_precisions[true_labels.unique()-1]

at
https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py#L271
like

# Calculate Mean Average Precision (mAP)
true_labels = torch.tensor(true_labels, dtype=torch.long)
average_precisions = average_precisions[true_labels.unique()-1]
mean_average_precision = average_precisions.mean().item()

Then the output is
({'aeroplane': 1.0, 'bicycle': 0.0}, 0.5)

This code works flexible like this when we want to try small dataset.

CUDA memory increasing when runing eval.py

Batch size = 2
13667*2 images

D:\Anaconda3\envs\pytorch_envs\python.exe E:/NNDL_pytorch/SSD/eval.py
Evaluating: 1%| | 121/13667 [06:05<18:15:33, 4.85s/it]Traceback (most recent call last):
File "E:/NNDL_pytorch/SSD/eval.py", line 117, in
evaluate(test_loader, model)
File "E:/NNDL_pytorch/SSD/eval.py", line 92, in evaluate
top_k=200)
File "E:\NNDL_pytorch\SSD\detect_modules.py", line 74, in detect_objects
overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs) # (n_qualified, n_qualified)
File "E:\NNDL_pytorch\SSD\utils.py", line 697, in find_jaccard_overlap
intersection = find_intersection(set_1, set_2) # (n1, n2)
File "E:\NNDL_pytorch\SSD\utils.py", line 681, in find_intersection
intersection_dims = torch.clamp(upper_bounds - lower_bounds, min=0) # (n1, n2, 2)
RuntimeError: CUDA out of memory. Tried to allocate 776.00 MiB (GPU 0; 8.00 GiB total capacity; 2.90 GiB already allocated; 32.01 MiB free; 604.85 MiB cached)
Evaluating: 1%| | 121/13667 [06:07<11:25:33, 3.04s/it]

Expected object of scalar type Byte but got scalar type Bool for argument #2 'other'

@sgrvinod
I tried to run the code as instructed in the discussion. The training part is completed but, during inference, it is raising below error. I didn't change anything in the code. Just executing as:

python detect.py

Does anyone have any suggestion, I am using torchvision 0.4.0

This particular line (model.py file) is raising an error.

suppress = torch.max(suppress, overlap[box] > max_overlap)

error

Expected object of scalar type Byte but got scalar type Bool for argument #2 'other'

ModuleNotFoundError: No module named 'datasets'

OS:Win10 ,64bit
Python:3.6.6
Pytorch:1.0.1

I have created the PASCAL VOC 2007 and 2012 JSON file list. When I ran train.py, I got the errors below:
`
553433881it [19:06, 482647.72it/s]

Loaded base model.

C:\Applications\WPy-3661\python-3.6.6.amd64\lib\site-packages\torch\nn_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File "", line 1, in
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
ModuleNotFoundError: No module named 'datasets'
Traceback (most recent call last):
File "D:\Files\python\MachineLearning\pytorch\a-PyTorch-Tutorial-to-Object-Detection-master\train.py", line 234, in
main()
File "D:\Files\python\MachineLearning\pytorch\a-PyTorch-Tutorial-to-Object-Detection-master\train.py", line 101, in main
epoch=epoch)
File "D:\Files\python\MachineLearning\pytorch\a-PyTorch-Tutorial-to-Object-Detection-master\train.py", line 142, in train
for i, (images, boxes, labels, _) in enumerate(train_loader):
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\site-packages\torch\utils\data\dataloader.py", line 819, in iter
return _DataLoaderIter(self)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\site-packages\torch\utils\data\dataloader.py", line 560, in init
w.start()
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Applications\WPy-3661\python-3.6.6.amd64\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe
`
Can anyone figure out what's wrong?

train.py just stops working

Hi,
Is there any reason why train.py script just die if I'm using CUDA as a device, and only then? If I'm running using CPU as a device it's working (but slow ofc). I have two GPUs and I've tried multiple approaches like defining exact cuda etc. Strange thing is that train.py didn't throw any error or warning message. Just dies.

json File

not have _images.json files in dataset class

Batch size in train.py

Hello, first of all I want to say this is the most useful tutorial about object detection using SSD frameworks i've red so far.
In line 20 of train.py a batch_size =8 is defined and then it is passed to the DataLoader in line 71. But, later, in lines 78 and 79, there is a batch size of 32 hardcoded in the formula. Shouldn't be used here the variable batch_size or am I missing something?
Thanks in advance!

Wrong scale values

These are the values used in

obj_scales = {'conv4_3': 0.1,
                      'conv7': 0.2,
                      'conv8_2': 0.375,
                      'conv9_2': 0.55,
                      'conv10_2': 0.725,
                      'conv11_2': 0.9}

In paper this values are used:
image
Proper values are:
0.2 , 0.34, 0.48, 0.62, 0.76, 0.9
Code to check:

m = 6
k = np.arange(1, m+1, 1)
scales = 0.2 + (0.9 - 0.2) / (m - 1) * (k - 1)

L1Loss or SmoothL1Loss?

Hi,
Thanks for your SSD tutorial, this is the most detailed SSD implementation as far as I know.
However, I just noticed you used L1Loss instead of SmoothL1Loss, is there any special reason?
Cheers,
P

BrokenPipeError: [Errno 32] Broken pipe

when I am trying to run either train.py or eval.py it gets me that error and I didn't find an answer for this issue. any HELP?

runfile('D:/my college books/SEMESTERS/8 SENIOR2/Spring20/a-PyTorch-Tutorial-to-Object-Detection-master/eval.py', wdir='D:/my college books/SEMESTERS/8 SENIOR2/Spring20/a-PyTorch-Tutorial-to-Object-Detection-master')
Reloaded modules: utils, model
Evaluating: 0%| | 0/78 [00:00<?, ?it/s]Traceback (most recent call last):

File "D:\my college books\SEMESTERS\8 SENIOR2\Spring20\a-PyTorch-Tutorial-to-Object-Detection-master\eval.py", line 88, in
evaluate(test_loader, model)

File "D:\my college books\SEMESTERS\8 SENIOR2\Spring20\a-PyTorch-Tutorial-to-Object-Detection-master\eval.py", line 54, in evaluate
for i, (images, boxes, labels, difficulties) in enumerate(tqdm(test_loader, desc='Evaluating')):

File "D:\Programs\anaconda_trial2\lib\site-packages\tqdm\std.py", line 1107, in iter
for obj in iterable:

File "D:\Programs\anaconda_trial2\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)

File "D:\Programs\anaconda_trial2\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()

File "D:\Programs\anaconda_trial2\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)

File "D:\Programs\anaconda_trial2\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)

File "D:\Programs\anaconda_trial2\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)

File "D:\Programs\anaconda_trial2\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)

File "D:\Programs\anaconda_trial2\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

Tensor mismatch error

I tried to run on my custom datasets [2-classes], after few epochs it started giving this error -

RuntimeError: The size of tensor a (0) must match the size of tensor b (4) at non-singleton dimension 1

Error and warning when running eval.py and detect.py

Error when running eval.py and detect.py:
File "D:\NNDL_pytorch\a-PyTorch-Tutorial-to-Object-Detection-master\model.py", line 499, in detect_objects
suppress = torch.max(suppress, overlap[box] > max_overlap)
RuntimeError: Expected object of scalar type Byte but got scalar type Bool for argument #2 'other'

Is there a problem with the use of torch.max()?

Warning when running eval.py and detect.py:
G:\NNDL\Anaconda3\envs\SSD\lib\site-packages\torch\serialization.py:453: SourceChangeWarning: source code of class 'model.SSD300' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
G:\NNDL\Anaconda3\envs\SSD\lib\site-packages\torch\serialization.py:453: SourceChangeWarning: source code of class 'model.VGGBase' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)

Can someone help me solve this?
I use python3.7.6 and pytorch1.2.0
Thanks a lot!

val and test data

Hi,
Thanks for this code. I just found it will be overfitting if the val and test are the same data. I ran 200 epochs and the best epoch is num_146. Then I use the best epoch's ckpt to evaluate and I found the mAP is very close to yours in the instruction. The point is since I am using the same data for testing and validation, the mAP is not accurate since it might be overfitting.

Thanks!

Files are missing

Please upload all the required files or mention where to download and where to place.

RuntimeError: expected device cpu but got device cuda:0 when I run train.py

Hi everyone !!!

I'm trying to run this code to train on my own dataset but I'm having this issue.

Here is my problem :
Traceback (most recent call last):
File "train.py", line 232, in
main()
File "train.py", line 101, in main
epoch=epoch)
File "train.py", line 153, in train
loss = criterion(predicted_locs, predicted_scores, boxes, labels) # scalar
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/drive/My Drive/Model/SSD/model.py", line 593, in forward
true_locs[i] = cxcy_to_gcxgcy(xy_to_cxcy(boxes[i][object_for_each_prior]), self.priors_cxcy) # (8732, 4)
File "/content/drive/My Drive/Model/SSD/utils.py", line 312, in cxcy_to_gcxgcy
return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:] / 10), # g_c_x, g_c_y
RuntimeError: expected device cuda:0 but got device cpu

So I supposed one of the two elements (cxcy or priors_cxcy) is not passed in cuda. I looked in the code for those elements but it seems that every elements are using .cuda().

If someone already has had this problem please help me. Any thoughts are welcome !

Facing error in eval.py

The execution of the following line in eval.py:
predicted_locs, predicted_scores = model(images)
throws an error:

AttributeError: 'Conv2d' object has no attribute 'padding_mode'

Any help please?

IndexError: too many indices for tensor of dimension 1

File "/home/arsene/Project/mini-project2/Projekt/utils.py", line 350, in find_intersection
lower_bounds = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0)) # (n1, n2, 2)
IndexError: too many indices for tensor of dimension 1.

This happens when I try to load my own but with the same data sets format

good detection but poor eval (on the same dataset)

I trained the SSD on custom image dataset (1 object per image and 3 different classes)

I evaluated it on similar (but new) images ---> mAP=0.96
and on realy different images ---> mAP = 0.31

The problem comes when I use detect.py on both eval datasets:
both give me rly good visual results (with almost no error)

Here is the 3 different poses i'm able to detect and classify (with top_k=1):
poses_detection

This is the evaluation on different cases:

easy dataset, topk = 200 : (AP = 0.9)
PR_lying_down_easy_dataset_topk200

easy dataset, topk = 1 : (AP = 0.9)
PR_lying_down_easy_dataset_topk1

hard dataset, topk = 200 : (AP = 0.2)
PR_lying_down_hard_dataset_topk200

hard dataset, topk = 1 : (AP = 0.2)
PR_lying_down_hard_dataset_topk1

And for all these cases, when I visualize images with top k = 1, I have a 0 error (visual AP = 1)

With the debugger, I figured out that the "lying_down" class produces a lot of false positives, which explains the low mAP.
So it comes from a bad jaccard intersection (<0.5)... but I don't understand because when I detect.py, predicted bounding boxes are realy accurate, visualy.
So, I took only the top_k = 1 object per image in eval.py, but nothing changed : still perfect visual results and still poor mAP.
I don't understand why by setting topk=1 in eval.py doesn't solve the false positive detection... Tuning these parameters change nothing...

loc_normalized_std and loc_normalized_mean problem

Hi, I'm reading your code, I'm a little confused with cxcy_to_gcxgcy and gcxgcy_to_cxcy function in utils.py, I think there are some problem in gcxgcy_to_cxcy.

cxcy_to_gcxgcy should do normalize target_loc according to std and mean, for example , for cx, new_target_cx = (target_cx - mean) / std.

gcxgcy_to_cxcy is used in inference, we predict result is new_target_cx, not target_cx, we should first convert new_target_cx to target_cx, and then get box's cx cy w h,
target_cx = new_target_cx * std + mean, cx = d_cx + d_w(target_x).
but in gcxgcy_to_cxcy, you do cx = d_cx + d_w * (new_target_cx / std), it should be new_target_cx * std

backprop through loss

Hello. My apologies for such a (probably) stupid question, but I can not understand how does backpropagation works through the label matching operation via Jaccard index comparison. Can you, please, help me to find an explanation?
Thanks in advance.

Always the same bounding box predicted

I modified your code to use 2 channels depth-thermal image.
I have 800 training data of different body positions.
But after a decent training, the predicted bbox are the same, no matter the body position in the picture. Exact same coordinates, exact same scores. (I use top_k=1 since I have only 1 object per image)
Capture du 2019-12-12 14-40-56
Same with top_k = 200 :
Capture du 2019-12-12 14-47-11
Capture du 2019-12-12 14-41-49

I do not use data augmentation, but I think this is another problem.
It's like the loss was trapped in a local minimum, but idk.
Did you have the same problem ?

EDIT: I figure out my predictions are based on the number of each body position images :
left, dataset : 56 standing, 53 sitting, 73 lying down
right, dataset : 56 standing, 108 sitting, 73 lying down
detection_thermique

this is just a small dataset to perform tests... but I dont understand why the loss is stucked here...

Why subtract 1 in flip?

Why 1 is subtract in adjusting the boxes coordinates after horizontal flipping?What is 1 mean? I thought this already satisfied without subtracting 1.Am i wrong?
The code is....
new_boxes[:, 0] = image.width - boxes[:, 0] - 1
new_boxes[:, 2] = image.width - boxes[:, 2] - 1

VOC dataset link is invalid

image
I try to click the dataset superlink. However, these three links seem invalid for now.
Could you please fix it up?

model.detect_objects take so much time

when I try the code, the model.detect_objects part take much more time than training part, is that normal? I am using 4gpu, but model.detect_objects actually use only one(model.module.detect_objects), so it's very slow. Is there any possible way to improve this?

Problem about the training result

I used the same parameters as you did and tried to train this model on PASCAL datasets, but I got a validation loss about 3.5, are the parameters in the code optimal?

something wrong with loss and mAP

Why can't i reach the mAP given in Readme, and the loss for each epoch is too high, since the second epoch, the value of loss is around 5, how can i reduce loss?

detect.py can not display the picture

My local is window10, when I run the detect.py file on the server side, the picture cannot be displayed, I do n’t know where to modify,
when i run [ sudo apt-get install imagemagick ] , the wrong is changed :
"display: unable to open X server `' @ error/display.c/DisplayImageCommand/426."

if you know, please tell me, thanks

silly issue

I am very new to machine vision but this project do exactly what I firstly need to do. Thanks for this huge effort.
my issue is I don't know how to run it for the first time. is there any manual on what to run first and what is second etc ??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.