AssertionError with the onnx testcase: <a href="https://github.com/apache/singa/blob/m

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

AssertionError with the onnx testcase: <a href="https://github.com/apache

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

AssertionError for the ONNX training testcases?,about apache/singa

Comments (18)

pinpom commented on July 23, 2024 1

@joddiy: FYR as below. i ran on panda13, when GPU memory was enough for the model:
$ cd singa $ python examples/onnx/training/train.py --model resnet152v1
Error:
"
thao@panda13:/hdd2/thao/singa$ python examples/onnx/training/train.py --model resnet152v1

2020-10-13 20:09:27,800 Downloading https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet152v1/resnet152v1.tar.gz
Traceback (most recent call last):
File "examples/onnx/training/train.py", line 352, in
args.data, sgd, args.graph, args.verbosity)
File "examples/onnx/training/train.py", line 216, in run
model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/model.py", line 177, in compile
self.forward(*inputs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 61, in wrapper
return func(self, *args, **kwargs)
File "examples/onnx/training/train.py", line 119, in forward
y = self.linear(y)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 108, in call
return self.forward(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 59, in wrapper
self.initialize(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 43, in wrapper
'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors"

env:

python 3.7
singa 3.1.0.rc1 (conda)
singa git - dev branch

from singa.

joddiy commented on July 23, 2024

Thanks for the report, let me check.

from singa.

pinpom commented on July 23, 2024

@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)

from singa.

joddiy commented on July 23, 2024

AssertionError with the onnx testcase: https://github.com/apache/singa/blob/master/examples/onnx/training/train.py

$ cd examples/onnx
$ python3 training/train.py --model vgg16

Then I get the following error msg:

File "training/train.py", line 437, in <module>
    args.onnx_model_path, args.data, sgd, args.graph, args.verbosity)
  File "training/train.py", line 295, in run
    model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/model.py", line 177, in compile
    self.forward(*inputs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 63, in wrapper
    return func(self, *args, **kwargs)
  File "training/train.py", line 191, in forward
    y = self.linear(y)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 110, in __call__
    return self.forward(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 61, in wrapper
    self.initialize(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 45, in wrapper
    'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors

Something maybe wrong with the layer initialization?

singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

Hi, @lijiansong , I cannot reproduce the error, I can see another error like this:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 10:55:05.660770  6279 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 10:55:05.660809  6279 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

The full log is:

root@567b66a2525c:/singa# cd examples/onnx/
root@567b66a2525c:/singa/examples/onnx# python3 training/train.py --model vgg16
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438994
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438996
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 11:00:16.066620  6308 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 11:00:16.066661  6308 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

from singa.

joddiy commented on July 23, 2024

@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)

Hi, @pinpom, can you reproduce the same error or the error like the one I comment above?

from singa.

lijiansong commented on July 23, 2024

@pinpom If you print the Assertion expression at /hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py of line 43,

assert len(args) > 0 and isinstance(args[0], Tensor), (
                    'initialize function expects PlaceHolders or Tensors')

you may find that args[0] here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?

from singa.

joddiy commented on July 23, 2024

@pinpom If you print the Assertion expression at /hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py of line 43,
assert len(args) > 0 and isinstance(args[0], Tensor), (
                    'initialize function expects PlaceHolders or Tensors')
you may find that args[0] here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?

Thanks for the reply, I'm checking it.

from singa.

joddiy commented on July 23, 2024

@lijiansong @pinpom
I guess the problem is here:

singa/examples/onnx/training/train.py

Lines 117 to 118 in 3654b91

 # if you change to other models, please update the output name here 

 y = super(MyModel, self).forward(*x, aux_output=['flatten_170'])[1]

Because each model actually has a different operator's name, before I assume if the user wants to train another model, they should update this name firstly. Let me think how to optimize it.

from singa.

joddiy commented on July 23, 2024

@lijiansong @pinpom
it should be fixed at this PR: #808

from singa.

lijiansong commented on July 23, 2024

@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):

$ cd examples/onnx
$ python3 training/train.py --model vgg16 --data cifar10 --bs 1

the full log is:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491
 ^M  0%|          | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR
 F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0)  an illegal memory access was encountered
 *** Check failure stack trace: ***

env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

@lijiansong @pinpom
it should be fixed at this PR: #808

from singa.

joddiy commented on July 23, 2024

@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):

$ cd examples/onnx
$ python3 training/train.py --model vgg16 --data cifar10 --bs 1

the full log is:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491
 ^M  0%|          | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR
 F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0)  an illegal memory access was encountered
 *** Check failure stack trace: ***

env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

@lijiansong @pinpom
it should be fixed at this PR: #808

I found this issue yesterday, however, this is caused by GPU memory, @dcslin has any idea?

from singa.

lijiansong commented on July 23, 2024

In the Singa internal source code, there are 4 enumeration types of Block type, that is kInput, kParam, kInter, kEnd. Here kInput, kParam and kInter is easy to follow. kInput is the input data of DNN workloads, kParam is the weight parameters, kInter is the intermediate results during DNN workloads. But what does kEnd mean here?

singa/include/singa/core/scheduler.h

Line 55 in f04d197

enum BlockType { kUnknow, kInput, kParam, kInter, kEnd };

@joddiy

from singa.

joddiy commented on July 23, 2024

In the Singa internal source code, there are 4 enumeration types of Block type, that is kInput, kParam, kInter, kEnd. Here kInput, kParam and kInter is easy to follow. kInput is the input data of DNN workloads, kParam is the weight parameters, kInter is the intermediate results during DNN workloads. But what does kEnd mean here?

singa/include/singa/core/scheduler.h

Line 55 in f04d197

enum BlockType { kUnknow, kInput, kParam, kInter, kEnd };

@joddiy

Sorry, I have no idea about the c++ code.

Hi, @chrishkchris, can you help check it?

from singa.

chrishkchris commented on July 23, 2024

All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph

@XJDKC The code was written by you, so you may know clearer. Did I describe it right?

from singa.

XJDKC commented on July 23, 2024

All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph

@XJDKC The code was written by you, so you may know clearer. Did I describe it right?

@chrishkchris @lijiansong It's correct. Take the computational graph below for example:

The type of the pink block in the picture is kEnd which means this block is not used by any other operators in the graph. This kind of block is considered as the endpoint of the graph. I distinguish it from other types to better optimize the memory footprint of model training.

from singa.

lijiansong commented on July 23, 2024

@XJDKC @chrishkchris @joddiy Get it, thanks.

from singa.

XJDKC commented on July 23, 2024

@XJDKC @chrishkchris @joddiy Get it, thanks.

Welcome!

from singa.

delphieritas commented on July 23, 2024

Hi Singa team. I also encountered the error info:
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0217 18:35:53.217267 6606 convolution.cc:560] The required memory for workspace (1192230916) is larger than the expected Bytes (1073741824)
W0217 18:35:53.218562 6606 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F0217 18:35:53.218595 6606 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

My Singa Version is 3.1.0 3.1.0...master .

I saw this bug related pull request 808 https://github.com/apache/singa/pull/808/files . But this Pull Request seemed not addressing the bug (if it is). So could you guys reopen this issue for solving?

from singa.

AssertionError for the ONNX training testcases? about singa HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# if you change to other models, please update the output name here
	y = super(MyModel, self).forward(*x, aux_output=['flatten_170'])[1]