Comments (18)
@joddiy: FYR as below. i ran on panda13, when GPU memory was enough for the model:
$ cd singa $ python examples/onnx/training/train.py --model resnet152v1
Error:
"
thao@panda13:/hdd2/thao/singa$ python examples/onnx/training/train.py --model resnet152v1
2020-10-13 20:09:27,800 Downloading https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet152v1/resnet152v1.tar.gz
Traceback (most recent call last):
File "examples/onnx/training/train.py", line 352, in
args.data, sgd, args.graph, args.verbosity)
File "examples/onnx/training/train.py", line 216, in run
model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/model.py", line 177, in compile
self.forward(*inputs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 61, in wrapper
return func(self, *args, **kwargs)
File "examples/onnx/training/train.py", line 119, in forward
y = self.linear(y)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 108, in call
return self.forward(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 59, in wrapper
self.initialize(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 43, in wrapper
'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors"
env:
- python 3.7
- singa 3.1.0.rc1 (conda)
- singa git - dev branch
from singa.
Thanks for the report, let me check.
from singa.
@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)
from singa.
AssertionError with the onnx testcase: https://github.com/apache/singa/blob/master/examples/onnx/training/train.py
$ cd examples/onnx $ python3 training/train.py --model vgg16
Then I get the following error msg:
File "training/train.py", line 437, in <module> args.onnx_model_path, args.data, sgd, args.graph, args.verbosity) File "training/train.py", line 295, in run model.compile([tx], is_train=True, use_graph=graph, sequential=sequential) File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/model.py", line 177, in compile self.forward(*inputs) File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 63, in wrapper return func(self, *args, **kwargs) File "training/train.py", line 191, in forward y = self.linear(y) File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 110, in __call__ return self.forward(*args, **kwargs) File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 61, in wrapper self.initialize(*args, **kwargs) File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 45, in wrapper 'initialize function expects PlaceHolders or Tensors') AssertionError: initialize function expects PlaceHolders or Tensors
Something maybe wrong with the layer initialization?
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0
Hi, @lijiansong , I cannot reproduce the error, I can see another error like this:
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 10:55:05.660770 6279 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 10:55:05.660809 6279 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)
The full log is:
root@567b66a2525c:/singa# cd examples/onnx/
root@567b66a2525c:/singa/examples/onnx# python3 training/train.py --model vgg16
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438994
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438996
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 11:00:16.066620 6308 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 11:00:16.066661 6308 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)
from singa.
@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)
Hi, @pinpom, can you reproduce the same error or the error like the one I comment above?
from singa.
@pinpom If you print the Assertion expression at /hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py
of line 43,
assert len(args) > 0 and isinstance(args[0], Tensor), (
'initialize function expects PlaceHolders or Tensors')
you may find that args[0]
here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?
from singa.
@pinpom If you print the Assertion expression at
/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py
of line 43,assert len(args) > 0 and isinstance(args[0], Tensor), ( 'initialize function expects PlaceHolders or Tensors')
you may find that
args[0]
here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?
Thanks for the reply, I'm checking it.
from singa.
@lijiansong @pinpom
I guess the problem is here:
singa/examples/onnx/training/train.py
Lines 117 to 118 in 3654b91
Because each model actually has a different operator's name, before I assume if the user wants to train another model, they should update this name firstly. Let me think how to optimize it.
from singa.
@lijiansong @pinpom
it should be fixed at this PR: #808
from singa.
@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):
$ cd examples/onnx
$ python3 training/train.py --model vgg16 --data cifar10 --bs 1
the full log is:
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491
^M 0%| | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR
F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0
@lijiansong @pinpom
it should be fixed at this PR: #808
from singa.
@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):
$ cd examples/onnx $ python3 training/train.py --model vgg16 --data cifar10 --bs 1
the full log is:
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream:: SetTotalBytesLimit() in google/protobuf/io/coded_stream.h. [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491 ^M 0%| | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered *** Check failure stack trace: ***
env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0@lijiansong @pinpom
it should be fixed at this PR: #808
I found this issue yesterday, however, this is caused by GPU memory, @dcslin has any idea?
from singa.
In the Singa internal source code, there are 4 enumeration types of Block type, that is kInput, kParam, kInter, kEnd
. Here kInput, kParam and kInter
is easy to follow. kInput
is the input data of DNN workloads, kParam
is the weight parameters, kInter
is the intermediate results during DNN workloads. But what does kEnd
mean here?
singa/include/singa/core/scheduler.h
Line 55 in f04d197
from singa.
In the Singa internal source code, there are 4 enumeration types of Block type, that is
kInput, kParam, kInter, kEnd
. HerekInput, kParam and kInter
is easy to follow.kInput
is the input data of DNN workloads,kParam
is the weight parameters,kInter
is the intermediate results during DNN workloads. But what doeskEnd
mean here?
singa/include/singa/core/scheduler.h
Line 55 in f04d197
Sorry, I have no idea about the c++ code.
Hi, @chrishkchris, can you help check it?
from singa.
All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph
@XJDKC The code was written by you, so you may know clearer. Did I describe it right?
from singa.
All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph@XJDKC The code was written by you, so you may know clearer. Did I describe it right?
@chrishkchris @lijiansong It's correct. Take the computational graph below for example:
The type of the pink block in the picture is kEnd which means this block is not used by any other operators in the graph. This kind of block is considered as the endpoint of the graph. I distinguish it from other types to better optimize the memory footprint of model training.
from singa.
@XJDKC @chrishkchris @joddiy Get it, thanks.
from singa.
@XJDKC @chrishkchris @joddiy Get it, thanks.
Welcome!
from singa.
Hi Singa team. I also encountered the error info:
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0217 18:35:53.217267 6606 convolution.cc:560] The required memory for workspace (1192230916) is larger than the expected Bytes (1073741824)
W0217 18:35:53.218562 6606 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F0217 18:35:53.218595 6606 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)
My Singa Version is 3.1.0 3.1.0...master .
I saw this bug related pull request 808 https://github.com/apache/singa/pull/808/files . But this Pull Request seemed not addressing the bug (if it is). So could you guys reopen this issue for solving?
from singa.
Related Issues (20)
- Switch between CPU and GPU devices for cnn example HOT 4
- Save the downloaded datasets to local directory HOT 2
- Add running scripts for cnn and cifar_distributed_cnn examples HOT 4
- Intermediate information printing HOT 3
- Adding arguments for weight decay and momentum HOT 2
- Increase max epoch for cnn example for better convergence HOT 2
- Update CMakeLists.txt for release 4.0.0 HOT 1
- Check Apache license header for release 4.0.0
- OpenCL Compilation Fails
- Upload Release 4.0.0 Package to SVN HOT 1
- Update the NOTICE file for images HOT 1
- gitignore and gitmodules should be removed from the release tar file HOT 2
- Create a new branch dev-postgresql HOT 2
- Create the SumError New Loss Function HOT 1
- Dynamic Creation of Models HOT 2
- Need to return the gradients from optimizer HOT 4
- Maximum recursion depth exceeded in comparison for string HOT 1
- can sparse all-reduce keep efficiency with large number of gpu workers?
- Python 3.11, Model, ImportError HOT 9
- Update documentation for distributed training HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from singa.