Giter VIP home page Giter VIP logo

Comments (18)

pinpom avatar pinpom commented on July 23, 2024 1

@joddiy: FYR as below. i ran on panda13, when GPU memory was enough for the model:
$ cd singa $ python examples/onnx/training/train.py --model resnet152v1
Error:
"
thao@panda13:/hdd2/thao/singa$ python examples/onnx/training/train.py --model resnet152v1

2020-10-13 20:09:27,800 Downloading https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet152v1/resnet152v1.tar.gz
Traceback (most recent call last):
File "examples/onnx/training/train.py", line 352, in
args.data, sgd, args.graph, args.verbosity)
File "examples/onnx/training/train.py", line 216, in run
model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/model.py", line 177, in compile
self.forward(*inputs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 61, in wrapper
return func(self, *args, **kwargs)
File "examples/onnx/training/train.py", line 119, in forward
y = self.linear(y)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 108, in call
return self.forward(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 59, in wrapper
self.initialize(*args, **kwargs)
File "/hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py", line 43, in wrapper
'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors"

env:

  • python 3.7
  • singa 3.1.0.rc1 (conda)
  • singa git - dev branch

from singa.

joddiy avatar joddiy commented on July 23, 2024

Thanks for the report, let me check.

from singa.

pinpom avatar pinpom commented on July 23, 2024

@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)

from singa.

joddiy avatar joddiy commented on July 23, 2024

AssertionError with the onnx testcase: https://github.com/apache/singa/blob/master/examples/onnx/training/train.py

$ cd examples/onnx
$ python3 training/train.py --model vgg16

Then I get the following error msg:

File "training/train.py", line 437, in <module>
    args.onnx_model_path, args.data, sgd, args.graph, args.verbosity)
  File "training/train.py", line 295, in run
    model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/model.py", line 177, in compile
    self.forward(*inputs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 63, in wrapper
    return func(self, *args, **kwargs)
  File "training/train.py", line 191, in forward
    y = self.linear(y)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 110, in __call__
    return self.forward(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 61, in wrapper
    self.initialize(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 45, in wrapper
    'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors

Something maybe wrong with the layer initialization?

singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

Hi, @lijiansong , I cannot reproduce the error, I can see another error like this:

WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 10:55:05.660770  6279 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 10:55:05.660809  6279 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

The full log is:

root@567b66a2525c:/singa# cd examples/onnx/
root@567b66a2525c:/singa/examples/onnx# python3 training/train.py --model vgg16
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553437328
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438994
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553438996
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1013 11:00:16.066620  6308 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F1013 11:00:16.066661  6308 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

from singa.

joddiy avatar joddiy commented on July 23, 2024

@joddiy I got the same issue when do training for examples/onnx models. Only the default model ('resnet18v1') runs, all others failed for the mentioned reason. I think model urls might need to be updated too, since some are out-dated ('vgg19' & 'vgg19bn', for example)

Hi, @pinpom, can you reproduce the same error or the error like the one I comment above?

from singa.

lijiansong avatar lijiansong commented on July 23, 2024

@pinpom If you print the Assertion expression at /hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py of line 43,

assert len(args) > 0 and isinstance(args[0], Tensor), (
                    'initialize function expects PlaceHolders or Tensors')

you may find that args[0] here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?

from singa.

joddiy avatar joddiy commented on July 23, 2024

@pinpom If you print the Assertion expression at /hdd2/thao/conda/miniconda3/envs/sing/lib/python3.7/site-packages/singa/layer.py of line 43,

assert len(args) > 0 and isinstance(args[0], Tensor), (
                    'initialize function expects PlaceHolders or Tensors')

you may find that args[0] here is not an instance of singa.tensor.Tensor. Someone else please help to fix this bug?

Thanks for the reply, I'm checking it.

from singa.

joddiy avatar joddiy commented on July 23, 2024

@lijiansong @pinpom
I guess the problem is here:

# if you change to other models, please update the output name here
y = super(MyModel, self).forward(*x, aux_output=['flatten_170'])[1]

Because each model actually has a different operator's name, before I assume if the user wants to train another model, they should update this name firstly. Let me think how to optimize it.

from singa.

joddiy avatar joddiy commented on July 23, 2024

@lijiansong @pinpom
it should be fixed at this PR: #808

from singa.

lijiansong avatar lijiansong commented on July 23, 2024

@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):

$ cd examples/onnx
$ python3 training/train.py --model vgg16 --data cifar10 --bs 1

the full log is:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491
 ^M  0%|          | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR
 F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0)  an illegal memory access was encountered
 *** Check failure stack trace: ***

env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

@lijiansong @pinpom
it should be fixed at this PR: #808

from singa.

joddiy avatar joddiy commented on July 23, 2024

@joddiy Thanks for your patch PR at #808, another failure problem occurs(as you mentioned above):

$ cd examples/onnx
$ python3 training/train.py --model vgg16 --data cifar10 --bs 1

the full log is:

[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553512191
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514489
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than   2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::           SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
 [libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553514491
 ^M  0%|          | 0/50000 [00:00<?, ?it/s]WARNING: Logging before InitGoogleLogging() is written to STDERR
 F1014 15:29:10.526221 23807 cuda_gpu.cc:207] Check failed: error == cudaSuccess (700 vs. 0)  an illegal memory access was encountered
 *** Check failure stack trace: ***

env:
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

@lijiansong @pinpom
it should be fixed at this PR: #808

I found this issue yesterday, however, this is caused by GPU memory, @dcslin has any idea?

from singa.

lijiansong avatar lijiansong commented on July 23, 2024

In the Singa internal source code, there are 4 enumeration types of Block type, that is kInput, kParam, kInter, kEnd. Here kInput, kParam and kInter is easy to follow. kInput is the input data of DNN workloads, kParam is the weight parameters, kInter is the intermediate results during DNN workloads. But what does kEnd mean here?

enum BlockType { kUnknow, kInput, kParam, kInter, kEnd };

@joddiy

from singa.

joddiy avatar joddiy commented on July 23, 2024

In the Singa internal source code, there are 4 enumeration types of Block type, that is kInput, kParam, kInter, kEnd. Here kInput, kParam and kInter is easy to follow. kInput is the input data of DNN workloads, kParam is the weight parameters, kInter is the intermediate results during DNN workloads. But what does kEnd mean here?

enum BlockType { kUnknow, kInput, kParam, kInter, kEnd };

@joddiy

Sorry, I have no idea about the c++ code.

Hi, @chrishkchris, can you help check it?

from singa.

chrishkchris avatar chrishkchris commented on July 23, 2024

All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph

@XJDKC The code was written by you, so you may know clearer. Did I describe it right?

from singa.

XJDKC avatar XJDKC commented on July 23, 2024

All the blocks are used to construct the computational graph. I think kEnd means the end nodes of the graph
like this example: https://stackoverflow.com/questions/57678534/find-end-node-in-directed-graph

@XJDKC The code was written by you, so you may know clearer. Did I describe it right?

@chrishkchris @lijiansong It's correct. Take the computational graph below for example:

FYP

The type of the pink block in the picture is kEnd which means this block is not used by any other operators in the graph. This kind of block is considered as the endpoint of the graph. I distinguish it from other types to better optimize the memory footprint of model training.

from singa.

lijiansong avatar lijiansong commented on July 23, 2024

@XJDKC @chrishkchris @joddiy Get it, thanks.

from singa.

XJDKC avatar XJDKC commented on July 23, 2024

@XJDKC @chrishkchris @joddiy Get it, thanks.

Welcome!

from singa.

delphieritas avatar delphieritas commented on July 23, 2024

Hi Singa team. I also encountered the error info:
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0217 18:35:53.217267 6606 convolution.cc:560] The required memory for workspace (1192230916) is larger than the expected Bytes (1073741824)
W0217 18:35:53.218562 6606 convolution.cc:560] The required memory for workspace (2333081604) is larger than the expected Bytes (1073741824)
F0217 18:35:53.218595 6606 device.cc:88] Check failed: size >= 0 (-1961885692 vs. 0) size is negative, could be caused by the type cast from size_t to int. In that case, the size is too large.
*** Check failure stack trace: ***
Aborted (core dumped)

My Singa Version is 3.1.0 3.1.0...master .

I saw this bug related pull request 808 https://github.com/apache/singa/pull/808/files . But this Pull Request seemed not addressing the bug (if it is). So could you guys reopen this issue for solving?

from singa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.