Comments (12)
I also have this problem. I change the code, let the program continue if this problem occurs.
from caffe.
Thanks for sharing. Do you know if later was able to write a snapshot?
Otherwise it will be useless.
I will check other options too.
Sergio
On Jan 18, 2014 7:30 PM, "Feiteng Li" [email protected] wrote:
I also have this problem. I change the code, let the program continue if
this problem occurs.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-32700185
.
from caffe.
I have changed the code to let program keep running, but later is not able to snapshot the network again, and therefore become useless, since I can never save the parameters.
For now what I'm doing is training for 10000 iteractions, and making 2 snapshots and then resuming from there.
from caffe.
The program can work normally after ...(NO, cannot work normally!)
I change the function like this:
void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
// CHECK(proto.SerializeToOstream(&output));
if ( ! proto.SerializeToOstream(&output) ) { //add by LiFT
fstream out("SerializeToOstream_Error.txt", fstream::out | fstream::app);
out << "---- SerializeToOstream Error: file " << filename << " ----\n";
out.close();
}
output.close(); //add by LiFT
from caffe.
@lifeiteng thanks for sharing your code, I tried something similar on my own, and was able to keep the code running. But the problem is that after the first failure on WriteProtoToBinaryFile then it fails all later attempts, so I can never get a snapshot of the network for later use.
What I did is change the parameter snapshot_prefix: and then the code start working again. I don't yet why it was failing in the first case, I cannot think of any explanation why it fails sometimes.
from caffe.
What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have changed the code to make it OK for Acoustic Modeling).
from caffe.
I haven't found the reason yet, but I just changed the snapshot_prefix and
it worked. My only explanation is that maybe there were too many snapshots
with that name in the disk already.
Sergio
2014-02-07 Feiteng Li [email protected]:
What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have
changed the code to make it OK for Acoustic Modeling).—
Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-34526219
.
from caffe.
snapshot name store in an string value.
template
void Solver::Snapshot() {
NetParameter net_param;
// For intermediate results, we will also dump the gradient values.
net_->ToProto(&net_param, param_.snapshot_diff());
string filename(param_.snapshot_prefix());
char iter_str_buffer[20];
sprintf(iter_str_buffer, "iter%d", iter_);
filename += iter_str_buffer;
LOG(INFO) << "Snapshotting to " << filename;
WriteProtoToBinaryFile(net_param, filename.c_str()); //write error in here
SolverState state;
SnapshotSolverState(&state);
state.set_iter(iter_);
state.set_learned_net(filename);
filename += ".solverstate";
LOG(INFO) << "Snapshotting solver state to " << filename;
WriteProtoToBinaryFile(state, filename.c_str());
}
void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
CHECK(proto.SerializeToOstream(&output));
}
error information:
I0210 14:28:27.514936 22349 solver.cpp:126] Snapshotting to cnn_iter_10000
F0210 14:28:27.960814 22349 io.cpp:69] Check failed: proto.SerializeToOstream(&output)
*** Check failure stack trace: ***
@ 0x7f5a486c9b7d (unknown)
@ 0x7f5a486cbc7f (unknown)
@ 0x7f5a486c976c (unknown)
@ 0x7f5a486cc51d (unknown)
@ 0x41fbfd (unknown)
@ 0x4212e8 (unknown)
@ 0x4251db (unknown)
@ 0x40f3be (unknown)
@ 0x7f5a474cb76d (unknown)
@ 0x4109ad (unknown)
if restart the training using latest cnn_xx_xx.solverstate, this problem will occur every 10 Snapshot.
from caffe.
@sguada Sergio, I was visiting a friend who was using caffe in their work, and he pointed me to his solution: it turns out that leveldb is opening too many files for caching - the default is 1000, and the ubuntu default open file limit is 1024. This makes it dangerously near the limit so you are seeing random crashes from SerializeToOstream().
You could try either reducing the leveldb cache size (see #13), or increase the number of open file limit:
http://posidev.com/blog/2009/06/04/set-ulimit-parameters-on-ubuntu/
Let me know if it works :)
from caffe.
Thanks @Yangqing, that probably explains why the error was a bit random some times. I think we should make level-db options.max_open_files = 10
the default since we are reading in sequence and having multiple open files will not help. I guess that would be useful in random access.
from caffe.
Symptom of the same leveldb number of open files issue as #13.
Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000) as discovered and confirmed by @reedscot, @Yangqing and @sguada.
Fixed by #154.
from caffe.
Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000)
Can you give some examples? Where is the code inserted?
from caffe.
Related Issues (20)
- caffe time -model -weights -gpu=0
- BUG: error happens while building the project using cmake, if without preinstall `gflags`. HOT 1
- Makefile
- import error: segment fault when import caffe
- Segmentation fault (core dumped) when creating imageset
- MSBuild Error
- DeleteMe
- Glib 3.4.30 not found HOT 1
- Error MSB6006: "cmd.exe" exited with code -1073741 515 HOT 2
- blob.hpp dimension check code problem
- Is it possible to use OpenCL on FreeBSD without using ROCm?
- How to build Caffe(OpenCL) on Linux from source code? HOT 1
- Caffe(OpenCL) Error: ordered comparison between pointer and zero ('int32_t *' (aka 'int *') and 'int') HOT 1
- Failed inference with nyud-fcn32s-hha
- ю
- caffe installation HOT 1
- Assessment of the difficulty in porting CPU architecture for caffe
- How to add new layer to caffe like HardSigmoid or Resize HOT 1
- module 'caffe' has no attribute 'set_mode_cpu'
- `GLOG_LIBRARYRARY_DIRS` appears to be in error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from caffe.