Giter VIP home page Giter VIP logo

Comments (12)

lifeiteng avatar lifeiteng commented on April 27, 2024

I also have this problem. I change the code, let the program continue if this problem occurs.

from caffe.

sguada avatar sguada commented on April 27, 2024

Thanks for sharing. Do you know if later was able to write a snapshot?
Otherwise it will be useless.
I will check other options too.

Sergio
On Jan 18, 2014 7:30 PM, "Feiteng Li" [email protected] wrote:

I also have this problem. I change the code, let the program continue if
this problem occurs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-32700185
.

from caffe.

sguada avatar sguada commented on April 27, 2024

I have changed the code to let program keep running, but later is not able to snapshot the network again, and therefore become useless, since I can never save the parameters.
For now what I'm doing is training for 10000 iteractions, and making 2 snapshots and then resuming from there.

from caffe.

lifeiteng avatar lifeiteng commented on April 27, 2024

The program can work normally after ...(NO, cannot work normally!)
I change the function like this:

void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
// CHECK(proto.SerializeToOstream(&output));
if ( ! proto.SerializeToOstream(&output) ) { //add by LiFT
fstream out("SerializeToOstream_Error.txt", fstream::out | fstream::app);
out << "---- SerializeToOstream Error: file " << filename << " ----\n";
out.close();
}
output.close(); //add by LiFT

from caffe.

sguada avatar sguada commented on April 27, 2024

@lifeiteng thanks for sharing your code, I tried something similar on my own, and was able to keep the code running. But the problem is that after the first failure on WriteProtoToBinaryFile then it fails all later attempts, so I can never get a snapshot of the network for later use.

What I did is change the parameter snapshot_prefix: and then the code start working again. I don't yet why it was failing in the first case, I cannot think of any explanation why it fails sometimes.

from caffe.

lifeiteng avatar lifeiteng commented on April 27, 2024

What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have changed the code to make it OK for Acoustic Modeling).

from caffe.

sguada avatar sguada commented on April 27, 2024

I haven't found the reason yet, but I just changed the snapshot_prefix and
it worked. My only explanation is that maybe there were too many snapshots
with that name in the disk already.

Sergio

2014-02-07 Feiteng Li [email protected]:

What is the cause of the problem?
It always happens when I use the code to do Acoustic Modeling(I have
changed the code to make it OK for Acoustic Modeling).


Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-34526219
.

from caffe.

lifeiteng avatar lifeiteng commented on April 27, 2024

snapshot name store in an string value.

template
void Solver::Snapshot() {
NetParameter net_param;
// For intermediate results, we will also dump the gradient values.
net_->ToProto(&net_param, param_.snapshot_diff());
string filename(param_.snapshot_prefix());
char iter_str_buffer[20];
sprintf(iter_str_buffer, "iter%d", iter_);
filename += iter_str_buffer;
LOG(INFO) << "Snapshotting to " << filename;
WriteProtoToBinaryFile(net_param, filename.c_str()); //write error in here
SolverState state;
SnapshotSolverState(&state);
state.set_iter(iter_);
state.set_learned_net(filename);
filename += ".solverstate";
LOG(INFO) << "Snapshotting solver state to " << filename;
WriteProtoToBinaryFile(state, filename.c_str());
}

void WriteProtoToBinaryFile(const Message& proto, const char* filename) {
fstream output(filename, ios::out | ios::trunc | ios::binary);
CHECK(proto.SerializeToOstream(&output));
}

error information:
I0210 14:28:27.514936 22349 solver.cpp:126] Snapshotting to cnn_iter_10000
F0210 14:28:27.960814 22349 io.cpp:69] Check failed: proto.SerializeToOstream(&output)
*** Check failure stack trace: ***
@ 0x7f5a486c9b7d (unknown)
@ 0x7f5a486cbc7f (unknown)
@ 0x7f5a486c976c (unknown)
@ 0x7f5a486cc51d (unknown)
@ 0x41fbfd (unknown)
@ 0x4212e8 (unknown)
@ 0x4251db (unknown)
@ 0x40f3be (unknown)
@ 0x7f5a474cb76d (unknown)
@ 0x4109ad (unknown)


if restart the training using latest cnn_xx_xx.solverstate, this problem will occur every 10 Snapshot.

from caffe.

Yangqing avatar Yangqing commented on April 27, 2024

@sguada Sergio, I was visiting a friend who was using caffe in their work, and he pointed me to his solution: it turns out that leveldb is opening too many files for caching - the default is 1000, and the ubuntu default open file limit is 1024. This makes it dangerously near the limit so you are seeing random crashes from SerializeToOstream().

You could try either reducing the leveldb cache size (see #13), or increase the number of open file limit:

http://posidev.com/blog/2009/06/04/set-ulimit-parameters-on-ubuntu/

Let me know if it works :)

from caffe.

sguada avatar sguada commented on April 27, 2024

Thanks @Yangqing, that probably explains why the error was a bit random some times. I think we should make level-db options.max_open_files = 10 the default since we are reading in sequence and having multiple open files will not help. I guess that would be useful in random access.

from caffe.

shelhamer avatar shelhamer commented on April 27, 2024

Symptom of the same leveldb number of open files issue as #13.

Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000) as discovered and confirmed by @reedscot, @Yangqing and @sguada.

Fixed by #154.

from caffe.

xiadaoxun avatar xiadaoxun commented on April 27, 2024

Solution is to modify src/caffe/layers/data_layer.cpp by setting options.max_open_files = 100 (or any number significantly lower than 1000)

Can you give some examples? Where is the code inserted?

from caffe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.