I ran the IAM demo and it crashes part way through epoch 2 right after loading train.2

I'm not sure. I have only tried two so far. <a class="commit-link" data-hovercard-type

mdlstm IAM demo crashes after loading train.2.h5 about returnn HOT 10 CLOSED

rwth-i6 commented on July 17, 2024

mdlstm IAM demo crashes after loading train.2.h5

from returnn.

Comments (10)

doetsch commented on July 17, 2024

Unfortunately I am not able to reproduce the error. Are you using the most recent commit? You can also try to deactivate caching by setting cache_size to "8G".

from returnn.

pvoigtlaender commented on July 17, 2024

I was able to reproduce it:

ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

KeyboardInterrupt
train epoch 2, batch 191, cost:output 2.81097157796, elapsed 0:07:17, exp. remaining 1:44:52, complete 6.49%
1:44:52 [||||||||||||| 6.49%

But I don't know yet, what is the problem here
edit: setting setting cache_size to "8G" did not help, and it loads the data anyway:
1:47:03 [|||||||||| 4.81% ]running 2 sequence slices (473110 nts) of batch 141 on device gpu0
train epoch 2, batch 141, cost:output 3.09054326076, elapsed 0:05:26, exp. remaining 1:47:47, complete 4.81%
1:47:47 [|||||||||| 4.81% ]loading file features/raw/train.2.h5
running 2 sequence slices (463386 nts) of batch 142 on device gpu0
loading file features/raw/train.1.h5
TaskThread train failed
Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140624219232000)>, proc 23277.

from returnn.

pvoigtlaender commented on July 17, 2024

for a quick fix you could try to put all the data into one file instead of two, although this does not solve the actual issue ofcourse.
You can also try an older commit. The demo used to work in earlier commits. If you can find out, which commit broke it, then it might be easy to fix it

from returnn.

cwig commented on July 17, 2024

Thanks for looking into this. I did try putting all the training data in one file and I still had the issue. I modified the create_IAM_dataset.py file on line 203-209.

I'll try an older commit.

from returnn.

cwig commented on July 17, 2024

This didn't solve the actual problem, but it worked when I reverted back to commit 82be088

from returnn.

pvoigtlaender commented on July 17, 2024

Is this the last commit which works? It would be very helpful to find it, so we can see which change was the problem.

from returnn.

cwig commented on July 17, 2024

I'm not sure. I have only tried two so far. a925c7a did not work so it is somewhere between a925c7a and 82be088.

from returnn.

doetsch commented on July 17, 2024

Could you try the most recent commit? There seems to be an issue with the cache size calculation on some few machines and it took me a while to reproduce it. Hard coding it to 16GB in config_real as done by commit 2d1744c resolved the issue for me on this machine.

from returnn.

pvoigtlaender commented on July 17, 2024

With cache_size set to 256G (as in the latest version in the repository) it works with the latest commit now. There seems to be a problem with the size calculation for two-dimensional data. So for now just set the cache_size real high

from returnn.

doetsch commented on July 17, 2024

The fix has been confirmed on three independent machines. Therefore I am closing this issue.

from returnn.

mdlstm IAM demo crashes after loading train.2.h5 about returnn HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent