Comments (10)
Unfortunately I am not able to reproduce the error. Are you using the most recent commit? You can also try to deactivate caching by setting cache_size to "8G".
from returnn.
I was able to reproduce it:
ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)
KeyboardInterrupt
train epoch 2, batch 191, cost:output 2.81097157796, elapsed 0:07:17, exp. remaining 1:44:52, complete 6.49%
1:44:52 [||||||||||||| 6.49%
But I don't know yet, what is the problem here
edit: setting setting cache_size to "8G" did not help, and it loads the data anyway:
1:47:03 [|||||||||| 4.81% ]running 2 sequence slices (473110 nts) of batch 141 on device gpu0
train epoch 2, batch 141, cost:output 3.09054326076, elapsed 0:05:26, exp. remaining 1:47:47, complete 4.81%
1:47:47 [|||||||||| 4.81% ]loading file features/raw/train.2.h5
running 2 sequence slices (463386 nts) of batch 142 on device gpu0
loading file features/raw/train.1.h5
TaskThread train failed
Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140624219232000)>, proc 23277.
from returnn.
for a quick fix you could try to put all the data into one file instead of two, although this does not solve the actual issue ofcourse.
You can also try an older commit. The demo used to work in earlier commits. If you can find out, which commit broke it, then it might be easy to fix it
from returnn.
Thanks for looking into this. I did try putting all the training data in one file and I still had the issue. I modified the create_IAM_dataset.py file on line 203-209.
I'll try an older commit.
from returnn.
This didn't solve the actual problem, but it worked when I reverted back to commit 82be088
from returnn.
Is this the last commit which works? It would be very helpful to find it, so we can see which change was the problem.
from returnn.
I'm not sure. I have only tried two so far. a925c7a did not work so it is somewhere between a925c7a and 82be088.
from returnn.
Could you try the most recent commit? There seems to be an issue with the cache size calculation on some few machines and it took me a while to reproduce it. Hard coding it to 16GB in config_real as done by commit 2d1744c resolved the issue for me on this machine.
from returnn.
With cache_size set to 256G (as in the latest version in the repository) it works with the latest commit now. There seems to be a problem with the size calculation for two-dimensional data. So for now just set the cache_size real high
from returnn.
The fix has been confirmed on three independent machines. Therefore I am closing this issue.
from returnn.
Related Issues (20)
- torch.onnx.export requires input_names and output_names to be in order HOT 12
- RF weight dropout HOT 6
- Support for larger scale datasets HOT 33
- RuntimeError: CUDA error: unknown error
- PyTorch debug_add_check_numerics_ops
- Compilation of custom operations failing on TF 2.15/CUDA 12 HOT 5
- `ConcatFilesDataset` combines poorly with `MetaDataset` HOT 6
- RF torch `lstm` fails with torch amp option. HOT 6
- `DistributeFilesDataset` with sharding on file level HOT 6
- ConcatFilesDataset: Reshuffle files per subepoch after every full epoch HOT 2
- `ConcatFilesDataset` needs a better name HOT 10
- RF BatchNorm running var small diff between TF-layers, pure RF and direct PyTorch, biased vs unbiased
- `DistributeFilesDataset`, allow kwargs in `get_sub_epoch_dataset` HOT 10
- Tensor deepcopy does not copy raw_tensor
- Possible race condition in `FileCache`? HOT 5
- Ideas for generic `CachedFile` support across all datasets HOT 18
- `FileCache`: better cleaning, free more than just the minimum
- `FileCache`: avoid cache-wide dir lock
- `DistributeFilesDataset`: copying files blocks `init_seq_order` HOT 2
- `FileCache`: Race condition when removing empty directories HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.