Comments (10)
(What I previously wrote was wrong, I updated my answer)
Hi,
The message about dotparser is not important, the real error (or at least on of them) is this:
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
Not able to select available GPU from 4 cards (all CUDA-capable devices are busy or unavailable).
This seems to be an important hint:
UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
Please change your .theanorc so that you have device=cpu and floatX=float32 and try again.
from returnn.
Thanks, after installing theano and h5py for python3, and setting .theanorc the above error is gone.
from returnn.
Hi,
Till now I have done the following:
- Downloaded the toolkit.
- installed theano and h5py
- created .theanorc
- provided paths in .bashrc
Does the returnn toolkit need to be compiled?
I am getting error related to cuDNN not available, cudnn.h: no such file or directory. But I have provided path for cudnn.h in the my .bashrc. I am getting following error.
Your job 4817836 ("go.sh") has been submitted
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
RETURNN starting up, version 20171127.183959--git-94c0542-dirty, pid 134196, cwd /export/b18/aarora/returnn_IAM/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 ( in /home/aaror/.local/lib/python3.4/site-packages/theano)
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Device gpuX proc starting up, pid 134214
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is enabled with initial size: 87.6% of memory, cuDNN not available)
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpuX proc exception: ('The following error happened while compiling the node', CuDNNConvHWBCOp{border_mode='valid'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0), '\n', "We can't determine the cudnn version as it is not available", 'Can not compile with cuDNN. We got this error:\nb"nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\n/tmp/4817836.1.g.q/try_flags_i7yj23xl.c:5:19: fatal error: cudnn.h: No such file or directory\n #include <cudnn.h>\n
from returnn.
Hi,
what exactly did you do for
"provided paths in .bashrc"?
For me it looks like
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/voigtlaender/cudnn/
export LIBRARY_PATH=$LIBRARY_PATH:/home/voigtlaender/cudnn/
export CPATH=$CPATH:/home/voigtlaender/cudnn/
and I linked all .h and .so files of cudnn into this folder.
from returnn.
Thank you. currently my .bashrc looks like:
CUDA_HOME=/usr/local/cuda
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/export/a11/hlyu/cudnn/include"
PATH="${CUDA_HOME}/bin:${PATH}"
PATH="$PATH:$HOME/.local/bin"
CLASSPATH=$PATH:"."
export PATH
export CLASSPATH
LD_LIBRARY_PATH=/home/dpovey/libs/
LD_LIBRARY_PATH=/home/gkumar/.local/include/:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=/export/a11/hlyu/cudnn/lib64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=/export/a11/hlyu/cudnn/include/:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
I will also add other paths now.
from returnn.
I am getting a type error, "concatenation of dict with a list". I haven't changed anything in the code. Can you please help me in debugging this error. I am currently working with config demo (3 images) for IAM.
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
RETURNN starting up, version 20171127.183959--git-94c0542-dirty, pid 28332, cwd /export/b18/aarora/returnn_IAM/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/aaror/.local/lib/python3.4/site-packages/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 3: Tesla K10.G2.8GB (CNMeM is enabled with initial size: 87.6% of memory, cuDNN 5103)
Device gpuX proc starting up, pid 29255
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpu3 proc, pid 29255 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 65.9914571758 GB left over)
Train data:
input: 1 x 1
output: {'data': [1, 2], 'sizes': [2, 1], 'classes': [79, 1]}
HDF dataset, sequences: 3, frames: 764399
Devices:
gpu3: Tesla K10.G2.8GB (units: 1000 clock: 1.00Ghz memory: 2.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
input #: 1
hidden 1Dto2D '1Dto2D' #: 1
hidden source 'classes_source' #: 2
hidden conv2 'conv0' #: 15
hidden conv2 'conv1' #: 45
hidden conv2 'conv2' #: 75
hidden conv2 'conv3' #: 105
hidden conv2 'conv4' #: 105
hidden mdlstm 'mdlstm0' #: 30
hidden mdlstm 'mdlstm1' #: 60
hidden mdlstm 'mdlstm2' #: 90
hidden mdlstm 'mdlstm3' #: 120
hidden mdlstm 'mdlstm4' #: 120
output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu3
EXCEPTION
Traceback (most recent call last):
File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 284, in run
line: self.finish()
locals:
self = <local> <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>
self.finish = <local> <bound method DeviceBatchRun.finish of <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>>
File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 270, in finish
line: self.eval_info = self.parent.evaluate(**self.result)
locals:
self = <local> <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>
self.eval_info = <local> None
self.parent = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
self.parent.evaluate = <local> <bound method TrainTaskThread.evaluate of <TrainTaskThread(TaskThread train, started daemon 47895616210688)>>
self.result = <local> {'batchess': [[<Batch start_seq:2, #seqs:1>]], 'result_format': ['cost:output', 'ctc_priors'], 'num_frames': NumbersDict(numbers_dict={'data': 485990, 'classes': 77, 'sizes': 4}, broadcast_value=0), 'results': [[array(1527.3756103515625, dtype=float32), array([ 0. , 19.13910294, 6.3...
File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 141, in evaluate
line: target = self._get_target_for_key(key)
locals:
target = <not found>
self = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
self._get_target_for_key = <local> <bound method TrainTaskThread._get_target_for_key of <TrainTaskThread(TaskThread train, started daemon 47895616210688)>>
key = <local> 'cost:output', len = 11
File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 168, in _get_target_for_key
line: available_data_keys = self.data.get_data_keys()
locals:
available_data_keys = <not found>
self = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
self.data = <local> <HDFDataset 'train'>
self.data.get_data_keys = <local> <bound method HDFDataset.get_data_keys of <HDFDataset 'train'>>
File "/export/b18/aarora/returnn_IAM/Dataset.py", line 417, in get_data_keys
line: return ["data"] + self.get_target_list()
locals:
self = <local> <HDFDataset 'train'>
self.get_target_list = <local> <bound method HDFDataset.get_target_list of <HDFDataset 'train'>>
TypeError: can only concatenate list (not "dict_keys") to list
Device gpuX proc, pid 29255: Parent seem to have died: recv_bytes EOFError:
from returnn.
Please check out the newest version. This is a compatibility problem with python 3 which should have been fixed by a commit 7 days ago:
c0fec6e
from returnn.
Thank you so much for the help. I am now able to run it for demo dataset. I am now making modifications to run it with usual splits. I am getting follow logs on small dataset.
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
Couldn't import dot_parser, loading of dot files will not be possible.
RETURNN starting up, version 20171212.115418--git-95f0a14-dirty, pid 33372, cwd /export/b01/aarora8/returnn/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/aaror/.local/lib/python2.7/site-packages/theano)
Couldn't import dot_parser, loading of dot files will not be possible.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 2: GeForce GTX 1080 Ti (CNMeM is enabled with initial size: 87.6% of memory, cuDNN 5103)
Device gpuX proc starting up, pid 33455
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpu2 proc, pid 33455 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 14.6581238424 GB left over)
Train data:
input: 1 x 1
output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
HDF dataset, sequences: 3, frames: 764399
Devices:
gpu2: GeForce GTX 1080 Ti (units: 3584 clock: 1.45Ghz memory: 11.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
input #: 1
hidden 1Dto2D '1Dto2D' #: 1
hidden source 'classes_source' #: 2
hidden conv2 'conv0' #: 15
hidden conv2 'conv1' #: 45
hidden conv2 'conv2' #: 75
hidden conv2 'conv3' #: 105
hidden conv2 'conv4' #: 105
hidden mdlstm 'mdlstm0' #: 30
hidden mdlstm 'mdlstm1' #: 60
hidden mdlstm 'mdlstm2' #: 90
hidden mdlstm 'mdlstm3' #: 120
hidden mdlstm 'mdlstm4' #: 120
output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 1, batch 0, cost:output 19.8360468877, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2
train epoch 1, batch 1, cost:output 16.7365370009, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.82% computing, 0.18% updating data
Save model from epoch 1 under models/mdlstm_demo.001
Learning-rate-control: error key 'train_score' from {'train_score': 18.692785044185452}
epoch 1 score: 18.6927850442 elapsed: 0:00:02
start epoch 2 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 2, batch 0, cost:output 19.1958261465, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2
train epoch 2, batch 1, cost:output 15.5196750217, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.81% computing, 0.08% updating data
Save model from epoch 2 under models/mdlstm_demo.002
epoch 2 score: 17.8398687644 elapsed: 0:00:02
start epoch 3 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu2
train epoch 3, batch 0, cost:output 12.6697686089, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu2
train epoch 3, batch 1, cost:output 10.2576254312, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.56% computing, 0.09% updating data
Save model from epoch 3 under models/mdlstm_demo.003
epoch 3 score: 11.1473503738 elapsed: 0:00:02
start epoch 4 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu2
train epoch 4, batch 0, cost:output 11.5054497613, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu2
train epoch 4, batch 1, cost:output 7.69990281935, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.72% computing, 0.13% updating data
Save model from epoch 4 under models/mdlstm_demo.004
epoch 4 score: 9.10358816678 elapsed: 0:00:02
start epoch 5 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 5, batch 0, cost:output 6.47300938198, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2
train epoch 5, batch 1, cost:output 4.01199374729, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.68% computing, 0.09% updating data
Save model from epoch 5 under models/mdlstm_demo.005
epoch 5 score: 5.56525771344 elapsed: 0:00:02
start epoch 6 with learning rate 0.0005 ...
from returnn.
Thank you for the help. I am able to run the multidimensional lstm code to usual splits. It is on 5th epoch, so i guess it will run fine.
from returnn.
The issue seems to be resolved, so I'll close it for now.
from returnn.
Related Issues (20)
- CUDA error: initialization error HOT 3
- MultiProcDataset inside PyTorch DataLoader with num_workers>0, multiple issues HOT 4
- RuntimeError: CUDA error: unspecified launch failure HOT 2
- NonDaemonicSpawnProcess hangs at exit HOT 2
- High memory usage with datasets (specifically when multi procs are used)
- Hang at exit in TDL worker in multiprocessing `_run_finalizers`, deadlock in `_wait_for_tstate_lock`? HOT 6
- Hang HOT 2
- Returnn Native after using different apptainer uses old compilation HOT 6
- MetaDataset with sequence list filter file
- HDFDataset (or generic dataset) post processing HOT 15
- Dataset batching like ESPnet support
- torch.nn.functional.conv2d: RuntimeError: GET was unable to find an engine to execute this computation HOT 1
- TensorFlow 2.14 degradation in WER HOT 2
- Updates for recent TensorFlow version
- Hang in dataset iterator HOT 5
- Log GPU device for torch backend HOT 2
- torch.onnx.export requires input_names and output_names to be in order HOT 12
- RF weight dropout HOT 6
- Support for larger scale datasets HOT 33
- RuntimeError: CUDA error: unknown error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.