I installed h5py and theano and tried running go.sh at location returnn_IAM/demos/mdls

Hi, Till now I have done the following: Down

Error in IAM Demo: Couldn't import dot_parser, loading of dot files will not be possible. about returnn HOT 10 CLOSED

rwth-i6 commented on July 17, 2024

Error in IAM Demo: Couldn't import dot_parser, loading of dot files will not be possible.

from returnn.

Comments (10)

pvoigtlaender commented on July 17, 2024

(What I previously wrote was wrong, I updated my answer)

Hi,

The message about dotparser is not important, the real error (or at least on of them) is this:

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
Not able to select available GPU from 4 cards (all CUDA-capable devices are busy or unavailable).

This seems to be an important hint:

UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.

Please change your .theanorc so that you have device=cpu and floatX=float32 and try again.

from returnn.

aarora8 commented on July 17, 2024

Thanks, after installing theano and h5py for python3, and setting .theanorc the above error is gone.

from returnn.

aarora8 commented on July 17, 2024

Hi,

Till now I have done the following:

Downloaded the toolkit.
installed theano and h5py
created .theanorc
provided paths in .bashrc

Does the returnn toolkit need to be compiled?

I am getting error related to cuDNN not available, cudnn.h: no such file or directory. But I have provided path for cudnn.h in the my .bashrc. I am getting following error.

Your job 4817836 ("go.sh") has been submitted
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
RETURNN starting up, version 20171127.183959--git-94c0542-dirty, pid 134196, cwd /export/b18/aarora/returnn_IAM/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 ( in /home/aaror/.local/lib/python3.4/site-packages/theano)
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Device gpuX proc starting up, pid 134214
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is enabled with initial size: 87.6% of memory, cuDNN not available)
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpuX proc exception: ('The following error happened while compiling the node', CuDNNConvHWBCOp{border_mode='valid'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0), '\n', "We can't determine the cudnn version as it is not available", 'Can not compile with cuDNN. We got this error:\nb"nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\n/tmp/4817836.1.g.q/try_flags_i7yj23xl.c:5:19: fatal error: cudnn.h: No such file or directory\n #include <cudnn.h>\n

from returnn.

pvoigtlaender commented on July 17, 2024

Hi,

what exactly did you do for
"provided paths in .bashrc"?

For me it looks like

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/voigtlaender/cudnn/
export LIBRARY_PATH=$LIBRARY_PATH:/home/voigtlaender/cudnn/
export CPATH=$CPATH:/home/voigtlaender/cudnn/

and I linked all .h and .so files of cudnn into this folder.

from returnn.

aarora8 commented on July 17, 2024

Thank you. currently my .bashrc looks like:

CUDA_HOME=/usr/local/cuda
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/export/a11/hlyu/cudnn/include"
PATH="${CUDA_HOME}/bin:${PATH}"
PATH="$PATH:$HOME/.local/bin"
CLASSPATH=$PATH:"."
export PATH
export CLASSPATH
LD_LIBRARY_PATH=/home/dpovey/libs/
LD_LIBRARY_PATH=/home/gkumar/.local/include/:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=/export/a11/hlyu/cudnn/lib64:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=/export/a11/hlyu/cudnn/include/:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

I will also add other paths now.

from returnn.

aarora8 commented on July 17, 2024

I am getting a type error, "concatenation of dict with a list". I haven't changed anything in the code. Can you please help me in debugging this error. I am currently working with config demo (3 images) for IAM.

('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
RETURNN starting up, version 20171127.183959--git-94c0542-dirty, pid 28332, cwd /export/b18/aarora/returnn_IAM/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/aaror/.local/lib/python3.4/site-packages/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 3: Tesla K10.G2.8GB (CNMeM is enabled with initial size: 87.6% of memory, cuDNN 5103)
Device gpuX proc starting up, pid 29255
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule                                                                                                                                                                                                     
Device gpu3 proc, pid 29255 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 65.9914571758 GB left over)
Train data:
  input: 1 x 1
  output: {'data': [1, 2], 'sizes': [2, 1], 'classes': [79, 1]}
  HDF dataset, sequences: 3, frames: 764399
Devices:
  gpu3: Tesla K10.G2.8GB (units: 1000 clock: 1.00Ghz memory: 2.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
  input #: 1
  hidden 1Dto2D '1Dto2D' #: 1
  hidden source 'classes_source' #: 2
  hidden conv2 'conv0' #: 15
  hidden conv2 'conv1' #: 45
  hidden conv2 'conv2' #: 75
  hidden conv2 'conv3' #: 105
  hidden conv2 'conv4' #: 105
  hidden mdlstm 'mdlstm0' #: 30
  hidden mdlstm 'mdlstm1' #: 60
  hidden mdlstm 'mdlstm2' #: 90
  hidden mdlstm 'mdlstm3' #: 120
  hidden mdlstm 'mdlstm4' #: 120
  output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu3
EXCEPTION
Traceback (most recent call last):
  File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 284, in run
    line: self.finish()
    locals:
      self = <local> <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>
      self.finish = <local> <bound method DeviceBatchRun.finish of <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>>
  File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 270, in finish
    line: self.eval_info = self.parent.evaluate(**self.result)
    locals:
      self = <local> <DeviceBatchRun(DeviceThread gpu3, started daemon 47895785236224)>
      self.eval_info = <local> None
      self.parent = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
      self.parent.evaluate = <local> <bound method TrainTaskThread.evaluate of <TrainTaskThread(TaskThread train, started daemon 47895616210688)>>
      self.result = <local> {'batchess': [[<Batch start_seq:2, #seqs:1>]], 'result_format': ['cost:output', 'ctc_priors'], 'num_frames': NumbersDict(numbers_dict={'data': 485990, 'classes': 77, 'sizes': 4}, broadcast_value=0), 'results': [[array(1527.3756103515625, dtype=float32), array([   0.        ,   19.13910294,    6.3...
  File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 141, in evaluate
    line: target = self._get_target_for_key(key)
    locals:
      target = <not found>
      self = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
      self._get_target_for_key = <local> <bound method TrainTaskThread._get_target_for_key of <TrainTaskThread(TaskThread train, started daemon 47895616210688)>>
      key = <local> 'cost:output', len = 11
  File "/export/b18/aarora/returnn_IAM/EngineTask.py", line 168, in _get_target_for_key
    line: available_data_keys = self.data.get_data_keys()
    locals:
      available_data_keys = <not found>
      self = <local> <TrainTaskThread(TaskThread train, started daemon 47895616210688)>
      self.data = <local> <HDFDataset 'train'>
      self.data.get_data_keys = <local> <bound method HDFDataset.get_data_keys of <HDFDataset 'train'>>
  File "/export/b18/aarora/returnn_IAM/Dataset.py", line 417, in get_data_keys
    line: return ["data"] + self.get_target_list()
    locals:
      self = <local> <HDFDataset 'train'>
      self.get_target_list = <local> <bound method HDFDataset.get_target_list of <HDFDataset 'train'>>
TypeError: can only concatenate list (not "dict_keys") to list
Device gpuX proc, pid 29255: Parent seem to have died: recv_bytes EOFError:

from returnn.

pvoigtlaender commented on July 17, 2024

Please check out the newest version. This is a compatibility problem with python 3 which should have been fixed by a commit 7 days ago:
c0fec6e

from returnn.

aarora8 commented on July 17, 2024

Thank you so much for the help. I am now able to run it for demo dataset. I am now making modifications to run it with usual splits. I am getting follow logs on small dataset.

('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
Couldn't import dot_parser, loading of dot files will not be possible.
RETURNN starting up, version 20171212.115418--git-95f0a14-dirty, pid 33372, cwd /export/b01/aarora8/returnn/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/aaror/.local/lib/python2.7/site-packages/theano)
Couldn't import dot_parser, loading of dot files will not be possible.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 2: GeForce GTX 1080 Ti (CNMeM is enabled with initial size: 87.6% of memory, cuDNN 5103)
Device gpuX proc starting up, pid 33455
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule                                                                                                                                                                                                     
Device gpu2 proc, pid 33455 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 14.6581238424 GB left over)
Train data:
  input: 1 x 1
  output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
  HDF dataset, sequences: 3, frames: 764399
Devices:
  gpu2: GeForce GTX 1080 Ti (units: 3584 clock: 1.45Ghz memory: 11.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
  input #: 1
  hidden 1Dto2D '1Dto2D' #: 1
  hidden source 'classes_source' #: 2
  hidden conv2 'conv0' #: 15
  hidden conv2 'conv1' #: 45
  hidden conv2 'conv2' #: 75
  hidden conv2 'conv3' #: 105
  hidden conv2 'conv4' #: 105
  hidden mdlstm 'mdlstm0' #: 30
  hidden mdlstm 'mdlstm1' #: 60
  hidden mdlstm 'mdlstm2' #: 90
  hidden mdlstm 'mdlstm3' #: 120
  hidden mdlstm 'mdlstm4' #: 120
  output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 1, batch 0, cost:output 19.8360468877, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2                                                                                                                                                                              
train epoch 1, batch 1, cost:output 16.7365370009, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.82% computing, 0.18% updating data                                                                                                                                                       
Save model from epoch 1 under models/mdlstm_demo.001                                                                                                                                                                                          
Learning-rate-control: error key 'train_score' from {'train_score': 18.692785044185452}
epoch 1 score: 18.6927850442 elapsed: 0:00:02 
start epoch 2 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 2, batch 0, cost:output 19.1958261465, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2                                                                                                                                                                              
train epoch 2, batch 1, cost:output 15.5196750217, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.81% computing, 0.08% updating data                                                                                                                                                       
Save model from epoch 2 under models/mdlstm_demo.002                                                                                                                                                                                          
epoch 2 score: 17.8398687644 elapsed: 0:00:02 
start epoch 3 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu2
train epoch 3, batch 0, cost:output 12.6697686089, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu2                                                                                                                                                                              
train epoch 3, batch 1, cost:output 10.2576254312, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.56% computing, 0.09% updating data                                                                                                                                                       
Save model from epoch 3 under models/mdlstm_demo.003                                                                                                                                                                                          
epoch 3 score: 11.1473503738 elapsed: 0:00:02 
start epoch 4 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu2
train epoch 4, batch 0, cost:output 11.5054497613, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu2                                                                                                                                                                              
train epoch 4, batch 1, cost:output 7.69990281935, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.72% computing, 0.13% updating data                                                                                                                                                       
Save model from epoch 4 under models/mdlstm_demo.004                                                                                                                                                                                          
epoch 4 score: 9.10358816678 elapsed: 0:00:02 
start epoch 5 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu2
train epoch 5, batch 0, cost:output 6.47300938198, elapsed 0:00:01, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu2                                                                                                                                                                              
train epoch 5, batch 1, cost:output 4.01199374729, elapsed 0:00:02, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:02, 96.68% computing, 0.09% updating data                                                                                                                                                       
Save model from epoch 5 under models/mdlstm_demo.005                                                                                                                                                                                          
epoch 5 score: 5.56525771344 elapsed: 0:00:02 
start epoch 6 with learning rate 0.0005 ...

from returnn.

aarora8 commented on July 17, 2024

Thank you for the help. I am able to run the multidimensional lstm code to usual splits. It is on 5th epoch, so i guess it will run fine.

from returnn.

pvoigtlaender commented on July 17, 2024

The issue seems to be resolved, so I'll close it for now.

from returnn.

Error in IAM Demo: Couldn't import dot_parser, loading of dot files will not be possible. about returnn HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent