Hi, I tried to train the network and following is the error log: <p dir="auto"

Not able to run IAM training about returnn HOT 7 CLOSED

rwth-i6 commented on July 17, 2024

Not able to run IAM training

from returnn.

Comments (7)

pvoigtlaender commented on July 17, 2024

Hi,

in your log it says
"Mixed dnn version. The header is from one version, but we link with a different version (5110, 6021))"
Please check your cudnn versions, maybe remove all versions, download a new one and try again.

from returnn.

sharmaannapurna commented on July 17, 2024

Hi, I resolved the issue of multiple versions but code is not working because old backend is not supported. Any suggestions on workaround??

from returnn.

albertz commented on July 17, 2024

Use Theano 0.9. Am 28.09.2017 06:54 schrieb "sharmaannapurna" <[email protected]>:

…

Hi, I resolved the issue of multiple versions but code is not working because old backend is not supported. Any suggestions on workaround?? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADm_E4OJW_2ucQwv-uEsOeZ87hq9k3iks5smyZ_gaJpZM4PifDR> .

from returnn.

sharmaannapurna commented on July 17, 2024

I am working with Theano 0.9.0 and pygpu version 0.7.3. It was showing problems with sandbox cuda. I made the feasible changes but still it is showing errors after errors:

returnn_IAM/cuda_implementation/CuDNNConvHWBCOp.py", line 13, in
class CuDNNConvHWBCOpGrad(theano.gpuarray.GpuOp):
AttributeError: 'module' object has no attribute 'GpuOp'

from returnn.

albertz commented on July 17, 2024

I wonder a bit about the base class theano.gpuarray.GpuOp, because that is the base-class of the new gpuarray backend but the code is actually using the old CUDA backend, so it should be theano.sandbox.cuda.GpuOp. Can you try to replace that? Maybe you also need to add some import.

from returnn.

pvoigtlaender commented on July 17, 2024

Hi,

I just tried to reproduce the problem, but for me the demo is working. Please make sure, that you are using the newest version of returnn.
My output looks as follows:

voigtlaender@helios:/work/voigtlaender/returnn/demos/mdlstm/IAM$ python ../../../rnn.py config_demo
CRNN starting up, version 20170929.103426--git-875161a-dirty, pid 10527, cwd /work/voigtlaender/returnn/demos/mdlstm/IAM
CRNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/voigtlaender/python2_new/local/lib/python2.7/site-packages/theano)
faulthandler import error. No module named faulthandler
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Graphics Device (CNMeM is enabled with initial size: 10.0% of memory, cuDNN 5105)
Device gpuX proc starting up, pid 10558
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule                                                                                                                                                                                              
Device gpu0 proc, pid 10558 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 14.3247905085 GB left over)
Train data:
  input: 1 x 1
  output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
  HDF dataset, sequences: 3, frames: 764399
Devices:
  gpu0: Geforce GTX TITAN X (units: 3072 clock: 0.98Ghz memory: 12.0GB) working on 1 batch (update on device)
warning: there is an existing model: (2, 'models/mdlstm_demo.002')
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
  input #: 1
  hidden 1Dto2D '1Dto2D' #: 1
  hidden source 'classes_source' #: 2
  hidden conv2 'conv0' #: 15
  hidden conv2 'conv1' #: 45
  hidden conv2 'conv2' #: 75
  hidden conv2 'conv3' #: 105
  hidden conv2 'conv4' #: 105
  hidden mdlstm 'mdlstm0' #: 30
  hidden mdlstm 'mdlstm1' #: 60
  hidden mdlstm 'mdlstm2' #: 90
  hidden mdlstm 'mdlstm3' #: 120
  hidden mdlstm 'mdlstm4' #: 120
  output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 1, batch 0, cost:output 19.8360468877, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0                                                                                                                                                                       
train epoch 1, batch 1, cost:output 16.7365681966, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.53% computing, 0.01% updating data                                                                                                                                                
Save model from epoch 1 under models/mdlstm_demo.001                                                                                                                                                                                   
Learning-rate-control: error key 'train_score' from {'train_score': 18.69279655081327}
epoch 1 score: 18.6927965508 elapsed: 0:00:16 
start epoch 2 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 2, batch 0, cost:output 19.1958800477, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0                                                                                                                                                                       
train epoch 2, batch 1, cost:output 15.5198174371, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.64% computing, 0.01% updating data                                                                                                                                                
Save model from epoch 2 under models/mdlstm_demo.002                                                                                                                                                                                   
epoch 2 score: 17.8399553143 elapsed: 0:00:16 
start epoch 3 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu0
train epoch 3, batch 0, cost:output 12.670304362, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu0

from returnn.

sharmaannapurna commented on July 17, 2024

Hello, Thanks!
Able to run the training now. It was an issue with theano installation with pygpu

from returnn.

Not able to run IAM training about returnn HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent