Giter VIP home page Giter VIP logo

Comments (7)

pvoigtlaender avatar pvoigtlaender commented on July 17, 2024

Hi,

in your log it says
"Mixed dnn version. The header is from one version, but we link with a different version (5110, 6021))"
Please check your cudnn versions, maybe remove all versions, download a new one and try again.

from returnn.

sharmaannapurna avatar sharmaannapurna commented on July 17, 2024

Hi, I resolved the issue of multiple versions but code is not working because old backend is not supported. Any suggestions on workaround??

from returnn.

albertz avatar albertz commented on July 17, 2024

from returnn.

sharmaannapurna avatar sharmaannapurna commented on July 17, 2024

I am working with Theano 0.9.0 and pygpu version 0.7.3. It was showing problems with sandbox cuda. I made the feasible changes but still it is showing errors after errors:

returnn_IAM/cuda_implementation/CuDNNConvHWBCOp.py", line 13, in
class CuDNNConvHWBCOpGrad(theano.gpuarray.GpuOp):
AttributeError: 'module' object has no attribute 'GpuOp'

from returnn.

albertz avatar albertz commented on July 17, 2024

I wonder a bit about the base class theano.gpuarray.GpuOp, because that is the base-class of the new gpuarray backend but the code is actually using the old CUDA backend, so it should be theano.sandbox.cuda.GpuOp. Can you try to replace that? Maybe you also need to add some import.

from returnn.

pvoigtlaender avatar pvoigtlaender commented on July 17, 2024

Hi,

I just tried to reproduce the problem, but for me the demo is working. Please make sure, that you are using the newest version of returnn.
My output looks as follows:

voigtlaender@helios:/work/voigtlaender/returnn/demos/mdlstm/IAM$ python ../../../rnn.py config_demo
CRNN starting up, version 20170929.103426--git-875161a-dirty, pid 10527, cwd /work/voigtlaender/returnn/demos/mdlstm/IAM
CRNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/voigtlaender/python2_new/local/lib/python2.7/site-packages/theano)
faulthandler import error. No module named faulthandler
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Graphics Device (CNMeM is enabled with initial size: 10.0% of memory, cuDNN 5105)
Device gpuX proc starting up, pid 10558
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule                                                                                                                                                                                              
Device gpu0 proc, pid 10558 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 14.3247905085 GB left over)
Train data:
  input: 1 x 1
  output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
  HDF dataset, sequences: 3, frames: 764399
Devices:
  gpu0: Geforce GTX TITAN X (units: 3072 clock: 0.98Ghz memory: 12.0GB) working on 1 batch (update on device)
warning: there is an existing model: (2, 'models/mdlstm_demo.002')
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
  input #: 1
  hidden 1Dto2D '1Dto2D' #: 1
  hidden source 'classes_source' #: 2
  hidden conv2 'conv0' #: 15
  hidden conv2 'conv1' #: 45
  hidden conv2 'conv2' #: 75
  hidden conv2 'conv3' #: 105
  hidden conv2 'conv4' #: 105
  hidden mdlstm 'mdlstm0' #: 30
  hidden mdlstm 'mdlstm1' #: 60
  hidden mdlstm 'mdlstm2' #: 90
  hidden mdlstm 'mdlstm3' #: 120
  hidden mdlstm 'mdlstm4' #: 120
  output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 1, batch 0, cost:output 19.8360468877, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0                                                                                                                                                                       
train epoch 1, batch 1, cost:output 16.7365681966, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.53% computing, 0.01% updating data                                                                                                                                                
Save model from epoch 1 under models/mdlstm_demo.001                                                                                                                                                                                   
Learning-rate-control: error key 'train_score' from {'train_score': 18.69279655081327}
epoch 1 score: 18.6927965508 elapsed: 0:00:16 
start epoch 2 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 2, batch 0, cost:output 19.1958800477, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0                                                                                                                                                                       
train epoch 2, batch 1, cost:output 15.5198174371, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.64% computing, 0.01% updating data                                                                                                                                                
Save model from epoch 2 under models/mdlstm_demo.002                                                                                                                                                                                   
epoch 2 score: 17.8399553143 elapsed: 0:00:16 
start epoch 3 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu0
train epoch 3, batch 0, cost:output 12.670304362, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu0                                              

from returnn.

sharmaannapurna avatar sharmaannapurna commented on July 17, 2024

Hello, Thanks!
Able to run the training now. It was an issue with theano installation with pygpu

from returnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.