Comments (7)
Hi,
in your log it says
"Mixed dnn version. The header is from one version, but we link with a different version (5110, 6021))"
Please check your cudnn versions, maybe remove all versions, download a new one and try again.
from returnn.
Hi, I resolved the issue of multiple versions but code is not working because old backend is not supported. Any suggestions on workaround??
from returnn.
from returnn.
I am working with Theano 0.9.0 and pygpu version 0.7.3. It was showing problems with sandbox cuda. I made the feasible changes but still it is showing errors after errors:
returnn_IAM/cuda_implementation/CuDNNConvHWBCOp.py", line 13, in
class CuDNNConvHWBCOpGrad(theano.gpuarray.GpuOp):
AttributeError: 'module' object has no attribute 'GpuOp'
from returnn.
I wonder a bit about the base class theano.gpuarray.GpuOp
, because that is the base-class of the new gpuarray backend but the code is actually using the old CUDA backend, so it should be theano.sandbox.cuda.GpuOp
. Can you try to replace that? Maybe you also need to add some import
.
from returnn.
Hi,
I just tried to reproduce the problem, but for me the demo is working. Please make sure, that you are using the newest version of returnn.
My output looks as follows:
voigtlaender@helios:/work/voigtlaender/returnn/demos/mdlstm/IAM$ python ../../../rnn.py config_demo
CRNN starting up, version 20170929.103426--git-875161a-dirty, pid 10527, cwd /work/voigtlaender/returnn/demos/mdlstm/IAM
CRNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/voigtlaender/python2_new/local/lib/python2.7/site-packages/theano)
faulthandler import error. No module named faulthandler
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 0: Graphics Device (CNMeM is enabled with initial size: 10.0% of memory, cuDNN 5105)
Device gpuX proc starting up, pid 10558
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpu0 proc, pid 10558 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/demo.h5
cached 3 seqs 0.00854282453656 GB (fully loaded, 14.3247905085 GB left over)
Train data:
input: 1 x 1
output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
HDF dataset, sequences: 3, frames: 764399
Devices:
gpu0: Geforce GTX TITAN X (units: 3072 clock: 0.98Ghz memory: 12.0GB) working on 1 batch (update on device)
warning: there is an existing model: (2, 'models/mdlstm_demo.002')
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
input #: 1
hidden 1Dto2D '1Dto2D' #: 1
hidden source 'classes_source' #: 2
hidden conv2 'conv0' #: 15
hidden conv2 'conv1' #: 45
hidden conv2 'conv2' #: 75
hidden conv2 'conv3' #: 105
hidden conv2 'conv4' #: 105
hidden mdlstm 'mdlstm0' #: 30
hidden mdlstm 'mdlstm1' #: 60
hidden mdlstm 'mdlstm2' #: 90
hidden mdlstm 'mdlstm3' #: 120
hidden mdlstm 'mdlstm4' #: 120
output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 1, batch 0, cost:output 19.8360468877, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0
train epoch 1, batch 1, cost:output 16.7365681966, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.53% computing, 0.01% updating data
Save model from epoch 1 under models/mdlstm_demo.001
Learning-rate-control: error key 'train_score' from {'train_score': 18.69279655081327}
epoch 1 score: 18.6927965508 elapsed: 0:00:16
start epoch 2 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (569522 nts) of batch 0 on device gpu0
train epoch 2, batch 0, cost:output 19.1958800477, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 1 sequence slices (278409 nts) of batch 1 on device gpu0
train epoch 2, batch 1, cost:output 15.5198174371, elapsed 0:00:16, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 0:00:16, 99.64% computing, 0.01% updating data
Save model from epoch 2 under models/mdlstm_demo.002
epoch 2 score: 17.8399553143 elapsed: 0:00:16
start epoch 3 with learning rate 0.0005 ...
starting task train
running 1 sequence slices (278409 nts) of batch 0 on device gpu0
train epoch 3, batch 0, cost:output 12.670304362, elapsed 0:00:08, exp. remaining 0:00:00, complete 100.00%
running 2 sequence slices (569522 nts) of batch 1 on device gpu0
from returnn.
Hello, Thanks!
Able to run the training now. It was an issue with theano installation with pygpu
from returnn.
Related Issues (20)
- PyTorch distributed training CPU OOM with sync_on_cpu HOT 1
- Support `torch.compile` for RF
- RF backend: PyTorch code
- Different effective learning rate reported over gpus HOT 11
- CUDA error: initialization error HOT 3
- MultiProcDataset inside PyTorch DataLoader with num_workers>0, multiple issues HOT 4
- RuntimeError: CUDA error: unspecified launch failure HOT 2
- NonDaemonicSpawnProcess hangs at exit HOT 2
- High memory usage with datasets (specifically when multi procs are used)
- Hang at exit in TDL worker in multiprocessing `_run_finalizers`, deadlock in `_wait_for_tstate_lock`? HOT 6
- Hang HOT 2
- Returnn Native after using different apptainer uses old compilation HOT 6
- MetaDataset with sequence list filter file
- HDFDataset (or generic dataset) post processing HOT 15
- Dataset batching like ESPnet support
- torch.nn.functional.conv2d: RuntimeError: GET was unable to find an engine to execute this computation HOT 1
- TensorFlow 2.14 degradation in WER HOT 2
- Updates for recent TensorFlow version
- Hang in dataset iterator HOT 5
- Log GPU device for torch backend HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.