Giter VIP home page Giter VIP logo

returnn's Introduction

Welcome to RETURNN

GitHub repository. RETURNN paper 2016, RETURNN paper 2018.

RETURNN - RWTH extensible training framework for universal recurrent neural networks, is a Theano/TensorFlow-based implementation of modern recurrent neural network architectures. It is optimized for fast and reliable training of recurrent neural networks in a multi-GPU environment.

The high-level features and goals of RETURNN are:

  • Simplicity
    • Writing config / code is simple & straight-forward (setting up experiment, defining model)
    • Debugging in case of problems is simple
    • Reading config / code is simple (defined model, training, decoding all becomes clear)
  • Flexibility
    • Allow for many different kinds of experiments / models
  • Efficiency
    • Training speed
    • Decoding speed

All items are important for research, decoding speed is esp. important for production.

See our Interspeech 2020 tutorial "Efficient and Flexible Implementation of Machine Learning for ASR and MT" video (slides) with an introduction of the core concepts.

More specific features include:

  • Mini-batch training of feed-forward neural networks
  • Sequence-chunking based batch training for recurrent neural networks
  • Long short-term memory recurrent neural networks including our own fast CUDA kernel
  • Multidimensional LSTM (GPU only, there is no CPU version)
  • Memory management for large data sets
  • Work distribution across multiple devices
  • Flexible and fast architecture which allows all kinds of encoder-attention-decoder models

See documentation. See basic usage and technological overview.

Here is the video recording of a RETURNN overview talk (slides, exercise sheet; hosted by eBay).

There are many example demos which work on artificially generated data, i.e. they should work as-is.

There are some real-world examples such as setups for speech recognition on the Switchboard or LibriSpeech corpus.

Some benchmark setups against other frameworks can be found here. The results are in the RETURNN paper 2016. Performance benchmarks of our LSTM kernel vs CuDNN and other TensorFlow kernels are in TensorFlow LSTM benchmark.

There is also a wiki. Questions can also be asked on StackOverflow using the RETURNN tag.

returnn's People

Contributors

albertz avatar arnenx avatar atticus1806 avatar christophmluscher avatar curufinwe avatar dengliuhui avatar dependabot[bot] avatar dewi2a1 avatar doetsch avatar doetsch-apptek avatar dthulke avatar e-matusov avatar icemole avatar jacktemaki avatar kazuki-irie avatar michelwi avatar mmz33 avatar moothiringote avatar neolegends avatar nikita6187 avatar patrick-wilken avatar pavelgolik avatar pvoigtlaender avatar robin-p-schmitt avatar spotlight0xff avatar squarenabla avatar uralik avatar vieting avatar zettelkasten avatar zhouw321 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

returnn's Issues

Training, validation, and testing of models, and Predicting using a model

I have a few questions in training, validating, and testing of models in the mdlstm demo IAM.

Tl;dr, what are the steps in training, validating, and testing of models? How do I use a model to predict?

  1. In training (I used config_real for this btw), where do I find the training error? The same question goes for validating and testing.
  2. I have already have my models generated from training. I have also tried validating one of them by using config_fwd and got a validated model from that. Now how do I test this model?
  3. Say I got a model from testing. How do I use said model?

CachedDataset2 memory issue

Hey

In CachedDataset2

The command self.load_seqs(self.expected_load_seq_start, sorted_seq_idx + 1) in get_seq_length(...) loads the sequences from self.expected_load_seq_start to sorted_seq_idx + 1 to the cache, right?

So lets assume the situation where the dataset has just been created and thus self.expected_load_seq_start = 0
and you want to get the length of every single sequence (as you would want e.g. for calling Dataset.get_seq_order_for_epoch(...) which I would call during init_seq_order(...)), for this I would then use get_seq_length(...).

This would result in iteratively calling get_seq_lenth(...) for all sequence indices but with a fixed self.expected_load_seq_start = 0, which would result in the cache finally storing all data in self.added_data, right?

The problem here is, that by doing this I run out of memory.

Am I using any of those functions not as they were intended to? Please let me know if I am doing anything wrong here.

The obvious fix for me would be to overwrite get_seq_length(...) in my dataset, but I just wanted to check if that is maybe a general bug, that is worth fixing in CachedDataset2.

Thanks

@albertz @doetsch

Error hdf5 in training

Hi all,
I have adapted this code to recognize my own dataset, but I have got an error when training (files hdf5 in feature folder were successfully generated):
I have changed images (extension : .tif)
example of groundtruth:
AHTD3A0058_Para1_6 err 154 10 0 0 1826 80 comA|naBseMraM
targets code : AHTD3A0058_Para1_6 25 110 89 77 86 81 110

Thank you in advance



    locals:
      self = <local> <HDFDataset 'dev'>
      self._init_start_cache = <local> <bound method HDFDataset._init_start_cache of <HDFDataset 'dev'>>
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/CachedDataset.py", line 120, in _init_start_cache
    line: self.load_seqs(0, num_cached, with_cache=False)
    locals:
      self = <local> <HDFDataset 'dev'>
      self.load_seqs = <local> <bound method HDFDataset.load_seqs of <HDFDataset 'dev'>>
      num_cached = <local> 852
      with_cache = <not found>
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/CachedDataset.py", line 143, in load_seqs
    line: super(CachedDataset, self).load_seqs(start, end)
    locals:
      super = <builtin> <type 'super'>
      CachedDataset = <global> <class 'CachedDataset.CachedDataset'>
      self = <local> <HDFDataset 'dev'>
      load_seqs = <not found>
      start = <local> 0
      end = <local> 852
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/Dataset.py", line 189, in load_seqs
    line: self._load_seqs(start, end)
    locals:
      self = <local> <HDFDataset 'dev'>
      self._load_seqs = <local> <bound method HDFDataset._load_seqs of <HDFDataset 'dev'>>
      start = <local> 0
      end = <local> 852
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/HDFDataset.py", line 153, in _load_seqs
    line: self.targets[k][self.get_seq_start(idc)[ldx]:self.get_seq_start(idc)[ldx] + l[ldx]] = fin['targets/data/' + k][p[ldx] : p[ldx] + l[ldx]] #[...]
    locals:
      self = <local> <HDFDataset 'dev'>
      self.targets = <local> {u'classes': array([-1., -1., -1., ..., -1., -1., -1.], dtype=float32), u'sizes': array([-1., -1., -1., ..., -1., -1., -1.], dtype=float32)}
      k = <local> u'classes', len = 7
      self.get_seq_start = <local> <bound method HDFDataset.get_seq_start of <HDFDataset 'dev'>>
      idc = <local> 0
      ldx = <local> 1
      l = <local> array([40950,     9,     2], dtype=int32), len = 3
      fin = <local> <HDF5 file "train_valid.h5" (mode r)>, len = 5
      p = <local> array([0, 0, 0]), len = 3
ValueError: could not broadcast input array from shape (0) into shape (9)
Device gpuX proc, pid 26111: Parent seem to have died: recv_bytes EOFError:

What is the Kaldi, RASR decoder to decode the posterior file in demo/mdlstm/IAM?

In #7 , @pvoigtlaender has described the test stage in demo/mdlstm/IAM

"
first you create a hdf 5 file containing the data for the image. For this, you can adapt the script https://github.com/rwth-i6/returnn/blob/master/demos/mdlstm/IAM/create_IAM_dataset.py or you might also start from here https://github.com/rwth-i6/returnn/blob/master/demos/mdlstm/artificial/create_test_h5.py and replace the artificial data with your image.

Afterwards you forward your trained model on this data, which gives you a hdf5 file with posteriors for the image.

Then you run a decoder (e.g. Kaldi, RASR).

It might also be possible in a easier way to run the best path decoder using "task": "daemon". For a rough explanation, please view this issue: #3
"

what do you mean by using a decoder Kaldi, RASR, is there any example code to decode the posterior file test.hd5?

Thank you very much!

Failed Complie in Win Envrionment

EXCEPTION
Traceback (most recent call last):
  File "rnn.py", line 524, in <module>
    line: main(sys.argv)
    locals:
      main = <local> <function main at 0x0000007F9A9C0EA0>
      sys = <local> <module 'sys' (built-in)>
      sys.argv = <local> ['rnn.py', 'demos\\demo-tf-vanilla-lstm.12ax.config'], _[0]: {len = 6}
  File "rnn.py", line 511, in main
    line: init(commandLineOptions=argv[1:])
    locals:
      init = <global> <function init at 0x0000007F9A9C0BF8>
      commandLineOptions = <not found>
      argv = <local> ['rnn.py', 'demos\\demo-tf-vanilla-lstm.12ax.config'], _[0]: {len = 6}
  File "rnn.py", line 338, in init
    line: initFaulthandler()
    locals:
      initFaulthandler = <global> <function initFaulthandler at 0x0000007F9A641AE8>
  File "C:\Dev\returnn\Debug.py", line 194, in initFaulthandler
    line: if install_signal_handler_if_default(signal.SIGUSR1):
    locals:
      install_signal_handler_if_default = <global> <function install_signal_handler_if_default at 0x0000007F9A641950>
      signal = <global> <module 'signal' from 'C:\\AppData\\Local\\conda\\conda\\envs\\tensorflow\\lib\\signal.py'>
      signal.SIGUSR1 = <global> !AttributeError: module 'signal' has no attribute 'SIGUSR1'
AttributeError: module 'signal' has no attribute 'SIGUSR1'

As the Windows environment doesn't support the SIGUSR1 signal, this error occurs.

Any help will be appreciated.

Problems running a simple example

I tried to run the example in the demos/mdlstm/IAM folder. Here is the terminal output when a run the go.sh file:

deep@deep-System-Product-Name ~/Documents/Deep Learning Libraries/returnn/demos/mdlstm/IAM $ ./go.sh
  File "./create_IAM_dataset.py", line 96
    print i, "/", len(file_list)
          ^
SyntaxError: invalid syntax
CRNN starting up, version 20171010.153820--git-6a5bd96, pid 23775, cwd /home/deep/Documents/Deep Learning Libraries/returnn/demos/mdlstm/IAM
CRNN command line options: ['config_demo']
Theano: 0.9.0.dev-c697eeab84... (<site-package> in /home/deep/anaconda3/lib/python3.6/site-packages/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: TITAN Xp (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5110)
Device gpuX proc starting up, pid 23803
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
Device gpuX proc exception: ('Shape must be integers', MakeVector{dtype='float32'}.0, 'float32')
Unhandled exception <class 'TypeError'> in thread <_MainThread(MainThread, started 140551095064320)>, proc 23803.

Thread current, main, <_MainThread(MainThread, started 140551095064320)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Traceback (most recent call last):
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Device.py", line 977, in process
    line: self.process_inner(device, config, self.update_specs, asyncTask)
    locals:
      self = <local> <Device.Device object at 0x7fd49507d860>
      self.process_inner = <local> <bound method Device.process_inner of <Device.Device object at 0x7fd49507d860>>
      device = <local> 'gpuX'
      config = <local> <Config.Config object at 0x7fd45d8594a8>
      self.update_specs = <local> {'update_rule': 'global', 'update_params': {}, 'layers': [], 'block_size': 0}
      asyncTask = <local> <TaskSystem.AsyncTask object at 0x7fd480055ef0>
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Device.py", line 1030, in process_inner
    line: self.initialize(config, update_specs=update_specs)
    locals:
      self = <local> <Device.Device object at 0x7fd49507d860>
      self.initialize = <local> <bound method Device.initialize of <Device.Device object at 0x7fd49507d860>>
      config = <local> <Config.Config object at 0x7fd45d8594a8>
      update_specs = <local> {'update_rule': 'global', 'update_params': {}, 'layers': [], 'block_size': 0}
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Device.py", line 439, in initialize
    line: self.trainnet = LayerNetwork.from_config_topology(config, train_flag=True, eval_flag=False)
    locals:
      self = <local> <Device.Device object at 0x7fd49507d860>
      self.trainnet = <local> !AttributeError: 'Device' object has no attribute 'trainnet'
      LayerNetwork = <global> <class 'Network.LayerNetwork'>
      LayerNetwork.from_config_topology = <global> <bound method LayerNetwork.from_config_topology of <class 'Network.LayerNetwork'>>
      config = <local> <Config.Config object at 0x7fd45d8594a8>
      train_flag = <not found>
      eval_flag = <local> False
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 119, in from_config_topology
    line: return cls.from_json_and_config(json_content, config, mask=mask, **kwargs)
    locals:
      cls = <local> <class 'Network.LayerNetwork'>
      cls.from_json_and_config = <local> <bound method LayerNetwork.from_json_and_config of <class 'Network.LayerNetwork'>>
      json_content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      config = <local> <Config.Config object at 0x7fd45d8594a8>
      mask = <local> None
      kwargs = <local> {'train_flag': True, 'eval_flag': False}
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 204, in from_json_and_config
    line: network = cls.from_json(json_content, **dict_joined(kwargs, cls.init_args_from_config(config)))
    locals:
      network = <not found>
      cls = <local> <class 'Network.LayerNetwork'>
      cls.from_json = <local> <bound method LayerNetwork.from_json of <class 'Network.LayerNetwork'>>
      json_content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      dict_joined = <global> <function dict_joined at 0x7fd46403ce18>
      kwargs = <local> {'mask': None, 'train_flag': True, 'eval_flag': False}
      cls.init_args_from_config = <local> <bound method LayerNetwork.init_args_from_config of <class 'Network.LayerNetwork'>>
      config = <local> <Config.Config object at 0x7fd45d8594a8>
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 422, in from_json
    line: traverse(json_content, layer_name, trg, index)
    locals:
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      json_content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      layer_name = <local> 'output', len = 6
      trg = <local> 'classes', len = 7
      index = <local> j_classes
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'mdlstm4', len = 7
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'conv4'
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'mdlstm3', len = 7
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'conv3'
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'mdlstm2', len = 7
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'conv2'
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'mdlstm1', len = 7
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'conv1'
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'mdlstm0', len = 7
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> 'conv0'
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 350, in traverse
    line: index = traverse(content, prev, target, index)
    locals:
      index = <local> j_classes
      traverse = <local> <function LayerNetwork.from_json.<locals>.traverse at 0x7fd45d2d0d08>
      content = <local> {'classes_source': {'class': 'source', 'data_key': 'sizes', 'from': [], 'n_out': 2, 'dtype': 'int32'}, '1Dto2D': {'class': '1Dto2D', 'from': ['data', 'classes_source']}, 'conv0': {'class': 'conv2', 'n_features': 15, 'filter': [3, 3], 'pool_size': [2, 2], 'from': ['1Dto2D']}, 'mdlstm0': {'class': ..., len = 13
      prev = <local> '1Dto2D', len = 6
      target = <local> 'classes', len = 7
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Network.py", line 408, in traverse
    line: return network.add_layer(layer_class(**params)).index
    locals:
      network = <local> <Network.LayerNetwork object at 0x7fd45d2c1fd0>
      network.add_layer = <local> <bound method LayerNetwork.add_layer of <Network.LayerNetwork object at 0x7fd45d2c1fd0>>
      layer_class = <local> <class 'NetworkTwoDLayer.OneDToTwoDLayer'>
      params = <local> {'sources': [<<class 'NetworkBaseLayer.SourceLayer'> class:source name:data>, <<class 'NetworkBaseLayer.SourceLayer'> class:source name:classes_source>], 'dropout': 0.0, 'name': '1Dto2D', 'train_flag': True, 'eval_flag': False, 'network': <Network.LayerNetwork object at 0x7fd45d2c1fd0>, 'mask': N..., len = 9
      index = <local> j_sizes
  File "/home/deep/Documents/Deep Learning Libraries/returnn/NetworkTwoDLayer.py", line 51, in __init__
    line: sizes = sizes.reshape((2, sizes.size / 2)).dimshuffle(1, 0)
    locals:
      sizes = <local> Elemwise{Cast{float32}}.0
      sizes.reshape = <local> <bound method _tensor_py_operators.reshape of Elemwise{Cast{float32}}.0>
      sizes.size = <local> Prod{axis=None, dtype='int64', acc_dtype='int64'}.0
      dimshuffle = <not found>
  File "/home/deep/anaconda3/lib/python3.6/site-packages/theano/tensor/var.py", line 321, in reshape
    line: return theano.tensor.basic.reshape(self, shape, ndim=ndim)
    locals:
      theano = <global> <module 'theano' from '/home/deep/anaconda3/lib/python3.6/site-packages/theano/__init__.py'>
      theano.tensor = <global> <module 'theano.tensor' from '/home/deep/anaconda3/lib/python3.6/site-packages/theano/tensor/__init__.py'>
      theano.tensor.basic = <global> <module 'theano.tensor.basic' from '/home/deep/anaconda3/lib/python3.6/site-packages/theano/tensor/basic.py'>
      theano.tensor.basic.reshape = <global> <function reshape at 0x7fd4731cca60>
      self = <local> Elemwise{Cast{float32}}.0
      shape = <local> (2, Elemwise{true_div,no_inplace}.0)
      ndim = <local> None
  File "/home/deep/anaconda3/lib/python3.6/site-packages/theano/tensor/basic.py", line 4910, in reshape
    line: rval = op(x, newshape)
    locals:
      rval = <not found>
      op = <local> <theano.tensor.basic.Reshape object at 0x7fd45d131358>
      x = <local> Elemwise{Cast{float32}}.0
      newshape = <local> MakeVector{dtype='float32'}.0
  File "/home/deep/anaconda3/lib/python3.6/site-packages/theano/gof/op.py", line 615, in __call__
    line: node = self.make_node(*inputs, **kwargs)
    locals:
      node = <not found>
      self = <local> <theano.tensor.basic.Reshape object at 0x7fd45d131358>
      self.make_node = <local> <bound method Reshape.make_node of <theano.tensor.basic.Reshape object at 0x7fd45d131358>>
      inputs = <local> (Elemwise{Cast{float32}}.0, MakeVector{dtype='float32'}.0)
      kwargs = <local> {}
  File "/home/deep/anaconda3/lib/python3.6/site-packages/theano/tensor/basic.py", line 4748, in make_node
    line: raise TypeError("Shape must be integers", shp, shp.dtype)
    locals:
      TypeError = <builtin> <class 'TypeError'>
      shp = <local> MakeVector{dtype='float32'}.0
      shp.dtype = <local> 'float32', len = 7
TypeError: ('Shape must be integers', MakeVector{dtype='float32'}.0, 'float32')
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True
EXCEPTION
Traceback (most recent call last):
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Device.py", line 329, in startProc
    line: self._startProc(*args, **kwargs)
    locals:
      self = <local> <Device.Device object at 0x7f35d859d4e0>
      self._startProc = <local> <bound method Device._startProc of <Device.Device object at 0x7f35d859d4e0>>
      args = <local> ('gpuZ',)
      kwargs = <local> {}
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Device.py", line 383, in _startProc
    line: interrupt_main()
    locals:
      interrupt_main = <global> <function interrupt_main at 0x7f35dd813ae8>
  File "/home/deep/Documents/Deep Learning Libraries/returnn/Util.py", line 607, in interrupt_main
    line: sys.exit(1)  # And exit the thread.
    locals:
      sys = <global> <module 'sys' (built-in)>
      sys.exit = <global> <built-in function exit>
SystemExit: 1
KeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
  File "../../../rnn.py", line 520, in main
    line: init(commandLineOptions=argv[1:])
    locals:
      init = <global> <function init at 0x7f35d85b1950>
      commandLineOptions = <not found>
      argv = <local> ['../../../rnn.py', 'config_demo'], _[0]: {len = 15}
  File "../../../rnn.py", line 336, in init
    line: devices = initDevices()
    locals:
      devices = <not found>
      initDevices = <global> <function initDevices at 0x7f35d85b1488>
  File "../../../rnn.py", line 152, in initDevices
    line: time.sleep(0.25)
    locals:
      time = <global> <module 'time' (built-in)>
      time.sleep = <global> <built-in function sleep>
KeyboardInterrupt
Quitting

Running on Ubuntu 16.04/ Cuda 8.0/ python 3.6/Theano 0.9.0/ Titan XP

How to specify training data?

In the doc, it says:

train / dev
    The datasets. This can be a filename to a hdf-file.
    Or it can be a dict with an entry ``class`` where you can choose a from a variety
    of other dataset implementations, including many synthetic generated data.

This is very obscure that how the hdf-file or the "class" should be created. Suppose I have a batch data generator like this:

def get_batches():
  """Returns a batch of X and Y
      X has shape (batch_size, sequence_length, input_dimension)
      Y has shape (batch_size, input_dimension)
  """
  for i in range(num_batches):
      x = np.random.rand(batch_size, sequence_length, input_dimension)
      y = x[:,-1,:]
      yield x, y

What's next?

Error running demo of IAM script

I attempted to run the example in the demos/mdlstm/IAM folder. Here is the terminal output when a run the go.sh file:

CRNN starting up, version 20171107.125510--git-ef4f6a7-dirty, pid 21149, cwd /home/scr/person/invite/chiekahm/home/sanaw/returnn/demos/mdlstm/IAM
CRNN command line options: ['config_demo']
Theano: 0.9.0 (<site-package> in /home/scr/person/invite/chiekahm/.local/lib/python2.7/site-packages/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN not available)
Device gpuX proc starting up, pid 21224
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,floatX=float32,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']

ectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).\n/usr/bin/ld\xc2\xa0: ne peut trouver -lcudnn\ncollect2: error: ld returned 1 exit status\n")
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,floatX=float32,force_device=True
EXCEPTION
Traceback (most recent call last):
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/Device.py", line 332, in startProc
    line: self._startProc(*args, **kwargs)
    locals:
      self = <local> <Device.Device object at 0x7fbba87b1a10>
      self._startProc = <local> <bound method Device._startProc of <Device.Device object at 0x7fbba87b1a10>>
      args = <local> ('gpuZ',)
      kwargs = <local> {}
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/Device.py", line 386, in _startProc
    line: interrupt_main()
    locals:
      interrupt_main = <global> <function interrupt_main at 0x7fbba9310500>
  File "/home/scr/person/invite/chiekahm/home/sanaw/returnn/Util.py", line 625, in interrupt_main
    line: sys.exit(1)  # And exit the thread.
    locals:
      sys = <global> <module 'sys' (built-in)>
      sys.exit = <global> <built-in function exit>
SystemExit: 1
KeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
  File "../../../rnn.py", line 519, in main
    line: init(commandLineOptions=argv[1:])
    locals:
      init = <global> <function init at 0x7fbba87ac8c0>
      commandLineOptions = <not found>
      argv = <local> ['../../../rnn.py', 'config_demo'], _[0]: {len = 15}
  File "../../../rnn.py", line 335, in init
    line: devices = initDevices()
    locals:
      devices = <not found>
      initDevices = <global> <function initDevices at 0x7fbba87ac500>
  File "../../../rnn.py", line 151, in initDevices
    line: sleep(0.25)
    locals:
      sleep = <global> <built-in function sleep>
KeyboardInterrupt
Quitting

Create HDF5 without targets

Hi! I've been using the forward mode of the RETURNN regularly based on the IAM recipe, but now I'm wondering how I could perform this forward task without requiring passing the target's info to the HDF5 file. I'm using the method presented in the create_IAM_dataset.py:


def write_to_hdf(file_list, transcription_list, charlist, n_labels, out_file_name, dataset_prefix, pad_y=15, pad_x=15, compress=True):
  with h5py.File(out_file_name, "w") as f:
    f.attrs["inputPattSize"] = 1
    f.attrs["numDims"] = 1
    f.attrs["numSeqs"] = len(file_list)
    classes = charlist

    inputs = []
    sizes = []
    seq_lengths = []
    targets = []
    for i, (img_name, transcription) in enumerate(zip(file_list, transcription_list)):
      targets += transcription
      img = imread(img_name)
      img = 255 - img
      img = numpy.pad(img, ((pad_y, pad_y), (pad_x, pad_x)), 'constant')
      sizes.append(img.shape)
      img = img.reshape(img.size, 1)
      inputs.append(img)
      seq_lengths.append([[img.size, len(transcription), 2]])
      if i % 100 == 0:
        print(i, "/", len(file_list))

    inputs = numpy.concatenate(inputs, axis=0)
    sizes = numpy.concatenate(numpy.array(sizes, dtype="int32"), axis=0)
    seq_lengths = numpy.concatenate(numpy.array(seq_lengths, dtype="int32"), axis=0)
    targets = numpy.array(targets, dtype="int32")
    
    f.attrs["numTimesteps"] = inputs.shape[0]

    if compress:
      f.create_dataset("inputs", compression="gzip", data=inputs.astype("float32") / 255.0)
    else:
      f["inputs"] = inputs.astype("float32") / 255.0
    hdf5_strings(f, "labels", classes)
    f["seqLengths"] = seq_lengths
    seq_tags = [dataset_prefix + "/" + tag.split("/")[-1].split(".png")[0] for tag in file_list]
    hdf5_strings(f, "seqTags", seq_tags)

    f["targets/data/classes"] = targets
    f["targets/data/sizes"] = sizes
    hdf5_strings(f, "targets/labels/classes", classes)
    hdf5_strings(f, "targets/labels/sizes", ["foo"]) #TODO, can we just omit it?
    g = f.create_group("targets/size")
    g.attrs["classes"] = len(classes)
    g.attrs["sizes"] = 2

I tried ommiting the "f["targets/data/classes"] = targets" part but it didn't work.

Thanks in advance,
Dayvid Castro.

GpuArrayException: cuMemcpyDtoHAsync: CUDA_ERROR_INVALID_VALUE: invalid argument

Hi, I tried running training on IAM handwriting dataset. Training dropped after half an hour or so with following messages:

Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX 1080 (0000:01:00.0)
CRNN starting up, version 20170921.193339--git-9f9dd5d-dirty, pid 12964, cwd /home/mpl2/Annapurna/returnn_IAM/demos/mdlstm/IAM
CRNN command line options: ['config_real']
Theano: 0.9.0 (<site-package> in /home/mpl2/anaconda2/envs/theano/lib/python2.7/site-packages/theano)
faulthandler import error. No module named faulthandler
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 20.0% of memory, cuDNN 5110)
Device gpuX proc starting up, pid 13015
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule                                                                                                                                                                   
Device gpu0 proc, pid 13015 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/train_valid.h5
cached 616 seqs 1.88295629248 GB (fully loaded, 40.7837103745 GB left over)
loading file features/raw/train.1.h5
loading file features/raw/train.2.h5
cached 5545 seqs 16.9867935553 GB (fully loaded, 39.2744432362 GB left over)
Train data:
  input: 1 x 1
  output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
  HDF dataset, sequences: 5545, frames: 1519952558
Dev data:
  HDF dataset, sequences: 616, frames: 168484077
Devices:
  gpu0: GeForce GTX 1080 (units: 1000 clock: 1.00Ghz memory: 2.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
CUDA.use gpu in proc 12964
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 20.0% of memory, cuDNN 5110)
Network layer topology:
  input #: 1
  hidden 1Dto2D '1Dto2D' #: 1
  hidden source 'classes_source' #: 2
  hidden conv2 'conv0' #: 15
  hidden conv2 'conv1' #: 45
  hidden conv2 'conv2' #: 75
  hidden conv2 'conv3' #: 105
  hidden conv2 'conv4' #: 105
  hidden mdlstm 'mdlstm0' #: 30
  hidden mdlstm 'mdlstm1' #: 60
  hidden mdlstm 'mdlstm2' #: 90
  hidden mdlstm 'mdlstm3' #: 120
  hidden mdlstm 'mdlstm4' #: 120
  output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
TaskThread train failed
EXCEPTION
Traceback (most recent call last):
  File "/home/mpl2/Annapurna/returnn_IAM/EngineTask.py", line 376, in run
    line: self.run_inner()
    locals:
      self = <local> <TrainTaskThread(TaskThread train, started daemon 140227476027136)>
      self.run_inner = <local> <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140227476027136)>>
  File "/home/mpl2/Annapurna/returnn_IAM/EngineTask.py", line 401, in run_inner
    line: device.prepare(epoch=self.epoch, **self.get_device_prepare_args())
    locals:
      device = <local> <Device.Device object at 0x7f8940e7a8d0>
      device.prepare = <local> <bound method Device.prepare of <Device.Device object at 0x7f8940e7a8d0>>
      epoch = <not found>
      self = <local> <TrainTaskThread(TaskThread train, started daemon 140227476027136)>
      self.epoch = <local> 1
      self.get_device_prepare_args = <local> <bound method TrainTaskThread.get_device_prepare_args of <TrainTaskThread(TaskThread train, started daemon 140227476027136)>>
  File "/home/mpl2/Annapurna/returnn_IAM/Device.py", line 1374, in prepare
    line: self.set_net_params(network)
    locals:
      self = <local> <Device.Device object at 0x7f8940e7a8d0>
      self.set_net_params = <local> <bound method Device.set_net_params of <Device.Device object at 0x7f8940e7a8d0>>
      network = <local> <Network.LayerNetwork object at 0x7f89572804d0>
  File "/home/mpl2/Annapurna/returnn_IAM/Device.py", line 1193, in set_net_params
    line: self.set_net_encoded_params([
            numpy.asarray(p.get_value()) for p in network.get_all_params_vars()])
    locals:
      self = <local> <Device.Device object at 0x7f8940e7a8d0>
      self.set_net_encoded_params = <local> <bound method Device.set_net_encoded_params of <Device.Device object at 0x7f8940e7a8d0>>
      numpy = <global> <module 'numpy' from '/home/mpl2/anaconda2/envs/theano/lib/python2.7/site-packages/numpy/__init__.pyc'>
      numpy.asarray = <global> <function asarray at 0x7f89691f37d0>
      p = <local> W_conv0
      p.get_value = <local> <bound method GpuArraySharedVariable.get_value of W_conv0>
      network = <local> <Network.LayerNetwork object at 0x7f89572804d0>
      network.get_all_params_vars = <local> <bound method LayerNetwork.get_all_params_vars of <Network.LayerNetwork object at 0x7f89572804d0>>
  File "/home/mpl2/anaconda2/envs/theano/lib/python2.7/site-packages/theano/gpuarray/type.py", line 602, in get_value
    line: return np.asarray(self.container.value)
    locals:
      np = <global> <module 'numpy' from '/home/mpl2/anaconda2/envs/theano/lib/python2.7/site-packages/numpy/__init__.pyc'>
      np.asarray = <global> <function asarray at 0x7f89691f37d0>
      self = <local> W_conv0
      self.container = <local> <gpuarray.array(<content not available>)>
      self.container.value = <local> gpuarray.array(<content not available>), len = 15, _[0]: {len = 1, _[0]: {len = 3, _[0]: {len = 3}}}
  File "/home/mpl2/anaconda2/envs/theano/lib/python2.7/site-packages/numpy/core/numeric.py", line 531, in asarray
    line: return array(a, dtype, copy=False, order=order)
    locals:
      array = <global> <built-in function array>
      a = <local> gpuarray.array(<content not available>), len = 15, _[0]: {len = 1, _[0]: {len = 3, _[0]: {len = 3}}}
      dtype = <local> None
      copy = <not found>
      order = <local> None
  File "pygpu/gpuarray.pyx", line 1734, in pygpu.gpuarray.GpuArray.__array__
    -- code not available --
  File "pygpu/gpuarray.pyx", line 1407, in pygpu.gpuarray._pygpu_as_ndarray
    -- code not available --
  File "pygpu/gpuarray.pyx", line 394, in pygpu.gpuarray.array_read
    -- code not available --
GpuArrayException: cuMemcpyDtoHAsync(dst, src->ptr + srcoff, sz, ctx->mem_s): CUDA_ERROR_INVALID_VALUE: invalid argument

KeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
  File "../../../rnn.py", line 521, in main
    line: executeMainTask()
    locals:
      executeMainTask = <global> <function executeMainTask at 0x7f8940e76ed8>
  File "../../../rnn.py", line 375, in executeMainTask
    line: engine.train()
    locals:
      engine = <global> <Engine.Engine instance at 0x7f8940e93560>
      engine.train = <global> <bound method Engine.train of <Engine.Engine instance at 0x7f8940e93560>>
  File "/home/mpl2/Annapurna/returnn_IAM/Engine.py", line 378, in train
    line: self.train_epoch()
    locals:
      self = <local> <Engine.Engine instance at 0x7f8940e93560>
      self.train_epoch = <local> <bound method Engine.train_epoch of <Engine.Engine instance at 0x7f8940e93560>>
  File "/home/mpl2/Annapurna/returnn_IAM/Engine.py", line 494, in train_epoch
    line: trainer.join()
    locals:
      trainer = <local> <TrainTaskThread(TaskThread train, stopped daemon 140227476027136)>
      trainer.join = <local> <bound method TrainTaskThread.join_hacked of <TrainTaskThread(TaskThread train, stopped daemon 140227476027136)>>
  File "/home/mpl2/Annapurna/returnn_IAM/Util.py", line 559, in join_hacked
    line: join_orig(threadObj, timeout=0.1)
    locals:
      join_orig = <local> <unbound method Thread.join>
      threadObj = <local> <TrainTaskThread(TaskThread train, stopped daemon 140227476027136)>
      timeout = <local> None
  File "/home/mpl2/anaconda2/envs/theano/lib/python2.7/threading.py", line 951, in join
    line: self.__block.wait(delay)
    locals:
      self = <local> <TrainTaskThread(TaskThread train, stopped daemon 140227476027136)>
      self.__block = <local> !AttributeError: 'TrainTaskThread' object has no attribute '__block'
      self.__block.wait = <local> !AttributeError: 'TrainTaskThread' object has no attribute '__block'
      delay = <local> 0.09999895095825195
  File "/home/mpl2/Annapurna/returnn_IAM/Util.py", line 572, in cond_wait_hacked
    line: cond_wait_orig(cond, timeout=timeout)
    locals:
      cond_wait_orig = <local> <unbound method _Condition.wait>
      cond = <local> <Condition(<thread.lock object at 0x7f8957563dd0>, 0)>
      timeout = <local> 0.09999895095825195
  File "/home/mpl2/anaconda2/envs/theano/lib/python2.7/threading.py", line 359, in wait
    line: _sleep(delay)
    locals:
      _sleep = <global> <built-in function sleep>
      delay = <local> 0.032
KeyboardInterrupt
Quitting

Bug in running full IAM dataset

hey guys. when i am trying to run ./go file for full IAM data set using config_real after loading all the images then it get stuck.here i have copied the loading page.it gives me no error what so ever:

sadeghi@kiew:~/returnn/demos/mdlstm/IAM> ./go.sh
/usr/lib64/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
('converting IAM_lines to', 'features/raw/train.1.h5', 'and', 'features/raw/train.2.h5')
features/raw/train.1.h5
(0, '/', 2772)
(100, '/', 2772)
(200, '/', 2772)
(300, '/', 2772)
(400, '/', 2772)
(500, '/', 2772)
(600, '/', 2772)
(700, '/', 2772)
(800, '/', 2772)
(900, '/', 2772)
(1000, '/', 2772)
(1100, '/', 2772)
(1200, '/', 2772)
(1300, '/', 2772)
(1400, '/', 2772)
(1500, '/', 2772)
(1600, '/', 2772)
(1700, '/', 2772)
(1800, '/', 2772)
(1900, '/', 2772)
(2000, '/', 2772)
(2100, '/', 2772)
(2200, '/', 2772)
(2300, '/', 2772)
(2400, '/', 2772)
(2500, '/', 2772)
(2600, '/', 2772)
(2700, '/', 2772)
features/raw/train.2.h5
(0, '/', 2773)
(100, '/', 2773)
(200, '/', 2773)
(300, '/', 2773)
(400, '/', 2773)
(500, '/', 2773)
(600, '/', 2773)
(700, '/', 2773)
(800, '/', 2773)
(900, '/', 2773)
(1000, '/', 2773)
(1100, '/', 2773)
(1200, '/', 2773)
(1300, '/', 2773)
(1400, '/', 2773)
(1500, '/', 2773)
(1600, '/', 2773)
(1700, '/', 2773)
(1800, '/', 2773)
(1900, '/', 2773)
(2000, '/', 2773)
(2100, '/', 2773)
(2200, '/', 2773)
(2300, '/', 2773)
(2400, '/', 2773)
(2500, '/', 2773)
(2600, '/', 2773)
(2700, '/', 2773)
features/raw/train_valid.h5
(0, '/', 616)
(100, '/', 616)
(200, '/', 616)
(300, '/', 616)
(400, '/', 616)
(500, '/', 616)
(600, '/', 616)
('converting IAM_lines to', 'features/raw/valid.h5')
(0, '/', 976)
(100, '/', 976)
(200, '/', 976)
(300, '/', 976)
(400, '/', 976)
(500, '/', 976)
(600, '/', 976)
(700, '/', 976)
(800, '/', 976)
(900, '/', 976)
('converting IAM_lines to', 'features/raw/test.h5')
(0, '/', 2915)
(100, '/', 2915)
(200, '/', 2915)
(300, '/', 2915)
(400, '/', 2915)
(500, '/', 2915)
(600, '/', 2915)
(700, '/', 2915)
(800, '/', 2915)
(900, '/', 2915)
(1000, '/', 2915)
(1100, '/', 2915)
(1200, '/', 2915)
(1300, '/', 2915)
(1400, '/', 2915)
(1500, '/', 2915)
(1600, '/', 2915)
(1700, '/', 2915)
(1800, '/', 2915)
(1900, '/', 2915)
(2000, '/', 2915)
(2100, '/', 2915)
(2200, '/', 2915)
(2300, '/', 2915)
(2400, '/', 2915)
(2500, '/', 2915)
(2600, '/', 2915)
(2700, '/', 2915)
(2800, '/', 2915)
(2900, '/', 2915)


  • hwloc has encountered what looks like an error from the operating system.
  • L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion!
  • Error occurred in topology.c line 942
  • The following FAQ entry in a recent hwloc documentation may help:
  • What should I do when hwloc reports "operating system" warnings?
  • Otherwise please report this error message to the hwloc user's mailing list,
  • along with the output+tarball generated by the hwloc-gather-topology script.

[kiew:09731] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[kiew:09731] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[kiew:09731] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
/usr/lib64/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
RETURNN starting up, version 20180626.085545--git-e21a4b53-dirty, date/time 2018-06-26-16-54-22 (UTC+0200), pid 9731, cwd /home/sadeghi/returnn/demos/mdlstm/IAM, Python /usr/bin/python
RETURNN command line options: ['config_demo']
Hostname: kiew
faulthandler import error. No module named faulthandler
Theano: 0.8.2 ( in /usr/lib/python2.7/site-packages/theano)
pynvml not available, memory information missing


  • hwloc has encountered what looks like an error from the operating system.
  • L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion!
  • Error occurred in topology.c line 942
  • The following FAQ entry in a recent hwloc documentation may help:
  • What should I do when hwloc reports "operating system" warnings?
  • Otherwise please report this error message to the hwloc user's mailing list,
  • along with the output+tarball generated by the hwloc-gather-topology script.

[kiew:09765] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[kiew:09765] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[kiew:09765] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
Using gpu device 0: GeForce GTX 1080 (CNMeM is disabled, cuDNN 5105)
/usr/lib/python2.7/site-packages/theano/sandbox/cuda/init.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)
/usr/lib64/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as register_converters
Device gpuX proc starting up, pid 9765
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir
%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule

Running Tests with models

How do you run tests with the models generated after training? Is "forwarding" the same as testing? I am using the IAM demos by the way.

mdlstm IAM demo crashes after loading train.2.h5

I ran the IAM demo and it crashes part way through epoch 2 right after loading train.2.h5. All of the other demos worked correctly. Here is a script of exactly what I was running: https://gist.github.com/cwig/315d212964542f7f1797d5fdd122891e

Let me know if I need to run anything differently. Thank you.

This is the traceback.

train epoch 2, batch 190, cost:output 2.56628417969, elapsed 0:04:28, exp. remaining 1:04:21, complete 6.49%
1:04:21 [|||| 6.49% ]running 2 sequence slices (442764 nts) of batch 191 on device gpu0
loading file features/raw/train.2.h5
TaskThread train failed
Unhandled exception <type 'exceptions.ValueError'> in thread <TrainTaskThread(TaskThread train, started daemon 140405735151360)>, proc 2207.
EXCEPTION
Traceback (most recent call last):
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 373, in run
line: self.run_inner()
locals:
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 466, in run_inner
line: deviceRuns[i].allocate()
locals:
deviceRuns = [<DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>]
i = 0
allocate =
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 218, in allocate
line: self.devices_batches = self.parent.allocate_devices(self.alloc_devices)
locals:
self = <DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>
self.devices_batches = [[<Batch start_seq:3846, #seqs:2>]]
self.parent = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.parent.allocate_devices = <bound method TrainTaskThread.allocate_devices of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
self.alloc_devices = [<Device.Device object at 0x7fb2c2ba19d0>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 84, in allocate_devices
line: success, batch_adv_idx = self.assign_dev_data(device, batches)
locals:
success =
batch_adv_idx =
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.assign_dev_data = <bound method TrainTaskThread.assign_dev_data of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
device = <Device.Device object at 0x7fb2c2ba19d0>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 54, in assign_dev_data
line: return assign_dev_data(device, self.data, batches)
locals:
assign_dev_data = <function assign_dev_data at 0x7fb2ca2ab230>
device = <Device.Device object at 0x7fb2c2ba19d0>
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.data = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineUtil.py", line 23, in assign_dev_data
line: if load_seqs: dataset.load_seqs(batch.start_seq, batch.end_seq)
locals:
load_seqs = True
dataset = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
dataset.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
batch = <Batch start_seq:3349, #seqs:2>
batch.start_seq = 3349
batch.end_seq = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 140, in load_seqs
line: self._load_seqs_with_cache(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs_with_cache = <bound method HDFDataset._load_seqs_with_cache of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 168, in _load_seqs_with_cache
line: self.load_seqs(start, end, with_cache=False)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
with_cache =
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 143, in load_seqs
line: super(CachedDataset, self).load_seqs(start, end)
locals:
super = <type 'super'>
CachedDataset = <class 'CachedDataset.CachedDataset'>
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
load_seqs =
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/Dataset.py", line 159, in load_seqs
line: self._load_seqs(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs = <bound method HDFDataset._load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/HDFDataset.py", line 152, in _load_seqs
line: self._set_alloc_intervals_data(idc, data=fin['inputs'][p[0] : p[0] + l[0]][...])
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._set_alloc_intervals_data = <bound method HDFDataset._set_alloc_intervals_data of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
idc = 3358
data =
fin = <HDF5 file "train.2.h5" (mode r)>, len = 5
p = array([110348430, 25353, 1172]), len = 3
l = array([151211, 40, 2], dtype=int32), len = 3
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 220, in _set_alloc_intervals_data
line: self.alloc_intervals[idi][2][o:o + l] = x
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.alloc_intervals = [(0, 1321, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1329, 1333, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1417, 1420, array([[ 0.],
[ 0.],
..., len = 52
idi = 48
o = 1824769
l = 151211
x = array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32), len = 151211, _[0]: {len = 1}
ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

Documentation for implementing new RETURNN layers

Hello,
I'm currently in the process of implementing a new RETURNN TF layer. I noticed that quite a lot of features are well documented, though I recommend adding a better description/tutorial for the following aspects (even if some of the info can be gathered by going through the code & comments):

  • General confusion avoidance: In some locations the internal placeholder of an object can be easily confused with a tf.placeholder. Adding a small remark where the tf.placeholder is used should be enough.
  • General confusion avoidance: Adding a central list of project specific conventions and best practices (e.g. batch or time major, shape of targets etc.)
  • In TFNetworkLayer.py:LayerBase: The sources parameter isn't completely described. Especially interesting is in which order the layers are listed.
  • In TFNetworkLayer.py:LayerBase: self.output really needs to have an exact description.
  • In TFNetworkLayer.py:LayerBase:transform_config_dict: The description is confusing as to when it needs to be used.
  • In TFNetworkLayer.py:_ConcatInputLayer: A description of how the input_data parameter exactly is managed would be helpful.
  • In TFNetworkLayer.py:Loss: A general quick intro as to how the loss is constucted and interacts with the general system.

I believe these are the main points concerning the implementation of new layers in RETURNN.

Error running IAM demo

I attempted to run the example in the demos/mdlstm/IAM folder. Here is the terminal output when a run the go.sh file:

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 Ti (CNMeM is disabled, cuDNN not available)
RETURNN starting up, version 20160714.042013--git-5a40490-dirty, pid 12085, cwd /home/kartik/new_returnn/demos/mdlstm/IAM
RETURNN command line options: ['config_fwd_kd']
Theano: 0.9.0 ( in /home/kartik/.local/lib/python2.7/site-packages/theano)
CUDA already initialized in proc 12085
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
Not able to select available GPU from 1 cards (all CUDA-capable devices are busy or unavailable).
EXCEPTION
Traceback (most recent call last):
File "/home/kartik/new_returnn/TaskSystem.py", line 1381, in
line: ExecingProcess.checkExec() # Never returns if this proc is called via ExecingProcess.
locals:
ExecingProcess = <class main.ExecingProcess at 0x7f7f3c140d50>
ExecingProcess.checkExec = <function checkExec at 0x7f7f3c171488>
File "/home/kartik/new_returnn/TaskSystem.py", line 1001, in checkExec
line: args = unpickler.load()
locals:
args =
unpickler = <pickle.Unpickler instance at 0x7f7f3c16de60>
unpickler.load = <bound method Unpickler.load of <pickle.Unpickler instance at 0x7f7f3c16de60>>
File "/usr/lib/python2.7/pickle.py", line 864, in load
line: dispatchkey
locals:
dispatch = {'': <function load_eof at 0x7f7fa3ca50c8>, '\x80': <function load_proto at 0x7f7fa3ca5140>, '\x83': <function load_ext2 at 0x7f7fa3ca6230>, '\x82': <function load_ext1 at 0x7f7fa3ca61b8>, '\x85': <function load_tuple1 at 0x7f7fa3ca5b90>, '\x84': <function load_ext4 at 0x7f7fa3ca62a8>, '\x87': <f..., len = 54
key = 'R'
self = <pickle.Unpickler instance at 0x7f7f3c16de60>
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
line: value = func(*args)
locals:
value =
func = <function getModuleDict at 0x7f7f9f4686e0>
args = ('Device', ['/home/kartik/new_returnn', '/home/kartik/.local/lib/python2.7/site-packages/PyTorch-4.1.1_SNAPSHOT-py2.7-linux-x86_64.egg', '/home/kartik/.local/lib/python2.7/site-packages/voc_utils-0.0-py2.7.egg', '/home/kartik/.local/lib/python2.7/site-packages/more_itertools-3.1.0-py2.7.egg', '/h..., [0]: {len = 6}
File "/home/kartik/new_returnn/TaskSystem.py", line 552, in getModuleDict
line: mod = import_module(modname)
locals:
mod =
import_module = <function import_module at 0x7f7fa3ca6f50>
modname = 'Device', len = 6
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
line: import(name)
locals:
import =
name = 'Device', len = 6
File "/home/kartik/new_returnn/Device.py", line 5, in
line: from Updater import Updater
locals:
Updater =
File "/home/kartik/new_returnn/Updater.py", line 4, in
line: import theano
locals:
theano =
File "/home/kartik/.local/lib/python2.7/site-packages/theano/init.py", line 108, in
line: import theano.sandbox.cuda
locals:
theano = <module 'theano' from '/home/kartik/.local/lib/python2.7/site-packages/theano/init.pyc'>
theano.sandbox = <module 'theano.sandbox' from '/home/kartik/.local/lib/python2.7/site-packages/theano/sandbox/init.pyc'>
theano.sandbox.cuda = !AttributeError: 'module' object has no attribute 'cuda'
File "/home/kartik/.local/lib/python2.7/site-packages/theano/sandbox/cuda/init.py", line 728, in
line: use(device=config.device, force=config.force_device, test_driver=False)
locals:
use = None
device =
config = None
config.device = !AttributeError: 'NoneType' object has no attribute 'device'
force =
config.force_device = !AttributeError: 'NoneType' object has no attribute 'force_device'
test_driver =
File "/home/kartik/.local/lib/python2.7/site-packages/theano/sandbox/cuda/init.py", line 586, in use
line: cuda_ndarray.cuda_ndarray.select_a_gpu()
locals:
cuda_ndarray = None
cuda_ndarray.cuda_ndarray = !AttributeError: 'NoneType' object has no attribute 'cuda_ndarray'
cuda_ndarray.cuda_ndarray.select_a_gpu = !AttributeError: 'NoneType' object has no attribute 'cuda_ndarray'
RuntimeError: ('Not able to select available GPU from 1 cards (all CUDA-capable devices are busy or unavailable).', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir
%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True
EXCEPTION
Traceback (most recent call last):
File "/home/kartik/new_returnn/Device.py", line 332, in startProc
line: self._startProc(*args, **kwargs)
locals:
self = <Device.Device object at 0x7f703bc37210>
self._startProc = <bound method Device._startProc of <Device.Device object at 0x7f703bc37210>>
args = ('gpuZ',)
kwargs = {}
File "/home/kartik/new_returnn/Device.py", line 386, in _startProc
line: interrupt_main()
locals:
interrupt_main = <function interrupt_main at 0x7f70405c08c0>
File "/home/kartik/new_returnn/Util.py", line 637, in interrupt_main
line: sys.exit(1) # And exit the thread.
locals:
sys = <module 'sys' (built-in)>
sys.exit =
SystemExit: 1
ERR!\nKeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
File "../../../rnn.py", line 532, in main
line: init(commandLineOptions=argv[1:])
locals:
init = <function init at 0x7f703bc2cc08>
commandLineOptions =
argv = ['../../../rnn.py', 'config_fwd_kd'], _[0]: {len = 15}
File "../../../rnn.py", line 345, in init
line: devices = initDevices()
locals:
devices =
initDevices = <function initDevices at 0x7f703bc2c848>
File "../../../rnn.py", line 158, in initDevices
line: time.sleep(0.25)
locals:
time = <module 'time' (built-in)>
time.sleep =
KeyboardInterrupt
Quitting

Results using config_real structure IAM

Hello, thanks for RETURNN Framework, it is really useful.

I've trained the network structure in https://github.com/rwth-i6/returnn/blob/master/demos/mdlstm/IAM/config_real with the train and dev datasets generated with the code in create_IAM_dataset.py and it gets LER of 5.4 % in the validation set (I see it in the log file) and LER of 8.8% when test set it is computed with config_fwd, and externally decoded and evaluated.
However, the results showed in https://www.vision.rwth-aachen.de/media/papers/MDLSTM_final.pdf, indicate CER = 2.4 % and 3.5 % for dev and eval sets respectively.
Did you get these results with this code directly or did you use the word-based trigram and character language model indicated in section III in that paper (that I guess it is not implemented in the code)?

Thank you in advance.

Error in IAM Demo: Couldn't import dot_parser, loading of dot files will not be possible.

I installed h5py and theano and tried running go.sh at location returnn_IAM/demos/mdlstm/IAM:
I am getting following error:

('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
Couldn't import dot_parser, loading of dot files will not be possible.
RETURNN starting up, version 20171127.183959--git-94c0542-dirty, pid 183326, cwd /export/b18/aarora/returnn_IAM/demos/mdlstm/IAM
RETURNN command line options: ['config_demo']
Theano: 0.9.0 ( in /home/aaror/.local/lib/python2.7/site-packages/theano)
faulthandler import error. No module named faulthandler
Couldn't import dot_parser, loading of dot files will not be possible.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

/home/aaror/.local/lib/python2.7/site-packages/theano/sandbox/cuda/init.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
warnings.warn(msg)
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
Not able to select available GPU from 4 cards (all CUDA-capable devices are busy or unavailable).
EXCEPTION
Traceback (most recent call last):
File "/export/b18/aarora/returnn_IAM/TaskSystem.py", line 1381, in
line: ExecingProcess.checkExec() # Never returns if this proc is called via ExecingProcess.
locals:
ExecingProcess = <class main.ExecingProcess at 0x7fe9e6578738>
ExecingProcess.checkExec = <function checkExec at 0x7fe9e658ea28>
File "/export/b18/aarora/returnn_IAM/TaskSystem.py", line 1001, in checkExec
line: args = unpickler.load()
locals:
args =
unpickler = <pickle.Unpickler instance at 0x7fe9e65921b8>
unpickler.load = <bound method Unpickler.load of <pickle.Unpickler instance at 0x7fe9e65921b8>>
File "/usr/lib/python2.7/pickle.py", line 858, in load
line: dispatchkey
locals:
dispatch = {'': <function load_eof at 0x7fe9eb0c4140>, '\x80': <function load_proto at 0x7fe9eb0c41b8>, '\x83': <function load_ext2 at 0x7fe9eb0c32a8>, '\x82': <function load_ext1 at 0x7fe9eb0c3230>, '\x85': <function load_tuple1 at 0x7fe9eb0c4c08>, '\x84': <function load_ext4 at 0x7fe9eb0c3320>, '\x87': <f..., len = 54
key = 'R'
self = <pickle.Unpickler instance at 0x7fe9e65921b8>
File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
line: value = func(*args)
locals:
value =
func = <function getModuleDict at 0x7fe9e6337c80>
args = ('Device', ['/export/b18/aarora/returnn_IAM', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/home/aaror/.local/lib/python2.7/site-packages', '/home/aaror/.local/lib/python2.7/site-pac..., [0]: {len = 6}
File "/export/b18/aarora/returnn_IAM/TaskSystem.py", line 552, in getModuleDict
line: mod = import_module(modname)
locals:
mod =
import_module = <function import_module at 0x7fe9eb0c7050>
modname = 'Device', len = 6
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
line: import(name)
locals:
import =
name = 'Device', len = 6
File "/export/b18/aarora/returnn_IAM/Device.py", line 5, in
line: from Updater import Updater
locals:
Updater =
File "/export/b18/aarora/returnn_IAM/Updater.py", line 4, in
line: import theano
locals:
theano =
File "/home/aaror/.local/lib/python2.7/site-packages/theano/init.py", line 108, in
line: import theano.sandbox.cuda
locals:
theano = <module 'theano' from '/home/aaror/.local/lib/python2.7/site-packages/theano/init.pyc'>
theano.sandbox = <module 'theano.sandbox' from '/home/aaror/.local/lib/python2.7/site-packages/theano/sandbox/init.pyc'>
theano.sandbox.cuda = !AttributeError: 'module' object has no attribute 'cuda'
File "/home/aaror/.local/lib/python2.7/site-packages/theano/sandbox/cuda/init.py", line 728, in
line: use(device=config.device, force=config.force_device, test_driver=False)
locals:
use = None
device =
config = None
config.device = !AttributeError: 'NoneType' object has no attribute 'device'
force =
config.force_device = !AttributeError: 'NoneType' object has no attribute 'force_device'
test_driver =
File "/home/aaror/.local/lib/python2.7/site-packages/theano/sandbox/cuda/init.py", line 586, in use
line: cuda_ndarray.cuda_ndarray.select_a_gpu()
locals:
cuda_ndarray = None
cuda_ndarray.cuda_ndarray = !AttributeError: 'NoneType' object has no attribute 'cuda_ndarray'
cuda_ndarray.cuda_ndarray.select_a_gpu = !AttributeError: 'NoneType' object has no attribute 'cuda_ndarray'
RuntimeError: ('Not able to select available GPU from 4 cards (all CUDA-capable devices are busy or unavailable).', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir
%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True
EXCEPTION
Traceback (most recent call last):
File "/export/b18/aarora/returnn_IAM/Device.py", line 332, in startProc
line: self._startProc(*args, **kwargs)
locals:
self = <Device.Device object at 0x7fd3342190d0>
self._startProc = <bound method Device._startProc of <Device.Device object at 0x7fd3342190d0>>
args = ('gpuZ',)
kwargs = {}
File "/export/b18/aarora/returnn_IAM/Device.py", line 386, in _startProc
line: interrupt_main()
locals:
interrupt_main = <function interrupt_main at 0x7fd334d56578>
File "/export/b18/aarora/returnn_IAM/Util.py", line 625, in interrupt_main
line: sys.exit(1) # And exit the thread.
locals:
sys = <module 'sys' (built-in)>
sys.exit =
SystemExit: 1
ERR!\nKeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
File "../../../rnn.py", line 532, in main
line: init(commandLineOptions=argv[1:])
locals:
init = <function init at 0x7fd33420d398>
commandLineOptions =
argv = ['../../../rnn.py', 'config_demo'], _[0]: {len = 15}
File "../../../rnn.py", line 345, in init
line: devices = initDevices()
locals:
devices =
initDevices = <function initDevices at 0x7fd33420cf50>
File "../../../rnn.py", line 158, in initDevices
line: time.sleep(0.25)
locals:
time = <module 'time' (built-in)>
time.sleep =
KeyboardInterrupt
Quitting

I have following installed on the grid:

pydot (1.0.32)
pydot2 (1.0.33)
pyparsing (2.1.10)
Theano (0.9.0)
h5py (2.7.1)

Pad a sparse output layer

Hi,

I want to modify a sparse output layer over a vocab of length 10017 (lets call it output1)into an output layer with a vocab of length 10020 (lets call it output2). I tried using the PadLayer class to pad zeros to the right of output1 but that does not seem to work. I cannot change the n_out of the output1 layer unfortunately. Is there any other way of doing it? Can this be done using the PadLayer itself?

Thanks

Commercial usage

Read the LICENSE but since I'm not sure:
is it ok to use returnn in for commercial purposes?

Error or Bug when i run config_fwd on my own created Test Set

Hi everyone

I have took some images of handwritten scripts which i have prepared myself. I have then adapted the code from create_IAM_dataset.py to convert my images to hdf5 format and it converted them successfully. Now that I am trying to use config_fwd to get the on my created testset i face an error. Do i need to change anything in config_fwd except the path of eval? here is the error:

RETURNN starting up, version 20180405.225130--git-538ed96-dirty, date/time 2018-09-05-10-04-58 (UTC+0200), pid 13139, cwd /home/arman/returnn/demos/mdlstm/IAM, Python /usr/bin/python
RETURNN command line options: ['config_fwd']
faulthandler import error. No module named faulthandler
Theano: 0.9.0 ( in /usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 960M (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110)
Device train-network: Used data keys: ['classes', 'data', 'sizes']
Devices: Used in blocking / single proc mode.
parsing file features/raw/test.h5
loading file 1/1 features/raw/test.h5
cached 3 seqs 0.005933091044425964 GB (fully loaded, 1.6607335759326816 GB left over)
Unhandled exception <type 'exceptions.AssertionError'> in thread <_MainThread(MainThread, started 139787278063360)>, proc 13139.

Thread current, main, <_MainThread(MainThread, started 139787278063360)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Traceback (most recent call last):
File "../../../rnn.py", line 559, in
line: main(sys.argv)
locals:
main = <function main at 0x7f2287e4b410>
sys = <module 'sys' (built-in)>
sys.argv = ['../../../rnn.py', 'config_fwd'], _[0]: {len = 15}
File "../../../rnn.py", line 546, in main
line: init(commandLineOptions=argv[1:])
locals:
init = <function init at 0x7f2287e4b1b8>
commandLineOptions =
argv = ['../../../rnn.py', 'config_fwd'], _[0]: {len = 15}
File "../../../rnn.py", line 345, in init
line: initData()
locals:
initData = <function initData at 0x7f2287e4aed8>
File "../../../rnn.py", line 242, in initData
line: train_data, extra_train = load_data(config, train_cache_bytes, 'train')
locals:
train_data = None
extra_train =
load_data = <function load_data at 0x7f2287e4ae60>
config = <Config.Config instance at 0x7f2287eb8c68>
train_cache_bytes = 7151908220
File "../../../rnn.py", line 216, in load_data
line: data = init_dataset_via_str(config_str, config=config, cache_byte_size=cache_byte_size, **kwargs)
locals:
data =
init_dataset_via_str = <function init_dataset_via_str at 0x7f22881bca28>
config_str = 'features/raw/train.1.h5', len = 23
config = <Config.Config instance at 0x7f2287eb8c68>
cache_byte_size = 7151908220
kwargs = {'name': 'train'}
File "/home/arman/returnn/Dataset.py", line 856, in init_dataset_via_str
line: assert os.path.exists(f)
locals:
os = <module 'os' from '/usr/lib/python2.7/os.pyc'>
os.path = <module 'posixpath' from '/usr/lib/python2.7/posixpath.pyc'>
os.path.exists = <function exists at 0x7f22c305ced8>
f = 'features/raw/train.1.h5', len = 23
AssertionError

Possible bug in Theano MDLSTM backend C++/cuda implementation ?

Hi,

I have been looking at your theano MDLSTM cuda backend c++ code for a while.

Here is a piece of code that looks very much like "referencing un-initialized variable".

File : returnn/cuda_implementation/MultiDirectionalTwoDLSTMOp.py

Function : "c_code" of class MultiDirectionalTwoDLSTMOp(theano.sandbox.cuda.GpuOp)

Lines : 437 ~ 516

Suspicious logic: %(Y1)s,..,%(Y4)s are allocated as Cuda Arrays but never initialzed before being used.

Details :

 line 437 -445,     %(Y1)s = (CudaNdarray*) MyCudaNdarray_NewDims(4, Y_dim);

 line 505 - 516,  
           affine_y_x_batched_multidir(0, -1,
                       %(Y1)s, %(Y2)s, %(Y3)s, %(Y4)s,
                       %(V_h1)s, %(V_h2)s, %(V_h3)s, %(V_h4)s,
                       %(H1)s, %(H2)s, %(H3)s, %(H4)s,
                      ys_h, xs_h, ptr_storage, height, width);
     which is effectively matrix operation --  "H += Y * V_h"

My speculation:
line 505-516, the order of "%(Y)s" and "%(H)s" should be switched.
otherwise, if we are lucky, %(Y)s == 0, and these 2 function calls are NO-OPs, %(H)s don't change.

Please take a look !

Thanks,

Roger

RETURNN wont create priors

Hey all

After running config_real i stop the training on 72 epochs which the error is reduced to 0.0564. After that i have run the config_forward and all i got was a mdlstm_real_valid.h5 but my prior folder is still empty. Can some one please help me with this issue. and also how can i do the testing on my dataset to calculate accuracy.

Thanks you

PicklingError on Windows

By the latest download, i am trying to execute IAM example from demos using config_real but i am getting below error which is related to pickling/multiprocessing :

(Pdb) WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce 940MX (CNMeM is enabled with initial size: 70.0% of memory, cuDNN 5110)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\Anaconda2\lib\multiprocessing\forking.py", line 381, in main
    self = load(from_parent)
  File "C:\ProgramData\Anaconda2\lib\pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "C:\ProgramData\Anaconda2\lib\pickle.py", line 864, in load
    dispatch[key](self)
  File "C:\ProgramData\Anaconda2\lib\pickle.py", line 886, in load_eof
    raise EOFError
EOFError

System configuration:
Using gpu device 0: GeForce 940MX 2GB (CNMeM is enabled with initial size: 70.0% of memory, cuDNN 5110)

Windows 10 64bit
Intel Core i7 7th generation
GeForce 940MX 2GB
8GB RAM

How to set a specific GPU device

python ../../../rnn.py config_demo

I'm on a cluster - on the node when I get scheduled I have to select a free gpu. I can find the free gpu, but I am not sure how to pass it in code for multidimensional lstm (IAM code).

inside go.sh should I do the following ?

./create_IAM_dataset.py
mkdir -p models log priors
CUDA_VISIBLE_DEVICES=$(free-gpu) /home/aaror/miniconda2/bin/python ../../../rnn.py config_demo

multi-threaded TensorFlow

Situation is that I create multiple Engine objects in each thread (8 threads currently), and each will spawn its own tf.Session and own tf.Graph and do its own computation.

On stdout, on MacOSX, I see a lot of these:

2017-12-08 15:03:50.040 system_profiler[93128:19224180] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.041 system_profiler[93126:19224165] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.045 system_profiler[93127:19224172] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.046 system_profiler[93129:19224188] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.075 system_profiler[93129:19224188] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.075 system_profiler[93128:19224180] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.075 system_profiler[93126:19224165] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.092 system_profiler[93127:19224172] Error -536870187 (e00002d5) making IOI2CSendRequest
2017-12-08 15:03:50.094 system_profiler[93131:19224192] Error -536870187 (e00002d5) making IOI2CSendRequest

I'm just posting this here because any Google search for it doesn't give much results. So anyone else encountering this should now find this.

Not sure where the problem is. Might be a TensorFlow upstream bug or something else. Maybe I'm doing something wrong.

adapt model to single writer

Is there a way to adapt a model (that has been trained on IAM) to the handwriting of a single writer?
Would that make a substantial difference to the recognition accuracy (for that writer)?

Number of Epochs

Hi guys
I have run the model for full IAM dataset without any changes in the config files(hyper parameters). I am running the model on GTX 1080 Ti and after 30 epochs the output cost is close to 0. I wanted to know if its a normal situation since some people have mentioned they have trained until 100 epochs and still have high cost. and please let me know how can i test the accuracy of the model if anyone know.

Thanks

Using returnn on Multi-gpu

I am trying to use Returnn framework for speech recognition and I have multiple GPUs available but cannot get the code to use them. I am running the returnn-experiments/2018-asr-attention example on librispeech data.
Do you have any suggestions?

Experiments with IAM Database

Hi, i'm trying reproduce the experiments with IAM database, but i'm facing some problems. First, i've followed the instructions in this comment #5 (comment) in order to work with the most recent version of returnn, but when executing the config_real, i've got the following output:

`WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110)
CRNN starting up, version 20170510.185843--git-e8453fc-dirty, pid 5181
CRNN command line options: ['config_real']
Theano: 0.9.0.dev-c697eeab84... ( in /home/dayvidwelles/anaconda2/lib/python2.7/site-packages/theano)
Device train-network: Used data keys: ['classes', 'data', u'sizes']
using adam with nag and momentum schedule
Devices: Used in blocking / single proc mode.
loading file features/raw/train_valid.h5
cached 616 seqs 1.88295629248 GB (fully loaded, 40.7837103745 GB left over)
loading file features/raw/train.1.h5
loading file features/raw/train.2.h5
cached 5545 seqs 16.9867935553 GB (fully loaded, 39.2744432362 GB left over)
Train data:
input: 1 x 1
output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
HDF dataset, sequences: 5545, frames: 1519952558
Dev data:
HDF dataset, sequences: 616, frames: 168484077
Devices:
gpu0: GeForce GTX 1060 6GB (units: 1280 clock: 1.58Ghz memory: 6.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
input #: 1
hidden 1Dto2D '1Dto2D' #: 1
hidden source 'classes_source' #: 2
hidden conv2 'conv0' #: 15
hidden conv2 'conv1' #: 45
hidden conv2 'conv2' #: 75
hidden conv2 'conv3' #: 105
hidden conv2 'conv4' #: 105
hidden mdlstm 'mdlstm0' #: 30
hidden mdlstm 'mdlstm1' #: 60
hidden mdlstm 'mdlstm2' #: 90
hidden mdlstm 'mdlstm3' #: 120
hidden mdlstm 'mdlstm4' #: 120
output softmax 'output' #: 82
net params #: 2627902
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 1, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
TaskThread train failed
Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140063963887360)>, proc 5181.
EXCEPTION
Traceback (most recent call last):
File "/home/dayvidwelles/MasterResearch/main/returnn/EngineTask.py", line 376, in run
line: self.run_inner()
locals:
self = <TrainTaskThread(TaskThread train, started daemon 140063963887360)>
self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140063963887360)>>
File "/home/dayvidwelles/MasterResearch/main/returnn/EngineTask.py", line 401, in run_inner
line: device.prepare(epoch=self.epoch, **self.get_device_prepare_args())
locals:
device = <Device.Device object at 0x7f63a5393650>
device.prepare = <bound method Device.prepare of <Device.Device object at 0x7f63a5393650>>
epoch =
self = <TrainTaskThread(TaskThread train, started daemon 140063963887360)>
self.epoch = 1
self.get_device_prepare_args = <bound method TrainTaskThread.get_device_prepare_args of <TrainTaskThread(TaskThread train, started daemon 140063963887360)>>
File "/home/dayvidwelles/MasterResearch/main/returnn/Device.py", line 1359, in prepare
line: self.set_net_params(network)
locals:
self = <Device.Device object at 0x7f63a5393650>
self.set_net_params = <bound method Device.set_net_params of <Device.Device object at 0x7f63a5393650>>
network = <Network.LayerNetwork object at 0x7f6353c79290>
File "/home/dayvidwelles/MasterResearch/main/returnn/Device.py", line 1172, in set_net_params
line: self.trainnet.set_params_by_dict(network.get_params_dict())
locals:
self = <Device.Device object at 0x7f63a5393650>
self.trainnet = <Network.LayerNetwork object at 0x7f63a5354bd0>
self.trainnet.set_params_by_dict = <bound method LayerNetwork.set_params_by_dict of <Network.LayerNetwork object at 0x7f63a5354bd0>>
network = <Network.LayerNetwork object at 0x7f6353c79290>
network.get_params_dict = <bound method LayerNetwork.get_params_dict of <Network.LayerNetwork object at 0x7f6353c79290>>
File "/home/dayvidwelles/MasterResearch/main/returnn/Network.py", line 680, in set_params_by_dict
line: self.output[name].set_params_by_dict(params[name])
locals:
self = <Network.LayerNetwork object at 0x7f63a5354bd0>
self.output = {'output': <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output>}
name = 'output', len = 6
set_params_by_dict =
params = {'1Dto2D': {}, 'mdlstm1': {'b4_mdlstm1': CudaNdarray([ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. ..., len = 13
File "/home/dayvidwelles/MasterResearch/main/returnn/NetworkBaseLayer.py", line 148, in set_params_by_dict
line: (self, p, self_param_shape, v.shape)
locals:
self = <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output>
p = 'W_in_mdlstm4_output', len = 19
self_param_shape = (120, 80)
v = CudaNdarray([[ 0.1616703 0.27070677 0.27121079 ..., 0.36024842 0.06087828
0.27304971]
[-0.29783311 -0.72121382 0.00289656 ..., 0.37427774 0.14804624
0.28115278]
[-0.03678599 0.25347066 -0.31765896 ..., -0.37810767 -0.14737074
0.09291939]
...,
[-0.50088167 -0.01235303 -0.043..., len = 120, _[0]: {len = 82, _[0]: {len = 0}}
v.shape = (120, 82)
AssertionError: In <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output>, param W_in_mdlstm4_output shape does not match. Expected (120, 80), got (120, 82).

KeyboardInterrupt
Quitting

UPDATE
Same is happening when using old commits like bdbeb33

Passing multiple data to the daemon

Hello!

I'm working with IAM dataset. I've been playing with the daemon task (loading a trained model), and I've understood the how to create the JSON structure it expects (data, sizes, classes). It works well when I send one image, but when I try to send multiple images (even only 2) either the library crashes or I get a result that, when decoded, is just gibberish.

The way I'm creating the JSON structure is based on the way the IAM h5 file is created in the function write_to_hdf on demos/mdlstm/IAM/create_IAM_dataset.py Basically, with multiple images I concatenate them and put the sequence in the "data" key, and, if I understood correctly the "sizes" key contains a 1D array with the sizes (height, width) of each of the image passed. Something like this (case with 2 images):

{ "data": [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.21291029], [0.121221], [0.3233232], [0.0434343], [0.278821], ....], "sizes": [89, 1667, 23, 2280], "classes": [79,1] }

Questions:

  • Is it possible to pass multiple data to the daemon?
  • If so, is that the correct JSON structure? Am I missing something? I have the feeling that maybe I'm not concatenating the data correctly, because it works for one image, but not for two or more.

Thanks.

Inf value in the logs

Hi, I have a problem in the log probabilities values as I obtain -inf value:
0. -inf -37.70607758 -inf -42.47481155 -42.84217453 -49.46448135 -56.07925797 -49.56125641 -55.94313812 -89.72915649 -54.34759903 -85.81500244 -48.88659668 -62.43969727 -68.71505737 -69.98902893 -54.09999084 -60.66642761 -51.04163361 -59.69820404 -42.64873886 -80.87241364 -64.62988281 -inf -54.81242371 -51.49855042 -inf -46.05450821 -53.26513672 -50.18508148 -50.80073547 -45.60667801 -inf -52.66472626 -53.40474701 -93.70770264 -61.22911835 -46.88584137 -81.49685669 -61.51108932 -53.74775696 -67.20650482 -56.51394653 -73.68787384 -68.04111481 -56.98586273 -49.08909607 -54.01488495 -40.78330231 -59.44810486 -79.71131897 -58.36343384 -67.72359467 -52.00548553 -60.10074234 -59.3542099 -56.50511932 -66.27165985 -47.93336487 -63.80725861 -60.54309082 -51.09378815 -47.2504158 -39.75489044 -46.88323975 -58.90422058 -53.51332092 -54.76487732 -60.65594482 -66.46124268 -62.25313187 -56.2291832 -59.68401337 -56.63359833 -63.68535995 -54.58312225 -45.54991531 -40.62715149 -69.29733276 -101.33302307 -101.33302307 -37.78124619 -63.92712784 -56.14258575 -53.96005249 -65.82307434 -43.64934921 -62.65488052 -48.00870514 -37.70607758 -97.93659973 -53.32273865 -49.87110519 -60.78359604 -37.59594727 -53.01013184 -50.76114655 -61.31332397 -46.94801331 -61.3396225 -48.61423492 -55.86868668 -41.1423378 -69.7824707 -63.35431671 -46.79589462 -34.70685959 -50.86413574 -54.89645386 -64.22349548 -66.80497742
0. -inf -37.06203079 -inf -41.81446075 -42.29504395 -48.91819 -55.56117249 -48.94821548 -55.40008545 -88.66357422 -53.74860382 -85.07017517 -48.56520844 -61.7991333 -68.09744263 -69.22842407 -53.49817657 -60.03842545 -50.34869003 -59.15971375 -42.09218597 -79.88864899 -63.96572876 -inf -54.11365509 -50.80646515 -inf -45.58577347 -52.69251633 -49.72389221 -50.13254166 -45.21523285 -inf -51.83375168 -52.87867737 -92.65901947 -60.6215477 -46.29016495 -80.65167236 -60.96844482 -53.31570816 -66.54362488 -55.80446625 -73.08409119 -67.3098526 -56.49045563 -48.65434265 -53.53612518 -40.14875793 -58.88114166 -78.95735168 -57.90089798 -67.029953 -51.45124435 -59.59119034 -58.70852661 -55.87550735 -65.81600952 -47.53460312 -63.26594543 -60.17967606 -50.50235748 -46.84017181 -39.35032654 -46.44360733 -58.34305954 -53.04711533 -54.28547287 -60.13535309 -65.94954681 -61.74348831 -55.77182388 -59.18076324 -56.1773262 -63.20414734 -54.0113945 -45.07807159 -40.16356277 -68.50437164 -100.33448792 -100.2831955 -37.35686493 -63.43578339 -55.49005127 -53.27494049 -65.23242188 -43.31959534 -62.12603378 -47.57835388 -37.06203079 -96.95278168 -52.66784668 -49.3058548 -60.25371552 -37.12116241 -52.5087738 -50.32568359 -60.61753464 -46.45907593 -60.88881302 -48.21120453 -55.24966812 -40.67898941 -69.07620239 -62.84642029 -46.31671906 -34.32710648 -50.34968567 -54.31123352 -63.60016632 -66.16163635
0. -inf -36.02425766 -inf -40.82946

How can I replace this value or resolve this???

Error while runnin IAM demo

Hi guys
I have no experience using RETURNN. While trying to run the go.sh in IAM folder i get some errors which i am not sure why they occur. I am using Theano 0.9 and cudnn 5. It would be great if someone could leave the exact versions of the dependencies they have used and also worked fine. Anyways here i leave the error, I will appreciate it if you can guide me.

arman@arman-N551JW:~/returnn/demos/mdlstm/IAM$ ./go.sh
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
RETURNN starting up, version 20180405.225130--git-538ed96-dirty, date/time 2018-04-09-11-35-51 (UTC+0200), pid 11244, cwd /home/arman/returnn/demos/mdlstm/IAM, Python /usr/bin/python
RETURNN command line options: ['config_demo']
faulthandler import error. No module named faulthandler
Theano: 0.9.0 ( in /usr/local/lib/python2.7/dist-packages/theano)
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/init.py:556: UserWarning: Theano flag device=gpu* (old gpu back-end) only support floatX=float32. You have floatX=float64. Use the new gpu back-end with device=cuda* for that value of floatX.
warnings.warn(msg)
Using gpu device 0: GeForce GTX 960M (CNMeM is disabled, cuDNN Mixed dnn version. The header is from one version, but we link with a different version (5110, 6021))
Device gpuX proc starting up, pid 11270
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True'
faulthandler import error. No module named faulthandler
Device gpuX proc exception: CudaNdarrayType only supports dtype float32 for now. Tried using dtype float64 for variable None
Unhandled exception <type 'exceptions.TypeError'> in thread <_MainThread(MainThread, started 140321675474688)>, proc 11270.

Thread current, main, <_MainThread(MainThread, started 140321675474688)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Traceback (most recent call last):
File "/home/arman/returnn/Device.py", line 1027, in process
line: self.process_inner(device, config, self.update_specs, asyncTask)
locals:
self = <Device.Device object at 0x7f9f002200d0>
self.process_inner = <bound method Device.process_inner of <Device.Device object at 0x7f9f002200d0>>
device = 'gpuX'
config = <Config.Config instance at 0x7f9f00228128>
self.update_specs = {'layers': [], 'block_size': 0, 'update_params': {}, 'update_rule': 'global'}
asyncTask = <TaskSystem.AsyncTask instance at 0x7f9f2b563680>
File "/home/arman/returnn/Device.py", line 1080, in process_inner
line: self.initialize(config, update_specs=update_specs)
locals:
self = <Device.Device object at 0x7f9f002200d0>
self.initialize = <bound method Device.initialize of <Device.Device object at 0x7f9f002200d0>>
config = <Config.Config instance at 0x7f9f00228128>
update_specs = {'layers': [], 'block_size': 0, 'update_params': {}, 'update_rule': 'global'}
File "/home/arman/returnn/Device.py", line 457, in initialize
line: self.trainnet = LayerNetwork.from_config_topology(config, train_flag=True, eval_flag=False)
locals:
self = <Device.Device object at 0x7f9f002200d0>
self.trainnet = !AttributeError: 'Device' object has no attribute 'trainnet'
LayerNetwork = <class 'Network.LayerNetwork'>
LayerNetwork.from_config_topology = <bound method type.from_config_topology of <class 'Network.LayerNetwork'>>
config = <Config.Config instance at 0x7f9f00228128>
train_flag =
eval_flag = False
File "/home/arman/returnn/Network.py", line 119, in from_config_topology
line: return cls.from_json_and_config(json_content, config, mask=mask, **kwargs)
locals:
cls = <class 'Network.LayerNetwork'>
cls.from_json_and_config = <bound method type.from_json_and_config of <class 'Network.LayerNetwork'>>
json_content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
config = <Config.Config instance at 0x7f9f00228128>
mask = None
kwargs = {'train_flag': True, 'eval_flag': False}
File "/home/arman/returnn/Network.py", line 204, in from_json_and_config
line: network = cls.from_json(json_content, **dict_joined(kwargs, cls.init_args_from_config(config)))
locals:
network =
cls = <class 'Network.LayerNetwork'>
cls.from_json = <bound method type.from_json of <class 'Network.LayerNetwork'>>
json_content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
dict_joined = <function dict_joined at 0x7f9f0802a0c8>
kwargs = {'train_flag': True, 'mask': None, 'eval_flag': False}
cls.init_args_from_config = <bound method type.init_args_from_config of <class 'Network.LayerNetwork'>>
config = <Config.Config instance at 0x7f9f00228128>
File "/home/arman/returnn/Network.py", line 422, in from_json
line: traverse(json_content, layer_name, trg, index)
locals:
traverse = <function traverse at 0x7f9f00109410>
json_content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
layer_name = 'output', len = 6
trg = 'classes', len = 7
index = j_classes
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'mdlstm4', len = 7
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'conv4'
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'mdlstm3', len = 7
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'conv3'
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'mdlstm2', len = 7
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'conv2'
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'mdlstm1', len = 7
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'conv1'
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'mdlstm0', len = 7
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 350, in traverse
line: index = traverse(content, prev, target, index)
locals:
index = j_classes
traverse = <function traverse at 0x7f9f00109410>
content = {'mdlstm0': {'dropout': 0.25, 'from': ['conv0'], 'class': 'mdlstm', 'n_out': 30}, 'conv2': {'from': ['mdlstm1'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_features': 75, 'class': 'conv2'}, 'conv1': {'from': ['mdlstm0'], 'dropout': 0.25, 'pool_size': [2, 2], 'filter': [3, 3], 'n_f..., len = 13
prev = 'conv0'
target = 'classes', len = 7
File "/home/arman/returnn/Network.py", line 408, in traverse
line: return network.add_layer(layer_class(**params)).index
locals:
network = <Network.LayerNetwork object at 0x7f9f0017b310>
network.add_layer = <bound method LayerNetwork.add_layer of <Network.LayerNetwork object at 0x7f9f0017b310>>
layer_class = <class 'NetworkTwoDLayer.ConvPoolLayer2'>
params = {'index': j_sizes, 'network': <Network.LayerNetwork object at 0x7f9f0017b310>, 'dropout': 0.0, 'mask': None, 'pool_size': [2, 2], 'filter': [3, 3], 'sources': [<<class 'NetworkTwoDLayer.OneDToTwoDLayer'> class:1Dto2D name:1Dto2D>], 'y_in': {'classes': y_classes, 'data': x, 'sizes': y_sizes}, 'tra..., len = 12
index = j_sizes
File "/home/arman/returnn/NetworkTwoDLayer.py", line 460, in init
line: Z = conv_crop_pool_op(self.X, sizes, self.output_sizes, self.W, self.b, self.n_in, self.n_features, self.filter_height,
self.filter_width, filter_dilation, pool_size)
locals:
Z =
conv_crop_pool_op = <function conv_crop_pool_op at 0x7f9f002de9b0>
self = <<class 'NetworkTwoDLayer.ConvPoolLayer2'> class:conv2 name:conv0>
self.X = if{}.0
sizes = if{}.0
self.output_sizes = Join.0
self.W = W_conv0
self.b = b_conv0
self.n_in = 1
self.n_features = 15
self.filter_height = 3
self.filter_width = 3
filter_dilation = [1, 1]
pool_size = [2, 2]
File "/home/arman/returnn/NetworkTwoDLayer.py", line 358, in conv_crop_pool_op
line: conv_out = conv_op(X, W, b) if filter_height * filter_width > 0 else X
locals:
conv_out =
conv_op = <cuda_implementation.CuDNNConvHWBCOp.CuDNNConvHWBCOp object at 0x7f9f002d7f50>
X = if{}.0
W = W_conv0
b = b_conv0
filter_height = 3
filter_width = 3
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 615, in call
line: node = self.make_node(*inputs, **kwargs)
locals:
node =
self = <cuda_implementation.CuDNNConvHWBCOp.CuDNNConvHWBCOp object at 0x7f9f002d7f50>
self.make_node = <bound method CuDNNConvHWBCOp.make_node of <cuda_implementation.CuDNNConvHWBCOp.CuDNNConvHWBCOp object at 0x7f9f002d7f50>>
inputs = (if{}.0, W_conv0, b_conv0)
kwargs = {}
File "/home/arman/returnn/cuda_implementation/CuDNNConvHWBCOp.py", line 217, in make_node
line: W = gpu_contiguous(as_cuda_ndarray_variable(W))
locals:
W = W_conv0
gpu_contiguous = <theano.sandbox.cuda.basic_ops.GpuContiguous object at 0x7f9f098dced0>
as_cuda_ndarray_variable = <function as_cuda_ndarray_variable at 0x7f9f0d943848>
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/basic_ops.py", line 46, in as_cuda_ndarray_variable
line: return gpu_from_host(tensor_x)
locals:
gpu_from_host = <theano.sandbox.cuda.basic_ops.GpuFromHost object at 0x7f9f098dc690>
tensor_x = W_conv0
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 615, in call
line: node = self.make_node(*inputs, **kwargs)
locals:
node =
self = <theano.sandbox.cuda.basic_ops.GpuFromHost object at 0x7f9f098dc690>
self.make_node = <bound method GpuFromHost.make_node of <theano.sandbox.cuda.basic_ops.GpuFromHost object at 0x7f9f098dc690>>
inputs = (W_conv0,)
kwargs = {}
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/basic_ops.py", line 132, in make_node
line: return Apply(self, [x], [CudaNdarrayType(broadcastable=x.broadcastable,
dtype=x.dtype)()])
locals:
Apply = <class 'theano.gof.graph.Apply'>
self = <theano.sandbox.cuda.basic_ops.GpuFromHost object at 0x7f9f098dc690>
x = W_conv0
CudaNdarrayType = <class 'theano.sandbox.cuda.type.CudaNdarrayType'>
broadcastable =
x.broadcastable = (False, False, False, False)
dtype =
x.dtype = 'float64', len = 7
File "/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda/type.py", line 95, in init
line: raise TypeError('%s only supports dtype float32 for now. Tried '
'using dtype %s for variable %s' %
(self.class.name, dtype, name))
locals:
TypeError = <type 'exceptions.TypeError'>
self = !AttributeError: 'CudaNdarrayType' object has no attribute 'name'
self.class = <class 'theano.sandbox.cuda.type.CudaNdarrayType'>
self.class.name = 'CudaNdarrayType', len = 15
dtype = 'float64', len = 7
name = None
TypeError: CudaNdarrayType only supports dtype float32 for now. Tried using dtype float64 for variable None
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,force_device=True
EXCEPTION
Traceback (most recent call last):
File "/home/arman/returnn/Device.py", line 347, in startProc
line: self._startProc(*args, **kwargs)
locals:
self = <Device.Device object at 0x7f7a861e7b10>
self._startProc = <bound method Device._startProc of <Device.Device object at 0x7f7a861e7b10>>
args = ('gpuZ',)
kwargs = {}
File "/home/arman/returnn/Device.py", line 401, in _startProc
line: interrupt_main()
locals:
interrupt_main = <function interrupt_main at 0x7f7a86e5a5f0>
File "/home/arman/returnn/Util.py", line 665, in interrupt_main
line: sys.exit(1) # And exit the thread.
locals:
sys = <module 'sys' (built-in)>
sys.exit =
SystemExit: 1
KeyboardInterrupt
EXCEPTION
Traceback (most recent call last):
File "../../../rnn.py", line 546, in main
line: init(commandLineOptions=argv[1:])
locals:
init = <function init at 0x7f7a861dda28>
commandLineOptions =
argv = ['../../../rnn.py', 'config_demo'], _[0]: {len = 15}
File "../../../rnn.py", line 343, in init
line: devices = initDevices()
locals:
devices =
initDevices = <function initDevices at 0x7f7a861dd668>
File "../../../rnn.py", line 154, in initDevices
line: time.sleep(0.25)
locals:
time = <module 'time' (built-in)>
time.sleep =
KeyboardInterrupt
Quitting

Index out of range error

Hi, I tried running the training on a different dataset as suggested by you.
I am getting "index out of range" error after every batch...Please help me to figure out what has gone wrong? Following is the log:

`
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
index out of range4: 182 / 183, 0 / 1, 81 / 80
train epoch 1, batch 12, cost:output 4.52913375128, elapsed 0:00:44, exp. remaining 0:20:27, complete 3.50%
0:20:27 [|||||| 3.50% ]running 1 sequence slices (382273 nts) of batch 13 on device gpu0
index out of range4: 183 / 184, 0 / 1, 81 / 80

`

Error when using config_real in mdlstm/IAM

screenshot 658

When I'm using the config_demo for training/validation/testing, I don't seem to encounter this problem. I followed the instructions in the readme file. Has anyone else encountered this error?

Running model outside RETURNN

Hi!

I've trained an MDLSTM model using the RETURNN and I would like to export the model in order to use it outside RETURNN. Is it possible? Can I serialize an object that allows running predictions (forwarding) given some image without installing all the environment?

Thanks in advance.

Asking about num_outputs in config

Hi , I would like to ask about ""num_outputs": {"data": [1,2], "classes": [79,1], "sizes": [2,1]}," which is in the file config real,
Concerning my own dataset, I have 107 units (like charlist : aaA, baB.....)
Then is it correct to make this:
num_outputs": {"data": [1,2], "classes": [107,1], "sizes": [2,1]},
and what are "data": [1,2],"sizes": [2,1] ????
Thanks in advance

Not able to run IAM training

Hi,
I tried to train the network and following is the error log:

`
Theano: 0.9.0 ( in /usr/local/lib/python2.7/dist-packages/theano)
faulthandler import error. No module named faulthandler
pynvml not available, memory information missing
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 1: GeForce GTX 1080 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN Mixed dnn version. The header is from one version, but we link with a different version (5110, 6021))
Device gpu1 proc starting up, pid 27377
Device gpu1 proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpu1,device=gpu1,force_device=True'
faulthandler import error. No module named faulthandler
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpu1 proc, pid 27377 is ready for commands.
Devices: Used in multiprocessing mode.
loading file features/raw/train_valid.h5
cached 616 seqs 1.88295629248 GB (fully loaded, 40.7837103745 GB left over)
loading file features/raw/train.1.h5
loading file features/raw/train.2.h5
cached 5545 seqs 16.9867935553 GB (fully loaded, 39.2744432362 GB left over)
Train data:
input: 1 x 1
output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]}
HDF dataset, sequences: 5545, frames: 1519952558
Dev data:
HDF dataset, sequences: 616, frames: 168484077
Devices:
gpu1: GeForce GTX 1080 (units: 1000 clock: 1.00Ghz memory: 2.0GB) working on 1 batch (update on device)
Learning-rate-control: no file specified, not saving history (no proper restart possible)
using adam with nag and momentum schedule
Network layer topology:
input #: 1
hidden 1Dto2D '1Dto2D' #: 1
hidden source 'classes_source' #: 2
hidden conv2 'conv0' #: 15
hidden conv2 'conv1' #: 45
hidden conv2 'conv2' #: 75
hidden conv2 'conv3' #: 105
hidden conv2 'conv4' #: 105
hidden mdlstm 'mdlstm0' #: 30
hidden mdlstm 'mdlstm1' #: 60
hidden mdlstm 'mdlstm2' #: 90
hidden mdlstm 'mdlstm3' #: 120
hidden mdlstm 'mdlstm4' #: 120
output softmax 'output' #: 80
net params #: 2627660
net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output]
start training at epoch 1 and batch 0
using batch size: 600000, max seqs: 10
learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 0.0005 ...
starting task train
running 2 sequence slices (477432 nts) of batch 0 on device gpu1
CUDNN failure\nError: CUDNN_STATUS_BAD_PARAM\nmod.cu:641\nAborting...\nDev gpu1 proc died: recv_bytes EOFError:
device crashed on batch 0
Save model from epoch 0 under models/mdlstm_real.001.crash_0

`
Anyone facing same issue or can anyone suggests what is wrong??

Scores in log file for training

Hi,
I would like to understand the meaning of these two lines in the log file:
Is the training going ok??

Learning-rate-control: error key 'dev_score' from {'dev_error': 0.99728335186974681, 'dev_score': 3.6586107690151706}
epoch 1 score: 3.96068657248 elapsed: 3:19:03 dev: score 3.65861076902 error 0.99728335187

IAM data testset processing and path for rnn.py

text = re.sub("([^|])'" , "\g<1>|'", text)

The program (create_IAM_dataset.py) processes training images(padding, dividing pixel value by 255) and transcription(regex expression), but do not processes test transcription (no regex operation). It would be very helpful, if you can please help me with the following:

Is there any processing done for test transcription?
Do rnn.py only trains on training images (train.1.h5, train.2.h5, train_valid.h5)?
Where and how valid.h5 and test.h5 are used?
Thanks.

Error WHile running demo

Hey guys. I am trying to run the demo on the server which i have ssh access to. while running i encountered some error regarding to theano's flag (optimizer).Now i have chnaged the optimizer in .theanorc several times to (fast_compile and None) but none of that works.here i have copied the error i got maybe you guys can help me with it :
at64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
('converting IAM_lines to', 'features/raw/demo.h5')
features/raw/demo.h5
(0, '/', 3)
[koeln:15698] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[koeln:15698] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[koeln:15698] mca: base: component_find: unable to open /usr/lib64/mpi/gcc/openmpi/lib64/openmpi/mca_mtl_psm: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
initCnmem: cnmemInit call failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY. numdev=1

Traceback (most recent call last):
File "../../../rnn.py", line 27, in
from Device import Device, TheanoFlags, getDevicesInitArgs
File "/home/sadeghi/returnn/Device.py", line 5, in
from Updater import Updater
File "/home/sadeghi/returnn/Updater.py", line 4, in
import theano
File "/usr/lib/python2.7/site-packages/theano/init.py", line 111, in
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "/usr/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 38, in test_nvidia_driver1
if not numpy.allclose(f(), a.sum()):
File "/usr/lib/python2.7/site-packages/theano/compile/function_module.py", line 871, in call
storage_map=getattr(self.fn, 'storage_map', None))
File "/usr/lib/python2.7/site-packages/theano/gof/link.py", line 314, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File "/usr/lib/python2.7/site-packages/theano/compile/function_module.py", line 859, in call
outputs = self.fn()
RuntimeError: Cuda error: kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0: out of memory. (grid: 1 x 1; block: 256 x 1 x 1)

Apply node that caused the error: GpuCAReduce{add}{1}(<CudaNdarrayType(float32, vector)>)
Toposort index: 0
Inputs types: [CudaNdarrayType(float32, vector)]
Inputs shapes: [(10000,)]
Inputs strides: [(1,)]
Inputs values: ['not shown']
Outputs clients: [[HostFromGpu(GpuCAReduce{add}{1}.0)]]

Debugprint of the apply node:
GpuCAReduce{add}{1} [id A] <CudaNdarrayType(float32, scalar)> ''
|<CudaNdarrayType(float32, vector)> [id B] <CudaNdarrayType(float32, vector)>

Storage map footprint:

  • <CudaNdarrayType(float32, vector)>, Shared Input, Shape: (10000,), ElemSize: 4 Byte(s), TotalSize: 40000 Byte(s)
    TotalSize: 40000 Byte(s) 0.000 GB
    TotalSize inputs: 40000 Byte(s) 0.000 GB

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
sadeghi@koeln:~/returnn/demos/mdlstm/IAM>

Device gpuX proc exception: ('The following error happened while compiling the node'

Tesla K80, theano 0.9

"THEANO_FLAGS='floatX=float32' python ../../../rnn.py config_demo"

It seems to me:

Device.py: 347
self._startProc(*args, **kwargs) /// try to compile with a task using nvcc

Device.py: 355
def _startProc(self, device_tag):

Device.py: 385 -391
self.proc = AsyncTask(...
...
self.input_queue = self.output_queue = self.proc.conn
/// created connection with the task

Device.py : 393 -
try:
// then try to read results, but got an exception
self.id = self.output_queue.recv(); """ :type: int """
self.device_name = self.output_queue.recv(); """ :type: str """
self.num_train_params = self.output_queue.recv(); """ :type: int """ # = len(trainnet.gparams)
self.sync_used_targets()
except ProcConnectionDied as e:

Somehow, the "nvcc" proc died, therefore cannot be "read".


Device gpuX proc starting up, pid 7360
Device gpuX proc: THEANO_FLAGS = 'compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,floatX=float32,force_device=True'
Device train-network: Used data keys: ['classes', 'data', 'sizes']
using adam with nag and momentum schedule
Device gpuX proc exception: ('The following error happened while compiling the node', GpuDnnSoftmax{tensor_format='bc01', mode='channel', algo='accurate'}(GpuContiguous.0), '\n', 'nvcc return status', 1, 'for cmd', 'nvcc -shared -O3 -Xlinker -rpath,/usr/local/cuda/lib64 -arch=sm_37 -m64 -Xcompiler -fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=mc72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/ubuntu/.theano/compiledir_Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid-x86_64-3.6.3-64--dev-gpuZ/cuda_ndarray -I/home/ubuntu/.theano/compiledir_Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid-x86_64-3.6.3-64--dev-gpuZ/cuda_ndarray -I/usr/local/cuda/include -I/home/ubuntu/anaconda3/lib/python3.6/site-packages/theano/sandbox/cuda -I/home/ubuntu/anaconda3/lib/python3.6/site-packages/numpy/core/include -I/home/ubuntu/anaconda3/include/python3.6m -I/home/ubuntu/anaconda3/lib/python3.6/site-packages/theano/gof -L/home/ubuntu/.theano/compiledir_Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid-x86_64-3.6.3-64--dev-gpuZ/cuda_ndarray -L/home/ubuntu/anaconda3/lib -o /home/ubuntu/.theano/compiledir_Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid-x86_64-3.6.3-64--dev-gpuZ/tmp52h_i3it/m3084ad5093769045f45143c157756096.so mod.cu -lcudart -lcublas -lcuda_ndarray -lcudnn -lpython3.6m', "[GpuDnnSoftmax{tensor_format='bc01', mode='channel', algo='accurate'}(<CudaNdarrayType(float32, (False, False, True, True))>)]")
Device proc gpuX (gpuZ) died: ProcConnectionDied('recv_bytes EOFError: ',)
Theano flags: compiledir_format=compiledir_%(platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s--dev-gpuZ,device=gpu,floatX=float32,force_device=True


Keyword Spotting in return

At First, i want to thank you for the RETURNN Framework.

My application is to use the trained model for Keyword Spotting in text images.

Does this framework support such application ? if not, is there anyway to implement it ?

Thank you in advance.

Profiling tensorflow network

Hey!

I am currently struggling with profiling a network:
There is a possibility to profile a tensorflow network using TensorBoard via RunMetadata (see here).
I searched returnn for this feature but couldn't find it.
Is there a similar way to profile individual components (e.g. layers) of a returnn tensorflow network?

Best,
Peter

Lexicon and Rescoring

Using the Returnn, is it possible to perform the decoding using a lexicon ??
And can we use n-gram LM form rescoring n-best hypothesis??
Thanks

Cost going negative after epoch 12

Hi,
I am trying to train the model on a different dataset. after epoch 12 the cost is negative and continuously decreasing afterwards with increase in accuracy. I am not sure whether it is fine or not? do we need to reduce learning rate to avoid this behavior or continue with the training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.