Giter VIP home page Giter VIP logo

opennmt-py's Issues

How to train without gpu?

I get:

$ python train.py -data data/demo -save_model demo-model
...
Loading train dataset from data/demo.train.1.pt, number of examples: 10000
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    main(opt)
  File "train.py", line 28, in main
    single_main(opt)
  File "/toknas/hugh/git/OpenNMT-Ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/toknas/hugh/git/OpenNMT-Ubiqus/onmt/trainer.py", line 142, in train
    if (i % self.n_gpu == self.gpu_rank):
ZeroDivisionError: integer division or modulo by zero

bug coverage_attn in multi-gpu mode

Could you please help me on how to use the new GPU options? I used the flags -gpuid 0 1 -gpu_verbose 0 -gpu_rank 0 for the trianing script which resulted in the following error

Traceback (most recent call last):
  File "/data/projects/opennmt-ubiqus/train_multi.py", line 43, in run
    single_main(opt)
  File "/data/projects/opennmt-ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/data/projects/opennmt-ubiqus/onmt/trainer.py", line 143, in train
    if self.gpu_verbose > 1:
TypeError: unorderable types: list() > int()

Here's the output from nvidia-smi, just in case.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000A20E:00:00.0 Off |                    0 |
| N/A   70C    P0    65W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000C0B5:00:00.0 Off |                    0 |
| N/A   38C    P0    72W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Training crashes at end

I setup a conda environment with PyTorch 0.4 and installed this fork of OpenNMT-py, as I mentioned in the original issue I had posted here: OpenNMT#743

I then ran:

python preprocess.py -train_src C:\src\torchevere-offensive-classifier\training\character\train_src.txt -train_tgt C:\src\torchevere-offensive-classifier\training\character\train_dst.txt -valid_src C:\src\torchevere-offensive-classifier\training\character\val_src.txt -valid_tgt C:\src\torchevere-offensive-classifier\training\character\val_dst.txt -save_data data/character/tc-offense-classifier-character_v3 -src_seq_length 5000 -tgt_seq_length 5000

python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500

Everything ran smoothly until it crashed at the end.
Here's the output:

(pyTorchOffensive) C:\src\pyopennmt\ubiqus\OpenNMT-py>python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		 * vocabulary size. source = 165; target = 6
		Building model...
		Intializing model parameters.
		NMTModel(
		  (encoder): RNNEncoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(165, 500, padding_idx=1)
		        )
		      )
		    )
		    (rnn): LSTM(500, 500, num_layers=3, dropout=0.3)
		  )
		  (decoder): InputFeedRNNDecoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(6, 500, padding_idx=1)
		        )
		      )
		    )
		    (dropout): Dropout(p=0.3)
		    (rnn): StackedLSTM(
		      (dropout): Dropout(p=0.3)
		      (layers): ModuleList(
		        (0): LSTMCell(1000, 500)
		        (1): LSTMCell(500, 500)
		        (2): LSTMCell(500, 500)
		      )
		    )
		    (attn): GlobalAttention(
		      (linear_in): Linear(in_features=500, out_features=500, bias=False)
		      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
		      (softmax): Softmax()
		      (tanh): Tanh()
		    )
		  )
		  (generator): Sequential(
		    (0): Linear(in_features=500, out_features=6, bias=True)
		    (1): LogSoftmax()
		  )
		)
		* number of parameters: 13862506
		encoder:  6094500
		decoder:  7768006
		Making optimizer for training.
		
		Start training...
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 50, 10000; acc:  27.34; ppl:  16.07; xent:   2.78; lr: 1.00000; 14099 / 3200 tok/s;      4 sec
		GPU 0: for information we completed an epoch at step 54
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 100, 10000; acc:  74.22; ppl:   1.45; xent:   0.37; lr: 1.00000; 26876 / 2614 tok/s;      6 sec
		GPU 0: for information we completed an epoch at step 107
		
		. . . 
		
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 9950, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 16583 / 2462 tok/s;    616 sec
		GPU 0: for information we completed an epoch at step 9965
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 10000, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 19345 / 2416 tok/s;    619 sec
		Loading valid dataset from data/character\tc-offense-classifier-character_v3.valid.1.pt, number of examples: 376
		Traceback (most recent call last):
		  File "train.py", line 41, in <module>
		    main(opt)
		  File "train.py", line 28, in main
		    single_main(opt)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\train_single.py", line 120, in main
		    opt.valid_steps)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 176, in train
		    valid_stats = self.validate(valid_iter)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 208, in validate
		    for batch in valid_iter:
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\inputters\inputter.py", line 423, in __iter__
		    for batch in self.cur_iter:
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\iterator.py", line 151, in __iter__
		    self.train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\batch.py", line 27, in __init__
		    setattr(self, name, field.process(batch, device=device, train=train))
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 188, in process
		    tensor = self.numericalize(padded, device=device, train=train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in numericalize
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		KeyError: '๐Ÿป'

Luckily, there was still a model file that was produced during the middle of the training (half way through the specified number of training steps), and I was able to use that with translate.py. And thankfully, it solved the issue with having blanks in the output file, as I mentioned in that original issue (OpenNMT#743).

However, it's still an issue to have the training crash at the end of the process, so I'm reporting this bug.

'ZeroDivisionError: division by zero' while training

Hello everyone,
I am very new to opennmt-py. I am trying to run this model for summarization task.
I am training this model on google colab and using cnndm as a datatset.
The lines that are used for training dataset is 50000 and for validation and testing dataset is 5000.
Number of train_steps is 12000.
When It reached at the 10000 step of training, I got the following error.

[2018-10-10 13:20:14,520 INFO] Step 10000/12000; acc: 99.76; ppl: 1.00; xent: 0.00; lr: 0.15000; 45/ 40 tok/s; 37616 sec
[2018-10-10 13:20:16,528 INFO] Loading valid dataset from data/cnndm/CNNDM.valid.1.pt, number of examples: 0
Traceback (most recent call last):
File "train.py", line 40, in
main(opt)
File "train.py", line 27, in main
single_main(opt)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/train_single.py", line 133, in main
opt.valid_steps)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/trainer.py", line 196, in train
step, valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/trainer.py", line 356, in _report_step
valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/report_manager.py", line 91, in report_step
lr, step, train_stats=train_stats, valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/report_manager.py", line 147, in _report_step
self.log('Validation perplexity: %g' % valid_stats.ppl())
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/statistics.py", line 97, in ppl
return math.exp(min(self.loss / self.n_words, 100))
ZeroDivisionError: division by zero

Can anyone please help me out what I am doing wrong and why I am getting this error? Do I need to increase my dataset for training?
Looking forward for your help.

Thanks in advance

TypeError: __init__() got an unexpected keyword argument 'dtype'

I have already installed the dependecy properly. But when I run preprocess.py , it returns error:

Traceback (most recent call last):
  File "preprocess.py", line 218, in <module>
    main()
  File "preprocess.py", line 205, in main
    fields = inputters.get_fields(opt.data_type, src_nfeats, tgt_nfeats)
  File "/opt/conda/lib/python3.6/site-packages/OpenNMT_py-0.4-py3.6.egg/onmt/inputters/inputter.py", line 46, in get_fields
    return TextDataset.get_fields(n_src_features, n_tgt_features)
  File "/opt/conda/lib/python3.6/site-packages/OpenNMT_py-0.4-py3.6.egg/onmt/inputters/text_dataset.py", line 244, in get_fields
    postprocessing=make_src, sequential=False)
TypeError: __init__() got an unexpected keyword argument 'dtype'

But everything works find when I run the OpenNMT master on the same dataset in another docker.

Still getting blanks in translate.py output

This is a carry-over from: OpenNMT#743

When I run translate.py on a model that was trained to completion via the steps here: #4

I'm getting blanks in the output even for small input files, but only some of the predictions are blank. I also tried running:

python translate.py -model C:\src\pyopennmt\ubiqus\OpenNMT-py\tc-offense-classifier-character_v5_step_1000000.pt -src C:\src\torchevere-offensive-classifier\test2.txt -v

so that I could see each prediction in the console, and many of the prediction values are still blank. Any idea what could be causing this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.