Giter VIP home page Giter VIP logo

opennmt-py's People

Contributors

adamlerer avatar adrianjav avatar apaszke avatar bmccann avatar bpopeters avatar chenbeh avatar colesbury avatar da03 avatar francoishernandez avatar guillaumekln avatar gwenniger avatar helson73 avatar henry-e avatar jianyuzhan avatar jsenellart avatar justinchiu avatar mattiadg avatar meocong avatar pltrdy avatar scarletpan avatar sebastiangehrmann avatar soumith avatar srush avatar taolei87 avatar tayciryahmed avatar thammegowda avatar vince62s avatar waino avatar wjbianjason avatar xutaima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

opennmt-py's Issues

'ZeroDivisionError: division by zero' while training

Hello everyone,
I am very new to opennmt-py. I am trying to run this model for summarization task.
I am training this model on google colab and using cnndm as a datatset.
The lines that are used for training dataset is 50000 and for validation and testing dataset is 5000.
Number of train_steps is 12000.
When It reached at the 10000 step of training, I got the following error.

[2018-10-10 13:20:14,520 INFO] Step 10000/12000; acc: 99.76; ppl: 1.00; xent: 0.00; lr: 0.15000; 45/ 40 tok/s; 37616 sec
[2018-10-10 13:20:16,528 INFO] Loading valid dataset from data/cnndm/CNNDM.valid.1.pt, number of examples: 0
Traceback (most recent call last):
File "train.py", line 40, in
main(opt)
File "train.py", line 27, in main
single_main(opt)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/train_single.py", line 133, in main
opt.valid_steps)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/trainer.py", line 196, in train
step, valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/trainer.py", line 356, in _report_step
valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/report_manager.py", line 91, in report_step
lr, step, train_stats=train_stats, valid_stats=valid_stats)
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/report_manager.py", line 147, in _report_step
self.log('Validation perplexity: %g' % valid_stats.ppl())
File "/content/gdrive/My Drive/OpenNMT-py/onmt/utils/statistics.py", line 97, in ppl
return math.exp(min(self.loss / self.n_words, 100))
ZeroDivisionError: division by zero

Can anyone please help me out what I am doing wrong and why I am getting this error? Do I need to increase my dataset for training?
Looking forward for your help.

Thanks in advance

Training crashes at end

I setup a conda environment with PyTorch 0.4 and installed this fork of OpenNMT-py, as I mentioned in the original issue I had posted here: OpenNMT#743

I then ran:

python preprocess.py -train_src C:\src\torchevere-offensive-classifier\training\character\train_src.txt -train_tgt C:\src\torchevere-offensive-classifier\training\character\train_dst.txt -valid_src C:\src\torchevere-offensive-classifier\training\character\val_src.txt -valid_tgt C:\src\torchevere-offensive-classifier\training\character\val_dst.txt -save_data data/character/tc-offense-classifier-character_v3 -src_seq_length 5000 -tgt_seq_length 5000

python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500

Everything ran smoothly until it crashed at the end.
Here's the output:

(pyTorchOffensive) C:\src\pyopennmt\ubiqus\OpenNMT-py>python train.py -data data/character/tc-offense-classifier-character_v3 -save_model tc-offense-classifier-character_v3 -gpuid 0 -layers 3 -learning_rate_decay 0.99 -train_steps 10000 -rnn_size 500
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		 * vocabulary size. source = 165; target = 6
		Building model...
		Intializing model parameters.
		NMTModel(
		  (encoder): RNNEncoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(165, 500, padding_idx=1)
		        )
		      )
		    )
		    (rnn): LSTM(500, 500, num_layers=3, dropout=0.3)
		  )
		  (decoder): InputFeedRNNDecoder(
		    (embeddings): Embeddings(
		      (make_embedding): Sequential(
		        (emb_luts): Elementwise(
		          (0): Embedding(6, 500, padding_idx=1)
		        )
		      )
		    )
		    (dropout): Dropout(p=0.3)
		    (rnn): StackedLSTM(
		      (dropout): Dropout(p=0.3)
		      (layers): ModuleList(
		        (0): LSTMCell(1000, 500)
		        (1): LSTMCell(500, 500)
		        (2): LSTMCell(500, 500)
		      )
		    )
		    (attn): GlobalAttention(
		      (linear_in): Linear(in_features=500, out_features=500, bias=False)
		      (linear_out): Linear(in_features=1000, out_features=500, bias=False)
		      (softmax): Softmax()
		      (tanh): Tanh()
		    )
		  )
		  (generator): Sequential(
		    (0): Linear(in_features=500, out_features=6, bias=True)
		    (1): LogSoftmax()
		  )
		)
		* number of parameters: 13862506
		encoder:  6094500
		decoder:  7768006
		Making optimizer for training.
		
		Start training...
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 50, 10000; acc:  27.34; ppl:  16.07; xent:   2.78; lr: 1.00000; 14099 / 3200 tok/s;      4 sec
		GPU 0: for information we completed an epoch at step 54
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 100, 10000; acc:  74.22; ppl:   1.45; xent:   0.37; lr: 1.00000; 26876 / 2614 tok/s;      6 sec
		GPU 0: for information we completed an epoch at step 107
		
		. . . 
		
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 9950, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 16583 / 2462 tok/s;    616 sec
		GPU 0: for information we completed an epoch at step 9965
		Loading train dataset from data/character\tc-offense-classifier-character_v3.train.1.pt, number of examples: 3384
		Step 10000, 10000; acc: 100.00; ppl:   1.00; xent:   0.00; lr: 1.00000; 19345 / 2416 tok/s;    619 sec
		Loading valid dataset from data/character\tc-offense-classifier-character_v3.valid.1.pt, number of examples: 376
		Traceback (most recent call last):
		  File "train.py", line 41, in <module>
		    main(opt)
		  File "train.py", line 28, in main
		    single_main(opt)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\train_single.py", line 120, in main
		    opt.valid_steps)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 176, in train
		    valid_stats = self.validate(valid_iter)
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\trainer.py", line 208, in validate
		    for batch in valid_iter:
		  File "C:\src\pyopennmt\ubiqus\OpenNMT-py\onmt\inputters\inputter.py", line 423, in __iter__
		    for batch in self.cur_iter:
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\iterator.py", line 151, in __iter__
		    self.train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\batch.py", line 27, in __init__
		    setattr(self, name, field.process(batch, device=device, train=train))
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 188, in process
		    tensor = self.numericalize(padded, device=device, train=train)
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in numericalize
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		  File "C:\Anaconda3\envs\pyTorchOffensive\lib\site-packages\torchtext\data\field.py", line 287, in <listcomp>
		    arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
		KeyError: '๐Ÿป'

Luckily, there was still a model file that was produced during the middle of the training (half way through the specified number of training steps), and I was able to use that with translate.py. And thankfully, it solved the issue with having blanks in the output file, as I mentioned in that original issue (OpenNMT#743).

However, it's still an issue to have the training crash at the end of the process, so I'm reporting this bug.

bug coverage_attn in multi-gpu mode

Could you please help me on how to use the new GPU options? I used the flags -gpuid 0 1 -gpu_verbose 0 -gpu_rank 0 for the trianing script which resulted in the following error

Traceback (most recent call last):
  File "/data/projects/opennmt-ubiqus/train_multi.py", line 43, in run
    single_main(opt)
  File "/data/projects/opennmt-ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/data/projects/opennmt-ubiqus/onmt/trainer.py", line 143, in train
    if self.gpu_verbose > 1:
TypeError: unorderable types: list() > int()

Here's the output from nvidia-smi, just in case.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000A20E:00:00.0 Off |                    0 |
| N/A   70C    P0    65W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000C0B5:00:00.0 Off |                    0 |
| N/A   38C    P0    72W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Still getting blanks in translate.py output

This is a carry-over from: OpenNMT#743

When I run translate.py on a model that was trained to completion via the steps here: #4

I'm getting blanks in the output even for small input files, but only some of the predictions are blank. I also tried running:

python translate.py -model C:\src\pyopennmt\ubiqus\OpenNMT-py\tc-offense-classifier-character_v5_step_1000000.pt -src C:\src\torchevere-offensive-classifier\test2.txt -v

so that I could see each prediction in the console, and many of the prediction values are still blank. Any idea what could be causing this?

How to train without gpu?

I get:

$ python train.py -data data/demo -save_model demo-model
...
Loading train dataset from data/demo.train.1.pt, number of examples: 10000
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    main(opt)
  File "train.py", line 28, in main
    single_main(opt)
  File "/toknas/hugh/git/OpenNMT-Ubiqus/train_single.py", line 120, in main
    opt.valid_steps)
  File "/toknas/hugh/git/OpenNMT-Ubiqus/onmt/trainer.py", line 142, in train
    if (i % self.n_gpu == self.gpu_rank):
ZeroDivisionError: integer division or modulo by zero

TypeError: __init__() got an unexpected keyword argument 'dtype'

I have already installed the dependecy properly. But when I run preprocess.py , it returns error:

Traceback (most recent call last):
  File "preprocess.py", line 218, in <module>
    main()
  File "preprocess.py", line 205, in main
    fields = inputters.get_fields(opt.data_type, src_nfeats, tgt_nfeats)
  File "/opt/conda/lib/python3.6/site-packages/OpenNMT_py-0.4-py3.6.egg/onmt/inputters/inputter.py", line 46, in get_fields
    return TextDataset.get_fields(n_src_features, n_tgt_features)
  File "/opt/conda/lib/python3.6/site-packages/OpenNMT_py-0.4-py3.6.egg/onmt/inputters/text_dataset.py", line 244, in get_fields
    postprocessing=make_src, sequential=False)
TypeError: __init__() got an unexpected keyword argument 'dtype'

But everything works find when I run the OpenNMT master on the same dataset in another docker.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.