I changed the paths in train_nmt.py and nmt.py to point to where they are located loca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Output contains lot of UNKs about dl4mt-tutorial HOT 11 CLOSED

nyu-dl commented on September 28, 2024

Output contains lot of UNKs

from dl4mt-tutorial.

Comments (11)

orhanf commented on September 28, 2024

I guess you are trying English to French, is there a chance that the data and dictionaries are not matching?

from dl4mt-tutorial.

arvind2505 commented on September 28, 2024

I'm trying french to english. In preprocess.sh, I set S=fr and T=en.
And in nmt.py and train_mnt.py, I used the following:
datasets=['../data/all_fr-en.en.tok',
'../data/all_fr-en.fr.tok'],
valid_datasets=['../data/test2011/newstest2011.en',
'../data/test2011/newstest2011.fr'],
dictionaries=['../data/all_fr-en.en.tok.bpe.pkl',
'../data/all_fr-en.fr.tok.bpe.pkl']
I couldn't find .tok files for the test/validation set. Am I doing something wrong here?

from dl4mt-tutorial.

chrishokamp commented on September 28, 2024

Are your datasets preprocessed for bpe subword encoding? From your file
extensions it like your vocabulary uses bpe but your datasets haven't been
encoded.
On Apr 27, 2016 5:54 PM, "Arvind Neelakantan" [email protected]
wrote:

I'm trying french to english. In preprocess.sh, I set S=fr and T=en.
And in nmt.py and train_mnt.py, I used the following:
datasets=['../data/all_fr-en.en.tok',
'../data/all_fr-en.fr.tok'],
valid_datasets=['../data/test2011/newstest2011.en',
'../data/test2011/newstest2011.fr'],
dictionaries=['../data/all_fr-en.en.tok.bpe.pkl',
'../data/all_fr-en.fr.tok.bpe.pkl']
I couldn't find .tok files for the test/validation set. Am I doing
something wrong here?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#48 (comment)

from dl4mt-tutorial.

orhanf commented on September 28, 2024

@chrishokamp i think you're right

@arvind2505 your datasets should be:

datasets=[
    '../data/all_fr-en.en.tok.bpe.shuf',
    '../data/all_fr-en.fr.tok.bpe.shuf'
],

also, from your setup, it seems like you're doing English to French.

For validation datasets, please apply these steps in preprocess.sh :

from dl4mt-tutorial.

arvind2505 commented on September 28, 2024

Thanks guys!
Here is the output after I changed the paths to datasets, and ran tokenizer and apply bpe on the newstest data.
Does this look ok?
{'use-dropout': [False], 'dim': [1024], 'optimizer': ['adadelta'], 'dim_word': [512], 'reload': [True], 'clip-c': [1.0], 'n-words': [30000], 'model': ['model_hal.npz'], 'learning-rate': [0.0001], 'decay-c': [0.0]}
Reloading model options
Loading data
Building model
Reloading model parameters
/share/apps/python/lib/python2.7/site-packages/theano/scan_module/scan.py:1019: Warning: In the strict mode, all neccessary shared variables must be passed as a part of non_sequences
'must be passed as a part of non_sequences', Warning)
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Building f_cost... Done
Computing gradient... Done
Building optimizers... Done
Optimization
Minibatch with zero sample under length 50
Epoch 0 Update 510 Cost 76.1549987793 UD 0.353935003281
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Epoch 0 Update 520 Cost 160.68145752 UD 0.732856035233
Epoch 0 Update 530 Cost 86.3781051636 UD 0.623296022415
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Epoch 0 Update 540 Cost 173.231338501 UD 0.775022983551
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Epoch 0 Update 550 Cost 74.6405639648 UD 0.485258817673
Minibatch with zero sample under length 50
Epoch 0 Update 560 Cost 165.346435547 UD 0.714576005936
Minibatch with zero sample under length 50
Epoch 0 Update 570 Cost 142.101211548 UD 0.74903011322
Minibatch with zero sample under length 50
Minibatch with zero sample under length 50
Epoch 0 Update 580 Cost 171.009277344 UD 0.793890953064
Epoch 0 Update 590 Cost 140.376556396 UD 0.761693000793
Epoch 0 Update 600 Cost 152.642288208 UD 0.790089130402
Saving the best model... Done
Saving the model at iteration 600... Done
Source 0 : L@@ i@@ b@@ e@@ r@@ a@@ l@@ i@@ s@@ a@@ t@@ i@@ o@@ n p@@ r@@ e@@ s@@ e@@ n@@ t@@ s o@@ t@@ h@@ e@@ r d@@ i@@ f@@ f@@ i@@ c@@ u@@ l@@ t@@ i@@ e@@ s .
Truth 0 : L@@ a l@@ i@@ b@@ é@@ r@@ a@@ l@@ i@@ s@@ a@@ t@@ i@@ o@@ n p@@ r@@ é@@ s@@ e@@ n@@ t@@ e d &@@ a@@ p@@ o@@ s@@ ; a@@ u@@ t@@ r@@ e@@ s d@@ i@@ f@@ f@@ i@@ c@@ u@@ l@@ t@@ é@@ s .
Sample 0 : u@@ e a@@ u@@ r@@ o@@ f@@ r@@ r@@ e@@ e v@@ c@@ u@@ UNK i@@ i@@ r@@ d@@ i@@ e@@ &@@ r n &@@ o@@ u@@ t@@ a@@ e
Source 1 : T@@ h@@ e@@ s@@ e a@@ r@@ e t@@ h@@ e c@@ o@@ n@@ c@@ r@@ e@@ t@@ e a@@ n@@ s@@ w@@ e@@ r@@ s t@@ o t@@ h@@ e c@@ o@@ n@@ c@@ r@@ e@@ t@@ e q@@ u@@ e@@ s@@ t@@ i@@ o@@ n@@ s .
Truth 1 : V@@ o@@ i@@ l@@ à l@@ e@@ s r@@ é@@ p@@ o@@ n@@ s@@ e@@ s c@@ o@@ n@@ c@@ r@@ è@@ t@@ e@@ s a@@ u@@ x q@@ u@@ e@@ s@@ t@@ i@@ o@@ n@@ s c@@ o@@ n@@ c@@ r@@ è@@ t@@ e@@ s .
Sample 1 : r@@ s v@@ e a@@ d@@ v@@ u@@ n@@ I@@ e d@@ s@@ a@@ s l@@ m@@ s@@ i@@ i@@ s p@@ a@@ s r@@ e I@@ o@@ é@@ a@@
Source 2 : M@@ r C@@ a@@ s@@ a@@ c@@ a h@@ a@@ s t@@ h@@ e f@@ l@@ o@@ o@@ r f@@ o@@ r a s@@ u@@ p@@ p@@ l@@ e@@ m@@ e@@ n@@ t@@ a@@ r@@ y q@@ u@@ e@@ s@@ t@@ i@@ o@@ n .
Truth 2 : M@@ . C@@ a@@ s@@ a@@ c@@ a a l@@ a p@@ a@@ r@@ o@@ l@@ e p@@ o@@ u@@ r u@@ n@@ e q@@ u@@ e@@ s@@ t@@ i@@ o@@ n c@@ o@@ m@@ p@@ l@@ é@@ m@@ e@@ n@@ t@@ a@@ i@@ r@@ e .
Sample 2 : s@@ l@@ v@@ s UNK : p@@ r@@ e@@ l@@ s@@ ê@@ r@@ r@@
Source 3 : I w@@ e@@ l@@ c@@ o@@ m@@ e y@@ o@@ u h@@ e@@ r@@ e t@@ o P@@ a@@ r@@ l@@ i@@ a@@ m@@ e@@ n@@ t !
Truth 3 : N@@ o@@ u@@ s l@@ u@@ i s@@ o@@ u@@ h@@ a@@ i@@ t@@ o@@ n@@ s l@@ a b@@ i@@ e@@ n@@ v@@ e@@ n@@ u@@ e d@@ a@@ n@@ s c@@ e@@ t@@ t@@ e e@@ n@@ c@@ e@@ i@@ n@@ t@@ e .
Sample 3 : v@@ i@@ à - UNK o@@ p@@ r@@ a@@ i@@ o@@ e s@@ p@@
Source 4 : T@@ h@@ a@@ t i@@ s w@@ h@@ y t@@ h@@ i@@ s l@@ e@@ g@@ i@@ s@@ l@@ a@@ t@@ i@@ o@@ n h@@ a@@ s b@@ e@@ e@@ n p@@ u@@ t i@@ n p@@ l@@ a@@ c@@ e .
Truth 4 : C ’ e@@ s@@ t p@@ o@@ u@@ r@@ q@@ u@@ o@@ i c@@ e@@ t@@ t@@ e l@@ é@@ g@@ i@@ s@@ l@@ a@@ t@@ i@@ o@@ n a é@@ t@@ é m@@ i@@ s@@ e e@@ n p@@ l@@ a@@ c@@ e .
Sample 4 : m@@ s a@@ a@@ t@@ o@@ a@@ s@@ o@@ s@@ a@@ e .
32 samples computed
64 samples computed
96 samples computed
98 samples computed
130 samples computed
162 samples computed
194 samples computed
226 samples computed
234 samples computed
266 samples computed
298 samples computed
330 samples computed
333 samples computed
365 samples computed
387 samples computed
419 samples computed
451 samples computed
464 samples computed
Valid 172.487

from dl4mt-tutorial.

rsennrich commented on September 28, 2024

This looks like the BPE codes file is empty, giving you essentially a character-level segmentation. Check if you executed the learn_bpe.py script with a proper training set.

from dl4mt-tutorial.

arvind2505 commented on September 28, 2024

I see, my goal is to get a reasonable system for English to French. I removed the BPE part and ran the code directly on word tokens. The output looks good now. Do you know if the the BPE part affects the performance significantly ? The output now looks like:
Using gpu device 0: Graphics Device (CNMeM is disabled)
{'use-dropout': [False], 'dim': [1024], 'optimizer': ['adadelta'], 'dim_word': [512], 'reload': [True], 'clip-c': [1.0], 'n-words': [30000], 'model': ['model_hal.npz'], 'learning-rate': [0.0001], 'decay-c': [0.0]}
Loading data
Building model
/share/apps/python/lib/python2.7/site-packages/theano/scan_module/scan.py:1019: Warning: In the strict mode, all neccessary shared variables must be passed as a part of non_sequences
'must be passed as a part of non_sequences', Warning)
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Building f_cost... Done
Computing gradient... Done
Building optimizers... Done
Optimization
Epoch 0 Update 10 Cost 242.930175781 UD 0.504160881042
Minibatch with zero sample under length 50
Epoch 0 Update 20 Cost 693.750854492 UD 0.700622081757
Epoch 0 Update 30 Cost 157.208435059 UD 0.45711684227
Minibatch with zero sample under length 50
Epoch 0 Update 40 Cost 290.630859375 UD 0.703430175781
Epoch 0 Update 50 Cost 123.121292114 UD 0.405668020248
Minibatch with zero sample under length 50
Epoch 0 Update 60 Cost 245.284179688 UD 0.621522903442
Epoch 0 Update 70 Cost 102.520629883 UD 0.365594148636
Minibatch with zero sample under length 50
Epoch 0 Update 80 Cost 214.226745605 UD 0.587069988251
Epoch 0 Update 90 Cost 59.2604370117 UD 0.292782068253
Minibatch with zero sample under length 50
Epoch 0 Update 100 Cost 177.20161438 UD 0.50200510025
Saving the best model... Done
Saving the model at iteration 100... Done
Source 0 : I voted against this proposal at first reading , but the governments of the Member States have now improved it .
Truth 0 : . - J ’ avais voté contre cette proposition en première lecture , mais les gouvernements des États membres l ’ ont à présent améliorée .
Sample 0 : l un , je en de hostilités ’ en
Source 1 : I am very pleased that many new members of the committee are here tonight to support me and to contribute to the debate .
Truth 1 : Je suis ravi de voir que bon nombre des nouveaux membres de la commission sont présents ce soir pour me soutenir et contribuer au débat .
Sample 1 : qu' Bien clos qu' aisance , , l' importante ,
Source 2 : Within many of the Member States , my own included , there has been a similar process , whereby elected representatives have given away their powers .
Truth 2 : Un processus similaire a eu lieu au sein de nombreux Etats membres , y compris le mien : les représentants élus ont cédé leurs pouvoirs .
Sample 2 : concède entend-elle silence essaient de a ) le nous le peut Mais ne , Présidente votés - les Bombay nous entre Parlement à droit sur positif comme , , politiques
Source 3 : If we are to restore consumer confidence , it is essential that we do all we possibly can at all levels .
Truth 3 : Afin de pouvoir restaurer la confiance des consommateurs , il est indispensable que nous fassions à tous les niveaux tout ce qui est humainement possible .
Sample 3 : pense des déjà le ne insoluble , existants à a En résumé niveau incapacités indulgent énergie un de M. européenne yougoslaves est que
Source 4 : But children are very capable of understanding what is positive content and what is UNK content .
Truth 4 : Mais les enfants sont tout à fait capables de comprendre la différence entre un contenu positif et un autre qui ne l' est pas .
Sample 4 : reposant la périmètre attaquions sonder , dit d' et , , est dans crise
32 samples computed
64 samples computed
96 samples computed
128 samples computed
160 samples computed
192 samples computed
224 samples computed
256 samples computed
288 samples computed
320 samples computed
352 samples computed
384 samples computed
416 samples computed
448 samples computed
480 samples computed
512 samples computed
544 samples computed
576 samples computed
608 samples computed
629 samples computed
661 samples computed
693 samples computed
725 samples computed
757 samples computed
789 samples computed
821 samples computed
853 samples computed
885 samples computed
917 samples computed
949 samples computed
981 samples computed
1013 samples computed
1045 samples computed
1077 samples computed
1109 samples computed
1141 samples computed
1173 samples computed
1205 samples computed
1237 samples computed
1257 samples computed
1289 samples computed
1321 samples computed
1353 samples computed
1385 samples computed
1417 samples computed
1449 samples computed
1481 samples computed
1513 samples computed
1545 samples computed
1577 samples computed
1609 samples computed
1641 samples computed
1673 samples computed
1705 samples computed
1737 samples computed
1769 samples computed
1801 samples computed
1833 samples computed
1858 samples computed
1890 samples computed
1922 samples computed
1954 samples computed
1986 samples computed
2018 samples computed
2050 samples computed
2082 samples computed
2114 samples computed
2146 samples computed
2178 samples computed
2210 samples computed
2242 samples computed
2274 samples computed
2306 samples computed
2338 samples computed
2370 samples computed
2402 samples computed
2434 samples computed
2440 samples computed
2472 samples computed
2504 samples computed
2536 samples computed
2568 samples computed
2600 samples computed
2632 samples computed
2664 samples computed
2696 samples computed
2728 samples computed
2760 samples computed
2792 samples computed
2824 samples computed
2856 samples computed
2864 samples computed
Valid 191.142
Does this look ok?

from dl4mt-tutorial.

rsennrich commented on September 28, 2024

with a basic word-level system [and without backoff model], you won't get good translations of OOV source words, and won't be able to produce any target words that are not in your vocabulary.

How much this affects your translation quality depends on the language pair and vocabulary size, but you'll definitely want some way of translating rare words, since they are often names and other content words that carry a lot of information.

from dl4mt-tutorial.

arvind2505 commented on September 28, 2024

cool, I got it running with sub-word units too. Thanks everyone! This was extremely helpful.

from dl4mt-tutorial.

jli05 commented on September 28, 2024

Sorry how did you run it on sub-word units? By tokenising on letters? Could we train on both word and sub-word units?

from dl4mt-tutorial.

arvind2505 commented on September 28, 2024

I was making a mistake while running the apply_pbe.py code. I just followed the steps given by @orhanf in this thread.

from dl4mt-tutorial.

Output contains lot of UNKs about dl4mt-tutorial HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent