Giter VIP home page Giter VIP logo

americasnlp2021's People

Contributors

abteen avatar americasnlpws avatar aoncevay avatar ftyers avatar pywirrarika avatar rolandocoto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

americasnlp2021's Issues

Extraneous file?

There appears to be some Mac special files in the repo: ../data/quechua-spanish/.DS_Store

Missing dev data?

After going through the readme for the baseline system for Spanish-Nahuatl, I get:

fran@ipek:~/source/americasnlp2021/baseline_system$ ./run_baseline_system.sh nah ../data/nahuatl-spanish/ . 5
################ Training SentencePiece tokenizer ################
sentencepiece_trainer.cc(75) LOG(INFO) Starts training with : 
trainer_spec {
  input: ../data/nahuatl-spanish//train.es
  input: ../data/nahuatl-spanish//train.nah
  input_format: 
  model_prefix: ./models/nah_es/sentencepiece.bpe
  model_type: BPE
  vocab_size: 3557
  accept_language: es
  accept_language: nah
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(330) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(185) LOG(INFO) Loading corpus: ../data/nahuatl-spanish//train.es
trainer_interface.cc(357) LOG(WARNING) Found too long line (5137 > 4192).
trainer_interface.cc(359) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(360) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.
trainer_interface.cc(185) LOG(INFO) Loading corpus: ../data/nahuatl-spanish//train.nah
trainer_interface.cc(386) LOG(INFO) Loaded all 32105 sentences
trainer_interface.cc(392) LOG(INFO) Skipped 20 too long sentences.
trainer_interface.cc(401) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(401) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(401) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(406) LOG(INFO) Normalizing sentences...
trainer_interface.cc(467) LOG(INFO) all chars count=4552492
trainer_interface.cc(488) LOG(INFO) Alphabet size=142
trainer_interface.cc(489) LOG(INFO) Final character coverage=1
trainer_interface.cc(521) LOG(INFO) Done! preprocessed 32105 sentences.
trainer_interface.cc(527) LOG(INFO) Tokenizing input sentences with whitespace: 32105
trainer_interface.cc(537) LOG(INFO) Done! 79331
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=77400 min_freq=33
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=29907 size=20 all=3171 active=2084 piece=ch
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=17897 size=40 all=4381 active=3294 piece=as
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=10792 size=60 all=6028 active=4941 piece=▁la
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=8091 size=80 all=7365 active=6278 piece=▁los
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=5905 size=100 all=8967 active=7880 piece=ua
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=5795 min_freq=470
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=4562 size=120 all=10507 active=2407 piece=▁ma
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3632 size=140 all=11823 active=3723 piece=▁me
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=3174 size=160 all=12935 active=4835 piece=▁del
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2682 size=180 all=14391 active=6291 piece=ton
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2306 size=200 all=15737 active=7637 piece=dad
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=2296 min_freq=397
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=2032 size=220 all=16802 active=2023 piece=▁man
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1856 size=240 all=17915 active=3136 piece=pil
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1699 size=260 all=18899 active=4120 piece=▁cas
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1556 size=280 all=19919 active=5140 piece=mp
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1430 size=300 all=20990 active=6211 piece=▁na
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=1429 min_freq=328
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1305 size=320 all=22042 active=2055 piece=▁ten
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1227 size=340 all=22990 active=3003 piece=▁nos
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1166 size=360 all=24022 active=4035 piece=yotl
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1079 size=380 all=24898 active=4911 piece=tza
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=1042 size=400 all=26219 active=6232 piece=teca
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=1039 min_freq=240
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=991 size=420 all=27252 active=2248 piece=meh
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=923 size=440 all=27973 active=2969 piece=miqui
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=862 size=460 all=28625 active=3621 piece=▁tlaca
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=819 size=480 all=29636 active=4632 piece=▁san
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=787 size=500 all=30328 active=5324 piece=▁mar
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=785 min_freq=188
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=736 size=520 all=31500 active=2658 piece=ones
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=707 size=540 all=32429 active=3587 piece=▁inic
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=688 size=560 all=33116 active=4274 piece=ul
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=652 size=580 all=33882 active=5040 piece=▁Ca
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=636 size=600 all=34473 active=5631 piece=patl
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=634 min_freq=157
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=608 size=620 all=35588 active=2830 piece=ún
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=589 size=640 all=36410 active=3652 piece=▁fueron
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=571 size=660 all=36838 active=4080 piece=▁todo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=550 size=680 all=37521 active=4763 piece=tis
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=535 size=700 all=38156 active=5398 piece=ko
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=535 min_freq=132
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=516 size=720 all=38808 active=2440 piece=ido
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=502 size=740 all=39613 active=3245 piece=▁za
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=494 size=760 all=40100 active=3732 piece=▁Ama
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=480 size=780 all=40724 active=4356 piece=▁160
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=456 size=800 all=41144 active=4776 piece=ni
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=455 min_freq=118
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=438 size=820 all=41773 active=2612 piece=ño
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=424 size=840 all=42313 active=3152 piece=▁vis
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=412 size=860 all=42825 active=3664 piece=▁tona
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=397 size=880 all=43537 active=4376 piece=▁día
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=388 size=900 all=44204 active=5043 piece=ini
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=386 min_freq=103
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=374 size=920 all=44770 active=2694 piece=tiaya
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=363 size=940 all=45507 active=3431 piece=▁dicha
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=353 size=960 all=46190 active=4114 piece=▁Los
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=343 size=980 all=46668 active=4591 piece=tlahtoca
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=334 size=1000 all=47460 active=5383 piece=▁prim
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=333 min_freq=92
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=328 size=1020 all=47982 active=2894 piece=▁flores
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=318 size=1040 all=48652 active=3564 piece=▁ba
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=312 size=1060 all=49311 active=4223 piece=tque
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=306 size=1080 all=49623 active=4535 piece=gún
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=300 size=1100 all=50133 active=5045 piece=elig
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=300 min_freq=83
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=294 size=1120 all=50587 active=2954 piece=oy
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=289 size=1140 all=50836 active=3203 piece=tecas
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=284 size=1160 all=51160 active=3527 piece=▁corazón
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=279 size=1180 all=51814 active=4181 piece=ío
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=274 size=1200 all=52138 active=4505 piece=pacho
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=274 min_freq=77
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=268 size=1220 all=52547 active=2940 piece=cuil
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=263 size=1240 all=53012 active=3405 piece=▁yh
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=258 size=1260 all=53432 active=3825 piece=▁tú
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=254 size=1280 all=53802 active=4195 piece=pohualxihuitl
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=248 size=1300 all=54188 active=4581 piece=▁nepa
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=248 min_freq=72
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=242 size=1320 all=54590 active=3104 piece=▁om
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=237 size=1340 all=55078 active=3592 piece=▁1608
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=234 size=1360 all=55562 active=4076 piece=pohualli
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=229 size=1380 all=56044 active=4558 piece=▁Des
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=226 size=1400 all=56548 active=5062 piece=▁tech
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=226 min_freq=66
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=221 size=1420 all=57062 active=3285 piece=▁governador
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=215 size=1440 all=57436 active=3659 piece=cua
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=212 size=1460 all=58059 active=4282 piece=▁aquel
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=207 size=1480 all=58632 active=4855 piece=coli
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=202 size=1500 all=58951 active=5174 piece=zó
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=202 min_freq=61
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=200 size=1520 all=59465 active=3444 piece=culo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=197 size=1540 all=59856 active=3835 piece=▁mos
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=194 size=1560 all=60173 active=4152 piece=▁159
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=192 size=1580 all=60458 active=4437 piece=▁tlamantli
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=188 size=1600 all=60831 active=4810 piece=▁Per
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=188 min_freq=57
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=184 size=1620 all=61171 active=3369 piece=jar
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=182 size=1640 all=61565 active=3763 piece=▁quer
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=179 size=1660 all=61836 active=4034 piece=huicac
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=176 size=1680 all=62188 active=4386 piece=▁pasado
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=173 size=1700 all=62686 active=4884 piece=ep
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=173 min_freq=54
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=171 size=1720 all=63186 active=3622 piece=maron
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=169 size=1740 all=63486 active=3922 piece=▁personas
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=166 size=1760 all=63825 active=4261 piece=ñore
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=165 size=1780 all=64180 active=4616 piece=tlalis
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=163 size=1800 all=64622 active=5058 piece=▁doc
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=163 min_freq=51
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=161 size=1820 all=64812 active=3415 piece=quiza
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=159 size=1840 all=65057 active=3660 piece=cuilo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=156 size=1860 all=65391 active=3994 piece=idente
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=154 size=1880 all=65614 active=4217 piece=guna
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=151 size=1900 all=65854 active=4457 piece=rad
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=151 min_freq=49
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=150 size=1920 all=66238 active=3645 piece=▁puede
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=148 size=1940 all=66556 active=3963 piece=kat
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=146 size=1960 all=66874 active=4281 piece=▁oquichiuh
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=144 size=1980 all=67308 active=4715 piece=yolo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=142 size=2000 all=67814 active=5221 piece=cado
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=142 min_freq=47
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=141 size=2020 all=68188 active=3744 piece=▁hombre
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=139 size=2040 all=68609 active=4165 piece=▁ello
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=137 size=2060 all=69018 active=4574 piece=▁kikua
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=135 size=2080 all=69279 active=4835 piece=macaz
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=133 size=2100 all=69485 active=5041 piece=▁cho
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=133 min_freq=44
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=131 size=2120 all=69803 active=3777 piece=▁Tlil
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=129 size=2140 all=69959 active=3933 piece=ger
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=127 size=2160 all=70310 active=4284 piece=idor
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=126 size=2180 all=70452 active=4426 piece=▁quimon
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=124 size=2200 all=70765 active=4739 piece=▁VI
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=124 min_freq=42
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=123 size=2220 all=71091 active=3861 piece=▁Las
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=122 size=2240 all=71397 active=4167 piece=▁Quen
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=121 size=2260 all=71611 active=4381 piece=▁vein
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=120 size=2280 all=71798 active=4568 piece=▁quienes
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=118 size=2300 all=72114 active=4884 piece=tolo
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=118 min_freq=40
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=117 size=2320 all=72241 active=3706 piece=uaya
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=116 size=2340 all=72636 active=4101 piece=▁clas
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=115 size=2360 all=72833 active=4298 piece=▁Chimal
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=113 size=2380 all=73121 active=4586 piece=erto
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=112 size=2400 all=73434 active=4899 piece=▁sep
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=112 min_freq=39
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=111 size=2420 all=73577 active=3811 piece=▁febrero
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=109 size=2440 all=73938 active=4172 piece=tlahuac
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=108 size=2460 all=74288 active=4522 piece=▁colo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=107 size=2480 all=74548 active=4782 piece=▁águ
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=106 size=2500 all=74814 active=5048 piece=▁Cuix
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=106 min_freq=37
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=105 size=2520 all=75168 active=4093 piece=▁Real
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=104 size=2540 all=75275 active=4200 piece=▁Testigo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=103 size=2560 all=75615 active=4540 piece=▁recib
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=102 size=2580 all=75930 active=4855 piece=▁tepetl
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=100 size=2600 all=76192 active=5117 piece=popo
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=100 min_freq=36
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=99 size=2620 all=76519 active=4068 piece=▁wa
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=98 size=2640 all=76744 active=4293 piece=ptla
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=97 size=2660 all=76962 active=4511 piece=iga
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=97 size=2680 all=77203 active=4752 piece=▁Señora
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=96 size=2700 all=77419 active=4968 piece=tilique
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=96 min_freq=35
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=95 size=2720 all=77700 active=4120 piece=▁tendr
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=94 size=2740 all=78014 active=4434 piece=▁ancho
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=93 size=2760 all=78218 active=4638 piece=▁estr
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=92 size=2780 all=78534 active=4954 piece=▁tzo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=91 size=2800 all=78808 active=5228 piece=▁hacen
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=91 min_freq=34
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=90 size=2820 all=79018 active=4150 piece=onotza
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=89 size=2840 all=79263 active=4395 piece=▁hemos
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=88 size=2860 all=79494 active=4626 piece=mimil
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=87 size=2880 all=79700 active=4832 piece=eno
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=87 size=2900 all=79953 active=5085 piece=▁sequin
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=87 min_freq=32
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=86 size=2920 all=80142 active=4186 piece=▁ihquac
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=85 size=2940 all=80349 active=4393 piece=▁infor
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=84 size=2960 all=80522 active=4566 piece=▁cebol
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=83 size=2980 all=80720 active=4764 piece=yllo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=83 size=3000 all=80905 active=4949 piece=▁huecauh
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=83 min_freq=31
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=82 size=3020 all=81081 active=4220 piece=icanos
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=81 size=3040 all=81257 active=4396 piece=▁tia
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=81 size=3060 all=81424 active=4563 piece=▁gobernando
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=80 size=3080 all=81636 active=4775 piece=▁pluma
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=79 size=3100 all=81687 active=4826 piece=▁rev
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=79 min_freq=30
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=78 size=3120 all=81881 active=4270 piece=miz
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=78 size=3140 all=82253 active=4642 piece=tequiuh
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=77 size=3160 all=82465 active=4854 piece=coton
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=76 size=3180 all=82706 active=5095 piece=huah
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=76 size=3200 all=82902 active=5291 piece=▁general
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=76 min_freq=29
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=75 size=3220 all=83198 active=4442 piece=ándo
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=75 size=3240 all=83314 active=4558 piece=▁tomaron
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=74 size=3260 all=83458 active=4702 piece=▁llamó
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=73 size=3280 all=83586 active=4830 piece=▁kua
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=73 size=3300 all=83692 active=4936 piece=palnemo
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=73 min_freq=28
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=72 size=3320 all=83880 active=4362 piece=▁Tia
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=72 size=3340 all=84051 active=4533 piece=▁capitán
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=71 size=3360 all=84189 active=4671 piece=▁peda
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=71 size=3380 all=84330 active=4812 piece=▁nochipa
bpe_model_trainer.cc(257) LOG(INFO) Added: freq=70 size=3400 all=84642 active=5124 piece=▁Ahui
bpe_model_trainer.cc(166) LOG(INFO) Updating active symbols. max_freq=70 min_freq=28
trainer_interface.cc(615) LOG(INFO) Saving model: ./models/nah_es/sentencepiece.bpe.model
trainer_interface.cc(626) LOG(INFO) Saving vocabs: ./models/nah_es/sentencepiece.bpe.vocab
################ Done training ################
################ Tokenizing data ################
Encode error: [Errno 2] No such file or directory: '../data/nahuatl-spanish//dev.es'
Encode error: [Errno 2] No such file or directory: '../data/nahuatl-spanish//dev.nah'
Encode error: [Errno 2] No such file or directory: '../data/nahuatl-spanish//test.es'
Encode error: [Errno 2] No such file or directory: '../data/nahuatl-spanish//test.nah'
################ Done tokenizing ################
################ Encoding Data ################
2021-01-02 23:18:19 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='./data_out/nah_es', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, profile=False, quantization_config_path=None, reset_logging=True, scoring='bleu', seed=1, source_lang='es', srcdict='./models/nah_es/fairseq.dict', target_lang='nah', task='translation', tensorboard_logdir=None, testpref=None, tgtdict='./models/nah_es/fairseq.dict', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='./data_out/nah_es/train.bpe', user_dir=None, validpref='./data_out/nah_es/dev.bpe', wandb_project=None, workers=4)
2021-01-02 23:18:19 | INFO | fairseq_cli.preprocess | [es] Dictionary: 3558 types
2021-01-02 23:18:23 | INFO | fairseq_cli.preprocess | [es] ./data_out/nah_es/train.bpe.es: 16145 sents, 717620 tokens, 0.0% replaced by <unk>
2021-01-02 23:18:23 | INFO | fairseq_cli.preprocess | [es] Dictionary: 3558 types
Traceback (most recent call last):
  File "/home/fran/.local/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 394, in cli_main
    main(args)
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 284, in main
    make_all(args.source_lang, src_dict)
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 256, in make_all
    make_dataset(
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 248, in make_dataset
    make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers)
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 133, in make_binary_dataset
    offsets = Binarizer.find_offsets(input_file, num_workers)
  File "/home/fran/.local/lib/python3.8/site-packages/fairseq/binarizer.py", line 103, in find_offsets
    with open(PathManager.get_local_path(filename), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './data_out/nah_es/dev.bpe.es'

language name problem

My understanding is that the language name is generally referred to as "Shipibo-Konibo" and not "Shipibo-knoibo".

Variant and orthography of test sets

Hi, would it be possible to find out what variant and what orthography will be used for the test sets? e.g. the Nahuatl data has both siua, sihua and cihua, but no siwa.

ISO language codes seem wrong

Would it be possible to change the language codes to the ISO-standard codes? It might help people find resources and avoid people confusing different languages. These are the ones I have in mind:
wixhch
tnhthh
shishp

If you don't have time, perhaps the issue could be left as a reference for others?

Will there be devset data for Hñähñu and Shipibo-Konibo?

Hey there!

For the other language pairs, there are provided dev sets, but for the hñähñu-spanish and shipibo_konibo-spanish language pairs, there are no given dev sets. Is that intended, or should we expect those to be provided later?

Thanks a bunch!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.