Giter VIP home page Giter VIP logo

masakhane-mt's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

masakhane-mt's Issues

Have some en-ha and en-ig parallel data from Gourmet and ParaCrawl

Hi! In a collaboration between https://gourmet-project.eu/ and https://paracrawl.eu/ , have some parallel corpora. It's so new we haven't linked to it from the website yet.

The raw data comes from Internet Archive WIDE0006, Internet Archive WIDE00015, and our own crawl. Our own crawl was targeted at sites in CommonCrawl that had enough of at least two EU languages but then we crawled the whole domain.

Text:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.txt.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.txt.gz

The same in TMX:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.tmx.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.tmx.gz

Starter Notebook -- AttributeError: module 'heapq' has no attribute '_heappushpop_max'

The Reverse Training Starter Notebook runs fine up to 4000 episodes and then gives the following error:

Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/joeynmt/joeynmt/__main__.py", line 48, in <module> main() File "/content/joeynmt/joeynmt/__main__.py", line 35, in main train(cfg_file=args.config_path, skip_test=args.skip_test) File "/content/joeynmt/joeynmt/training.py", line 843, in train trainer.train_and_validate(train_data=train_data, valid_data=dev_data) File "/content/joeynmt/joeynmt/training.py", line 494, in train_and_validate valid_duration = self._validate(valid_data, epoch_no) File "/content/joeynmt/joeynmt/training.py", line 626, in _validate self._save_checkpoint(new_best, ckpt_score) File "/content/joeynmt/joeynmt/training.py", line 283, in _save_checkpoint to_delete = heapq._heappushpop_max(self.ckpt_queue, AttributeError: module 'heapq' has no attribute '_heappushpop_max'

Cannot load the trained model

I am getting this error when trying to copy the created models from the notebook storage to google drive for persistant storage

cp: cannot stat 'joeynmt/models/xhen_reverse_transformer/*': No such file or directory

Ingest Mozilla localization corpus

Tip: Mozilla has localization at https://github.com/mozilla-l10n/

There seem to be a bunch of repos for different products with different formats. Here's an example of Standard Moroccan Tamazight:

https://github.com/mozilla-l10n/firefoxios-l10n/blob/master/zgh/firefox-ios.xliff

Or here's some Acholi:

https://github.com/mozilla-l10n/sumo-l10n/blob/master/ach/LC_MESSAGES/buddyup.po

Trying to find a nice place to just grab everything from; I've e-mailed @Delphine asking for a TMX.

corpus_bleu() got an unexpected keyword argument 'sys_stream'

Got this error when trying to run a English to Afrikaans training on google colab. I used the exact copy of the colab provided in the readme, just changing the target language to Afrikaans. The model starts training but before it finished one epoch I get the error. I did some searching online but wasn't able to find anything about this error. This is a screenshot of the full error I am getting:
Screenshot_1

Update notebooks to no longer rely on JW300

Edit: see #200, maybe we should leave the old JW300 notebooks up, and instead create new ones

The problem

JW300 has been taken down for copyright reasons. At least the following notebooks all rely on it:

https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_from_English_training.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_gdrive_from_English.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb

a solution (but see #200 )

They need to be fixed to no longer use this dataset. Perhaps we could use Tatoeba or FloRES 101? Or one of the other machine translation datasets on https://huggingface.co/datasets?task_ids=task_ids:machine-translation&sort=downloads

Some issues with the "into English" MT starter notebook

https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb

When training a Twi-to-English model, I ran into a couple of challenges. Though these errors may have been due to something I overlooked, I thought it best to post here in case these small updates could help future users:

In the third from last code block, the current code is:
!cp -r joeynmt/models/${tgt}${src}_reverse_transformer/* "$gdrive_path/models/${src}${tgt}_reverse_transformer/"

  • the tgt and src variables are in different orders
  • the directory has not been made yet, so there is an error when trying to copy something into it

This can be fixed simply with these lines:

mkdir "$gdrive_path/models/${tgt}${src}_reverse_transformer/"
!cp -r joeynmt/models/${tgt}${src}_reverse_transformer/* "$gdrive_path/models/${tgt}${src}_reverse_transformer/"

When trying to test my model in the last code block, I got an error because a checkpoint could not be found. I fixed this by using the checkpoint flag:

original line: ! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${tgt}${src}_reverse_transformer/config.yaml"

suggested update: ! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${tgt}${src}_reverse_transformer/config.yaml" --ckpt "$gdrive_path/models/${tgt}${src}_reverse_transformer/$filename.ckpt"

Masakhane Reverse Machine Learning translation notebook Training Issue

I am trying to run the the starter revere training notebook and run into the following issue:

Everything seems to be working just fine, right up to when I want to train the model (the fourth cell from the bottom of the notebook), when I get the following error:

Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/joeynmt/joeynmt/__main__.py", line 3, in <module> from joeynmt.training import train File "/content/joeynmt/joeynmt/training.py", line 22, in <module> from torchtext.legacy.data import Dataset File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 5, in <module> from . import vocab File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab.py", line 13, in <module> from torchtext._torchtext import ( ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN2at6detail10noopDeleteEPv

Package Version Dependency Issues in Starter Reverse Notebook

When running the following command in the starter notebook:

# Install JoeyNMT ! git clone https://github.com/joeynmt/joeynmt.git ! cd joeynmt; pip3 install .

There is the following dependency clash:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.6.0 requires numpy~=1.19.2, but you have numpy 1.21.2 which is incompatible. tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible. tensorflow 2.6.0 requires wrapt~=1.12.1, but you have wrapt 1.11.1 which is incompatible. google-colab 1.0.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible. google-api-python-client 1.12.8 requires six<2dev,>=1.13.0, but you have six 1.12.0 which is incompatible. google-api-core 1.26.3 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.

Manually pip installing the required versions also do not seem to help. Could you please give some advice on what to do?

Unable to join the slack channel

Hey everyone, i just came across the amazing work you all working on here and would love to contribute. however i am unable to join the slack channel as the link is apparently broken.

Is there something i'm missing? how can i join the chat?

image

Custom Data Notebook: Spaces in file paths can cause issues with bash commands

For example, /content/drive/My Drive/masakhane/$src-$tgt-$tag can cause issues, but also the following situation caused an error for me:

source_file = f"/content/drive/My Drive/Research/Hani Machine Translation/hni_story_corpus/v2/hani_story_corpus_train.{source_language}"
target_file = f"/content/drive/My Drive/Research/Hani MachineTranslation/hni_story_corpus/v2/hani_story_corpus_train.{target_language}"

# They should both have the same length.
! wc -l $source_file
! wc -l $target_file

Mitigations we could do:

"MyDrive" instead of "My Drive" helps

Actually, it seems you can just change from using My Drive to MyDrive paths, which helps a lot so long as there aren't spaces elsewhere in the path, e.g. in my case where Hani Machine Translation was in the path to train.eng and train.hni

Add quotes around bash variables

For example
! wc -l "$source_file" instead of wc -l $source_file

and `

! head "$source_file"* instead of ! head "$source_file"*

but this doesn't completely solve it, and can get complicated when we've got some of the more complex cases later in the notebook, like

!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

or within the yaml file:

#load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint

Warn the user about whitespaces.

Add a section that checks all the paths for white spaces and warns the user that, maybe it would be easier if they just removed them?

Do all our file manipulations with Python

We could rewrite a lot of these to use pathlib

See also pjreddie/darknet#1672 and https://stackoverflow.com/questions/56640534/cannot-open-train-txt-with-white-space-my-drivehe

Originally posted this on masakhane-io/masakhane-community#25, whoops.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.