masakhane-io / masakhane-mt Goto Github PK
View Code? Open in Web Editor NEWMachine Translation for Africa
License: MIT License
Machine Translation for Africa
License: MIT License
Hi! In a collaboration between https://gourmet-project.eu/ and https://paracrawl.eu/ , have some parallel corpora. It's so new we haven't linked to it from the website yet.
The raw data comes from Internet Archive WIDE0006, Internet Archive WIDE00015, and our own crawl. Our own crawl was targeted at sites in CommonCrawl that had enough of at least two EU languages but then we crawled the whole domain.
Text:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.txt.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.txt.gz
The same in TMX:
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.tmx.gz
https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.tmx.gz
The Reverse Training Starter Notebook runs fine up to 4000 episodes and then gives the following error:
Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/joeynmt/joeynmt/__main__.py", line 48, in <module> main() File "/content/joeynmt/joeynmt/__main__.py", line 35, in main train(cfg_file=args.config_path, skip_test=args.skip_test) File "/content/joeynmt/joeynmt/training.py", line 843, in train trainer.train_and_validate(train_data=train_data, valid_data=dev_data) File "/content/joeynmt/joeynmt/training.py", line 494, in train_and_validate valid_duration = self._validate(valid_data, epoch_no) File "/content/joeynmt/joeynmt/training.py", line 626, in _validate self._save_checkpoint(new_best, ckpt_score) File "/content/joeynmt/joeynmt/training.py", line 283, in _save_checkpoint to_delete = heapq._heappushpop_max(self.ckpt_queue, AttributeError: module 'heapq' has no attribute '_heappushpop_max'
Following #157, check what languages are not covered in https://github.com/juliakreutzer/masakhane/tree/master/jw300_utils/test, and create custom test sets for those. @juliakreutzer I think I can give this a go, but do I need to do a pull request to... your forked version of masakhane-mt?
Alternate language code list, looks the same: https://opus.nlpl.eu/opusapi/?languages=True&corpus=JW300
Project if somebody is interested in taking it up. https://khamenei.ir/ used to be localized in ha, sw, de, ja, tr, and id.
https://web.archive.org/web/20160531131338/http://hausa.khamenei.ir/
https://web.archive.org/web/20160524162105/http://swahili.khamenei.ir/
It's currently in ru, fr, es, en, ur, ar, fa, and fa for kids.
I am getting this error when trying to copy the created models from the notebook storage to google drive for persistant storage
cp: cannot stat 'joeynmt/models/xhen_reverse_transformer/*': No such file or directory
The Google Drive link to the en_ha models are deprecated/wrong.
Tip: Mozilla has localization at https://github.com/mozilla-l10n/
There seem to be a bunch of repos for different products with different formats. Here's an example of Standard Moroccan Tamazight:
https://github.com/mozilla-l10n/firefoxios-l10n/blob/master/zgh/firefox-ios.xliff
Or here's some Acholi:
https://github.com/mozilla-l10n/sumo-l10n/blob/master/ach/LC_MESSAGES/buddyup.po
Trying to find a nice place to just grab everything from; I've e-mailed @Delphine asking for a TMX.
Hi, I'm trying to use your script in https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook.ipynb to create a translator for Sango language, but it fails when trying to download the global test set, it doesn't exist for "test.en-any.en". I think it happens because my src is "sg" and my target is "en"; I tested the inverse (English to Sango) and it worked.
Regards.
Got this error when trying to run a English to Afrikaans training on google colab. I used the exact copy of the colab provided in the readme, just changing the target language to Afrikaans. The model starts training but before it finished one epoch I get the error. I did some searching online but wasn't able to find anything about this error. This is a screenshot of the full error I am getting:
Slack discussion: https://masakhane-nlp.slack.com/archives/C01JAP67HRV/p1634844082006400
https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb is the Tatoeba example.
Hi,
Thanks for this repository. I tried to run it but I am getting an issue at:
I don't know what 'fr_test_sents' is, is this maybe a typo?
https://colab.research.google.com/drive/1n_lRn6zWXDDor7scBUTnCk4x5j90LVum#scrollTo=FG5qoTCkdyCy makes it easy to generate new test sets.
Amazing! Thank you! Mind if I ask you to add the tensorboard code to the starter_notebook? I'm sure many people would benefit from it!
Would you mind adding a readme like this one?
Originally posted by @jaderabbit in #32 (comment)
JW300 has been taken down for copyright reasons. At least the following notebooks all rely on it:
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_from_English_training.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_gdrive_from_English.ipynb
https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb
They need to be fixed to no longer use this dataset. Perhaps we could use Tatoeba or FloRES 101? Or one of the other machine translation datasets on https://huggingface.co/datasets?task_ids=task_ids:machine-translation&sort=downloads
When training a Twi-to-English model, I ran into a couple of challenges. Though these errors may have been due to something I overlooked, I thought it best to post here in case these small updates could help future users:
In the third from last code block, the current code is:
!cp -r joeynmt/models/${tgt}${src}_reverse_transformer/* "$gdrive_path/models/${src}${tgt}_reverse_transformer/"
This can be fixed simply with these lines:
mkdir "$gdrive_path/models/${tgt}${src}_reverse_transformer/"
!cp -r joeynmt/models/${tgt}${src}_reverse_transformer/* "$gdrive_path/models/${tgt}${src}_reverse_transformer/"
When trying to test my model in the last code block, I got an error because a checkpoint could not be found. I fixed this by using the checkpoint flag:
original line: ! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${tgt}${src}_reverse_transformer/config.yaml"
suggested update: ! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${tgt}${src}_reverse_transformer/config.yaml" --ckpt "$gdrive_path/models/${tgt}${src}_reverse_transformer/$filename.ckpt"
I am trying to run the the starter revere training notebook and run into the following issue:
Everything seems to be working just fine, right up to when I want to train the model (the fourth cell from the bottom of the notebook), when I get the following error:
Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/joeynmt/joeynmt/__main__.py", line 3, in <module> from joeynmt.training import train File "/content/joeynmt/joeynmt/training.py", line 22, in <module> from torchtext.legacy.data import Dataset File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 5, in <module> from . import vocab File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab.py", line 13, in <module> from torchtext._torchtext import ( ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN2at6detail10noopDeleteEPv
Hi @kevindegila, I noticed that the config.yaml for your en-fon benchmark model is empty. Could you please upload it?
Thanks!
When running the following command in the starter notebook:
# Install JoeyNMT ! git clone https://github.com/joeynmt/joeynmt.git ! cd joeynmt; pip3 install .
There is the following dependency clash:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.6.0 requires numpy~=1.19.2, but you have numpy 1.21.2 which is incompatible. tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible. tensorflow 2.6.0 requires wrapt~=1.12.1, but you have wrapt 1.11.1 which is incompatible. google-colab 1.0.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible. google-api-python-client 1.12.8 requires six<2dev,>=1.13.0, but you have six 1.12.0 which is incompatible. google-api-core 1.26.3 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
Manually pip installing the required versions also do not seem to help. Could you please give some advice on what to do?
For example, /content/drive/My Drive/masakhane/$src-$tgt-$tag
can cause issues, but also the following situation caused an error for me:
source_file = f"/content/drive/My Drive/Research/Hani Machine Translation/hni_story_corpus/v2/hani_story_corpus_train.{source_language}"
target_file = f"/content/drive/My Drive/Research/Hani MachineTranslation/hni_story_corpus/v2/hani_story_corpus_train.{target_language}"
# They should both have the same length.
! wc -l $source_file
! wc -l $target_file
Mitigations we could do:
Actually, it seems you can just change from using My Drive
to MyDrive
paths, which helps a lot so long as there aren't spaces elsewhere in the path, e.g. in my case where Hani Machine Translation
was in the path to train.eng
and train.hni
For example
! wc -l "$source_file"
instead of wc -l $source_file
and `
! head "$source_file"* instead of ! head "$source_file"*
but this doesn't completely solve it, and can get complicated when we've got some of the more complex cases later in the notebook, like
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"
or within the yaml file:
#load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
Add a section that checks all the paths for white spaces and warns the user that, maybe it would be easier if they just removed them?
We could rewrite a lot of these to use pathlib
See also pjreddie/darknet#1672 and https://stackoverflow.com/questions/56640534/cannot-open-train-txt-with-white-space-my-drivehe
Originally posted this on masakhane-io/masakhane-community#25, whoops.
Hi I would like to contribute a Pulaar translator model, but need pointed to the the sentence pairs. Can anyone help me out?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.