janetzki / guide Goto Github PK
View Code? Open in Web Editor NEWCreate semantic domain dictionaries for low-resource languages
License: MIT License
Create semantic domain dictionaries for low-resource languages
License: MIT License
We use at least one metric to evaluate the GNNs used for dictionary creation.
Motivation: build GNN -> increase DC precision and recall > 0.30
I want to project semantic domain questions as labels from eng
+ fra
to deu
+ gej
verses for at least 10 verses.
We compute word embeddings for English words, fra/tpi/meu words, and semantic domains (by averaging the English words) to link the fra/tpi/meu words to the semantic domains.
(This issue is not part of the proposal. --> optional)
Motivation: less FP, less FN --> more precision and recall --> F1 > 0.30
Instead of contracted nodes, there are links between words that look like they belong together.
French examples:
ignoriez: {'ignore', 'ignorer', 'ignoriez'}
illuminé: {'illuminé', 'illuminés'}
image: {'images', 'image'}
Motivation: build GNN
As a developer, I want to automatically tune the model's hyperparameters by using Sweeps.
A person, who did not develop the code, used the code to create a dictionary.
Motivation: get feedback
As a developer, I want to automate the testing of the code on GitHub to make sure that the code coverage stays greater than 97% without the need for manual testing.
(https://huggingface.co/datasets/sil-ai/bloom-lm/viewer/tpi/train)
As a developer, I want to use books from the bloom library as supplementary training data to improve the word alignment's quality. This would successively increase the dictionary creator's precision.
Motivation: (More data beats more clever algorithms.) more parallel data -> improve alignment -> less FPs -> higher DC precision
The Story of Jonah
eng: In those days there was a very large town where many people lived. The town's name was Nineveh.
tpi: Long dispela taim i gat wanpela bikpela taun i gat planti manmeri. Nem bilong dispela taun em Nineveh.
As a developer, I want to see how well the lemmatization of LRL works so that I can explain and improve the lemmatization approach.
Motivation: improve lemma groups -> reduce false positives -> increase precision
We want to create a GraphQL graph database that includes the linguistic and geographical hierarchy of all spoken languages. The goal is to leverage this graph as input to GNNs. The language nodes link their respective monolingual texts (i.e., Bible translations) and semantic domain dictionaries.
We want to build a simple model for SDI that serves as a baseline.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu
todo
s in comments
(https://towardsdatascience.com/graph-neural-networks-with-pyg-on-node-classification-link-prediction-and-anomaly-detection-14aa38fe1275)
There is a GNN model that links LRL words with semantic domains.
Motivation: build GNN -> increase DC precision and recall > 0.30
There are no more todo
s in the code. It's more consistent to have them here on GitHub.
Motivation: see the bigger picture more clearly
Make at least one dictionary creator model accessible to at least one other person.
Motivation: get feedback
As a software developer, I want to try out replacing fast_align with the AWESoME word aligner to see if this improves the dictionary creator's F1 score.
Motivation: improve alignment -> reduce FPs -> increase DC's precision
dictionary_creator
to use eflomal instead of fast_alignLanguage pair | fast_align (baseline) | mBERT | fine-tuned by AWESoME | Eflomal |
---|---|---|---|---|
eng-eng | 0.25, 0.40 | 0.26, 0.38 | 0.27, 0.38 | 0.23, 0.39 |
eng-fra | 0.24, 0.37 | 0.27, 0.38 | 0.27, 0.36 | 0.26, 0.41 |
eng-tpi | n/a, 0.28 | n/a, 0.19 | n/a, 0.20 | n/a, 0.31 |
eng-meu | n/a, 0.23 | n/a, 0.11 | n/a, 0.11 | n/a, 0.21 |
There are ~8000 more nodes in the graph: Each is a semantic domain (question), with links to each of its words in different languages. Creating dictionaries means finding these links.
Motivation: build GNN -> increase DC precision and recall > 0.30
eng:polluted
by adding the two connections to 4.9.5.6
.For each target bible, we automatically choose an English bible that aligns well to make the alignment more meaningful.
Motivation: improve alignment -> reduce FPs -> increase DC precision > 0.30
We can evaluate the dictionary creation on at least one LRL (and not only on HRL/MRL).
Motivation: evaluate DC -> create 1 LRL dict with F1 > 30%
todo
s in commentsmetric | value | note |
---|---|---|
MRR for tpi | 0.281 | 642 / 5059 tpi questions selected |
MRR for meu | 0.230 | 842 / 5052 meu questions selected |
MRR for swp (Daui) | 0.171 | 1272 / 4518 swp questions selected |
The MMR lowers with an increasing number of selected questions. Apparently, getting more questions right is "more difficult".
We have one dictionary for Tok Pisin (tpi) and Suau/Duai (swp) that I can use to evaluate the dictionary creator on these two LRLs (and not only on HRL/MRL).
tabula-py
lib)
The dictionary_creator
loads a consistent progress (i.e., it continues with the same data that has been saved).
The dictionary_creator
loads incomplete progress (e.g., word_graph
).
self.progress_log = []
that stores all completed steps.todo
s in comments
(https://www.youtube.com/watch?v=1gHUiNLYa20)
As a developer, I want to try out wandb to understand my model and make it more explainable.
Result: I have an overview of wandb's features (e.g., the Keras integration).
Motivation: build GNN
As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper).
Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision
User Story: As a developer, I want to have only a single node for all lemmas of the same word in one language so that we collect more information per node, which should lead to a higher F1 score.
As a developer, I want to remove stop words and sentence tokens from the ground-truth semantic domain data to make the matches of words with semantic domains more meaningful.
(I.e., we “sacrifice the minorityfor the majority.”)
Motivation: improve alignment -> reduce false positives -> increase dictionary creation precision
We want to build a GNN-based edge prediction BOW model for SDI. We hypothesize that it has a higher performance than the simple baseline model.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu
We want to use a tree-structured graph that contains languages and their linguistic and geographic relations. We hypothesize that this improves the performance of the GNN-based edge prediction model for dictionary creation.
As a developer, I want to acquire the data from MARBLE to use their mappings from words to lexical domains. These mappings indicate to which semantic domain a verse (or word or phrase) belongs (for Hebrew and Greek).
Motivation: disambiguate words -> reduce FNs -> increase DC recall > 0.30
'macula-greek/sources/MARBLE/SDBG/'
(general information)Additional visualizations explain how the GNN works.
Motivation: explain GNN
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.