Evaluate GNN for dictionary creation

Goal

We use at least one metric to evaluate the GNNs used for dictionary creation.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

#22
Evaluate LRL

Notes

error analysis (examples where the model is failing)
- error analysis : ML = debugging : actual programming
look at learning curves
visualize learned representations (eg embeddings, nearest neighbors)
simplify the problem/model
Make model explainable
- apply data visualization

Reproduce Annotation Projection from Imani et al.

Goal

I want to project semantic domain questions as labels from eng + fra to deu + gej verses for at least 10 verses.

Tasks

Align single verses instead of whole corpus
- Save them as .connl files
Add all SemDom questions as labels
Run GNN with this data
Run on remote server
Run with 10 verses
Run with 4 languages
(try out XLM-R / static embeddings)
(try out multiple labels per token)

Use word embeddings for English and LRL

Goal

We compute word embeddings for English words, fra/tpi/meu words, and semantic domains (by averaging the English words) to link the fra/tpi/meu words to the semantic domains.
(This issue is not part of the proposal. --> optional)
Motivation: less FP, less FN --> more precision and recall --> F1 > 0.30

Tasks

Add links between similar words

Goal

Instead of contracted nodes, there are links between words that look like they belong together.
French examples:

ignoriez: {'ignore', 'ignorer', 'ignoriez'}
illuminé: {'illuminé', 'illuminés'}
image: {'images', 'image'}

Motivation: build GNN

Tasks

use the link contraction metric to add weighted links without contracting nodes.

Use Sweeps to tune hyperparameters

Goal

As a developer, I want to automatically tune the model's hyperparameters by using Sweeps.

Tasks

Add missing words to 9 ground-truth dictionaries

Goal

Build a user interface to create dictionaries

Goal

A person, who did not develop the code, used the code to create a dictionary.
Motivation: get feedback

Tasks

Think about use cases

Add testing pipeline on GitHub

Goal

As a developer, I want to automate the testing of the code on GitHub to make sure that the code coverage stays greater than 97% without the need for manual testing.

Tasks

read https://medium.com/thelorry-product-tech-data/unit-testing-and-continues-integration-ci-in-github-action-for-python-programming-c8ad57fae3a1 (free article)

Acquire parallel data from Bloom Library

(https://huggingface.co/datasets/sil-ai/bloom-lm/viewer/tpi/train)

Goal

As a developer, I want to use books from the bloom library as supplementary training data to improve the word alignment's quality. This would successively increase the dictionary creator's precision.
Motivation: (More data beats more clever algorithms.) more parallel data -> improve alignment -> less FPs -> higher DC precision

Example

The Story of Jonah
eng: In those days there was a very large town where many people lived. The town's name was Nineveh.
tpi: Long dispela taim i gat wanpela bikpela taun i gat planti manmeri. Nem bilong dispela taun em Nineveh.

Tasks

search for parallel texts on https://bloomlibrary.org/
- 10,000+ books
- 500+ languages
check https://huggingface.co/datasets/sil-ai/bloom-lm
check https://bloomlibrary.org/language:tpi
check https://bloomlibrary.org/page/create/linux
~~align these texts~~
- ask Joshua if there already exists an alignment
use https://huggingface.co/datasets/sil-ai/bloom-captioning

Evaluate semantic domain identification

Goal

We use the mappings between words and SDs to calculate the SDI's F1 score for at least one language.
(SDI = Given a phrase, which SDs does it belong to?)
Motivation: See if SDI already has F1 > 0.30.

Tasks

#29
#30

Evaluate LRL lemmatization quality

Goal

As a developer, I want to see how well the lemmatization of LRL works so that I can explain and improve the lemmatization approach.

Motivation: improve lemma groups -> reduce false positives -> increase precision

Tasks

Scatter plot word pairs by resource allocation index and Levenshtein distance
- see clusters --> discover appropriate thresholds
- evaluate on English --> show true and false connections in green and red
use English lemmas from WordNet as ground truth

Build graph database

Goal

We want to create a GraphQL graph database that includes the linguistic and geographical hierarchy of all spoken languages. The goal is to leverage this graph as input to GNNs. The language nodes link their respective monolingual texts (i.e., Bible translations) and semantic domain dictionaries.

Tasks

see A.2 in proposal
Fill up GraphQL database with enough data to train GNN models

Build simple model for Semantic Domain Identification

Goal

We want to build a simple model for SDI that serves as a baseline.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu

Tasks

Acquire mappings from verses to semantic domains #1
look up each word in a dictionary (see proposal: A.3.2.1 Dictionary Lookup)
Add a test for semantic domain identifier
Check that test coverage is >= 98%
Look at todos in comments

Build a simple GNN for dictionary creation

(https://towardsdatascience.com/graph-neural-networks-with-pyg-on-node-classification-link-prediction-and-anomaly-detection-14aa38fe1275)

Goal

There is a GNN model that links LRL words with semantic domains.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

#18

Notes

Some ideas to get started
- Re-implement https://medium.com/p/c71f6f61ff0f
  - https://github.com/plkmo/Bible_Text_GCN
  - use text-GCN for document classification for each of the 1189 chapters of the Bible
  - 66 labels = 66 books
  - each document-word and word-word edges has predetermined weights
    - eg tf-idf
  - chapter-word edges encode context within chapter
  - word-word edges encode relative context between chapters
  - https://miro.medium.com/max/1400/1*R2IFb9Q8HhtVlyxUa_cy3g.png
- GNN: https://github.com/danielegrattarola/spektral
- GNN: https://github.com/benedekrozemberczki/SimGNN
- Try out https://github.com/pyg-team/pytorch_geometric
- google "keras graph neural network (colab/notebook)"
- if needed: get inspiration at https://paperswithcode.com/search?q_meta=&q_type=&q=graph+neural+networks+keras

Create issues from todos in the code

Goal

There are no more todos in the code. It's more consistent to have them here on GitHub.
Motivation: see the bigger picture more clearly

Tasks

implement easy todos
create issues from all todos in the code or remove them

Ship it! Deploy the dictionary creator

Goal

Make at least one dictionary creator model accessible to at least one other person.
Motivation: get feedback

Tasks

#21
Send link to supervisors

Notes

Idea: Deploy model in Huggingface space
- inspiration: https://github.com/Eeman1113/Study_For_Me_AI, https://huggingface.co/Eemansleepdeprived
Idea: Deploy model with Truss.
- https://github.com/basetenlabs/truss
- https://truss.baseten.co/e2e

Try out AWESoME/eflomal word aligner

Goal

As a software developer, I want to try out replacing fast_align with the AWESoME word aligner to see if this improves the dictionary creator's F1 score.
Motivation: improve alignment -> reduce FPs -> increase DC's precision

Tasks

F1* / MRR

Language pair	fast_align (baseline)	mBERT	fine-tuned by AWESoME	Eflomal
eng-eng	0.25, 0.40	0.26, 0.38	0.27, 0.38	0.23, 0.39
eng-fra	0.24, 0.37	0.27, 0.38	0.27, 0.36	0.26, 0.41
eng-tpi	n/a, 0.28	n/a, 0.19	n/a, 0.20	n/a, 0.31
eng-meu	n/a, 0.23	n/a, 0.11	n/a, 0.11	n/a, 0.21

Include semantic domains in graph

Goal

There are ~8000 more nodes in the graph: Each is a semantic domain (question), with links to each of its words in different languages. Creating dictionaries means finding these links.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

Move the whole graph structure to NetworkX.
Add semantic domain questions.
Add semantic domains.
(Without GNN: For each target word, add the semantic domains of all its translation candidates and select the most likely ones.)
Compare both approaches.

Notes

give up 1:1 idea
idea shift: map a group of words in one lang to a group of words in another language?
For example, we could recognize eng:polluted by adding the two connections to 4.9.5.6.

Automatically choose English Bible that aligns well

Goal

For each target bible, we automatically choose an English bible that aligns well to make the alignment more meaningful.
Motivation: improve alignment -> reduce FPs -> increase DC precision > 0.30

Tasks

add code to try out all available English translations for a given LRL translation and find min perplexity / cross entropy
- does it contain many verses (also apocrypha)?
- look if alignments have a low perplexity or low cross entropy
  - e.g., does it mention eng:Euphrates in the same verses as tpi:Yufretis appear?
    - known seed of words
evaluate if these metrics correlate with F1 score

Evaluate dicationary creation on LRL

Goal

We can evaluate the dictionary creation on at least one LRL (and not only on HRL/MRL).
Motivation: evaluate DC -> create 1 LRL dict with F1 > 30%

Tasks

Results

metric	value	note
MRR for tpi	0.281	642 / 5059 tpi questions selected
MRR for meu	0.230	842 / 5052 meu questions selected
MRR for swp (Daui)	0.171	1272 / 4518 swp questions selected

Observation

The MMR lowers with an increasing number of selected questions. Apparently, getting more questions right is "more difficult".

Get Tok Pisin and Suau/Duai dictionaries

Goal

We have one dictionary for Tok Pisin (tpi) and Suau/Duai (swp) that I can use to evaluate the dictionary creator on these two LRLs (and not only on HRL/MRL).

Tasks

check which data we already have
create a parsable list of translations between Tok Pisin and English
create a parsable list of translations between Suau/Duai and English
(if needed, parse PDFs with tabula-py lib)
- (categorize rows according to language)
(if needed, read https://bytescout.com/articles/how-to-extract-pdf-with-multiple-columns-to-text-using-pdf-multitool)
optional: send parsable dictionaries back to creators

Notes

Why did we choose Tok Pisin and Duai?
- a) We have aligned bibles in these languages.
- b) SIL has language projects (and experts) for these languages.
Suau ~ Duai

Fix inconsistent loading in dictionary creator

Expected behavior

The dictionary_creator loads a consistent progress (i.e., it continues with the same data that has been saved).

Actual behavior

The dictionary_creator loads incomplete progress (e.g., word_graph).

Tasks

Implement self.progress_log = [] that stores all completed steps.
Assert consistent loading
Add a test for inconsistent loading
Check that test coverage is >= 98%
Look at todos in comments
Review all changes and merge

Try out wandb

(https://www.youtube.com/watch?v=1gHUiNLYa20)

Goal

As a developer, I want to try out wandb to understand my model and make it more explainable.
Result: I have an overview of wandb's features (e.g., the Keras integration).
Motivation: build GNN

Tasks

Install wandb
Apply wandb to GNN #22

Try out word segmentation approach for Greek words on LRL words

Goal

As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper).
Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision

Tasks

read approach of paper
e.g., use Byte-Pair Encoding (BPE)
- use tokenizer in more fine-granular configuration?
try out https://github.com/google/sentencepiece
- treats space as character
- works for languages with no spaces

Contract lemmatized LRL nodes

User Story: As a developer, I want to have only a single node for all lemmas of the same word in one language so that we collect more information per node, which should lead to a higher F1 score.

Remove stop words and sentence tokens from GT SD data

Goal

As a developer, I want to remove stop words and sentence tokens from the ground-truth semantic domain data to make the matches of words with semantic domains more meaningful.
(I.e., we “sacrifice the minorityfor the majority.”)
Motivation: improve alignment -> reduce false positives -> increase dictionary creation precision

Tasks

Find an example (very simple!)
- "terre” is missing in the semantic domain “Planet” because it's spelled “la Terre” in the ground truth data. Stop word removal might help to get this missing link.
Look at https://www.geeksforgeeks.org/removing-stop-words-nltk-python/?ref=lbp
Add a test

Build BOW model for Semantic Domain Identification

Goal

We want to build a ~~GNN-based edge prediction~~ BOW model for SDI. We hypothesize that it has a higher performance than the simple baseline model.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu

Tasks

Acquire refined mappings from verses to semantic domains #1
use refined mappings from words in verses to SDs to assign SDs to words in verses from LRL
- simply assign SDs in eng to each aligned word in LRL
- if many false positive mappings (i.e., low precision): refine assignments with generated SD dicts for LRL (set intersection)
collect BOW for every word with assigned SD (2 words before and after word in the middle)
aggregate BOWs by SD
perform SDI by extracting BOW for every candidate word in input sentence and compute cosine dist to aggregated BOW
try out baseline: look up each word in a dictionary
consider usefulness of WSD (word sense disambiguation) with pywsd or different tool: Eng verse → WordNet → SD (see Jonathan’s 2nd mail)

Build language information graph

Goal

We want to use a tree-structured graph that contains languages and their linguistic and geographic relations. We hypothesize that this improves the performance of the GNN-based edge prediction model for dictionary creation.

Tasks

see A.1.4 in proposal
acquire and parse language information in Ethnologue
acquire and parse language information in WALS

Acquire MARBLE Data

Goal

As a developer, I want to acquire the data from MARBLE to use their mappings from words to lexical domains. These mappings indicate to which semantic domain a verse (or word or phrase) belongs (for Hebrew and Greek).
Motivation: disambiguate words -> reduce FNs -> increase DC recall > 0.30

Tasks

Notes

Lexical domains, show for example, if a verse belongs to the lexical domain "sky".
The lexical domains are useful to disambiguate words.
- because we know to which lexical domain a word in a verse belongs
- e.g., "fire" can refer in one verse to burning and in another vers to moving quickly
Idea: Language-agnostic annotations are our starting point.
Lexical domains do not match semantic domains
Resources
- Clear-Bible/macula-greek#29
- Clear-Bible/macula-hebrew#61
- 'macula-greek/sources/MARBLE/SDBG/' (general information)

Make GNN explainable (e.g., with SubgraphX)

Goal

Additional visualizations explain how the GNN works.
Motivation: explain GNN

janetzki / guide Goto Github PK

guide's People

Contributors

Stargazers

Watchers

guide's Issues

Goal

Tasks

Notes

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Goal

Tasks

Goal

Tasks

Goal

Example

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Notes

Goal

Tasks

Goal

Tasks

Notes

Goal

Tasks

F1* / MRR

Goal

Tasks

Notes

Goal

Tasks

Goal

Tasks

Results

Observation

Goal

Tasks

Notes

Expected behavior

Actual behavior

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Goal

Tasks

Notes

Goal

Tasks

Recommend Projects

Recommend Topics

Recommend Org