Giter VIP home page Giter VIP logo

guide's People

Contributors

janetzki avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

guide's Issues

Evaluate GNN for dictionary creation

Goal

We use at least one metric to evaluate the GNNs used for dictionary creation.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

  • #22
  • Evaluate LRL

Notes

  • error analysis (examples where the model is failing)
    • error analysis : ML = debugging : actual programming
  • look at learning curves
  • visualize learned representations (eg embeddings, nearest neighbors)
  • simplify the problem/model
  • Make model explainable
    • apply data visualization

Reproduce Annotation Projection from Imani et al.

Goal

I want to project semantic domain questions as labels from eng + fra to deu + gej verses for at least 10 verses.

Tasks

  • Align single verses instead of whole corpus
    • Save them as .connl files
  • Add all SemDom questions as labels
  • Run GNN with this data
  • Run on remote server
  • Run with 10 verses
  • Run with 4 languages
  • (try out XLM-R / static embeddings)
  • (try out multiple labels per token)

Use word embeddings for English and LRL

Goal

We compute word embeddings for English words, fra/tpi/meu words, and semantic domains (by averaging the English words) to link the fra/tpi/meu words to the semantic domains.
(This issue is not part of the proposal. --> optional)
Motivation: less FP, less FN --> more precision and recall --> F1 > 0.30

Tasks

  • clarify the goal
  • Does the aligner (e.g., a fine-tuned AWESoME model) already use word embeddings?
  • clarify how this is different from node embeddings if words are nodes
  • look at Neural Cross-Lingual NER with Minimal Resources
    • try out MUSE
  • get word embeddings for English words (e.g., word2vec/Glove)
  • compute word embeddings for semantic domains
  • compute word embeddings for fra/LRL words
  • link fra/tpi/meu words to semantic domains
  • evaluate created dictionaries

Add links between similar words

Goal

Instead of contracted nodes, there are links between words that look like they belong together.
French examples:

ignoriez: {'ignore', 'ignorer', 'ignoriez'}
illuminé: {'illuminé', 'illuminés'}
image: {'images', 'image'}

Motivation: build GNN

Tasks

  • use the link contraction metric to add weighted links without contracting nodes.

Acquire parallel data from Bloom Library

image
(https://huggingface.co/datasets/sil-ai/bloom-lm/viewer/tpi/train)

Goal

As a developer, I want to use books from the bloom library as supplementary training data to improve the word alignment's quality. This would successively increase the dictionary creator's precision.
Motivation: (More data beats more clever algorithms.) more parallel data -> improve alignment -> less FPs -> higher DC precision

Example

The Story of Jonah
eng: In those days there was a very large town where many people lived. The town's name was Nineveh.
tpi: Long dispela taim i gat wanpela bikpela taun i gat planti manmeri. Nem bilong dispela taun em Nineveh.

Tasks

Evaluate semantic domain identification

Goal

We use the mappings between words and SDs to calculate the SDI's F1 score for at least one language.
(SDI = Given a phrase, which SDs does it belong to?)
Motivation: See if SDI already has F1 > 0.30.

Tasks

Evaluate LRL lemmatization quality

Goal

As a developer, I want to see how well the lemmatization of LRL works so that I can explain and improve the lemmatization approach.

Motivation: improve lemma groups -> reduce false positives -> increase precision

Tasks

  • Scatter plot word pairs by resource allocation index and Levenshtein distance
    • see clusters --> discover appropriate thresholds
    • evaluate on English --> show true and false connections in green and red
  • use English lemmas from WordNet as ground truth

Build graph database

Goal

We want to create a GraphQL graph database that includes the linguistic and geographical hierarchy of all spoken languages. The goal is to leverage this graph as input to GNNs. The language nodes link their respective monolingual texts (i.e., Bible translations) and semantic domain dictionaries.

Tasks

  • see A.2 in proposal
  • Fill up GraphQL database with enough data to train GNN models

Build simple model for Semantic Domain Identification

Goal

We want to build a simple model for SDI that serves as a baseline.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu

Tasks

  • Acquire mappings from verses to semantic domains #1
  • look up each word in a dictionary (see proposal: A.3.2.1 Dictionary Lookup)
  • Add a test for semantic domain identifier
  • Check that test coverage is >= 98%
  • Look at todos in comments

Build a simple GNN for dictionary creation

image
(https://towardsdatascience.com/graph-neural-networks-with-pyg-on-node-classification-link-prediction-and-anomaly-detection-14aa38fe1275)
image

Goal

There is a GNN model that links LRL words with semantic domains.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

Notes

Create issues from todos in the code

Goal

There are no more todos in the code. It's more consistent to have them here on GitHub.
Motivation: see the bigger picture more clearly

Tasks

  • implement easy todos
  • create issues from all todos in the code or remove them

Try out AWESoME/eflomal word aligner

image
image

Goal

As a software developer, I want to try out replacing fast_align with the AWESoME word aligner to see if this improves the dictionary creator's F1 score.
Motivation: improve alignment -> reduce FPs -> increase DC's precision

Tasks

  • Install awesome align
  • Adapt dictionary_creator to use eflomal instead of fast_align
  • Evaluate DC for eng-fra, eng-tpi without fine-tuning
  • Evaluate DC for eng-meu without fine-tuning
  • Fine-tune DC for eng-tpi on with NVIDIA/CUDA GPU or cluster (takes ~70h on notebook CPU)
  • Fine-tune DC for eng-meu
  • Evaluate DC for eng-fra, eng-tpi, eng-meu with fine-tuning
  • (Create follow-Up issue: Try out Wada/SimAlign/eflomal)
  • Try out Eflomal word aligner

F1* / MRR

Language pair fast_align (baseline) mBERT fine-tuned by AWESoME Eflomal
eng-eng 0.25, 0.40 0.26, 0.38 0.27, 0.38 0.23, 0.39
eng-fra 0.24, 0.37 0.27, 0.38 0.27, 0.36 0.26, 0.41
eng-tpi n/a, 0.28 n/a, 0.19 n/a, 0.20 n/a, 0.31
eng-meu n/a, 0.23 n/a, 0.11 n/a, 0.11 n/a, 0.21

Include semantic domains in graph

image

Goal

There are ~8000 more nodes in the graph: Each is a semantic domain (question), with links to each of its words in different languages. Creating dictionaries means finding these links.
Motivation: build GNN -> increase DC precision and recall > 0.30

Tasks

  • Move the whole graph structure to NetworkX.
  • Add semantic domain questions.
  • Add semantic domains.
  • (Without GNN: For each target word, add the semantic domains of all its translation candidates and select the most likely ones.)
  • Compare both approaches.

Notes

  • give up 1:1 idea
  • idea shift: map a group of words in one lang to a group of words in another language?
  • For example, we could recognize eng:polluted by adding the two connections to 4.9.5.6.

Automatically choose English Bible that aligns well

Goal

For each target bible, we automatically choose an English bible that aligns well to make the alignment more meaningful.
Motivation: improve alignment -> reduce FPs -> increase DC precision > 0.30

Tasks

  • add code to try out all available English translations for a given LRL translation and find min perplexity / cross entropy
    • does it contain many verses (also apocrypha)?
    • look if alignments have a low perplexity or low cross entropy
      • e.g., does it mention eng:Euphrates in the same verses as tpi:Yufretis appear?
        • known seed of words
  • evaluate if these metrics correlate with F1 score

Evaluate dicationary creation on LRL

Goal

We can evaluate the dictionary creation on at least one LRL (and not only on HRL/MRL).
Motivation: evaluate DC -> create 1 LRL dict with F1 > 30%

Tasks

  • get LRL dictionaries (e.g., #8)
  • compute MRR for tpi
  • compute MRR for meu
  • compute MRR for Daui
  • Add a test
  • Check that test coverage is >= 98%
  • Look at todos in comments
  • Review all changes and merge

Results

metric value note
MRR for tpi 0.281 642 / 5059 tpi questions selected
MRR for meu 0.230 842 / 5052 meu questions selected
MRR for swp (Daui) 0.171 1272 / 4518 swp questions selected

Observation

The MMR lowers with an increasing number of selected questions. Apparently, getting more questions right is "more difficult".

Get Tok Pisin and Suau/Duai dictionaries

Goal

We have one dictionary for Tok Pisin (tpi) and Suau/Duai (swp) that I can use to evaluate the dictionary creator on these two LRLs (and not only on HRL/MRL).

Tasks

Notes

  • Why did we choose Tok Pisin and Duai?
    • a) We have aligned bibles in these languages.
    • b) SIL has language projects (and experts) for these languages.
  • Suau ~ Duai

Fix inconsistent loading in dictionary creator

Expected behavior

The dictionary_creator loads a consistent progress (i.e., it continues with the same data that has been saved).

Actual behavior

The dictionary_creator loads incomplete progress (e.g., word_graph).

Tasks

  • Implement self.progress_log = [] that stores all completed steps.
  • Assert consistent loading
  • Add a test for inconsistent loading
  • Check that test coverage is >= 98%
  • Look at todos in comments
  • Review all changes and merge

Try out word segmentation approach for Greek words on LRL words

Goal

As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper).
Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision

Tasks

  • read approach of paper
  • e.g., use Byte-Pair Encoding (BPE)
    • use tokenizer in more fine-granular configuration?
  • try out https://github.com/google/sentencepiece
    • treats space as character
    • works for languages with no spaces

Contract lemmatized LRL nodes

User Story: As a developer, I want to have only a single node for all lemmas of the same word in one language so that we collect more information per node, which should lead to a higher F1 score.

Remove stop words and sentence tokens from GT SD data

Goal

As a developer, I want to remove stop words and sentence tokens from the ground-truth semantic domain data to make the matches of words with semantic domains more meaningful.
(I.e., we “sacrifice the minorityfor the majority.”)
Motivation: improve alignment -> reduce false positives -> increase dictionary creation precision

Tasks

Build BOW model for Semantic Domain Identification

Goal

We want to build a GNN-based edge prediction BOW model for SDI. We hypothesize that it has a higher performance than the simple baseline model.
Motivation: SDI with F1 > 0.30 for 1 tpi/meu

Tasks

  • Acquire refined mappings from verses to semantic domains #1
  • use refined mappings from words in verses to SDs to assign SDs to words in verses from LRL
    • simply assign SDs in eng to each aligned word in LRL
    • if many false positive mappings (i.e., low precision): refine assignments with generated SD dicts for LRL (set intersection)
  • collect BOW for every word with assigned SD (2 words before and after word in the middle)
  • aggregate BOWs by SD
  • perform SDI by extracting BOW for every candidate word in input sentence and compute cosine dist to aggregated BOW
  • try out baseline: look up each word in a dictionary
  • consider usefulness of WSD (word sense disambiguation) with pywsd or different tool: Eng verse → WordNet → SD (see Jonathan’s 2nd mail)

Build language information graph

Goal

We want to use a tree-structured graph that contains languages and their linguistic and geographic relations. We hypothesize that this improves the performance of the GNN-based edge prediction model for dictionary creation.

Tasks

  • see A.1.4 in proposal
  • acquire and parse language information in Ethnologue
  • acquire and parse language information in WALS

Acquire MARBLE Data

Goal

As a developer, I want to acquire the data from MARBLE to use their mappings from words to lexical domains. These mappings indicate to which semantic domain a verse (or word or phrase) belongs (for Hebrew and Greek).
Motivation: disambiguate words -> reduce FNs -> increase DC recall > 0.30

Tasks

  • get lexical domain mappings for the Old Testament
  • get lexical domain mappings for the New Testament
  • create a list of all data that we already can access
    • --> 2 parquet files
  • Parse mappings from words to lexical domains
  • think about how we can map lexical domains to SDs
  • see A.1.3 in proposal
  • try it out
  • Look at what we already have
    • a) mappings from words in verses to MARBLE domains
    • b) mappings from MARBLE domains to SDs
    • c) mappings from words in verses to SDs
  • refine c) with a) and b) (set intersection)
  • (alternative idea: add lexical domains to the graph (#GNN))
  • (disambiguate example verse with lexical domains)
  • (create more labels by trying out the semi-automatic labeling function from [Refinery](https://github.com/code-kern-ai/refinery) (NLP IDE).)

Notes

  • Lexical domains, show for example, if a verse belongs to the lexical domain "sky".
  • The lexical domains are useful to disambiguate words.
    • because we know to which lexical domain a word in a verse belongs
    • e.g., "fire" can refer in one verse to burning and in another vers to moving quickly
  • Idea: Language-agnostic annotations are our starting point.
  • Lexical domains do not match semantic domains
  • Resources

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.