attilanagy234 / treeswap Goto Github PK

Complimentary code for our paper TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping (RANLP 2023)

Jupyter Notebook 89.62% Shell 1.18% Python 9.20%

data-augmentation neural-machine-translation

treeswap's People

Contributors

Stargazers

Watchers

Forkers

chloe-mxxxxc vhientran kirti0019

treeswap's Issues

Save history

Ha részedől is okés, akkor a 8_save_history.sh scriptre cseréld le a full_train.sh mentés részét.
commit

Setup CI tooling to monitor breaking tests

At each commit to the master branch there is a chance that the test are broken by that commit if the one who committed forgot to run the tests locally. In this case its hard to see which commit broke the tests.

Setup a CI solution that is free to use and would monitor for broken tests.
Run the tests when:

committing to the main branch
creating a PR

Design an experiment tracking solution

We need to run lots of experiments continuously and it will be very hard to keep track of results. We should come up with an easy to use solution to track experiments on different augmentation methods. Nothing too complex, an easy to use DB like Mlflow or a well-crafted Google sheet could do.

Better test coverage for the Hungarian and English dependency parsers

The networkx graph construction from the dependency trees and the emtsv parsing from Hungarian is currently not tested

Refactor graph representation in the dependency graph wrapper

source_node, target_node and edge triples are stored in tuples and would be nicer to use namedtuples

Please provide preprocessed dataset in your paper

Hi @attilanagy234

Thank you very much for releasing the source code of your great work!
I am interested in your paper, so I want to reproduce the reported results. Could you please provide me all the preprocessed datasets used in your paper? For example, datasets are introduced in Table 1, including En-De, En-He, En-Vi, En-Hu.
Many thanks for your help!

Best regards,
Tran.

Speedup on English dependency parsing

English dependency parsing is currently 5 times slower than Hungarian. This is due to the difference in output format of Stanza and Spacy. There is an unnecessary usage of Pandas DataFrames during the dep parsing serialization here.

Wrap the code for precomputing dependency graphs in a bash script

A Python module should:

load parallel sentences from the Hunglish corpus
drop very long sentence-pairs
create dependency graphs and serialize them to TSVs

We should be able to run the module from a shell script.

Implement a module for computing various similarity metrics between two sentences.

This is an essential component for #16.

Several similarity metrics could be explored:

Jaccard index
BLEU
ROUGE
average of word embeddings + cosine similarity
sentence embedding (e.g. USE) + cosine similarity
Tversky index
Sørensen–Dice coefficient

Make the subject object augmentator use less memory

Currently running the subject object augmentator on a smaller part of the Hunglish 2 dataset causes it to consume a huge amount of memory. When spawning a process with 36GB-s of RAM the augmentator runs out of memory before even reading only the english dataset causing the process to crash.

When reading in the data we should constantly filter on-the-fly to lower memory usage.

If this method doesn't lower the memory usage enough, then need to come up with a more sophisticated approach.

Read papers on dependency parsing based augmentation for NLP

There has been some work done around dependency parsing based augmentation for NLP, similar to ours. We need to review these works and get familiar with the basic concepts.
Some of these works:

It would be nice to additionally do some googling to see what kind of data augmentation has been done in NMT or if there are other relevant works on dependency parsing based augmentation for other NLP tasks.

The findings should be presented in a short PPT to the others.

Emtsv dependency parsing done from Docker

At the moment dep parsing for Hungarian is done by manually running a docker container, like:

cat input.txt | docker run -i mtaril/emtsv tok-dep > output.txt

It would be nice to integrate this into the augmentator pipeline, either by running emtsv from code, or launching it as a web server and making API calls.

Run experiments on English-German

Use multiprocessing for precomputing dependency parse trees

Serialization of dependency trees before augmentation is fairly slow (tens of hours), because they use only one core. Multiprocessing could help.

Make build vocab work with shared vocabularies

Currently the 1_build_vocab.sh script doesn't work if we would want to build a shared vocabulary.

Need to update the script to handle this option.

Write tests for the subj-obj swapping

The basic logic of swapping dependency subtrees should be tested to ensure no major mistakes.

Fix failing tests

After setting up the CI tooling it is visible that there are some tests that are failing.

Find out why the tests are failing and fix them
Completed when there is a PR with the tests passing

Improve subject and predicate swapping augmentation methods

After creating tests for the SubjectObjectAugmentator in #23 it was visible here and here that we should come up with a better way of swapping out objects and predicates in sentences.

Rerun Hungarian-English experiments with the new huspacy dependency parser

Investigate the datasets released along WMT21

Conference website: http://statmt.org/wmt21/

If we want to submit a paper, we likely need to work on datasets released alongside the workshop / conference.
These datasets and the tasks should be investigated and we need to find out the effort needed to make the data augmentator support other languages. This mostly boils down to creating the dependency parse trees for the given language-pair.

German dependency parsing uses a different tagset from UD

Fix multiprocessing

Currently multiprocessing is broken as discussed in issue #62

fix multiprocessing
create tests to validate the fix

Increase test coverage for faster development

In order to develop new code faster without introducing unknown breaking changes we should increase the tests coverage.

cover the most crucial parts of the code that are currently in use (and not yet covered)
- entrypoints
- dependency parser factory
- subject object augmentator

Data analysis on the augmented samples

We should perform an exhaustive analysis on the augmented samples for each data augmentation method. This is very important to discover bugs or some constraints that we could apply to make the augmentations more refined.

Explore NLP tooling for low-resource languages in WMT21

Dep parsing in particular

Better and easier usability of the swapping augmentation method

The dependency graph model and the API of the subtree swapping should be improved, so doing experiments with the package in a Jupyter notebook is more comfortable.

Create script to easily rerun training on all the augmentation types

Creating such a script would greatly improve the speed at which we can try out new things.

problem about multiprocessing on jupyter notebook

Hi, I've been looking at some dependency parsing code recently and I'm getting the problem when trying to reproduce your code

code:
src/hu_nmt/data_augmentator/entrypoints/precompute_en_dependency_trees.py
def main(data_input_path, dep_tree_output_path, file_batch_size): eng_dep_parser = EnglishDependencyParser() eng_dep_parser.file_to_serialized_dep_graph_files(data_input_path, dep_tree_output_path, int(file_batch_size)) main(data_input_path=en_path, dep_tree_output_path=out_path, file_batch_size='10000')

problem:
TypeError: can't pickle _thread.lock objects

and log points here:
src/hu_nmt/data_augmentator/base/depedency_parser_base.py#L146
can you answer for me? looking forward to your reply~

Implement proper sampling of sentence-pairs for subj-obj augmentations

Right now the sampling of sentence pairs for subtree swapping is implemented as such:

filter all sentences based on predefined criteria
Given the amount of sentences we want to generate (X), we solve X = N * (N-2) / 2 and generate all permutations for such N.

This is not correct, because it does not make the generated sentences diverse enough.

The sentence-pairs selected for augmentation should be matched randomly. One possible solution is to generate two vectors of dim X containing indices that point to sentences and create the sentence-pairings accordingly. Cases where indices match need to be dropped (because it does not make sense to augment a sentence with itself). We should resample until sufficient amount of samples are selected.

Design model for fixing inflection of swapped subtrees

Subject Object Augmentator saves the source and target languages mixed up

Add support for German augmentation

Regenerate dependency graphs using the new huspacy parser and save them for later experiments

Preliminary step to #52

Rerun every experiment with proper sampling

Depends on #11

Need to set baseline scores with the proper sampling method, for this we need to rerun the following experiments (augmentation ratio = 0.5):

subject based subtree swapping
object based subtree swapping
subject based subtree swapping with same predicate lemma
object based subtree swapping with same predicate lemma
predicate swapping

Create Python class to load trained model and perform inference

Some of the discussed directions build upon filtering the augmented sentences using a baseline model. For this we need to load the trained model in Python, preferably with a simple to use API such as:

model = load_model(path)
sent_in_lang_A = model.predict(sent_in_lang_B)

Running the basic Hungarian -> English translation performs worse after updates

Before the Hungarian -> English translation had a BLEU score of 34.2 and now it has a score of 32.7, that's a huge difference considering that the seed used during training should be fixed.

Actions to take:

Rerun the training and translation again and see if it gives yet another result. (In this case the seed is not fixed somewhere)
Search through all the changes made to see what caused the huge change

Make preprocessor script work arbitrary language pairs (not hardcoded hun-eng)

This class needs to be modified.

Create scripts for running the depth-based augmentator

Create shell script and write augmented results to tsvs

Add depency parsing support to more languages

Right now only English-Hungarian and Hungarian-English are supported by the framework. It would be desirable to have multiple language pairs supported, especially common ones published in conferences, e.g. german, french.

[POC] Incorporate backtranslation in the subject-object augmentation

Replace soft-hyphens in the whole training corpus

str.replace('\xad', '-')

Analyze the constraints during the filtering of eligible sentence-pairs

We apply constraints when deriving sentences that we consider eligible for augmentation:

every sentence should contain one subject and one object
the ancestor of the subject and object should be the same
the subtree corresponding to the subject and object should contain consecutive words from the original sentence

This logic keeps about 2% of sentences. We should explore which constraint drops what portion of the data and optionally experiment with other constraints.

Reference: filtering logic

Implement module that filters sentences based on predefined criteria

The module should read precomputed dependency graphs from files both in Hungarian and English

Filtering should be done based on:

sentence contains only one subject (nsubj) and one object (obj) for both languages
nsubj-obj subtree contains consecutive tokens