attilanagy234 / treeswap Goto Github PK
View Code? Open in Web Editor NEWComplimentary code for our paper TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping (RANLP 2023)
Complimentary code for our paper TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping (RANLP 2023)
Ha részedől is okés, akkor a 8_save_history.sh scriptre cseréld le a full_train.sh mentés részét.
commit
At each commit to the master branch there is a chance that the test are broken by that commit if the one who committed forgot to run the tests locally. In this case its hard to see which commit broke the tests.
Setup a CI solution that is free to use and would monitor for broken tests.
Run the tests when:
We need to run lots of experiments continuously and it will be very hard to keep track of results. We should come up with an easy to use solution to track experiments on different augmentation methods. Nothing too complex, an easy to use DB like Mlflow or a well-crafted Google sheet could do.
The networkx graph construction from the dependency trees and the emtsv parsing from Hungarian is currently not tested
source_node, target_node and edge triples are stored in tuples and would be nicer to use namedtuples
Thank you very much for releasing the source code of your great work!
I am interested in your paper, so I want to reproduce the reported results. Could you please provide me all the preprocessed datasets used in your paper? For example, datasets are introduced in Table 1, including En-De, En-He, En-Vi, En-Hu.
Many thanks for your help!
Best regards,
Tran.
English dependency parsing is currently 5 times slower than Hungarian. This is due to the difference in output format of Stanza and Spacy. There is an unnecessary usage of Pandas DataFrames during the dep parsing serialization here.
A Python module should:
We should be able to run the module from a shell script.
This is an essential component for #16.
Several similarity metrics could be explored:
Currently running the subject object augmentator on a smaller part of the Hunglish 2 dataset causes it to consume a huge amount of memory. When spawning a process with 36GB-s of RAM the augmentator runs out of memory before even reading only the english dataset causing the process to crash.
When reading in the data we should constantly filter on-the-fly to lower memory usage.
If this method doesn't lower the memory usage enough, then need to come up with a more sophisticated approach.
There has been some work done around dependency parsing based augmentation for NLP, similar to ours. We need to review these works and get familiar with the basic concepts.
Some of these works:
It would be nice to additionally do some googling to see what kind of data augmentation has been done in NMT or if there are other relevant works on dependency parsing based augmentation for other NLP tasks.
The findings should be presented in a short PPT to the others.
At the moment dep parsing for Hungarian is done by manually running a docker container, like:
cat input.txt | docker run -i mtaril/emtsv tok-dep > output.txt
It would be nice to integrate this into the augmentator pipeline, either by running emtsv from code, or launching it as a web server and making API calls.
Serialization of dependency trees before augmentation is fairly slow (tens of hours), because they use only one core. Multiprocessing could help.
Currently the 1_build_vocab.sh script doesn't work if we would want to build a shared vocabulary.
Need to update the script to handle this option.
The basic logic of swapping dependency subtrees should be tested to ensure no major mistakes.
After setting up the CI tooling it is visible that there are some tests that are failing.
Conference website: http://statmt.org/wmt21/
If we want to submit a paper, we likely need to work on datasets released alongside the workshop / conference.
These datasets and the tasks should be investigated and we need to find out the effort needed to make the data augmentator support other languages. This mostly boils down to creating the dependency parse trees for the given language-pair.
Currently multiprocessing is broken as discussed in issue #62
In order to develop new code faster without introducing unknown breaking changes we should increase the tests coverage.
We should perform an exhaustive analysis on the augmented samples for each data augmentation method. This is very important to discover bugs or some constraints that we could apply to make the augmentations more refined.
Dep parsing in particular
The dependency graph model and the API of the subtree swapping should be improved, so doing experiments with the package in a Jupyter notebook is more comfortable.
Creating such a script would greatly improve the speed at which we can try out new things.
Hi, I've been looking at some dependency parsing code recently and I'm getting the problem when trying to reproduce your code
code:
src/hu_nmt/data_augmentator/entrypoints/precompute_en_dependency_trees.py
def main(data_input_path, dep_tree_output_path, file_batch_size): eng_dep_parser = EnglishDependencyParser() eng_dep_parser.file_to_serialized_dep_graph_files(data_input_path, dep_tree_output_path, int(file_batch_size)) main(data_input_path=en_path, dep_tree_output_path=out_path, file_batch_size='10000')
problem:
TypeError: can't pickle _thread.lock objects
and log points here:
src/hu_nmt/data_augmentator/base/depedency_parser_base.py#L146
can you answer for me? looking forward to your reply~
Right now the sampling of sentence pairs for subtree swapping is implemented as such:
This is not correct, because it does not make the generated sentences diverse enough.
The sentence-pairs selected for augmentation should be matched randomly. One possible solution is to generate two vectors of dim X containing indices that point to sentences and create the sentence-pairings accordingly. Cases where indices match need to be dropped (because it does not make sense to augment a sentence with itself). We should resample until sufficient amount of samples are selected.
Preliminary step to #52
Depends on #11
Need to set baseline scores with the proper sampling method, for this we need to rerun the following experiments (augmentation ratio = 0.5):
Some of the discussed directions build upon filtering the augmented sentences using a baseline model. For this we need to load the trained model in Python, preferably with a simple to use API such as:
model = load_model(path)
sent_in_lang_A = model.predict(sent_in_lang_B)
Before the Hungarian -> English translation had a BLEU score of 34.2 and now it has a score of 32.7, that's a huge difference considering that the seed used during training should be fixed.
Actions to take:
This class needs to be modified.
Create shell script and write augmented results to tsvs
Right now only English-Hungarian and Hungarian-English are supported by the framework. It would be desirable to have multiple language pairs supported, especially common ones published in conferences, e.g. german, french.
str.replace('\xad', '-')
We apply constraints when deriving sentences that we consider eligible for augmentation:
This logic keeps about 2% of sentences. We should explore which constraint drops what portion of the data and optionally experiment with other constraints.
Reference: filtering logic
The module should read precomputed dependency graphs from files both in Hungarian and English
Filtering should be done based on:
There is an overlap betwen the OPUS corpus and Hunglish, so we need to do a dedup (with bloom filter or LSH)
This is a bug, because sentences containing '-' break during preprocessing.
Also need to reprocess dependency graphs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.