Giter VIP home page Giter VIP logo

treeswap's People

Contributors

attilanagy234 avatar bana513 avatar botondbarta avatar dependabot[bot] avatar dorinapetra avatar patrick-nanys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

treeswap's Issues

Save history

Ha részedől is okés, akkor a 8_save_history.sh scriptre cseréld le a full_train.sh mentés részét.
commit

Setup CI tooling to monitor breaking tests

At each commit to the master branch there is a chance that the test are broken by that commit if the one who committed forgot to run the tests locally. In this case its hard to see which commit broke the tests.

Setup a CI solution that is free to use and would monitor for broken tests.
Run the tests when:

  • committing to the main branch
  • creating a PR

Design an experiment tracking solution

We need to run lots of experiments continuously and it will be very hard to keep track of results. We should come up with an easy to use solution to track experiments on different augmentation methods. Nothing too complex, an easy to use DB like Mlflow or a well-crafted Google sheet could do.

Please provide preprocessed dataset in your paper

Hi @attilanagy234

Thank you very much for releasing the source code of your great work!
I am interested in your paper, so I want to reproduce the reported results. Could you please provide me all the preprocessed datasets used in your paper? For example, datasets are introduced in Table 1, including En-De, En-He, En-Vi, En-Hu.
Many thanks for your help!

Best regards,
Tran.

Speedup on English dependency parsing

English dependency parsing is currently 5 times slower than Hungarian. This is due to the difference in output format of Stanza and Spacy. There is an unnecessary usage of Pandas DataFrames during the dep parsing serialization here.

Make the subject object augmentator use less memory

Currently running the subject object augmentator on a smaller part of the Hunglish 2 dataset causes it to consume a huge amount of memory. When spawning a process with 36GB-s of RAM the augmentator runs out of memory before even reading only the english dataset causing the process to crash.

When reading in the data we should constantly filter on-the-fly to lower memory usage.

If this method doesn't lower the memory usage enough, then need to come up with a more sophisticated approach.

Read papers on dependency parsing based augmentation for NLP

There has been some work done around dependency parsing based augmentation for NLP, similar to ours. We need to review these works and get familiar with the basic concepts.
Some of these works:

It would be nice to additionally do some googling to see what kind of data augmentation has been done in NMT or if there are other relevant works on dependency parsing based augmentation for other NLP tasks.

The findings should be presented in a short PPT to the others.

Emtsv dependency parsing done from Docker

At the moment dep parsing for Hungarian is done by manually running a docker container, like:

cat input.txt | docker run -i mtaril/emtsv tok-dep > output.txt

It would be nice to integrate this into the augmentator pipeline, either by running emtsv from code, or launching it as a web server and making API calls.

Fix failing tests

After setting up the CI tooling it is visible that there are some tests that are failing.

  • Find out why the tests are failing and fix them
  • Completed when there is a PR with the tests passing

Investigate the datasets released along WMT21

Conference website: http://statmt.org/wmt21/

If we want to submit a paper, we likely need to work on datasets released alongside the workshop / conference.
These datasets and the tasks should be investigated and we need to find out the effort needed to make the data augmentator support other languages. This mostly boils down to creating the dependency parse trees for the given language-pair.

Fix multiprocessing

Currently multiprocessing is broken as discussed in issue #62

  • fix multiprocessing
  • create tests to validate the fix

Increase test coverage for faster development

In order to develop new code faster without introducing unknown breaking changes we should increase the tests coverage.

  • cover the most crucial parts of the code that are currently in use (and not yet covered)
    • entrypoints
    • dependency parser factory
    • subject object augmentator

Data analysis on the augmented samples

We should perform an exhaustive analysis on the augmented samples for each data augmentation method. This is very important to discover bugs or some constraints that we could apply to make the augmentations more refined.

problem about multiprocessing on jupyter notebook

Hi, I've been looking at some dependency parsing code recently and I'm getting the problem when trying to reproduce your code

code:
src/hu_nmt/data_augmentator/entrypoints/precompute_en_dependency_trees.py
def main(data_input_path, dep_tree_output_path, file_batch_size): eng_dep_parser = EnglishDependencyParser() eng_dep_parser.file_to_serialized_dep_graph_files(data_input_path, dep_tree_output_path, int(file_batch_size)) main(data_input_path=en_path, dep_tree_output_path=out_path, file_batch_size='10000')

problem:
TypeError: can't pickle _thread.lock objects

and log points here:
src/hu_nmt/data_augmentator/base/depedency_parser_base.py#L146
can you answer for me? looking forward to your reply~

Implement proper sampling of sentence-pairs for subj-obj augmentations

Right now the sampling of sentence pairs for subtree swapping is implemented as such:

This is not correct, because it does not make the generated sentences diverse enough.

The sentence-pairs selected for augmentation should be matched randomly. One possible solution is to generate two vectors of dim X containing indices that point to sentences and create the sentence-pairings accordingly. Cases where indices match need to be dropped (because it does not make sense to augment a sentence with itself). We should resample until sufficient amount of samples are selected.

Rerun every experiment with proper sampling

Depends on #11

Need to set baseline scores with the proper sampling method, for this we need to rerun the following experiments (augmentation ratio = 0.5):

  • subject based subtree swapping
  • object based subtree swapping
  • subject based subtree swapping with same predicate lemma
  • object based subtree swapping with same predicate lemma
  • predicate swapping

Create Python class to load trained model and perform inference

Some of the discussed directions build upon filtering the augmented sentences using a baseline model. For this we need to load the trained model in Python, preferably with a simple to use API such as:

model = load_model(path)
sent_in_lang_A = model.predict(sent_in_lang_B)

Running the basic Hungarian -> English translation performs worse after updates

Before the Hungarian -> English translation had a BLEU score of 34.2 and now it has a score of 32.7, that's a huge difference considering that the seed used during training should be fixed.

Actions to take:

  1. Rerun the training and translation again and see if it gives yet another result. (In this case the seed is not fixed somewhere)
  2. Search through all the changes made to see what caused the huge change

Add depency parsing support to more languages

Right now only English-Hungarian and Hungarian-English are supported by the framework. It would be desirable to have multiple language pairs supported, especially common ones published in conferences, e.g. german, french.

Analyze the constraints during the filtering of eligible sentence-pairs

We apply constraints when deriving sentences that we consider eligible for augmentation:

  • every sentence should contain one subject and one object
  • the ancestor of the subject and object should be the same
  • the subtree corresponding to the subject and object should contain consecutive words from the original sentence

This logic keeps about 2% of sentences. We should explore which constraint drops what portion of the data and optionally experiment with other constraints.

Reference: filtering logic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.