Giter VIP home page Giter VIP logo

text-analytics-project's Introduction

Automatic Complexity Assessment of German Sentences

Team Members

Leo Nguyen
Raoul Berger
Konrad Straube
Till Nocher

Mail Addresses

[email protected]
[email protected]
[email protected]

Pretrained models:

pretrained BERT from Deepset AI
pretrained word2vec from NLPL repository (model ID: 45)

Additional Corpora Used

  • TextComplexityDE19 7 levels of difficulty

    1100 Wikipedia articles, 100 of them in Simple German

  • Deutsche Welle - Deutsch Lernen 2 levels of difficulty

  • WEEBIT English news corpus

    5 levels of difficulty, 625 documents each

Utilized libraries

antlr4-python3-runtime 4.8, appdirs 1.4.4, beautifulsoup4 4.9.3, black 20.8b1, blis 0.7.4, bs4 0.0.1, bz2file 0.98, cached-property 1.5.2, catalogue 2.0.1, certifi 2020.12.5, cffi 1.14.5, cfgv 3.2.0, chardet 4.0.0, click 7.1.2, cycler 0.10.0, cymem 2.0.5, Cython 0.29.21, dataclasses 0.6, distlib 0.3.1, fairseq 0.10.2, fastBPE 0.1.0, filelock 3.0.12, gensim 3.8.3, gitdb 4.0.5, GitPython 3.1.13, google-trans-new 1.1.9, h5py 3.1.0, hydra-core 1.0.6, identify 1.5.13, idna 2.10, importlib-metadata 3.4.0, importlib-resources 5.1.0, Jinja2 2.11.3, joblib 1.0.1, kiwisolver 1.3.1, langdetect 1.0.8, lxml 4.6.2, MarkupSafe 1.1.1, matplotlib 3.3.4, murmurhash 1.0.5, mypy-extensions 0.4.3, nlpaug 1.1.2, nltk 3.5, nodeenv 1.5.0, numexpr 2.7.2, numpy 1.20.1, omegaconf 2.0.6, packaging 20.9, pandas 1.2.2, pathspec 0.8.1, pathy 0.4.0, Pillow 8.1.0, plac 1.1.3, pluggy 0.13.1, portalocker 2.2.1, pre-commit 2.10.1, preshed 3.0.5, py 1.10.0, pycparser 2.20, pydantic 1.7.3, pygit 0.1, pyparsing 2.4.7, Pyphen 0.10.0, python-dateutil 2.8.1, pytz 2021.1, PyYAML 5.4.1, regex 2020.11.13, requests 2.25.1, sacrebleu 1.5.0, sacremoses 0.0.43, scikit-learn 0.24.1, scipy 1.6.1, six 1.15.0, sklearn 0.0, smart-open 3.0.0, smmap 3.0.5, soupsieve 2.2, spacy 3.0.3, spacy-legacy 3.0.1, srsly 2.4.0, stop-words 2018.7.23, tables 3.6.1, textstat 0.7.0, thinc 8.0.1, threadpoolctl 2.1.0, tokenizers 0.10.1, toml 0.10.2, torch 1.7.1, tox 3.22.0, tqdm 4.57.0, transformers 4.3.2, translate 3.5.0, typed-ast 1.4.2, typer 0.3.2, typing-extensions 3.7.4.3, urllib3 1.26.3, virtualenv 20.4.2, wasabi 0.8.2, zipp 3.4.0

Setup

Install dependencies

Install all necessary dependencies with:

pipenv install

Download datasets:

pipenv run main --download all

To download a specific dataset, replace 'all' with ['TextComplexityDE19', 'Weebit', 'dw']

Preprocessing and Augmentation

Run preprocessing and augmentation on datasets and save results in h5 file:

pipenv run main --create_h5 --filename example.h5

Additional tags:

  • --dset with argument 0 = 'TextComplexityDE19', 1 = 'Weebit', 2 = 'dw'. Example: --dset 012 for all datasets.
  • --lemmatization
  • --stemming
  • --random_swap
  • --random_deletion

Example: apply lemmatization

pipenv run main --create_h5 --filename example.h5 --lemmatization

Note: basic preprocessing will always be applied

Usage

Run experiment for specific vectorizer and regression method:

pipenv run main --experiment evaluate --filename example.h5 --vectorizer option --method option

Addtional tag: --engineered_features (concatenate engineered features to sentence vector)

Options:

  • vectorizer: 'tfidf', 'count', 'hash', 'word2vec', 'pretrained_word2vec'
  • method: 'linear', 'lasso', 'ridge', 'elastic-net', 'random-forest'

Run all combination of vectorizers and regression methods with and without engineered features:

pipenv run main --experiment compare_all --filename example.h5

Run pretrained BERT + 3-layer regression network:

pipenv run main --experiment train_net --filename example.h5

Additional tag:

  • --save_name name (name to save trained model under, used for training multiple models without overwriting the previous one. Default: name specified with --filename
  • --engineered_features (concatenate engineered features to sentence vector)

If multiple datasets were used, you have to specify conditional training by providing the tag --multiple_datasets.

The tag --pretask [pretask_epoch, pretask_file] will overwrite the --multiple_datasets tag. In that case, instead of conditional training, the model will be first trained on a pretask (on the provided pretask_file for the given pretask_epoch) and than fine-tuned on the dataset provided by --filename. Note that the first layer of the model will be freezed after the pretask. To allow fine-tuning the first layer, use the tag --no_freeze.

Hyperparameter tuning for word2vec: linear search along hyperparameter (generate plots and results saved to txt file)

pipenv run main --search [hyperparameter, start, end, step, model, filename]

  • hyperparameter: 'feature', 'window', 'count', 'epochs', 'lr' or 'min_lr'
  • start: start value of linear search
  • end: end value of linear search
  • step: step size of linear search
  • model: only option so far 'word2vec'
  • filename: h5 filename to load data from

Note: experiment results are saved in folders 'result', 'figures' and 'models'

License

License: MIT

text-analytics-project's People

Contributors

thegialeo avatar raoulber avatar kostraub avatar jomazi avatar w4c avatar bockstaller avatar

Stargazers

 avatar Dennis Aumiller avatar

Watchers

James Cloos avatar  avatar  avatar Dennis Aumiller avatar

text-analytics-project's Issues

Linting fail and workflow scripts

also den linting fail den Konrad angesprochen hat, habe ich auch "linting / black (3.7), Linting: all jobs have failed". und das liegt an den scripts den die tutoren in workflow folder haben: https://github.com/Xenovortex/text-analytics-project/blob/master/.github/workflows/lint.yml und https://github.com/Xenovortex/text-analytics-project/blob/master/.github/workflows/test.yml. ich habe da mal reingeschaut und anscheinend wird https://pypi.org/project/black/ benutzt um den code "zu vereinheitlichen". es läuft aber net durch sondern hängt sich irgendwo auf. da müsste man näher reinschauen

basic_stats() not working

Raoul, the basic_stats() function you add to visualize_data.py is not running:

line 112: plt.title("Mean word count per entry across datasets")

with this I can not test my functions, because I can not run the script at all. I made a copy named visualizer.py, where I comment out your function so that I can keep working on my part. Once you fix the bug, please copy your function back to visualizer.py and delete the file visaulize_data.py.

Augmentation

@RaoulBer put all augmentation in if statements, e.g. if backtranslation: (then put backtranslation as arguments in augmentation function, with backtranslation=True or False) -> this gives option to turn on and off certain augmentation and compare performance

import of normalization in sentencestats.py

@konrad-straube bitte imports korrigieren und code bitte testen bevor gepusht wird. gestern abend haben raoul und ich von 16uhr bis mitternacht debugged damit alles läuft. für alles was nach gestern 23:00uhr hinzugefügt wurde, bitte selbst testen und debuggen
Screenshot from 2021-02-25 18-35-32

for certain settings word2vec does not go run through

Since our main dataset only consists of 1000 sentences (800 train, 200 test), there are configurations where a sentences from a testset constains only words that the word2vec model has not seen during training. In that case, the array containing wordvectors of each word in the sentences will be empty (since we just ignore words that the word2vec model can't vectorize and word2vec is only trained on trainset). However, after taking the mean of all wordvectors in the sentence, in this case an empty array, the output will be nan, which will terminate the code at a later point in the pipeline. This is on a code problem, but a design problem. If this case arise, either choose another vectorizer (count, tfidf) or choose another h5 file (changing preprocessing and augmentation).

dataset location

I think it is better to have a separate folder for the dataset and the folder src should only contain the source code. This will make it easier to just run pdoc on the src folder and get the documentation from the source code only. But that is something we can discuss in a meeting. For the time being, we can just leave it as it is. I will create a Trello card for that under Meeting Topics.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.