Giter VIP home page Giter VIP logo

incremental-parsing-representations's Introduction

Incremental Parsing Representations

A high-accuracy incremental parser implemented in Python. Based on Learned Incremental Representations for Parsing from ACL 2022, which is built upon the works of Constituency Parsing with a Self-Attentive Encoder from ACL 2018, and with additional changes described in Multilingual Constituency Parsing with Self-Attention and Pre-Training.

Notebooks

Open In Colab This notebook gives an example of inference using the model we trained for our paper.

Contents

  1. Training
  2. Reproducing Experiments
  3. Citation
  4. Credits

If you are primarily interested in training your own parsing models, skip to the Training section of this README.

Training

Training requires cloning this repository from GitHub. While the model code in src/benepar is based on the benepar package on PyPI, much of the related code and training and evaluation scripts directly under src/ are not.

Software Requirements for Training

  • Python 3.7 or higher.
  • PyTorch 1.6.0, or any compatible version.
  • All dependencies required by the benepar package, including: NLTK 3.2, torch-struct 0.4, transformers 4.3.0, or compatible.
  • pytokenizations 0.7.2 or compatible.
  • EVALB. Before starting, run make inside the EVALB/ directory to compile an evalb executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run make inside the EVALB_SPMRL/ directory instead.
  • clusopt
  • torch_struct

Training Instructions

A new model can be trained using the command python src/main.py train .... Some of the available arguments are:

Argument Description Default
--model-path-base Path base to use for saving models N/A
--evalb-dir Path to EVALB directory EVALB/
--train-path Path to training trees data/wsj/train_02-21.LDC99T42
--train-path-text Optional non-destructive tokenization of the training data Guess raw text; see --text-processing
--dev-path Path to development trees data/wsj/dev_22.LDC99T42
--dev-path-text Optional non-destructive tokenization of the development data Guess raw text; see --text-processing
--text-processing Heuristics for guessing raw text from descructively tokenized tree files. See load_trees() in src/treebanks.py Default rules for languages other than Arabic, Chinese, and Hebrew
--subbatch-max-tokens Maximum number of tokens to process in parallel while training (a full batch may not fit in GPU memory) 2000
--parallelize Distribute pre-trained model (e.g. T5) layers across multiple GPUs. Use at most one GPU
--batch-size Number of examples per training update 32
--checks-per-epoch Number of development evaluations per epoch 4
--numpy-seed NumPy random seed Random
--use-pretrained Use pre-trained encoder Do not use pre-trained encoder
--pretrained-model Model to use if --use-pretrained is passed. May be a path or a model id from the HuggingFace Model Hub bert-base-uncased
--predict-tags Adds a part-of-speech tagging component and auxiliary loss to the parser Do not predict tags
--use-chars-lstm Use learned CharLSTM word representations Do not use CharLSTM
--use-encoder Use learned transformer layers on top of pre-trained model or CharLSTM Do not use extra transformer layers
--num-layers Number of transformer layers to use if --use-encoder is passed 8
--encoder-max-len Maximum sentence length (in words) allowed for extra transformer layers 512
--use-vq Use vector quantization to compress word representations into a smaller set of discrete categories Do not use vector quantization
--discrete-cats The number of distinct categories used by the read-out network to produce a tree 0 (Do not use categories)

Additional arguments are available for other hyperparameters; see make_hparams() in src/main.py. These can be specified on the command line, such as --num-layers 2 (for numerical parameters), --predict-tags (for boolean parameters that default to False), or --no-XXX (for boolean parameters that default to True).

For each development evaluation, the F-score on the development set is computed and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development F-score.

Prior to training the parser, you will first need to obtain appropriate training data. We provide instructions on how to process standard datasets like PTB, CTB, and the SMPRL 2013/2014 Shared Task data. After following the instructions for the English WSJ data, you can use the following command to train an English parser using the default hyperparameters:

python src/main.py train --use-pretrained --model-path-base models/en_bert_base

See EXPERIMENTS.md for more examples of good hyperparameter choices.

Evaluation Instructions

A saved model can be evaluated on a test corpus using the command python src/main.py test ... with the following arguments:

Argument Description Default
--model-path Path of saved model N/A
--evalb-dir Path to EVALB directory EVALB/
--test-path Path to test trees data/23.auto.clean
--test-path-text Optional non-destructive tokenization of the test data Guess raw text; see --text-processing
--text-processing Heuristics for guessing raw text from descructively tokenized tree files. See load_trees() in src/treebanks.py Default rules for languages other than Arabic, Chinese, and Hebrew
--test-path-raw Alternative path to test trees that is used for evalb only (used to double-check that evaluation against pre-processed trees does not contain any bugs) Compare to trees from --test-path
--subbatch-max-tokens Maximum number of tokens to process in parallel (a GPU does not have enough memory to process the full dataset in one batch) 500
--parallelize Distribute pre-trained model (e.g. T5) layers across multiple GPUs. Use at most one GPU
--output-path Path to write predicted trees to (use "-" for stdout). Do not save predicted trees
--no-predict-tags Use gold part-of-speech tags when running EVALB. This is the standard for publications, and omitting this flag may give erroneously high F1 scores. Use predicted part-of-speech tags for EVALB, if available

As an example, you can evaluate a trained model using the following command:

python src/main.py test --model-path models/en_bert_base_dev=*.pt

Reproducing Experiments

See EXPERIMENTS.md for instructions on how to reproduce experiments reported in our ACL 2022 paper.

Citation

Our paper can be cited as follows:

@inproceedings{kitaev-etal-2022-learned,
    title = "Learned Incremental Representations for Parsing",
    author = "Kitaev, Nikita  and
      Lu, Thomas  and
      Klein, Dan",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.220",
    pages = "3086--3095",
    abstract = "We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence, where the label is predicted using strictly incremental processing of a prefix of the sentence, and the sequence of labels for a sentence fully determines a parse tree. Our goal is to induce a syntactic representation that commits to syntactic choices only as they are incrementally revealed by the input, in contrast with standard representations that must make output choices such as attachments speculatively and later throw out conflicting analyses. Our learned representations achieve 93.72 F1 on the Penn Treebank with as few as 5 bits per word, and at 8 bits per word they achieve 94.97 F1, which is comparable with other state of the art parsing models when using the same pre-trained embeddings. We also provide an analysis of the representations learned by our system, investigating properties such as the interpretable syntactic features captured by the system and mechanisms for deferred resolution of syntactic ambiguities.",
}

Credits

The code in this repository and portions of this README are based on https://github.com/nikitakit/self-attentive-parser

incremental-parsing-representations's People

Contributors

nikitakit avatar thomaslu2000 avatar mitchellstern avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.