Giter VIP home page Giter VIP logo

pyconll's Introduction

Build Status Coverage Status Documentation Status

pyconll

Easily work with CoNLL files using the familiar syntax of python.

Links

Installation

As with most python packages, simply use pip to install from PyPi.

pip install pyconll

pyconll is also available as a conda package on the pyconll channel. Only package 2.2.0 and newer are available for conda at the moment.

conda install -c pyconll pyconll

This package is designed for, and only tested with python 3.4 and up and will not be backported to python 2.x.

Motivation

This tool is intended to be a minimal, low level, and functional library in a widely used programming language. pyconll creates a thin API on top of raw CoNLL annotations that is simple and intuitive in a popular programming language.

In my work with the Universal Dependencies project, I saw a dissapointing lack of low level APIs for working with the CoNLL-U format. Most tooling focuses on graph transformations and DSLs for terse, automated changes. Tools such as Grew and Treex are very powerful and productive, but have a learning curve and are limited the scope of their DSLs. CL-CoNLLU is simple and low level, but Common Lisp is not widely used in NLP, and difficult to pickup for beginners. UDAPI is in python but it is very large and has little guidance. pyconll attempts to fill the gaps between what other projects have accomplished.

Other useful tools can be found on the Universal Dependencies website.

Hopefully, individual researchers find pyconll useful, and will use it as a building block for their tools and projects. pyconll affords a standardized and complete base for building larger projects without worrying about CoNLL annotation and output.

Code Snippet

# This snippet finds what lemmas are marked as AUX which is a closed class POS in UD
import pyconll

UD_ENGLISH_TRAIN = './ud/train.conll'

train = pyconll.load_from_file(UD_ENGLISH_TRAIN)

aux_lemmas = set()
for sentence in train:
    for token in sentence:
        if token.upos == 'AUX':
            aux_lemmas.add(token.lemma)

Uses and Limitations

This package edits CoNLL-U annotations. This does not include the annotated text itself. Word forms on Tokens are not editable and Sentence Tokens cannot be reassigned or reordered. pyconll focuses on editing CoNLL-U annotation rather than creating it or changing the underlying text that is annotated. If there is interest in this functionality area, please create a github issue for more visibility.

This package also is only validated against the CoNLL-U format. The CoNLL and CoNLL-X format are not supported, but are very similar. I originally intended to support these formats as well, but their format is not as well defined as CoNLL-U so they are not included. Please create an issue for visibility if this feature interests you.

Lastly, linguistic data can often be very large and this package attempts to keep that in mind. pyconll provides methods for creating in memory conll objects along with an iterate only version in case a corpus is too large to store in memory (the size of the memory structure is several times larger than the actual corpus file). The iterate only version can parse upwards of 100,000 words per second on a 16gb ram machine, so for most datasets to be used on a dev machine, this package will perform well. The 2.2.0 release also improves parse time and memory footprint by about 25%!

Contributing

Contributions to this project are welcome and encouraged! If you are unsure how to contribute, here is a guide from Github explaining the basic workflow. After cloning this repo, please run make hooks and pip install -r requirements.txt to properly setup locally. make hooks setups up a pre-push hook to validate that code matches the default YAPF style. While this is technically optional, it is highly encouraged. pip install -r requirements.txt sets up environment dependencies like yapf, twine, sphinx, etc.

For packaging new versions, please use setuptools version 24.2.0 or greater for creating the appropriate packaging that recognizes the python_requires metadata.

README and CHANGELOG

When changing either of these files, please change the Markdown version and run make docs so that the other versions stay in sync.

Code Formatting

Code formatting is done automatically on push if githooks are setup properly. The code formatter is YAPF, and using this ensures that coding style stays consistent over time and between authors.

pyconll's People

Contributors

bittlingmayer avatar matgrioni avatar zmeigorynych avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.