Giter VIP home page Giter VIP logo

tools-11's Introduction

This repository contains various scripts in Perl and Python that can be used as tools for Universal Dependencies.



==============================
validate.py
==============================

Reads a CoNLL-U file and verifies that it complies with the UD specification. It must be run with the language /
treebank code and there must exist corresponding lists of treebank-specific features and dependency relations in order
to check that they are valid, too.

  cat la_proiel-ud-train.conllu | validate.py --lang la_proiel



==============================
conllu-stats.py
conllu-stats.pl
==============================

Reads a CoNLL-U file, collects various statistics and prints them. These two scripts, one in Python and the other in
Perl, are independent of each other. The statistics they collect overlap but are not the same. The Perl script
(conllu-stats.pl) was used to generate the stats.xml files in each data repository.



==============================
overlap.py
==============================

Compares two CoNLL-U files and searches for sentences that occur in both (verbose duplicates of token sequences). Some
treebanks, especially those where the original text had been acquired from the web, contained duplicate documents that
were found at different addresses and downloaded twice. This tool helps to find out whether one of the duplicates fell
in the training data and the other in development or test. The output has to be verified manually, as some “duplicates”
are repetitions that occur naturally in the language (in particular short sentences such as “Thank you.”)

The script can also help to figure out whether training-dev-test data split has been changed between two releases so
that a previously training sentence is now in test or vice versa. That is something we want to avoid.



==============================
conllu_to_conllx.pl
==============================

Converts a file in the CoNLL-U format to the old CoNLL-X format. Useful with old tools (e.g. parsers) that require
CoNLL-X as their input. Usage:

  perl conllu_to_conllx.pl < file.conllu > file.conll



==============================
conll_convert_tags_to_uposf.pl
==============================

This script takes the CoNLL columns CPOS, POS and FEAT and converts their combined values to the universal POS tag and
features.

You need Perl. On Linux, you probably already have it; on Windows, you may have to download and install Strawberry Perl.
You also need the Interset libraries. Once you have Perl, it is easy to get them via the following (call "cpan" instead
of "cpanm" if you do not have cpanm).

  cpanm Lingua::Interset

Then use the script like this:

  perl conll_convert_tags_to_uposf.pl -f source_tagset < input.conll > output.conll

The source tagset is the identifier of the tagset used in your data and known to Interset. Typically it is the language
code followed by two colons and "conll", e.g. "sl::conll" for the Slovenian data of CoNLL 2006. See the tagset conversion
tables at http://universaldependencies.github.io/docs/tagset-conversion/index.html for more tagset codes.

It assumes the CoNLL-X (2006 and 2007) file format. If your data is in another format (including CoNLL 2008/2009, which
is not identical to 2006/2007), you have to modify the data or the script.



==============================
check_files.pl
==============================

This script must be run in a folder where all the data repositories (UD_*) are stored as subfolders. It checks the
contents of the data repositories for various issues that we want to solve before a new release of UD is published.

tools-11's People

Contributors

alexpopov23 avatar dan-zeman avatar fginter avatar jipatsaa avatar kajad avatar manning avatar msimi avatar prokopidis avatar ryanmcd-goog avatar spyysalo avatar tlynn747 avatar tomazerjavec avatar verginicabm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.