Giter VIP home page Giter VIP logo

ud-compatibility's Introduction

ud-compatibility

Utility for converting Universal Dependencies–annotated corpora to UniMorph

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of a language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema.

Prerequisites

  • termcolor: pip install termcolor
  • Python 3.5 or later; Anaconda is a simple way to install it.

Usage

The driver of the entire endeavor is the file marry.py, which marries a UD dataset to its affiliated UniMorph.

Conversion

To convert one file to UniMorph, give the path (and optionally the specific language converter you'd like to use).

python marry.py convert --ud my/ud/path/rw-ud-dev.conllu
python marry.py convert --ud my/ud/path/da-ud-dev.conllu -l da

To convert your UD dataset to UniMorph, list the languages you'd like to convert:

python marry.py convert --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

When the input looks like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
2	2	2	NUM	_	NumType=Card	3	nummod	_	_
3	madres	madre	NOUN	_	Gender=Masc|Number=Plur	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

The output will look like this:

# sent_id = es-train-001-s21
# text = Tiene 2 madres.
1	Tiene	tener	VERB	_	PRS;V;FIN;3;IND;SG	0	root	_	_
2	2	2	NUM	_	NUM	3	nummod	_	_
3	madres	madre	NOUN	_	N;PL;MASC	1	obj	_	SpaceAfter=No
4	.	.	PUNCT	_	_	1	punct	_	_

Evaluation

To assess a conversion (either of the included Translator objects or your own), the syntax is similar:

python marry.py evaluate --langs he ro de it no_bokmaal 

(You'll need to update the paths in paths.py to reflect where your UD (and UniMorph, if evaluating) data are stored.)

Replication

To replicate the experiments from the paper, use:

python marry.py replicate 

Data

The individual datasets for Universal Dependencies v2 and UniMorph can be downloaded from their respective projects on GitHub.

Contributing

You're welcome to submit a pull request, harmonizing your UD dataset with the corresponding UniMorph.

  1. Write your own Translator subclass.
  2. Register it in the languages.py list.
  3. Submit the Pull Request.

License

This project is licensed under the GNU GPL v3 license; see the LICENSE.md file for details.

Citation

@InProceedings{mccarthy2018udw,
  author = 	"McCarthy, Arya D.
		and Silfverberg, Miikka
		and Cotterell, Ryan
		and Hulden, Mans
		and Yarowsky, David",
  title = 	"Marrying {U}niversal {D}ependencies and {U}niversal {M}orphology",
  booktitle = 	"Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"91--101",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/W18-6011"
}

ud-compatibility's People

Contributors

aryamccarthy avatar danielyakubov avatar kbatsuren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ud-compatibility's Issues

Enforcing POS comes first

Hi!

As far as I can tell, the UniMorph convention for attribute ordering is that the POS is the first entry before the first ;. Nothing in the code I could see enforces this, and indeed when operating on a conllu file a lot of the POSs appear in random locations of the string (when it appears at all), making the tag difficult to process downstream. Can this be fixed?

Sample Input (en_gum treebank, l.152):
9 years year NOUN NNS Number=Plur 7 nmod 7:nmod:to Entity=23)|SpaceAfter=No

Output:
9 years year NOUN NNS PL;N 7 nmod 7:nmod:to Entity=23)|SpaceAfter=No

Thanks,

  • Yuval

Labeling differences between UD and UM for Hungarian

It seems that for some languages, UD has slightly different labelings than UM, but the same thing in the end.

For Hungarian, the UD dative case is equal to the UM genitive case. And the UD imperative mood is equal to the UM subjunctive mood.

PS: I am just leaving it as a note. If someone works on Hungarian, it could be useful. So you can close the issue.

Annoying print for python interface

When using the package through Python (admittedly something we jerry-rigged it to do) there's an annoying print(ud2um) that pops up every time any code that imports unimorph is used. Suggest switching most of the package to use logging

UD/UM ordering mismatch

Hi all,

I was noticing that marry.py orders the UD features in the order they appear as opposed to ordering them how they are in Unimorph. Is there any way to remedy this?

For example, the following line from UD (as output by marry.py):
17 ist sein AUX VAFIN FIN;V;IND;3;SG;PRS 27 cop _ _
should have the following order:
V;IND;PRS;3;SG
and where FIN is optional.

I'm putting these into a system which is sensitive to order, so random ordering is breaking my application.

Thanks in advance!
--David

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.