Giter VIP home page Giter VIP logo

universal-segmentations's Introduction

Universal Segmentations

This repository servers as the main storage of scripts and documentation of the Universal Segmentations project, which aims to convert existing resources for morphological segmentation into a common format and, in the future, to develop new resources.

Currently, we focus on extracting information from existing resources and putting into a file format specifically designed for the segmentation task.

API

A Python API for working with morphological segmentation is available in src/useg/seg_lex.py. For a simple usage example, see src/import_derinet.py, which converts segmentation present in DeriNet-2.0-formatted files to our format.

Basic usage

  1. Create a lexicon. The lexicon is the main object you’ll be interacting with.

    from useg import SegLex
    lexicon = SegLex()
  2. Create any number of lexemes using the lexicon.add_lexeme() method, which requires a word form, lemma and a part-of-speech tag and returns an ID of the created lexeme. This ID is used to work with the lexeme using other methods.

    The last argument is (optional) features of the lexeme. Pass a dict with the required items, such as {"gender": "Masc"}. Keys are not yet standardized, we’ll have to talk about it.

    The POS tag can be replaced by an empty string if you don’t have it. Form should be always filled in – it is expected that a lemma is one of the word forms, so if you only have a lemma, specify it both as the lemma and as the form.

    Try to use Universal POS tags for the part of speech, if possible.

    lex_id = lexicon.add_lexeme("counterexamples", "counterexample", "NOUN")
  3. Add segmentation by using the lexicon.add_morpheme() method, or one of the convenience methods lexicon.add_contiguous_morpheme() or lexicon.add_morphemes_from_list() (see below).

    The lexicon.add_morphemes_from_list() method is the simplest one to use, but it requires that the morphs of the word, when concatenated, correspond exactly to the word form. Therefore, it doesn’t work with allomorphs and doesn’t allow you to add further annotation to the morphemes. Simply pass a list of morphs as the argument.

    lexicon.add_morphemes_from_list(lex_id, "test_1", ["count", "er", "example", "s"])

    The lexicon.add_contiguous_morpheme() allows you to add morphemes one by one and manually resolve allomorphy and annotate the individual morphemes. Pass start and end indices of the morpheme, as if they were slice indices that select the morph from the form.

    lexicon.add_contiguous_morpheme(lex_id, "test_2", 0, 5, {"type": "root", "morpheme": "contra"})
    lexicon.add_contiguous_morpheme(lex_id, "test_2", 5, 7, {"type": "suffix", "morpheme": "er"})

    The lexicon.add_morpheme() method is the lowest level one. It requires you to manually specify, which positions in the form the morph corresponds to using a list of integer indices, therefore allowing to mark non-contiguous morphemes such as circumfixes.

    lexicon.add_morpheme(lex_id, "test_2", [7, 8, 9, 10, 11, 12, 13], {"type": "root", "morpheme": "example"})
    lexicon.add_morpheme(lex_id, "test_2", [14], {"type": "suffix", "morpheme": "PLURAL"})
  4. When you’re done, you can save the lexicon to a file:

    lexicon.save("output.useg")  # Or e.g. lexicon.save(sys.stdout)

Loading and using pre-created lexicons

In addition to creating the lexicon from scratch, it is also possible to load an existing resource from a file using the lexicon.load() method, analogously to saving:

lexicon = SegLex()
lexicon.load("input.useg")  # Or e.g. lexicon.load(sys.stdin)

The lexicon now contains lexemes and their segmentation. The lexemes can be enumerated using the lexicon.iter_lexemes() method. When passed no arguments, it will iterate over the whole lexicon, returning IDs of all lexemes. When passed the string arguments form, lemma or pos, it will iterate only over lexeme IDs with the specified properties.

The methods lexicon.form(), .lemma() and .pos() can be used to obtain these strings, and lexicon.features() retrieves the features dict. The returned dictionary is shared with the API, so any edits will be visible in future calls.

# Print lemmas and features of all nouns.
for lex_id in lexicon.iter_lexemes(pos="NOUN"):
    print(lexicon.lemma(lex_id), lexicon.features(lex_id))

Troubleshooting

To make sure that Python sees the API sources and is able to import them, add the src/ directory to PYTHONPATH, either by setting the corresponding environment variable, or by changing the path before importing from useg. That is, either:

PYTHONPATH=/path/to/universal-segmentations/src python3 script.py

or:

import os
import sys

# Configure the path to the sources.
sys.path.insert(0, os.path.abspath('../../src'))

TODO

Add documentation.

  • ✓ Document setting the PYTHONPATH

  • ✓ Document importing and creating the lexicon

  • ✓ Document creating new lexemes

  • ❏ Document working with lexemes

    • ✓ Getting form, lemma, POS tag, other morphological info

    • ❏ Deleting lexemes

    • ✓ Changing morphological info

    • ❏ What to do when we need to change the lemma or POS (A: Create a new lexeme instead and delete the old one.)

    • ✓ Getting lexemes by lemma, form, pos etc.

  • ❏ Document basic concepts of morphemes

    • ❏ They are bound to the form of the lexeme

    • ❏ Spans index, where in the form is the morph of the morpheme located

    • ❏ Ideally, each lexeme has its form perfectly subdivided between morphemes with no gaps nor overlap

      • ❏ But you don’t need to fully segment the word if you don’t know all the morphemes

      • ❏ And edge cases (overlaps – morfémové uzly) are supported too

    • ❏ Each lexeme can have multiple alternative segmentations

      • ❏ These are distinguished using annotation layer names

  • ✓ Document creating new morphemes (contiguous or not)

  • ❏ Document deleting morphemes (not implemented yet)

  • ❏ Document morpheme properties; what are you expected to fill out

    • ❏ And how to query and obtain the properties

File format

Note
TODO Add documentation.

universal-segmentations's People

Contributors

iml-r avatar lukyjanek avatar niyatibafna avatar vidraj avatar zabokrtsky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.