Giter VIP home page Giter VIP logo

turkish-morphology's Introduction

Turkish Morphology

A two-level morphological analyzer for Turkish.

This is not an official Google product.

Components

This implementation is composed of three layers:

  • Lexicons:

    This layer includes wide-coverage Turkish lexicons which are manually annotated and validated for part-of-speech and morphophonemic irregularities. They are intended to be used in building Turkish natural language processing tools, such as morphological analyzers. The set of base lexicons that we provide includes annotated lexical items for 47,202 words. The tagsets and the annotation scheme are described in the lexicon annotation guidelines.

  • Morphotactics:

    This layer includes a set of FST definitions which are implemented in a custom format which is similar to AT&T FSM format (only difference being that we can use strings as state names and input/output labels for each transition instead of integers). With each of these FSTs we define the suffixation patterns and the morpheme inventories together with their corresponding output morphological feature category-value pairs for a given part-of-speech. Overall morphotactic model and the morphological feature category-value tagsets are described in the morphotactic model guidelines.

  • Morphophonemics:

    This layer includes a set of Thrax grammars, where each implements a standalone morphophonemic process (such as vowel harmony, vowel drop, consonant voicing and consonant drop and so on). Composition of the exported FSTs defined in these Thrax grammars yield the morphophonemic model of Turkish.

The first level of the morphological analysis is implemented by the morphophonemic model, which takes a Turkish word and transforms it into the intermediate representation. The output of the first level is all possible hypotheses of word stem annotations with morphophonemic irregularities followed by the meta-morphemes that correspond to the suffixes that are realized in the surface form.

Input: affında
Output: af"+SH+NDA

Lexicon entries and morphotactic FST definitions are composed and compiled into a single FST which acts as the second level of the morphological analysis, namely the morphotactic model. Morphotactic model takes the intermediate tape as the input and transforms it to all possible human-readable morphological analyses that can be generated from the hypotheses generated by the first level.

Input: af"+SH+NDA
Output: (af[NN]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]

See Interpreting Human-Readable Morphological Analysis section for a description of such human-readable morphological analysis.

How to Parse Words

To morphologically parse a word, simply run below from the project root directory.

bazel run -c opt scripts:print_analyses -- --word=[WORD_TO_PARSE]

This will morphologically parse the input word against the two-level morphological analyzer and output a set of human-readable morphological analysis, as such:

bazel run -c opt scripts:print_analyses -- --word=geldiğinde
> Morphological analyses for the word 'geldiğinde':
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=True]

If the input string is not accepted as a Turkish word, morphological analyzer outputs an empty result.

bazel run -c opt scripts:print_analyses -- --word=foo
> 'foo' is not accepted as a Turkish word

Interpreting Human-Readable Morphological Analysis

An example output human-readable morphological analysis is as follows;

Input Word (evlerindekilerin = those that belongs to ones in their homes):

bazel run -c opt scripts:print_analyses -- --word=evlerindekilerin

Sample Output Morphological Analysis String:

(ev[NN]+[PersonNumber=A3sg]+lArH[Possessive=P3pl]+NDA[Case=Loc])([PRF]-ki[Derivation=Pron]+lAr[PersonNumber=A3sg]+[Possessive=Pnon]+NHn[Case=Gen])+[Proper=False]

Human-readable morphological analyses can be decomposed into parts:

  • Inflectional groups:

    Each human-readable morphological analysis is composed of inflectional groups. An inflectional group is a sub-word span, and it is created by affixation of a derivational morpheme. Inflectional group analyses are enclosed in parenthesis. Above example contains two inflectional groups:

    • (ev[NN]+[PersonNumber=A3sg]+lArH[Possessive=P3pl]+NDA[Case=Loc])
    • ([PRF]-ki[Derivation=Pron]+lAr[PersonNumber=A3pl]+[Possessive=Pnon]+NHn[Case=Gen])
  • Word stem:

    First inflectional group contains the word stem (e.g. ev is the root form for the above example input word evlerindekilerin).

  • Analysis of morphemes:

    Within each inflectional group meta-morphemes and their corresponding morphological feature category-value tags are separated with either + or - delimiters. (e.g. +[PersonNumber=A3sg], +lArH[Possessive=P3pl], -ki[Derivation=Pron], etc.). Strings that are immediate followers of the delimiters + or - are the meta-morphemes (e.g. NDA is the meta-morpheme in morpheme analysis +NDA[Case=Loc]). Morphological feature category-value tags are enclosed in brackets right after the meta-morphemes (e.g. Case is the feature category and Loc is feature value in morpheme analysis +NDA[Case=Loc]).

  • Part-of-speech:

    Part-of-speech tag of each inflectional group is the first bracketed tag of the inflectional group (e.g. NN is the part-of-speech of the first inflectional group and PRF is for the second inflectional group).

  • Inflectional vs. Derivational morphemes:

    Meta-morphemes that are separated with + delimiter do not create a new inflectional group. They are inflectional morphemes (e.g. +[PersonNumber=A3sg], +NDA[Case=Loc], +[Possessive=Pnon], etc.). Meta-morphemes that are separated with - delimiter create a new inflectional group. They are the derivational morphemes (e.g. -ki[Derivation=Pron]). Therefore, first meta-morpheme in an inflectional group always follows the delimiter -, but not +.

  • Surface realization of inflections:

    Some meta-morphemes are not realized in the surface form. These meta-morphemes do not correspond to a span of characters in the input word. For them we do not output any meta-morpheme in the morpheme analysis (e.g. +lArH[Possessive=P3pl] and +NDA[Case=Loc] are realized in the surface form, thus they have explicit meta-morphemes lArH and NDA in their morpheme analysis. However, +[PersonNumber=A3sg] and +[Possessive=Pnon] are not realized in the surface form, therefore only morphological feature category-value tags are output for them in their morpheme analysis).

  • Surface realization of derivations:

    Derivational morphemes must always realize in the surface form. They always correspond to a span of characters in the input word. Therefore, we always output non-empty meta-morphemes in the corresponding morpheme analysis of derivational morphemes. Meaning that no zero-derivations are allowed in the morphotactic model.

  • Proper noun analysis:

    An optional proper noun feature analysis is output at the end of each inflectional group (e.g. +[Proper=False] which follows the second inflectional group). Proper noun feature category can take two values True or False. If it is specified as True, the inflectional group that it follows is considered to be a part of a proper noun. This feature is used to capture the internal structure of proper nouns that are composed of multiple words (e.g. for multi-word movie names the true part-of-speech and morphological feature of words that compose a multi-word movie name can be annotated, while marking the fact that they are part of a proper noun using this feature).

    Proper noun feature analysis is omitted for some of the inflectional groups to have a compact representation and to minimize the number of morphological analyses generated by the morphological analyzer. In such cases, proper noun feature analysis of an inflectional group applies to all preceding inflectional groups that does not have one (e.g. first inflectional group of the above example inherits its proper noun feature analysis Proper=False from the second inflectional group).

Python API

We also provide a Python API that can be used to morphologically analyze Turkish words, generate Turkish word forms from morphological analyses, parse human-readable morphological analyses into protobuf messages, validate their structural well-formedness and to generate human-readable analyses from them. You can see some example use cases in //examples.

If you are using Bazel, you can depend on this repository as an external dependency of your project by adding the following to your WORKSPACE file:

git_repository(
  name = "google_research_turkish_morphology",
  remote = "https://github.com/google-research/turkish-morphology.git",
  tag = "{version-tag}",
)

Then, you can simply use @google_research_turkish_morphology//turkish_morphology:analyze (or other modules of the API) as a dependecy of your relevant py_library or py_binary BUILD targets.

The API is also available on PyPi. To install the latest release from PyPi, run:

python3 -m pip install turkish-morphology

To install from source, run below from the project root directory (preferably within a Python virtual environment):

bazel build //...
bazel-bin/setup install

Requirements

To build and run the morphological analyzer install Bazel version 5.0.0 and Python 3.9. All other intrinsic dependencies will be imported, built and taken care of by Bazel according to the WORKSPACE setup throughout the first invocation of the morphological analyzer runtime. If you are installing from PyPi, you need pip.

Citing

If you use or discuss the code, data or tools from this repository in your work, please cite:

Öztürel, A., Kayadelen, T. & Demirşahin, I (2019, September). A syntactically expressive morphological analyzer for Turkish. In Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing (pp. 65-75).

@inproceedings{
    title = "A Syntactically Expressive Morphological Analyzer for Turkish",
    author = "\"{O}zt\"{u}rel, Adnan and Kayadelen, Tolga and Demir\c{s}ahin,
        I\c{s}{\i}n",
    booktitle = "Proceedings of the 14th International Conference on Finite-State
        Methods and Natural Language Processing",
    month = "23--25" # sep,
    year = "2019",
    address = "Dresden, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-3110",
    pages = "65--75",
}

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.

turkish-morphology's People

Contributors

ozturel avatar tkayadelen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turkish-morphology's Issues

Visibility-related error during initial build

I get the following error message during the initial build operation (on Lubuntu 19.10):

"//scripts/BUILD.bazel:36:1: in cc_binary rule //scripts:print_analyses: target '//src/analyzer:build_fst' is not visible from target '//scripts:print_analyses'. Check the visibility declaration of the former target if you think the dependency is legitimate."

The problem can be solved by changing the relevant part of Line 15 in src/analyzer/BUILD.bazel from "//public:pkg" to "//visibility:public".

Uzbek Morph Analyzer: Asking for Help

Hello, dear developers!

My name is Mukhammadsaid; I am a third-year Computer Science B.Sc. student at Inha University in Tashkent. This current issue bears little relevance to this repository, but I have to ask you for help.

As you know, Uzbek is a resource-poor language, although it is the second most spoken Turkic language after Turkish (approximately 34-35 mln speakers). That fact prompted me to create at least something for Uzbek, so I chose spell-checker. Later I discovered that Hunspell and other technologies might not work for Uzbek and the right way was to develop it using FSTs. So, I have started collecting all morphotactic rules from various sources and made MVP of morph analyzer UzMorph (in foma). I hired two linguists who annotated the lexicon for the analyzer. Following Oflazer, I found many similarities between Turkish and Uzbek, although Uzbek is detached from other Turkic languages due to the Persian influence. I can send you the paper draft that explains the morph analyzer if you want.

Then, I made a website and mobile keyboard Tahrirchi (Uzbek, "editor") for Uzbek that fully works on FSTs.

However, for further research and development the project needs funding. The current morph analyzer recognizes around 97-98% of words, so there's a room for improvement. Also, if we could make a treebank using the morph analyzer, we could create many other tools for Uzbek, such as a tagger.

The thing is, it seems implausible that I can get any funding from the government. I have tried all ways of proposing the project to the government, but my efforts bore no fruit. I wonder if there is a tiny chance that Google Research might be interested in the Uzbek language. I had to write issue here since I couldn't find the correct email address that considers such mails from Google Research. I would be more than grateful if you could help me with that matter. Thank you for understanding! Teşekkür ederim!

Morphological segmentation of Turkish words

I need to segment Turkish words. What I need is the morphemes that correspond to the surface form of an orthographic word. How can I use this morphological analyzer for this purpose?

Derivational Affixes Problem

Hello, Dear Developers!

I can't understand why some derivational affixes are universally appended to certain categories, even if the new complex word makes sense or not.

Examples: berber + lAn + mAk = berberlenmek.

What is your motivation to allow it?

Should a morphological analyzer be concerned about underlying semantics or it is a job of other application to distinguish does it make sense or not? Isn't the generalization of such affixes the same as allowing words like "untake" or "unpay" in English?

windows

merhaba kütüphanenizi incelerken fark ettim sanırım windows ortamında kullanılamıyor. Araştırdıklarım arasında en iyi sonuç veren kütüphane de bu kütüphane gibi gözüküyor. Bir yapay zeka projem için morfoloji kütüphanesi ihtiyacım var. yardımcı olabilirseniz sevinirim. 1.2.4 sürümünde fts.py deki from external.org_openfst import pywrapfst kütüphanesi hata veriyor.

Python 3.7.9 and openfst

I installed via pip in Python 3.7.9 and wanted to test Analyzer.

Trying to import analyze with:
from turkish_morphology import analyze
results in
ImportError: cannot import name 'pywrapfst' from 'external.org_openfst' (unknown location)
for Windows,

and
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/external/org_openfst/pywrapfst.so, 2): no suitable image found. Did find: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/external/org_openfst/pywrapfst.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x03 /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/external/org_openfst/pywrapfst.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x03

for Mac even though I installed openfst via brew.

Installing openfst via pip fails in both OS.

So my question is: is Python 3.9 is required and neccesary to run turkish-morphology especially considering these errors related to openfst?

Please don't retag!

Please do not delete and re-add git tags! When tags get pushed publicly that starts a chain reaction where releases start getting pushed into distros. This process usually involves things like checksums to make sure the releases don't change since what was approved by the distro. You screw this all up when you re-post tags. If something goes wrong with a release cycle add a new patch release tag by bumping the number. If you urgently need to (e.g. for major security flaws or releases that would brick a system if used) you can pull a tag, but do not repost the same tag later!

(v1.2.5 just got reposted an hour or so after first release and it's already in some Arch package repos, now broken.)

pip installation doesnt working

Hi,
I installed the repo via pip, and it was successfully installed.
Then, when I tried to import the repo as shown "from turkish_morphology import analyze," it didn't work. I put the error below.

/usr/local/lib/python3.7/dist-packages/turkish_morphology/fst.py in ()
20 from typing import Generator, Iterable, List, Optional
21
---> 22 from external.openfst import pywrapfst
23
24 _Arc = pywrapfst.Arc

ImportError: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /usr/local/lib/python3.7/dist-packages/external/openfst/pywrapfst.so)

How can I solve this problem?

Açıklama için sözlüksel hata

Çalışmanızda çok hayal kırıklığına uğradım. Kodu incelerken birkaç hata fark ettim, lütfen düzeltin. Spesifik olarak, birleşme sonrası durum, geçmiş bir katılımcının yanında lexed edildiğinde yanlıştır.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.