LookupAnalyzerDisambiguator

Introduction

A tool for Turkish language processing which inputs Turkish tokens (text tokenized into tokens such as words and punctuations) and outputs their disambiguated morphological analyzes. Since Turkish has highly rich morphology, morphological analysis is required for many tasks including pos-tagging, dependency parsing

In our solution, we implement a simple morhological analyzer based on stem and suffix dictionaries. Using this simple morphological analzyzer, all possible analyzes of each token is generated. A neural network (specifically Bidirectional character based LSTMs) is implemented using DyNet library and trained for selecting the correct morphological analysis among all possible analyzes according to the context which words have appeared in. The neural network architecture is similar to the architecture used in Shen et. al.'s study

Performance

Although we do not use any complex morphological analyzer as in most of the studies, our results are competitive with state-of-the-art morphological disambiguators (96%~97% accuracy).

We will report a comprehensive evaluation results soon.

Usage

Just build a docker image using Dockerfile in the repo:

docker build --tag turkish-tagger .

Then start a docker container using the docker image built on previos step:

docker run -p 8081:8081 -d  turkish-tagger

Then, it will start to serve as a web application if everything goes well. You can just send a post request to localhost:8081/analyze to analyze a Turkish sentence morphologically.

Example requests and responses

Request1 :

curl --request POST \
  --url http://localhost:8081/analyze \
  --header 'cache-control: no-cache' \
  --header 'content-type: application/json' \
  --header 'postman-token: c18af364-c1cb-cc41-0903-063547ac7fce' \
  --data '{
    "tokens" : [
        "alın",
        "yazısı"
    ]}'

Response 1:

[
    "alın+Noun+A3sg+Pnon+Nom",
    "yazı+Noun+A3sg+P3sg+Nom"
]

Request2 :

curl --request POST \
  --url http://localhost:8081/analyze \
  --header 'cache-control: no-cache' \
  --header 'content-type: application/json' \
  --header 'postman-token: dd9b686d-509c-d676-173c-f8f64d5dcee0' \
  --data '{
    "tokens" : [
        "gelirken",
        "ekmek",
        "alın",
        "."
     ]}'

Response 2:

[
    "gelir+Noun+A3sg+Pnon+Nom^DB+Verb+Zero^DB+Adverb+While",
    "ekmek+Noun+A3sg+Pnon+Nom",
    "al+Verb+Pos+Imp+A2pl",
    ".+Punc"
]

Notes

Please email me and ask for permission to use this tool. Also note that this is not a release version and may contain some bugs. Every contribution is welcome.

We still continue working with my advisor in my PhD thesis. Wait for better accuracies :)

Parameters are not matching

Hi, i've tried to run turkish-tagger on Windows and Ubuntu. First output was something like this:

e7d45b3b4fae652e5a88fd91a88e1f155520da4075de1fc71655c91028510f11

Docker was exiting after this output.

I changed a little on run command with the remove -d parameter and add -it parameter, i get error as following:

[dynet] random seed: 403473904 [dynet] allocating memory: 512MB [dynet] memory allocation done. 2019-07-16 08:07:56,997 - /usr/src/app/src/models.py - INFO - 112} - Loading Pre-Trained Model Traceback (most recent call last): File "src/run.py", line 25, in <module> class APIHandler(BaseHandler): File "src/run.py", line 27, in APIHandler _morph_anlyzer = AnalysisScorerModel.create_from_existed_model("lookup_disambiguator_wo_suffix") File "/usr/src/app/src/models.py", line 326, in create_from_existed_model return AnalysisScorerModel(train_from_scratch=False, model_file_name=model_name) File "/usr/src/app/src/models.py", line 114, in __init__ self.load_model(model_file_name, char_representation_len, word_lstm_rep_len) File "/usr/src/app/src/models.py", line 322, in load_model self.model.populate("resources/models/" + model_name + ".model") File "_dynet.pyx", line 1022, in _dynet.ParameterCollection.populate File "_dynet.pyx", line 1077, in _dynet.ParameterCollection.populate_from_textfile RuntimeError: Dimensions of parameter /vanilla-lstm-builder/_0 looked up from file ({512,128}) do not match parameters to be populated ({512,64})

How can i solve this problem?

erayyildiz / lookupanalyzerdisambiguator Goto Github PK