Giter VIP home page Giter VIP logo

kurdishlid's Introduction

Language Identification of Kurdish & Zaza-Gorani Languages

Kurdish alphabets

Language identification or detection is the task of detecting the language in which a sentence is written. This repository provides models for language identificaiton of Kurdish and Zaza-Gorani languages with their Kurdified Perso-Arabic and Latin scripts. Our models can predict the following languages and scripts:

  • Northern Kurdish / کورمانجی (Kurmanji, kmr) - both scripts with kuarab & kulatn labels
  • Central Kurdish / سۆرانی (Sorani, ckb) - both scripts with ckbarab & ckblatn labels
  • Southern Kurdish / کوردیی خوارین (sdh)
  • Gorani / گۆرانی (Hawrami, hac)
  • Zazaki / Zazakî / (zza) - both scripts with zza for Bedirxan and zzawiki for the script used on Zazaki Wikipedia
  • Arabic / اَلْعَرَبِيَّةُ (ar)
  • Persian / فارسی (fa)
  • Turkish / Türkçe / (tr)

How to use?

Our models are trained using fastText. You can run the models in Python or on command-line by installing the fastTextlibrary as described at https://fasttext.cc/docs/en/support.html.

Two models are provided:

  • models/KLID_model.ftz: use this if you don't mind about detecting the script of the language. This predicts language codes only.
  • models/KLID_model_scr.ftz: use this if you want the script label in addition to the language code. This predicts language and script.

Here is an example in Python:

>>> import fasttext
>>> model = fasttext.load_model("models/KLID_model.ftz")

# Central Kurdish
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات") 
(('__label__ckb',), array([1.00002003]))
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات", k=5)
(('__label__ckb', '__label__ku'), array([1.00002003e+00, 1.00000989e-05]))
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات")
(('__label__ckb',), array([1.00002003]))
>>> model.predict("باڵیۆزی عێراق")
(('__label__ckb',), array([1.00001979]))

# Southern Kurdish
>>> model.predict("چەس ئمڕوو چە قوومیاس؟!!") 
(('__label__sdh',), array([1.00003743]))

# Gorani
>>> model.predict("داستانێ فرەتەر و درێژتەرەنه و دەسی سەر پەی") 
(('__label__hac',), array([0.99998134]))

# Kurmanji
>>> model.predict("ئەگەر بێژم ئەز فەرهادم") 
(('__label__ku',), array([0.93445575]))

# Zazaki
>>> model.predict("Seba naye zî ganî ma rayîr û metodanê xo xurtêr bikerê.") 
(('__label__zza',), array([1.00003004]))

# Northern Kurdish
>>> model.predict("Amerîkayîyan di sala 2004 de zîndana Ebû Xerîb girtin.") 
(('__label__ku',), array([0.99766862]))

# Central Kurdish
>>> model.predict("Emin filsêkim le kitêban dest nekewtbû bełam") 
(('__label__ckb',), array([1.00001991]))

# Central Kurdish
>>> model.predict("گەرەکمە پێی بێژم نامگەرەکە") 
(('__label__ku',), array([0.99485904])) 
>>> model.predict("جا ئەتوو وەرە دەگەڵ وی ڕێک کەوە")
(('__label__sdh',), array([0.84034669])) 

# English
>>> model.predict("To be, or not to be") 
(('__label__zza',), array([1.00003004]))

If you would like to train your own models, you can use the datasets provided in the datasets folder. All the datasets are merged into train and train_scr; these two files refer to the instances tagged without and with their scripts, respectively.

Cite this corpus

If you're using the models, please cite the project along with the following paper (bib file | PDF).

@inproceedings{ahmadi2023fieldmatters,
  title = "Approaches to Corpus Creation for Low-Resource Language Technology: the Case of {Southern Kurdish and Laki}",
  author = "Ahmadi, Sina and Azin, Zahra and Belelli, Sara and Anastasopoulos, Antonios",
  booktitle = "Proceedings of the second workshop on NLP applications to field linguistics",
  month = may,
  year = "2023",
  address = "Dubrovnik, Croatia",
  publisher = "The 17th Conference of the European Chapter of the Association for Computational Linguistics"
}

License

MIT

kurdishlid's People

Contributors

sinaahmadi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.