Giter VIP home page Giter VIP logo

fitrat's Introduction

fitrat

Abdurauf Fitrat

An NLP library for Uzbek. It includes morphological analysis, transliterators, language identifiers, tokenizers and many more.

It is named after historian and linguist Abdurauf Fitrat, who was one of the creators of Modern Uzbek as well as the first Uzbek professor.


Usage

Installation

pip install fitrat

Transliteration

We used hfst library for creating transliterators. This library provides finite-state transducers, a finite-state machines that come very handy for efficient mapping one text to another.

from fitrat import Transliterator, WritingType

t = Transliterator(to=WritingType.LAT)
result = t.convert("Кеча циркка бордим.")
print(result)
# Kecha sirkka bordim.

t2 = Transliterator(to=WritingType.CYR)
result = t2.convert("Kecha sirkka bordim.")
print(result)
# Кеча циркка бордим.

While Cyrillic-Latin conversion is rule-based and simple, the converse is not true. We included special pre-compiled exceptions transducer for Latin-Cyrillic that handles all (to our knowledge) exceptions. We'll continue working on improving on our exceptions list.

If you want to compile the transliterators from source, you have to use hfst-dev or hfst library. The package uses only pre-compiled binaries and hfstol library for efficient lookup.

Language Identification

We can recognize Uzbek text, both Latin or Cyrillic. Additionally, we can recognize other major languages, such as Russian, English, Arabic and etc.

from fitrat import LanguageDetector

lang_detector = LanguageDetector()

print(lang_detector.is_uzbek("bu o'zbekchada yozilgan matn"))
# True

print(lang_detector.is_uzbek("бу нотугри йозилган булсаям, лекин узбекча матн"))
# True

print(lang_detector.is_uzbek("Текст на русском языке"))
# False

Tokenization

from fitrat import word_tokenize

s = "Bugun o'zbekchada gapirishga qaror qildim!"
print(word_tokenize(s))
# ['Bugun', "o'zbekchada", 'gapirishga', 'qaror', 'qildim', '!']

Authors

  • Mukhammadsaid Mamasaidov
  • Jasur Yusupov

fitrat's People

Contributors

mukhammadsaid19 avatar goodeejay avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.