Giter VIP home page Giter VIP logo

language-detector's Introduction

Build Status

language-detector

language-detector detects the language of text

Installation

pip install language-detector

Python Version

Works with both Python 2 and 3

Use

from language_detector import detect_language
text = "I arrived in that city on January 4, 1937"
language = detect_language(text)
# prints English

Features

Languages Supported
Arabic
English
Farsi
French
German
Khmer
Kurmanci (Kurdish)
Mandarin
Russian
Sorani (Kurdish)
Spanish
Turkish

Testing

To test the package run

python -m unittest language_detector.tests.test

Comparison

Test is a comparison of how well language-detector and langid identify languages in the data sources.

package language-detector langid
test-duration (in seconds) 0.10 3.83
accuracy 96.77% 67.74%

Excluding Languages

If you don't want language-detector to look for certain languages, you can monkey-patch the code. For example, in order to exclude English:

import language_detector
language_detector.char_language = [cl for cl in char_language if cl[1] != "English"]

# proceed as normal

Datasets

The following is a list of datasets used for each language:

Language Datasets
Arabic UN Corpora
English UN Corpora
Farsi BBC News Persian
French UN Corpora
German Deutsche Welle
Khmer Cambodia Daily
Kurmanci (Kurdish) Rudaw
Mandarin UN Corpora
Russian UN Corpora
Sorani (Kurdish) Rudaw
Spanish UN Corpora
Turkish BBC News Türkçe

How Does It Work?

When training the model, we scan all the data sources and compute the frequency of how often a character appears in each specific language. We also compute the frequency of how often a characters appears in all of the data sources for all the languages. For each language, we then calculate a score for each character as frequency_in_language / frequency_in_all_languages. We then save the top ten highest scoring characters for each language.
When detecting a language, we simply iterate through the saved characters (ten for each language), and add their score as a weighted-vote for each language. Whichever, language has the highest score is selected as the winner.

Contributing

If you'd like to contribute a new language, please consult CONTRIBUTING.md

Support

Contact the package author, Daniel J. Dufour, at [email protected]

language-detector's People

Contributors

danieljdufour avatar ftkurt avatar viymak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

language-detector's Issues

German Support

Hi,

it would be nice if you could add german as detection language. Thanks!

Japanese text being identified as Kurmanji

It's probably because of this line.

– Kurmanci 7.2974490546991655

See the following texts:

୨୧譲渡交換୨୧ ツイステ 色紙コレクション vol.1 vol.2 譲┊︎デューストレイケイト ジャミルオルトシルバー 求┊︎同異種リドル or 定価(+送料) 郵送 or 都内手渡し可能 ⿻ 各1BOX予約済みです。 ⿻…

東映HP更新✨ 来週はガルザとクランチュラがジャメンタルを研究🔍録りおろしナレーションたっぷりでお届けします! そしてHPで #キラトーーク 延長戦!? 魔進の声を演じるキャストのテンションMAX!なコメントを掲載しております✨ #キラ…

「DXヒューマギアプログライズキーセット」はご予約受付中!シェスタ、腹筋崩壊太郎、マモル、一貫ニギローのデータを宿したプログライズキーのセットです✨ 別売りのDXなりきりシリーズとも連動します。 URL…

Use Freq

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
N/A

Describe the solution you'd like
Use Freq to reduce lines of code

Describe alternatives you've considered
N/A

Additional context
https://github.com/danieljdufour/freq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.