Giter VIP home page Giter VIP logo

languagedetector's Introduction

LanguageDetector Build Status Flattr this git repo

PHP Class to detect languages from any free text.

It follows the approach described in the paper, a given text is tokenized into N-Grams (we cleanup whitespaces before doing this step). Then we sort the tokens and we compare against a language model.

How it works

The first thing we need is a language model (which looks like this file) that is used to compare the texts against at classification time. This process must done before anything, and it can be generated with an script similar to this file.

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// it could use a little bit of memory, but it's fine
// because this process runs once.
ini_set('memory_limit', '1G');

// we load the configuration (which will be serialized
// later into our language model file
$config = new LanguageDetector\Config;

$c = new LanguageDetector\Learn($config);
foreach (glob(__DIR__ . '/samples/*') as $file) { 
    // feed with examples ('language', 'text');
    $c->addSample(basename($file), file_get_contents($file));
}

// some callback so we know where the process is 
$c->addStepCallback(function($lang, $status) {
    echo "Learning {$lang}: $status\n";
});

// save it in `datafile`. 
// we currently support the `php` serialization but it's trivial
// to add other formats, just extend `\LanguageDetector\Format\AbstractFormat`. 
//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php
$c->save(AbstractFormat::initFormatByPath('language.php'));

Once we have our language model file (in this case language.php) we're ready to classify texts by their language.

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// we load the language model, it would create
// the $config object for us.
$detect = LanguageDetector\Detect::initByPath('language.php');

$lang = $detect->detect("Agricultura (-ae, f.), sensu latissimo, 
est summa omnium artium et scientiarum et technologiarum quae de 
terris colendis et animalibus creandis curant, ut poma, frumenta, 
charas, carnes, textilia, et aliae res e terra bene producantur. 
Specialius, agronomia est ars et scientia quae terris colendis student, 
agricultio autem animalibus creandis.")

var_dump($lang);

And that's it.

Algorithms

The project is designed to work with modules, which means you can provide your own algorithm for sorting and comparing the N-Grams. By default the library implements the PageRank as sorting algorithm, and out of place (described in the paper) as comparing.

In order to supply your own algorithms, you must change the $config at learning stage to load your own classes (which by the way should implement some interaces).

languagedetector's People

Contributors

adam-lynch avatar crodas avatar mente avatar pborreli avatar sasezaki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

languagedetector's Issues

More details about achieving a good quality ratio

First of all thank's for the great work and sharing it on GitHub.
I've installed and played around with it, first using the sample language files, then adding a couple of books by language to the learn system.

Though the success ratio is quite poor... It would be nice to have some explanations and use cases to know how much and what kind of text should be given to the system to learn and achive a good ratio of success.

Swedish sample wrong

I noticed that the swedish sample is completely wrong.

What corpus did you use for the samples?
I can provide a fix but I thought it would be best to use the same corpus for all samples.

It break's with some special characteres

Excellent work! Thank you.
Just an observations If you feed it with things like:
x
:)
es genial¡¡¡¡¡¡

It throws and error like this:

exception 'RuntimeException' with message 'Invalid or missing outlinks' in C:\Users\personal\GoogleDrive\202_Librerias\LanguageDetector-master\lib\LanguageDetector\Sort\PageRank.php:152
Stack trace:
#0 C:\Users\personal\GoogleDrive\202_Librerias\LanguageDetector-master\lib\LanguageDetector\Detect.php(83): LanguageDetector\Sort\PageRank->sort(Array)
#1 C:\Users\personal\GoogleDrive\202_Librerias\LanguageDetector-master\lib\LanguageDetector\Detect.php(122): LanguageDetector\Detect->detectChunk('!')
#2 C:\Users\personal\GoogleDrive\202_Librerias\LanguageDetector-master\example\detectaIdiomaALista.php(24): LanguageDetector\Detect->detect('!')

Just saying, so it can be more robust 👍

Maybe it could give "ascii art" as language : P

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.