jprante / elasticsearch-langdetect Goto Github PK

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector

License: Apache License 2.0

Shell 0.61% Java 93.61% Python 5.78%

elasticsearch-langdetect's Introduction

A langdetect plugin for Elasticsearch

This is an implementation of a plugin for Elasticsearch using the implementation of Nakatani Shuyo’s language detector.

It uses 3-gram character and a Bayesian filter with various normalizations and feature sampling. The precision is over 99% for 53 languages.

The plugin offers a mapping type to specify fields where you want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang', as you can see in the example. The field can be queried for language codes.

You can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin offers also a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Here is a list of languages code recognized:

Table 1. Langauges

Code	Description
af	Afrikaans
ar	Arabic
bg	Bulgarian
bn	Bengali
cs	Czech
da	Danish
de	German
el	Greek
en	English
es	Spanish
et	Estonian
fa	Farsi
fi	Finnish
fr	French
gu	Gujarati
he	Hebrew
hi	Hindi
hr	Croatian
hu	Hungarian
id	Indonesian
it	Italian
ja	Japanese
kn	Kannada
ko	Korean
lt	Lithuanian
lv	Latvian
mk	Macedonian
ml	Malayalam
mr	Marathi
ne	Nepali
nl	Dutch
no	Norwegian
pa	Eastern Punjabi
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
sk	Slovak
sl	Slovene
so	Somali
sq	Albanian
sv	Swedish
sw	Swahili
ta	Tamil
te	Telugu
th	Thai
tl	Tagalog
tr	Turkish
uk	Ukrainian
ur	Urdu
vi	Vietnamese
zh-cn	Chinese
zh-tw	Traditional Chinese characters (Taiwan, Hongkong, Macau)

Table 2. Compatibility matrix

Plugin version	Elasticsearch version	Release date
5.4.0.2	5.4.0	Jun 8, 2017
5.4.0.1	5.4.0	May 30, 2017
5.4.0.0	5.4.0	May 10, 2017
5.3.2.0	5.3.2	Apr 30, 2017
5.3.1.0	5.3.1	Apr 30, 2017
5.3.0.2	5.3.0	Apr 3, 2017
5.3.0.1	5.3.0	Apr 1, 2017
5.3.0.0	5.3.0	Mar 30, 2017
5.2.2.0	5.2.2	Mar 2, 2017
5.2.1.0	5.2.1	Mar 2, 2017
5.1.2.0	5.1.2	Jan 26, 2017
2.4.4.1	2.4.4	Jan 25, 2017
2.3.3.0	2.3.3	Jun 11, 2016
2.3.2.0	2.3.2	Jun 11, 2016
2.3.1.0	2.3.1	Apr 11, 2016
2.2.1.0	2.2.1	Apr 11, 2016
2.2.0.2	2.2.0	Mar 25, 2016
2.2.0.1	2.2.0	Mar 6, 2016
2.1.1.0	2.1.1	Dec 20, 2015
2.1.0.0	2.1.0	Dec 15, 2015
2.0.1.0	2.0.1	Dec 15, 2015
2.0.0.0	2.0.0	Nov 12, 2015
1.6.0.0	1.6.0	Jul 1, 2015
1.4.4.1	1.4.4	Apr 3, 2015
1.4.4.1	1.4.4	Mar 4, 2015
1.4.0.2	1.4.0	Nov 26, 2014
1.4.0.1	1.4.0	Nov 20, 2014
1.4.0.0	1.4.0	Nov 14, 2014
1.3.1.0	1.3.0	Jul 30, 2014
1.2.1.1	1.2.1	Jun 18, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/5.4.0.2/elasticsearch-langdetect-5.4.0.2-plugin.zip

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.4.4.1/elasticsearch-langdetect-2.4.4.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install langdetect -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.6.0.0/elasticsearch-langdetect-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Examples

Note	The examples are written for Elasticsearch 5.x and need to be adapted to earlier versions of Elastiscearch.

A simple language detection example

In this example, we create a simple detector field, and write text to it for detection.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

PUT /test/docs/1
{
      "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

PUT /test/docs/2
{
      "text" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}

PUT /test/docs/3
{
      "text" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "en"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "de"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "fr"
           }
       }
}

Indexing language-detected text alongside with code

Just indexing the language code is not enough in most cases. The language-detected text should be passed to a specific analyzer to apply language-specific analysis. This plugin allows that by the language_to parameter.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages": [
                  "de",
                  "en",
                  "fr",
                  "nl",
                  "it"
               ],
               "language_to": {
                  "de": "german_field",
                  "en": "english_field"
               }
            },
            "german_field": {
               "analyzer": "german",
               "type": "string"
            },
            "english_field": {
               "analyzer": "english",
               "type": "string"
            }
         }
      }
   }
}

PUT /test/docs/1
{
  "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "english_field" : "light"
       }
   }
}

Language code and `multi_field`

Using multifields, it is possible to store the text alongside with the detected language(s). Here, we use another (short nonsense) example text for demonstration, which has more than one detected language code.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "de",
                        "en",
                        "fr",
                        "nl",
                        "it"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text" : "light"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text.language" : "en"
       }
   }
}

Language detection ina binary field with `attachment` mapper plugin

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
    		  "type" : "attachment",
			  "fields" : {
				"content" : {
				  "type" : "text",
				  "fields" : {
					"language" : {
					  "type" : "langdetect",
					  "binary" : true
					}
				  }
				}
			  }
            }
         }
      }
   }
}

On a shell, enter commands

rm index.tmp
echo -n '{"content":"' >> index.tmp
echo "This is a very simple text in plain english" | base64  >> index.tmp
echo -n '"}' >> index.tmp
curl -XPOST --data-binary "@index.tmp" 'localhost:9200/test/docs/1'
rm index.tmp

POST /test/_refresh

POST /test/_search
{
   "query" : {
       "match" : {
            "content" : "very simple"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "content.language" : "en"
       }
   }
}

Language detection REST API Example

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "languages" : [
    {
      "language" : "en",
      "probability" : 0.9999972283490304
    }
  ]
}

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Das ist ein Test'
{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999985460514316
    }
  ]
}

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Datt isse ne test'
{
  "languages" : [
    {
      "language" : "no",
      "probability" : 0.5714275763833249
    },
    {
      "language" : "nl",
      "probability" : 0.28571402563882925
    },
    {
      "language" : "de",
      "probability" : 0.14285660343967294
    }
  ]
}

Use _langdetect endpoint from Sense

GET _langdetect
{
   "text": "das ist ein test"
}

Change profile of language detection

There is a "short text" profile which is better to detect languages in a few words.

curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
{
  "profile" : "/langdetect/short-text/",
  "languages" : [ {
    "language" : "de",
    "probability" : 0.9999993070517024
  } ]
}

Settings

These settings can be used in elasticsearch.yml to modify language detection.

Use with caution. You don’t need to modify settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code and be familiar with probabilistic matching using naive bayes with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name	Description
`languages`	a comma-separated list of language codes such as (de,en,fr…) used to restrict (and speed up) the detection process
`map.<code>`	a substitution code for a language code
`number_of_trials`	number of trials, affects CPU usage (default: 7)
`alpha`	additional smoothing parameter, default: 0.5
`alpha_width`	the width of smoothing, default: 0.05
`iteration_limit`	safeguard to break loop, default: 10000
`prob_threshold`	default: 0.1
`conv_threshold`	detection is terminated when normalized probability exceeds this threshold, default: 0.99999
`base_freq`	default 10000

Issues

All feedback is welcome! If you find issues, please post them at Github

Credits

Thanks to Alexander Reelsen for his OpenNLP plugin, from where I have copied and adapted the mapping type code.

License

elasticsearch-langdetect - a language detection plugin for Elasticsearch

Derived work of language-detection by Nakatani Shuyo http://code.google.com/p/language-detection/

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. you may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

elasticsearch-langdetect's People

Contributors

Stargazers

Watchers

elasticsearch-langdetect's Issues

Index custom analzed data

A "language_analyzer" field, which index analyzed terms, not the iso-code. E.g:

"fields": {
  "language": {
    "type": "langdetect",
    "languages": [ "af", "ar", "bg", "bn", "cs", "da", "de", "el", "en", "es", "et", "fa", "fi", "fr", "gu", "he", "hi", "hr", "hu", "id", "it", "ja", "kn", "ko", "lt", "lv", "mk", "ml", "mr", "ne", "nl", "no", "pa", "pl", "pt", "ro", "ru", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "th", "tl", "tr", "uk", "ur", "vi", "zh-cn", "zh-tw" ],
    "language_analyzer": {
      "ar": "arabic",
      "bg": "bulgarian",
      "cs": "czech",
      "da": "danish",
      "de": "german",
      "el": "greek",
      "en": "english",
      "es": "spanish",
      ...
    }
  }
}

Indexing language probabilities

It should be possible not only to index the language tags, but also the probabilities.

Accuracy problem with attachment

Here is a list of documents [1] that have been detected as none-english (folder name = language detected).

Is there a way to improve accuracy?

[1] - https://dl.dropboxusercontent.com/u/64847502/langdetect-sample.zip

Improve language detection using UTF-8 character ranges

There are a number of languages that are not currently supported by this plugin but are actually very easy to detect just based on the UTF-8 character ranges that those languages use.

Some examples are given in this gist: https://gist.github.com/gibrown/8652399#file-gistfile1-php-L28

Unfortunately I only have anecdotal data on how well this works at the moment, but its good enough that we run it in production.

Incompatible with ES 1.2

It appears ES 1.2 changed and removed some response classes (namely, XContentThrowableRestResponse) that langdetect currently relies on in elastic/elasticsearch@05d46f8 which causes all requests under ES 1.2.x to error out.

Dynamically apply the right analyser based on language detected at index time, possible?

Hi jprante,

Small question that might be useful for some people I guess.

Is there a way, at index time, to apply the right analyser based on the result of the language detection? If yes, could you provide us with a code example?

Thanks in advance,
F

ES 1.4.4 plugin returns file content instead of lang value

Just installed the plugin and execute some search queries.

MAPPING
Copy/Paste from the index page of the plugin.

PUT
{
"content": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

{
"fields" : "content.language.lang",
"query" : {
"match_all" : {}
}
}

RESULT

.................

"fields": {
"content.language.lang": [
"IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
]
}

As we can see the content.language.lang is not as expected.

Any ideas?

language detection in array fields

Hi,

Is it possible to configure this plugin to detect languages for array fields/

Doesn't seem to work with 1.1.0 or even 1.0.1

I can't get it to work with newer versions than the RC?

Unable to use plugin from TransportClient in java

I am getting the following exception :
"org.elasticsearch.transport.ActionNotFoundTransportException: No handler for action [langdetect]"
while trying to use langdetect from java using TransportClient.
I am using Elasticsearch v1.4.3 and langdetect v1.4.4.2
I am not sure what is causing this failure. Please help.

Which langdetect version is bundled along with Elasticsearch-plugin-bundle 5.1.1

Hi JPrante,

The Elasticsearch-plugin-bundle for ES 5.1.1 comes along with a lang-detect plugin. I am interested only in the lang-detect plugin. However the lang-detect repo only have version up to ES 2.4.

So could we use the the lang-detect plugin for ES 2.4 with ES 5.1.1 ?
If not, is there a way to get only the lang-detect plugin from the Elasticsearch-plugin-bundle for ES 5.1.1 ?

Thank You

Elasticsearch 2.0.1 not compatible with plugin 2.0.0

Looks like version 2.0.0 of langdetect is not compatible with ES 2.0.1

After installing I got this error:
ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.0.1]. Was designed for version [2.0.0]

ES 5.3

Please create a new release!

Issue with Aggregations on language field

Content.lang field does not give proper result when used in term aggregations having language zh-cn and zh-tw
I am using aggregations on content.lang filed as below

{
"aggs" : {
"tags" : {
"terms" : {
"field" : "content.lang"
}
}
}
}

the result is

"aggregations": {
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 5,
"buckets": [
{
"key": "zh",
"doc_count": 6
},
{
"key": "cn",
"doc_count": 4
},
{
"key": "en",
"doc_count": 4
},
{
"key": "ar",
"doc_count": 2
},
{
"key": "de",
"doc_count": 2
},
{
"key": "es",
"doc_count": 2
},
{
"key": "fr",
"doc_count": 2
},
{
"key": "ja",
"doc_count": 2
},
{
"key": "ko",
"doc_count": 2
},
{
"key": "no",
"doc_count": 2
}
]
}
}

I have document with zh-cn and zh-tw language.When using aggregations it splits the word from "-" and makes two different words..See above output zh=6 and cn=4.But actually this is a single language "zh-cn"

Is this a bug? or do i have to set up anything else to make a whole word

Upgrade to ES 5.2.1

Allow formating the detected language output string as output

Usecase:

I am using langdetect plugin to dynamically assign the analyzer at index time.

POST test/article/_mapping
{
"article" : {
"_analyzer" : {
"path" : "description.lang"
},
"properties" : {
"description" : { "type" : "langdetect" }
}
}
}

Langdetect plugin detects the language as 'en', 'fr', 'de', and so on. so the analyzers should be defined as 'en', etc. This make them less descriptive and the context of analyzer is lost. Is it possible to derive a more descriptive name, such that _analyzer is resolve to 'en_icu_analyzer', instead of just 'en'?

Something like... (this does not work), this is just what i want to achieve.
article" : {
"_analyzer" : {
"path" : "description.lang" + "_icu_analyzer"
}

Installation problems with ES 0.90

I can't install the plugin with the version number:
I try bin/plugin -install jprante/elasticsearch-langdetect/1.0.0 and it returns: failed to download out of all possible locations.

I tried with the master, but the issue is: Plugin installation assumed to be site plugin, but contains source code, aborting installation

Could you write in the installation section the way to compile it?

Is there an "all" option for language detection?

It seems that we have to always specify a value for "languages" in order to achieve language detection.

...
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
...

We have a very varied data set in many languages and indexing is also not time sensitive, assuming this would be a performance issue. We'd like to know if there is an "all" or similar option for language detection, so we don't have to specify the complete list of languages here.

support for multi_field

adding langdect on an existing multi_field text results in

            "text": {
                  "type": "multi_field",
            "fields":{
                "text":{
                    "index": "analyzed",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"
                },
                "cleaned":{
                    "index": "analyzed",
                    "analyzer":"ocranalyzer",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"

                },
                "language":{
                    "type": "langdetect"
                }
            }
        },

results in

    java.lang.ClassCastException: org.xbib.elasticsearch.index.mapper.langdetect.LangdetectMapper$Builder cannot be cast to org.elasticsearch.index.mapper.core.AbstractFieldMapper$Builder

Compatibility with ES2.2

ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.2.0]. Was designed for version [2.1.1]

Could you provide an update, pls?

Should accept empty value when indexing

Currently, if the data is empty, langdetect will throw exception and stop indexing data (if it in a bulk process).
So I think the plugin should accept empty/null value and return prob = 0, or have an option to set default lang in case data is empty/null

Seems like landdetect 5.3 does not work or documentation has incorrect examples

My config
ES:

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "D8Tv5qq",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "56FxolywSQW5Xx4WzxI0mg",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Kibana 5.3
and plugins

GET _cat/plugins
D8Tv5qq analysis-icu        5.3.0
D8Tv5qq analysis-morphology 5.3.0
D8Tv5qq langdetect          5.3.0.0

my tryes

from Kibana

GET or POST _langdetect 
{
  "text": "das ist ein test"
}

{
  "error": {
    "root_cause": [
      {
        "type": "json_generation_exception",
        "reason": "Can not write a field name, expecting a value"
      }
    ],
    "type": "json_generation_exception",
    "reason": "Can not write a field name, expecting a value"
  },
  "status": 500
}

from cURL

$ curl -XPOST http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"}],"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"},"status":500}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"}],"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"},"status":400}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

$curl -XPOST http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

so maybe I'm doing somewhat wrong ?
Please help.

regards
Alex

Could not find plugin descriptor 'plugin-descriptor.properties'

Hi.

When I try to install this plugin on ES2.3.3, It complains. :(

$ ./bin/plugin install https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
-> Installing from https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz...
Trying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz ...
Downloading ...................................................DONE
Verifying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

So, I extracted the tar and find plugin-descriptor.properties. It exists.

$ wget https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
$ tar zxvf 2.3.3.0.tar.gz 
$ find ./ -name plugin-descriptor.properties
./src/main/templates/plugin-descriptor.properties

Do I miss something?

ElasticSearch 1.2.1 crashing with langdetect installed

Seen this on both a Windows and Linux server running ES 1.2.1 with the latest langdetect plugin installed...

First I get a slew of warnings like this:

[2014-06-18 11:48:26,535][WARN ][transport ] [Alfie O'Meggan] Registered two transport handlers for action langdetect, handlers: org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@3872bb09, org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@f505228

Then it crashes outright with the following message:

{1.2.1}: Initialization Failed ...
 1) NoClassDefFoundError[com/fasterxml/jackson/core/Versioned]
        ClassNotFoundException[com.fasterxml.jackson.core.Versioned]2) NoClassDefFoundError[com/fasterxml/jackson/databind/ObjectMapper]

FYI: I checked the plugin directory and the jackson-databind-2.3.3.jar file is there. It's happening on brand new servers with the latest versions of Java and ElasticSearch.

Any ideas?

debug information

Is it possible to see debug logs for this plugin? I am seeing intermittent timeouts (longer than 5 seconds with small documents) with and im trying to narrow down what could be causing it. I dont see anything langdetect related in the elasticsearch log.

fwiw right after the timeout I can send it hundreds of requests and see it perform normally.

Need a 0.90 branch, and plugin needs to be recompiled for 0.90.10+

Plugin fails to work on versions 0.90.10+ due to elastic/elasticsearch@1b497d8

Good description of the problem is here: carrot2/elasticsearch-carrot2#4

The multi-branch structure of plugins like https://github.com/elasticsearch/elasticsearch-analysis-icu seems like a good way to handle this.

Is it possible to return the detected Language in the Elasticsearch API

In all the examples, the detected language is only queried, never returned. In my use-case I would like to classify documents as english or german, persist them, later another job works over the data and filters the data into language specific elasticsearch databases.

First Question: Is there always one specific language assigned, or is it possible that multiple languages are assigned?
Second Question: Can I somehow persist the detected language in the documents or return it when fetching the documents from elasticsearch?

If everything else fails, I could use the language analyzer api your script provides and send the text , or use 2 queries to fetch the documents (one for en one for de) , but I wanted to know beforehand if I am doing something wrong or misunderstanding anything.

Re-using the bayesian filter

Is there any plan to "extract" the Bayesian filter to use it with other types of data, to filter our spam content for example?

Support for attachment type

Is there a way to use this plugin with attachment type?

Installation of previous version on ES2.0

Hi Joerg,
I tried to install Langdetect on ES 2.0. But I'm getting " 'plugin-descriptor.properties' not found in plugin.zip" which is not in the latest build of langdetect plugin(1.6.0) - for ES 1.6.
I tried to install langDetect-beta-2.0.0 also but it's not compatible with ES 2.0.
So, there isn't any way to install the plugin for ES 2.0 or should I wait for the new release of this plugin?
Thanks

Is it possible to return the detected Language _langdetect endpoint with http request?

hi. and thanks for your plugin and your help.
I am using langdetect plugin on ES 2.3.3, Is it possible to return the detected Language using _langdetect endpoint with http request?

i saw this example in sense and is excelent. but i need request from my app.

GET _langdetect
{
"text": "das ist ein test"
}

i need this , becouse i have two index, one for "spanish" and other for "english", and i want know what language does my query(phrase), before performing search in the indice respective the language..
for the moment i am using python for my request.

response = requests.get('http://127.0.0.1:9200/_langdetect?pretty=myquery')
print (response.text)

I appreciate the help you can give me, and please sorry for my bad english

Should not fail and throw an exception when given punctuation, UTF-8 chars/symbols, empty text

There are a number of different cases where the language detector will fail:

Leading punctuation. Something like "----------ROMA.....sexy ragazza orientale 3888669169---------- ' Perch correre affannosamente qua e l senza motivo? Tu sei ci che l esistenza vuole che tu sia.'" will fail. Generally any text that leads off with a number of punctuation characters fails.
Unicode symbols, emoticons, etc anywhere in the text cause failures: U+2000-U+2BFF (symbols), U+1f000-U+1ffff (symbols, emoticons), probably others
Having any characters in the U+1780-U+17FF (Khmer lang symbols) range fail
Any text that has no Unicode characters fails (\p{L} in PCRE).

This is probably an issue with the underlying library, but if so, then would be nice for this wrapper to run some checks. Currently I have the following checks implemented in my client: https://gist.github.com/gibrown/7122061

Running about a million lang detect API calls a day, and I think this catches almost all failures.

Not working in mapper es 2.3.3 and 2.3.1

I've tried to define mapping field, but it is not working. it is working as endpoint. it seems bcs of LangdetectService Init. This line LangdetectService service = new LangdetectService(settingsBuilder.build()); is creating empty settings.

Initialization Failed in ElasticSearch 1.3

langdetect-1.2.1.1 on 1.3.0

[2014-07-24 12:15:06,533][WARN ][plugins                  ] [Stonecutter] plugin langdetect-1.2.1.1-f1082e1, failed to invoke custom onModule method
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.elasticsearch.plugins.PluginsService.processModule(PluginsService.java:198)
    at org.elasticsearch.plugins.PluginsModule.processModule(PluginsModule.java:61)
    at org.elasticsearch.common.inject.Modules.processModules(Modules.java:64)
    at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:58)
    at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:192)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:70)
    at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:203)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.VerifyError: class org.xbib.elasticsearch.rest.action.langdetect.RestLangdetectAction overrides final method handleRequest.(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;)V
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.xbib.elasticsearch.plugin.langdetect.LangdetectPlugin.onModule(LangdetectPlugin.java:33)
    ... 13 more
[2014-07-24 12:15:09,500][ERROR][bootstrap                ] {1.3.0}: Initialization Failed ...

Analyzer throws an Exception if the analyzed field is empy

When the analyzed field is empty the analyzer throws an IOException (due to a LanguageDetectionException ) and the document is skipped.

Update for ES 2.2.1

Could you provide an update for ES 2.2.1 as last release is not compatible with it?

Plugin is not compatible with ES 2.1.0

After upgrading ES2.0 to 2.1.0 it does not start with the error message:

Plugin [langdetect] is incompatible with Elasticsearch [2.1.0]. Was designed for version [2.0.0]

Can't use langdetect mapping

Hi,
I've just started using your plugin (very impressive!) and I'm running into a small issue with version 2.0.
When creating a map having a type=langdetect I get an error:

e.g.
curl -XPOST localhost:9200/test/article/_mapping -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}'
I receive an exception and mapping is not created.
[2013-10-24 11:39:57,096][WARN ][transport.netty ] [Aardwolf] Message not fully read (response) for [673] handl
er org.elasticsearch.action.support.master.TransportMasterNodeOperationAction$4@4dc51e57, error [true], resetting

However, when running the same command using the prior plugin version I get a successful response.

unrecognized parameter: [profile]

Example request in readme
curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
return

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
  },
  "status" : 400
}

elastic version

"version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
}

Not working for nested objects

It seems that the language detection does not work for the fields of nested objects.
Here's a sample mapping:

{
  mappings: {
    document: {
      properties: {
        title: {
          type: "string",
          copy_to: "l1"
        },
        l1: {
          type: "langdetect",
          store: true
        },
        chunks: {
          type: "nested",
          properties: {
            text: {
              type: "string",
              copy_to: "chunks.l2"
            },
            l2: {
              type: "langdetect",
              store: true
            }
          }
        }
      }
    }
  }
}

and the doc:

{
  title: "hello, world",
  chunks: [
    {
      text: "au revoir"
    }
  ]
}

It works for "l1" field, but it doesn't work for "l2" field. I tried the mapping without "copy_to" (just using those fields directly), to simplify the use case, but to no avail.

Compatibility with ES 2.1.1

Please upgrade version 2.1.0.0 of langdetect to be compatible with ES 2.1.1

Accuracy problem - all Slovak texts are resolved as Czech

Despite the similarity between languages, detection seems to always return Czech.
Testing texts:
sk - http://www.sme.sk/
cs - http://www.lidovky.cz/

Is it a bug, or is there a way to get more accurate result?

url often generates lang:en on small text

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

Plans for ES2.0?

Are you planning support for ES2.0? I've tried updating myself but it is to complicated for me, because there are many changes ...

Plugin not available

Hi,

The plugin is not available in xbib.org?

Link given in readme:
http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.4.0.2/elasticsearch-langdetect-1.4.0.2-plugin.zip

No releases for versions above 2.2.0.1

From version 2.2.0.1 there are no releases on github : https://github.com/jprante/elasticsearch-langdetect/releases

In README file there is a bad link to the artefact, that allows installation of the plugin.

ES 2.4.0

Could you help me to create plugin for ES 2.4.0 because it will allow me to complete my task. My request is due to the fact that I also use your plug-in in conjunction with another, but now I can not upgrade ES to version 5.
Thanks in advance

Getting "action [langdetect] is unauthorized for user" when using Shield

Hi,

I am using langdetect plugin on ES 2.2.1 with Shield. The tests work correctly before Shield is installed, but after Shield is installed, I am seeing the following error:

curl -XPOST -u es_admin 'http://localhost:9200/_langdetect?pretty' -d 'This is a test'
Enter host password for user 'es_admin':
{
  "error" : {
    "root_cause" : [ {
      "type" : "security_exception",
      "reason" : "action [langdetect] is unauthorized for user [es_admin]"
    } ],
    "type" : "security_exception",
    "reason" : "action [langdetect] is unauthorized for user [es_admin]"
  },
  "status" : 403
}

I am using a default admin user with admin role

bin/shield/esusers useradd es_admin -r admin

with admin role

admin:
  cluster: all
  indices:
    '*':
      privileges: all

Is there any additional configuration required for Shield?

Accuracy problem

Hi,

I have some strange results when I use on french text:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte'
{
"ok" : true,
"languages" : [ {
"language" : "nl",
"probability" : 0.9999951375010268
} ]
}

It's french and I get "nl". Something wrong?

apply filter on text before ngram detection

I ran into an issue which could be solved by running some custom filters (I do not mean Lucene filters, but more things like predefined filters, eg lowercase, uppercase, ...) :

I get the following french tweet with an uppercase text :

COMMENT DES GENS PEUVENT TROUVER DES CÉLÉBRITÉS DANS LES MAGASINS JE PEUX MÊME PAS TROUVER MA MÈRE
which is detected as english :

{
    "language": "en",
    "probability": 0.9999937971825049
}

But when I ask for the exact same text lowercased,

comment des gens peuvent trouver des célébrités dans les magasins je peux même pas trouver ma mère

{
    "language": "fr",
    "probability": 0.9999970343219597
}

french is now detected