Giter VIP home page Giter VIP logo

elasticsearch-langdetect's Introduction

A langdetect plugin for Elasticsearch

elasticsearch langdetect coverage badge License Apache%202.0 blue xbib

Tower of Babel

This is an implementation of a plugin for Elasticsearch using the implementation of Nakatani Shuyo’s language detector.

It uses 3-gram character and a Bayesian filter with various normalizations and feature sampling. The precision is over 99% for 53 languages.

The plugin offers a mapping type to specify fields where you want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang', as you can see in the example. The field can be queried for language codes.

You can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin offers also a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Here is a list of languages code recognized:

Table 1. Langauges

Code

Description

af

Afrikaans

ar

Arabic

bg

Bulgarian

bn

Bengali

cs

Czech

da

Danish

de

German

el

Greek

en

English

es

Spanish

et

Estonian

fa

Farsi

fi

Finnish

fr

French

gu

Gujarati

he

Hebrew

hi

Hindi

hr

Croatian

hu

Hungarian

id

Indonesian

it

Italian

ja

Japanese

kn

Kannada

ko

Korean

lt

Lithuanian

lv

Latvian

mk

Macedonian

ml

Malayalam

mr

Marathi

ne

Nepali

nl

Dutch

no

Norwegian

pa

Eastern Punjabi

pl

Polish

pt

Portuguese

ro

Romanian

ru

Russian

sk

Slovak

sl

Slovene

so

Somali

sq

Albanian

sv

Swedish

sw

Swahili

ta

Tamil

te

Telugu

th

Thai

tl

Tagalog

tr

Turkish

uk

Ukrainian

ur

Urdu

vi

Vietnamese

zh-cn

Chinese

zh-tw

Traditional Chinese characters (Taiwan, Hongkong, Macau)

Table 2. Compatibility matrix

Plugin version

Elasticsearch version

Release date

5.4.0.2

5.4.0

Jun 8, 2017

5.4.0.1

5.4.0

May 30, 2017

5.4.0.0

5.4.0

May 10, 2017

5.3.2.0

5.3.2

Apr 30, 2017

5.3.1.0

5.3.1

Apr 30, 2017

5.3.0.2

5.3.0

Apr 3, 2017

5.3.0.1

5.3.0

Apr 1, 2017

5.3.0.0

5.3.0

Mar 30, 2017

5.2.2.0

5.2.2

Mar 2, 2017

5.2.1.0

5.2.1

Mar 2, 2017

5.1.2.0

5.1.2

Jan 26, 2017

2.4.4.1

2.4.4

Jan 25, 2017

2.3.3.0

2.3.3

Jun 11, 2016

2.3.2.0

2.3.2

Jun 11, 2016

2.3.1.0

2.3.1

Apr 11, 2016

2.2.1.0

2.2.1

Apr 11, 2016

2.2.0.2

2.2.0

Mar 25, 2016

2.2.0.1

2.2.0

Mar 6, 2016

2.1.1.0

2.1.1

Dec 20, 2015

2.1.0.0

2.1.0

Dec 15, 2015

2.0.1.0

2.0.1

Dec 15, 2015

2.0.0.0

2.0.0

Nov 12, 2015

1.6.0.0

1.6.0

Jul 1, 2015

1.4.4.1

1.4.4

Apr 3, 2015

1.4.4.1

1.4.4

Mar 4, 2015

1.4.0.2

1.4.0

Nov 26, 2014

1.4.0.1

1.4.0

Nov 20, 2014

1.4.0.0

1.4.0

Nov 14, 2014

1.3.1.0

1.3.0

Jul 30, 2014

1.2.1.1

1.2.1

Jun 18, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/5.4.0.2/elasticsearch-langdetect-5.4.0.2-plugin.zip

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.4.4.1/elasticsearch-langdetect-2.4.4.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install langdetect -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.6.0.0/elasticsearch-langdetect-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Examples

Note
The examples are written for Elasticsearch 5.x and need to be adapted to earlier versions of Elastiscearch.

A simple language detection example

In this example, we create a simple detector field, and write text to it for detection.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

PUT /test/docs/1
{
      "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

PUT /test/docs/2
{
      "text" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}

PUT /test/docs/3
{
      "text" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "en"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "de"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "fr"
           }
       }
}

Indexing language-detected text alongside with code

Just indexing the language code is not enough in most cases. The language-detected text should be passed to a specific analyzer to apply language-specific analysis. This plugin allows that by the language_to parameter.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages": [
                  "de",
                  "en",
                  "fr",
                  "nl",
                  "it"
               ],
               "language_to": {
                  "de": "german_field",
                  "en": "english_field"
               }
            },
            "german_field": {
               "analyzer": "german",
               "type": "string"
            },
            "english_field": {
               "analyzer": "english",
               "type": "string"
            }
         }
      }
   }
}

PUT /test/docs/1
{
  "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "english_field" : "light"
       }
   }
}

Language code and multi_field

Using multifields, it is possible to store the text alongside with the detected language(s). Here, we use another (short nonsense) example text for demonstration, which has more than one detected language code.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "de",
                        "en",
                        "fr",
                        "nl",
                        "it"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text" : "light"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text.language" : "en"
       }
   }
}

Language detection ina binary field with attachment mapper plugin

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
    		  "type" : "attachment",
			  "fields" : {
				"content" : {
				  "type" : "text",
				  "fields" : {
					"language" : {
					  "type" : "langdetect",
					  "binary" : true
					}
				  }
				}
			  }
            }
         }
      }
   }
}

On a shell, enter commands

rm index.tmp
echo -n '{"content":"' >> index.tmp
echo "This is a very simple text in plain english" | base64  >> index.tmp
echo -n '"}' >> index.tmp
curl -XPOST --data-binary "@index.tmp" 'localhost:9200/test/docs/1'
rm index.tmp
POST /test/_refresh

POST /test/_search
{
   "query" : {
       "match" : {
            "content" : "very simple"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "content.language" : "en"
       }
   }
}

Language detection REST API Example

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "languages" : [
    {
      "language" : "en",
      "probability" : 0.9999972283490304
    }
  ]
}
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Das ist ein Test'
{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999985460514316
    }
  ]
}
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Datt isse ne test'
{
  "languages" : [
    {
      "language" : "no",
      "probability" : 0.5714275763833249
    },
    {
      "language" : "nl",
      "probability" : 0.28571402563882925
    },
    {
      "language" : "de",
      "probability" : 0.14285660343967294
    }
  ]
}

Use _langdetect endpoint from Sense

GET _langdetect
{
   "text": "das ist ein test"
}

Change profile of language detection

There is a "short text" profile which is better to detect languages in a few words.

curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
{
  "profile" : "/langdetect/short-text/",
  "languages" : [ {
    "language" : "de",
    "probability" : 0.9999993070517024
  } ]
}

Settings

These settings can be used in elasticsearch.yml to modify language detection.

Use with caution. You don’t need to modify settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code and be familiar with probabilistic matching using naive bayes with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name

Description

languages

a comma-separated list of language codes such as (de,en,fr…​) used to restrict (and speed up) the detection process

map.<code>

a substitution code for a language code

number_of_trials

number of trials, affects CPU usage (default: 7)

alpha

additional smoothing parameter, default: 0.5

alpha_width

the width of smoothing, default: 0.05

iteration_limit

safeguard to break loop, default: 10000

prob_threshold

default: 0.1

conv_threshold

detection is terminated when normalized probability exceeds this threshold, default: 0.99999

base_freq

default 10000

Issues

All feedback is welcome! If you find issues, please post them at Github

Credits

Thanks to Alexander Reelsen for his OpenNLP plugin, from where I have copied and adapted the mapping type code.

License

elasticsearch-langdetect - a language detection plugin for Elasticsearch

Derived work of language-detection by Nakatani Shuyo http://code.google.com/p/language-detection/

Copyright © 2012 Jörg Prante

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. you may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

btn donateCC LG

elasticsearch-langdetect's People

Contributors

jprante avatar juliendangers avatar marbleman avatar stambizzle avatar xyu avatar yanirs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-langdetect's Issues

Not working for nested objects

It seems that the language detection does not work for the fields of nested objects.
Here's a sample mapping:

{
  mappings: {
    document: {
      properties: {
        title: {
          type: "string",
          copy_to: "l1"
        },
        l1: {
          type: "langdetect",
          store: true
        },
        chunks: {
          type: "nested",
          properties: {
            text: {
              type: "string",
              copy_to: "chunks.l2"
            },
            l2: {
              type: "langdetect",
              store: true
            }
          }
        }
      }
    }
  }
}

and the doc:

{
  title: "hello, world",
  chunks: [
    {
      text: "au revoir"
    }
  ]
}

It works for "l1" field, but it doesn't work for "l2" field. I tried the mapping without "copy_to" (just using those fields directly), to simplify the use case, but to no avail.

Getting "action [langdetect] is unauthorized for user" when using Shield

Hi,

I am using langdetect plugin on ES 2.2.1 with Shield. The tests work correctly before Shield is installed, but after Shield is installed, I am seeing the following error:

curl -XPOST -u es_admin 'http://localhost:9200/_langdetect?pretty' -d 'This is a test'
Enter host password for user 'es_admin':
{
  "error" : {
    "root_cause" : [ {
      "type" : "security_exception",
      "reason" : "action [langdetect] is unauthorized for user [es_admin]"
    } ],
    "type" : "security_exception",
    "reason" : "action [langdetect] is unauthorized for user [es_admin]"
  },
  "status" : 403
}

I am using a default admin user with admin role

bin/shield/esusers useradd es_admin -r admin

with admin role

admin:
  cluster: all
  indices:
    '*':
      privileges: all

Is there any additional configuration required for Shield?

Accuracy problem

Hi,

I have some strange results when I use on french text:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte'
{
"ok" : true,
"languages" : [ {
"language" : "nl",
"probability" : 0.9999951375010268
} ]
}

It's french and I get "nl". Something wrong?

Installation problems with ES 0.90

I can't install the plugin with the version number:
I try bin/plugin -install jprante/elasticsearch-langdetect/1.0.0 and it returns: failed to download out of all possible locations.

I tried with the master, but the issue is: Plugin installation assumed to be site plugin, but contains source code, aborting installation

Could you write in the installation section the way to compile it?

Plugin is not compatible with ES 2.1.0

After upgrading ES2.0 to 2.1.0 it does not start with the error message:

Plugin [langdetect] is incompatible with Elasticsearch [2.1.0]. Was designed for version [2.0.0]

Should not fail and throw an exception when given punctuation, UTF-8 chars/symbols, empty text

There are a number of different cases where the language detector will fail:

  • Leading punctuation. Something like "----------ROMA.....sexy ragazza orientale 3888669169---------- ' Perch correre affannosamente qua e l senza motivo? Tu sei ci che l esistenza vuole che tu sia.'" will fail. Generally any text that leads off with a number of punctuation characters fails.
  • Unicode symbols, emoticons, etc anywhere in the text cause failures: U+2000-U+2BFF (symbols), U+1f000-U+1ffff (symbols, emoticons), probably others
  • Having any characters in the U+1780-U+17FF (Khmer lang symbols) range fail
  • Any text that has no Unicode characters fails (\p{L} in PCRE).

This is probably an issue with the underlying library, but if so, then would be nice for this wrapper to run some checks. Currently I have the following checks implemented in my client: https://gist.github.com/gibrown/7122061

Running about a million lang detect API calls a day, and I think this catches almost all failures.

apply filter on text before ngram detection

I ran into an issue which could be solved by running some custom filters (I do not mean Lucene filters, but more things like predefined filters, eg lowercase, uppercase, ...) :

I get the following french tweet with an uppercase text :

COMMENT DES GENS PEUVENT TROUVER DES CÉLÉBRITÉS DANS LES MAGASINS JE PEUX MÊME PAS TROUVER MA MÈRE
which is detected as english :

{
    "language": "en",
    "probability": 0.9999937971825049
}

But when I ask for the exact same text lowercased,

comment des gens peuvent trouver des célébrités dans les magasins je peux même pas trouver ma mère

{
    "language": "fr",
    "probability": 0.9999970343219597
}

french is now detected

Should accept empty value when indexing

Currently, if the data is empty, langdetect will throw exception and stop indexing data (if it in a bulk process).
So I think the plugin should accept empty/null value and return prob = 0, or have an option to set default lang in case data is empty/null

Re-using the bayesian filter

Is there any plan to "extract" the Bayesian filter to use it with other types of data, to filter our spam content for example?

url often generates lang:en on small text

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

Elasticsearch 2.0.1 not compatible with plugin 2.0.0

Looks like version 2.0.0 of langdetect is not compatible with ES 2.0.1

After installing I got this error:
ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.0.1]. Was designed for version [2.0.0]

ES 2.4.0

Could you help me to create plugin for ES 2.4.0 because it will allow me to complete my task. My request is due to the fact that I also use your plug-in in conjunction with another, but now I can not upgrade ES to version 5.
Thanks in advance

ES 5.3

Please create a new release!

Update for ES 2.2.1

Could you provide an update for ES 2.2.1 as last release is not compatible with it?

Is there an "all" option for language detection?

It seems that we have to always specify a value for "languages" in order to achieve language detection.

...
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
...

We have a very varied data set in many languages and indexing is also not time sensitive, assuming this would be a performance issue. We'd like to know if there is an "all" or similar option for language detection, so we don't have to specify the complete list of languages here.

ES 1.4.4 plugin returns file content instead of lang value

Just installed the plugin and execute some search queries.

MAPPING
Copy/Paste from the index page of the plugin.

PUT
{
"content": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

SEARCH

{
"fields" : "content.language.lang",
"query" : {
"match_all" : {}
}
}

RESULT

.................

"fields": {
"content.language.lang": [
"IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
]
}

As we can see the content.language.lang is not as expected.

Any ideas?

debug information

Is it possible to see debug logs for this plugin? I am seeing intermittent timeouts (longer than 5 seconds with small documents) with and im trying to narrow down what could be causing it. I dont see anything langdetect related in the elasticsearch log.

fwiw right after the timeout I can send it hundreds of requests and see it perform normally.

Compatibility with ES2.2

ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.2.0]. Was designed for version [2.1.1]

Could you provide an update, pls?

unrecognized parameter: [profile]

Example request in readme
curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
return

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
  },
  "status" : 400
}

elastic version

"version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
}

Installation of previous version on ES2.0

Hi Joerg,
I tried to install Langdetect on ES 2.0. But I'm getting " 'plugin-descriptor.properties' not found in plugin.zip" which is not in the latest build of langdetect plugin(1.6.0) - for ES 1.6.
I tried to install langDetect-beta-2.0.0 also but it's not compatible with ES 2.0.
So, there isn't any way to install the plugin for ES 2.0 or should I wait for the new release of this plugin?
Thanks

Issue with Aggregations on language field

Content.lang field does not give proper result when used in term aggregations having language zh-cn and zh-tw
I am using aggregations on content.lang filed as below

{
"aggs" : {
"tags" : {
"terms" : {
"field" : "content.lang"
}
}
}
}

the result is

"aggregations": {
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 5,
"buckets": [
{
"key": "zh",
"doc_count": 6
},
{
"key": "cn",
"doc_count": 4
},
{
"key": "en",
"doc_count": 4
},
{
"key": "ar",
"doc_count": 2
},
{
"key": "de",
"doc_count": 2
},
{
"key": "es",
"doc_count": 2
},
{
"key": "fr",
"doc_count": 2
},
{
"key": "ja",
"doc_count": 2
},
{
"key": "ko",
"doc_count": 2
},
{
"key": "no",
"doc_count": 2
}
]
}
}

I have document with zh-cn and zh-tw language.When using aggregations it splits the word from "-" and makes two different words..See above output zh=6 and cn=4.But actually this is a single language "zh-cn"

Is this a bug? or do i have to set up anything else to make a whole word

Seems like landdetect 5.3 does not work or documentation has incorrect examples

My config
ES:

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "D8Tv5qq",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "56FxolywSQW5Xx4WzxI0mg",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Kibana 5.3
and plugins

GET _cat/plugins
D8Tv5qq analysis-icu        5.3.0
D8Tv5qq analysis-morphology 5.3.0
D8Tv5qq langdetect          5.3.0.0

my tryes

  • from Kibana
GET or POST _langdetect 
{
  "text": "das ist ein test"
}

{
  "error": {
    "root_cause": [
      {
        "type": "json_generation_exception",
        "reason": "Can not write a field name, expecting a value"
      }
    ],
    "type": "json_generation_exception",
    "reason": "Can not write a field name, expecting a value"
  },
  "status": 500
}
  • from cURL
$ curl -XPOST http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"}],"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"},"status":500}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"}],"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"},"status":400}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

$curl -XPOST http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

so maybe I'm doing somewhat wrong ?
Please help.

regards
Alex

Unable to use plugin from TransportClient in java

I am getting the following exception :
"org.elasticsearch.transport.ActionNotFoundTransportException: No handler for action [langdetect]"
while trying to use langdetect from java using TransportClient.
I am using Elasticsearch v1.4.3 and langdetect v1.4.4.2
I am not sure what is causing this failure. Please help.

Which langdetect version is bundled along with Elasticsearch-plugin-bundle 5.1.1

Hi JPrante,

The Elasticsearch-plugin-bundle for ES 5.1.1 comes along with a lang-detect plugin. I am interested only in the lang-detect plugin. However the lang-detect repo only have version up to ES 2.4.

So could we use the the lang-detect plugin for ES 2.4 with ES 5.1.1 ?
If not, is there a way to get only the lang-detect plugin from the Elasticsearch-plugin-bundle for ES 5.1.1 ?

Thank You

Is it possible to return the detected Language _langdetect endpoint with http request?

hi. and thanks for your plugin and your help.
I am using langdetect plugin on ES 2.3.3, Is it possible to return the detected Language using _langdetect endpoint with http request?

i saw this example in sense and is excelent. but i need request from my app.

GET _langdetect
{
"text": "das ist ein test"
}

i need this , becouse i have two index, one for "spanish" and other for "english", and i want know what language does my query(phrase), before performing search in the indice respective the language..
for the moment i am using python for my request.

response = requests.get('http://127.0.0.1:9200/_langdetect?pretty=myquery')
print (response.text)

I appreciate the help you can give me, and please sorry for my bad english

Plans for ES2.0?

Are you planning support for ES2.0? I've tried updating myself but it is to complicated for me, because there are many changes ...

Is it possible to return the detected Language in the Elasticsearch API

In all the examples, the detected language is only queried, never returned. In my use-case I would like to classify documents as english or german, persist them, later another job works over the data and filters the data into language specific elasticsearch databases.

First Question: Is there always one specific language assigned, or is it possible that multiple languages are assigned?
Second Question: Can I somehow persist the detected language in the documents or return it when fetching the documents from elasticsearch?

If everything else fails, I could use the language analyzer api your script provides and send the text , or use 2 queries to fetch the documents (one for en one for de) , but I wanted to know beforehand if I am doing something wrong or misunderstanding anything.

Can't use langdetect mapping

Hi,
I've just started using your plugin (very impressive!) and I'm running into a small issue with version 2.0.
When creating a map having a type=langdetect I get an error:

e.g.
curl -XPOST localhost:9200/test/article/_mapping -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}'
I receive an exception and mapping is not created.
[2013-10-24 11:39:57,096][WARN ][transport.netty ] [Aardwolf] Message not fully read (response) for [673] handl
er org.elasticsearch.action.support.master.TransportMasterNodeOperationAction$4@4dc51e57, error [true], resetting

However, when running the same command using the prior plugin version I get a successful response.

ElasticSearch 1.2.1 crashing with langdetect installed

Seen this on both a Windows and Linux server running ES 1.2.1 with the latest langdetect plugin installed...

First I get a slew of warnings like this:

[2014-06-18 11:48:26,535][WARN ][transport ] [Alfie O'Meggan] Registered two transport handlers for action langdetect, handlers: org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@3872bb09, org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@f505228

Then it crashes outright with the following message:

{1.2.1}: Initialization Failed ...
 1) NoClassDefFoundError[com/fasterxml/jackson/core/Versioned]
        ClassNotFoundException[com.fasterxml.jackson.core.Versioned]2) NoClassDefFoundError[com/fasterxml/jackson/databind/ObjectMapper]

FYI: I checked the plugin directory and the jackson-databind-2.3.3.jar file is there. It's happening on brand new servers with the latest versions of Java and ElasticSearch.

Any ideas?

support for multi_field

adding langdect on an existing multi_field text results in

            "text": {
                  "type": "multi_field",
            "fields":{
                "text":{
                    "index": "analyzed",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"
                },
                "cleaned":{
                    "index": "analyzed",
                    "analyzer":"ocranalyzer",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"

                },
                "language":{
                    "type": "langdetect"
                }
            }
        },

results in

    java.lang.ClassCastException: org.xbib.elasticsearch.index.mapper.langdetect.LangdetectMapper$Builder cannot be cast to org.elasticsearch.index.mapper.core.AbstractFieldMapper$Builder

Could not find plugin descriptor 'plugin-descriptor.properties'

Hi.

When I try to install this plugin on ES2.3.3, It complains. :(

$ ./bin/plugin install https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
-> Installing from https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz...
Trying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz ...
Downloading ...................................................DONE
Verifying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

So, I extracted the tar and find plugin-descriptor.properties. It exists.

$ wget https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
$ tar zxvf 2.3.3.0.tar.gz 
$ find ./ -name plugin-descriptor.properties
./src/main/templates/plugin-descriptor.properties

Do I miss something?

Initialization Failed in ElasticSearch 1.3

langdetect-1.2.1.1 on 1.3.0

[2014-07-24 12:15:06,533][WARN ][plugins                  ] [Stonecutter] plugin langdetect-1.2.1.1-f1082e1, failed to invoke custom onModule method
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.elasticsearch.plugins.PluginsService.processModule(PluginsService.java:198)
    at org.elasticsearch.plugins.PluginsModule.processModule(PluginsModule.java:61)
    at org.elasticsearch.common.inject.Modules.processModules(Modules.java:64)
    at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:58)
    at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:192)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:70)
    at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:203)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.VerifyError: class org.xbib.elasticsearch.rest.action.langdetect.RestLangdetectAction overrides final method handleRequest.(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;)V
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.xbib.elasticsearch.plugin.langdetect.LangdetectPlugin.onModule(LangdetectPlugin.java:33)
    ... 13 more
[2014-07-24 12:15:09,500][ERROR][bootstrap                ] {1.3.0}: Initialization Failed ...

Not working in mapper es 2.3.3 and 2.3.1

I've tried to define mapping field, but it is not working. it is working as endpoint. it seems bcs of LangdetectService Init. This line LangdetectService service = new LangdetectService(settingsBuilder.build()); is creating empty settings.

Index custom analzed data

A "language_analyzer" field, which index analyzed terms, not the iso-code. E.g:

"fields": {
  "language": {
    "type": "langdetect",
    "languages": [ "af", "ar", "bg", "bn", "cs", "da", "de", "el", "en", "es", "et", "fa", "fi", "fr", "gu", "he", "hi", "hr", "hu", "id", "it", "ja", "kn", "ko", "lt", "lv", "mk", "ml", "mr", "ne", "nl", "no", "pa", "pl", "pt", "ro", "ru", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "th", "tl", "tr", "uk", "ur", "vi", "zh-cn", "zh-tw" ],
    "language_analyzer": {
      "ar": "arabic",
      "bg": "bulgarian",
      "cs": "czech",
      "da": "danish",
      "de": "german",
      "el": "greek",
      "en": "english",
      "es": "spanish",
      ...
    }
  }
}

Allow formating the detected language output string as output

Usecase:

I am using langdetect plugin to dynamically assign the analyzer at index time.

POST test/article/_mapping
{
"article" : {
"_analyzer" : {
"path" : "description.lang"
},
"properties" : {
"description" : { "type" : "langdetect" }
}
}
}

Langdetect plugin detects the language as 'en', 'fr', 'de', and so on. so the analyzers should be defined as 'en', etc. This make them less descriptive and the context of analyzer is lost. Is it possible to derive a more descriptive name, such that _analyzer is resolve to 'en_icu_analyzer', instead of just 'en'?

Something like... (this does not work), this is just what i want to achieve.
article" : {
"_analyzer" : {
"path" : "description.lang" + "_icu_analyzer"
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.