Giter VIP home page Giter VIP logo

elasticsearch-langdetect's Issues

Installation of previous version on ES2.0

Hi Joerg,
I tried to install Langdetect on ES 2.0. But I'm getting " 'plugin-descriptor.properties' not found in plugin.zip" which is not in the latest build of langdetect plugin(1.6.0) - for ES 1.6.
I tried to install langDetect-beta-2.0.0 also but it's not compatible with ES 2.0.
So, there isn't any way to install the plugin for ES 2.0 or should I wait for the new release of this plugin?
Thanks

Unable to use plugin from TransportClient in java

I am getting the following exception :
"org.elasticsearch.transport.ActionNotFoundTransportException: No handler for action [langdetect]"
while trying to use langdetect from java using TransportClient.
I am using Elasticsearch v1.4.3 and langdetect v1.4.4.2
I am not sure what is causing this failure. Please help.

support for multi_field

adding langdect on an existing multi_field text results in

            "text": {
                  "type": "multi_field",
            "fields":{
                "text":{
                    "index": "analyzed",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"
                },
                "cleaned":{
                    "index": "analyzed",
                    "analyzer":"ocranalyzer",
                    "store": "yes",
                    "term_vector": "with_positions_offsets",
                    "type": "string"

                },
                "language":{
                    "type": "langdetect"
                }
            }
        },

results in

    java.lang.ClassCastException: org.xbib.elasticsearch.index.mapper.langdetect.LangdetectMapper$Builder cannot be cast to org.elasticsearch.index.mapper.core.AbstractFieldMapper$Builder

ES 5.3

Please create a new release!

Issue with Aggregations on language field

Content.lang field does not give proper result when used in term aggregations having language zh-cn and zh-tw
I am using aggregations on content.lang filed as below

{
"aggs" : {
"tags" : {
"terms" : {
"field" : "content.lang"
}
}
}
}

the result is

"aggregations": {
"tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 5,
"buckets": [
{
"key": "zh",
"doc_count": 6
},
{
"key": "cn",
"doc_count": 4
},
{
"key": "en",
"doc_count": 4
},
{
"key": "ar",
"doc_count": 2
},
{
"key": "de",
"doc_count": 2
},
{
"key": "es",
"doc_count": 2
},
{
"key": "fr",
"doc_count": 2
},
{
"key": "ja",
"doc_count": 2
},
{
"key": "ko",
"doc_count": 2
},
{
"key": "no",
"doc_count": 2
}
]
}
}

I have document with zh-cn and zh-tw language.When using aggregations it splits the word from "-" and makes two different words..See above output zh=6 and cn=4.But actually this is a single language "zh-cn"

Is this a bug? or do i have to set up anything else to make a whole word

Could not find plugin descriptor 'plugin-descriptor.properties'

Hi.

When I try to install this plugin on ES2.3.3, It complains. :(

$ ./bin/plugin install https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
-> Installing from https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz...
Trying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz ...
Downloading ...................................................DONE
Verifying https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
ERROR: Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

So, I extracted the tar and find plugin-descriptor.properties. It exists.

$ wget https://github.com/jprante/elasticsearch-langdetect/archive/2.3.3.0.tar.gz
$ tar zxvf 2.3.3.0.tar.gz 
$ find ./ -name plugin-descriptor.properties
./src/main/templates/plugin-descriptor.properties

Do I miss something?

Not working for nested objects

It seems that the language detection does not work for the fields of nested objects.
Here's a sample mapping:

{
  mappings: {
    document: {
      properties: {
        title: {
          type: "string",
          copy_to: "l1"
        },
        l1: {
          type: "langdetect",
          store: true
        },
        chunks: {
          type: "nested",
          properties: {
            text: {
              type: "string",
              copy_to: "chunks.l2"
            },
            l2: {
              type: "langdetect",
              store: true
            }
          }
        }
      }
    }
  }
}

and the doc:

{
  title: "hello, world",
  chunks: [
    {
      text: "au revoir"
    }
  ]
}

It works for "l1" field, but it doesn't work for "l2" field. I tried the mapping without "copy_to" (just using those fields directly), to simplify the use case, but to no avail.

debug information

Is it possible to see debug logs for this plugin? I am seeing intermittent timeouts (longer than 5 seconds with small documents) with and im trying to narrow down what could be causing it. I dont see anything langdetect related in the elasticsearch log.

fwiw right after the timeout I can send it hundreds of requests and see it perform normally.

Index custom analzed data

A "language_analyzer" field, which index analyzed terms, not the iso-code. E.g:

"fields": {
  "language": {
    "type": "langdetect",
    "languages": [ "af", "ar", "bg", "bn", "cs", "da", "de", "el", "en", "es", "et", "fa", "fi", "fr", "gu", "he", "hi", "hr", "hu", "id", "it", "ja", "kn", "ko", "lt", "lv", "mk", "ml", "mr", "ne", "nl", "no", "pa", "pl", "pt", "ro", "ru", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "th", "tl", "tr", "uk", "ur", "vi", "zh-cn", "zh-tw" ],
    "language_analyzer": {
      "ar": "arabic",
      "bg": "bulgarian",
      "cs": "czech",
      "da": "danish",
      "de": "german",
      "el": "greek",
      "en": "english",
      "es": "spanish",
      ...
    }
  }
}

Plans for ES2.0?

Are you planning support for ES2.0? I've tried updating myself but it is to complicated for me, because there are many changes ...

Is there an "all" option for language detection?

It seems that we have to always specify a value for "languages" in order to achieve language detection.

...
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
...

We have a very varied data set in many languages and indexing is also not time sensitive, assuming this would be a performance issue. We'd like to know if there is an "all" or similar option for language detection, so we don't have to specify the complete list of languages here.

Not working in mapper es 2.3.3 and 2.3.1

I've tried to define mapping field, but it is not working. it is working as endpoint. it seems bcs of LangdetectService Init. This line LangdetectService service = new LangdetectService(settingsBuilder.build()); is creating empty settings.

Allow formating the detected language output string as output

Usecase:

I am using langdetect plugin to dynamically assign the analyzer at index time.

POST test/article/_mapping
{
"article" : {
"_analyzer" : {
"path" : "description.lang"
},
"properties" : {
"description" : { "type" : "langdetect" }
}
}
}

Langdetect plugin detects the language as 'en', 'fr', 'de', and so on. so the analyzers should be defined as 'en', etc. This make them less descriptive and the context of analyzer is lost. Is it possible to derive a more descriptive name, such that _analyzer is resolve to 'en_icu_analyzer', instead of just 'en'?

Something like... (this does not work), this is just what i want to achieve.
article" : {
"_analyzer" : {
"path" : "description.lang" + "_icu_analyzer"
}

Is it possible to return the detected Language in the Elasticsearch API

In all the examples, the detected language is only queried, never returned. In my use-case I would like to classify documents as english or german, persist them, later another job works over the data and filters the data into language specific elasticsearch databases.

First Question: Is there always one specific language assigned, or is it possible that multiple languages are assigned?
Second Question: Can I somehow persist the detected language in the documents or return it when fetching the documents from elasticsearch?

If everything else fails, I could use the language analyzer api your script provides and send the text , or use 2 queries to fetch the documents (one for en one for de) , but I wanted to know beforehand if I am doing something wrong or misunderstanding anything.

Re-using the bayesian filter

Is there any plan to "extract" the Bayesian filter to use it with other types of data, to filter our spam content for example?

Compatibility with ES2.2

ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.2.0]. Was designed for version [2.1.1]

Could you provide an update, pls?

Getting "action [langdetect] is unauthorized for user" when using Shield

Hi,

I am using langdetect plugin on ES 2.2.1 with Shield. The tests work correctly before Shield is installed, but after Shield is installed, I am seeing the following error:

curl -XPOST -u es_admin 'http://localhost:9200/_langdetect?pretty' -d 'This is a test'
Enter host password for user 'es_admin':
{
  "error" : {
    "root_cause" : [ {
      "type" : "security_exception",
      "reason" : "action [langdetect] is unauthorized for user [es_admin]"
    } ],
    "type" : "security_exception",
    "reason" : "action [langdetect] is unauthorized for user [es_admin]"
  },
  "status" : 403
}

I am using a default admin user with admin role

bin/shield/esusers useradd es_admin -r admin

with admin role

admin:
  cluster: all
  indices:
    '*':
      privileges: all

Is there any additional configuration required for Shield?

Accuracy problem

Hi,

I have some strange results when I use on french text:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte'
{
"ok" : true,
"languages" : [ {
"language" : "nl",
"probability" : 0.9999951375010268
} ]
}

It's french and I get "nl". Something wrong?

url often generates lang:en on small text

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

ES 2.4.0

Could you help me to create plugin for ES 2.4.0 because it will allow me to complete my task. My request is due to the fact that I also use your plug-in in conjunction with another, but now I can not upgrade ES to version 5.
Thanks in advance

Update for ES 2.2.1

Could you provide an update for ES 2.2.1 as last release is not compatible with it?

Which langdetect version is bundled along with Elasticsearch-plugin-bundle 5.1.1

Hi JPrante,

The Elasticsearch-plugin-bundle for ES 5.1.1 comes along with a lang-detect plugin. I am interested only in the lang-detect plugin. However the lang-detect repo only have version up to ES 2.4.

So could we use the the lang-detect plugin for ES 2.4 with ES 5.1.1 ?
If not, is there a way to get only the lang-detect plugin from the Elasticsearch-plugin-bundle for ES 5.1.1 ?

Thank You

Installation problems with ES 0.90

I can't install the plugin with the version number:
I try bin/plugin -install jprante/elasticsearch-langdetect/1.0.0 and it returns: failed to download out of all possible locations.

I tried with the master, but the issue is: Plugin installation assumed to be site plugin, but contains source code, aborting installation

Could you write in the installation section the way to compile it?

Is it possible to return the detected Language _langdetect endpoint with http request?

hi. and thanks for your plugin and your help.
I am using langdetect plugin on ES 2.3.3, Is it possible to return the detected Language using _langdetect endpoint with http request?

i saw this example in sense and is excelent. but i need request from my app.

GET _langdetect
{
"text": "das ist ein test"
}

i need this , becouse i have two index, one for "spanish" and other for "english", and i want know what language does my query(phrase), before performing search in the indice respective the language..
for the moment i am using python for my request.

response = requests.get('http://127.0.0.1:9200/_langdetect?pretty=myquery')
print (response.text)

I appreciate the help you can give me, and please sorry for my bad english

Should not fail and throw an exception when given punctuation, UTF-8 chars/symbols, empty text

There are a number of different cases where the language detector will fail:

  • Leading punctuation. Something like "----------ROMA.....sexy ragazza orientale 3888669169---------- ' Perch correre affannosamente qua e l senza motivo? Tu sei ci che l esistenza vuole che tu sia.'" will fail. Generally any text that leads off with a number of punctuation characters fails.
  • Unicode symbols, emoticons, etc anywhere in the text cause failures: U+2000-U+2BFF (symbols), U+1f000-U+1ffff (symbols, emoticons), probably others
  • Having any characters in the U+1780-U+17FF (Khmer lang symbols) range fail
  • Any text that has no Unicode characters fails (\p{L} in PCRE).

This is probably an issue with the underlying library, but if so, then would be nice for this wrapper to run some checks. Currently I have the following checks implemented in my client: https://gist.github.com/gibrown/7122061

Running about a million lang detect API calls a day, and I think this catches almost all failures.

apply filter on text before ngram detection

I ran into an issue which could be solved by running some custom filters (I do not mean Lucene filters, but more things like predefined filters, eg lowercase, uppercase, ...) :

I get the following french tweet with an uppercase text :

COMMENT DES GENS PEUVENT TROUVER DES CÉLÉBRITÉS DANS LES MAGASINS JE PEUX MÊME PAS TROUVER MA MÈRE
which is detected as english :

{
    "language": "en",
    "probability": 0.9999937971825049
}

But when I ask for the exact same text lowercased,

comment des gens peuvent trouver des célébrités dans les magasins je peux même pas trouver ma mère

{
    "language": "fr",
    "probability": 0.9999970343219597
}

french is now detected

Elasticsearch 2.0.1 not compatible with plugin 2.0.0

Looks like version 2.0.0 of langdetect is not compatible with ES 2.0.1

After installing I got this error:
ERROR: Plugin [langdetect] is incompatible with Elasticsearch [2.0.1]. Was designed for version [2.0.0]

Should accept empty value when indexing

Currently, if the data is empty, langdetect will throw exception and stop indexing data (if it in a bulk process).
So I think the plugin should accept empty/null value and return prob = 0, or have an option to set default lang in case data is empty/null

Initialization Failed in ElasticSearch 1.3

langdetect-1.2.1.1 on 1.3.0

[2014-07-24 12:15:06,533][WARN ][plugins                  ] [Stonecutter] plugin langdetect-1.2.1.1-f1082e1, failed to invoke custom onModule method
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.elasticsearch.plugins.PluginsService.processModule(PluginsService.java:198)
    at org.elasticsearch.plugins.PluginsModule.processModule(PluginsModule.java:61)
    at org.elasticsearch.common.inject.Modules.processModules(Modules.java:64)
    at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:58)
    at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:192)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:70)
    at org.elasticsearch.bootstrap.Bootstrap.main(Bootstrap.java:203)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:32)
Caused by: java.lang.VerifyError: class org.xbib.elasticsearch.rest.action.langdetect.RestLangdetectAction overrides final method handleRequest.(Lorg/elasticsearch/rest/RestRequest;Lorg/elasticsearch/rest/RestChannel;)V
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.xbib.elasticsearch.plugin.langdetect.LangdetectPlugin.onModule(LangdetectPlugin.java:33)
    ... 13 more
[2014-07-24 12:15:09,500][ERROR][bootstrap                ] {1.3.0}: Initialization Failed ...

unrecognized parameter: [profile]

Example request in readme
curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
return

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_langdetect] contains unrecognized parameter: [profile]"
  },
  "status" : 400
}

elastic version

"version" : {
    "number" : "5.1.2",
    "build_hash" : "c8c4c16",
    "build_date" : "2017-01-11T20:18:39.146Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
}

ES 1.4.4 plugin returns file content instead of lang value

Just installed the plugin and execute some search queries.

MAPPING
Copy/Paste from the index page of the plugin.

PUT
{
"content": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

SEARCH

{
"fields" : "content.language.lang",
"query" : {
"match_all" : {}
}
}

RESULT

.................

"fields": {
"content.language.lang": [
"IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
]
}

As we can see the content.language.lang is not as expected.

Any ideas?

Can't use langdetect mapping

Hi,
I've just started using your plugin (very impressive!) and I'm running into a small issue with version 2.0.
When creating a map having a type=langdetect I get an error:

e.g.
curl -XPOST localhost:9200/test/article/_mapping -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}'
I receive an exception and mapping is not created.
[2013-10-24 11:39:57,096][WARN ][transport.netty ] [Aardwolf] Message not fully read (response) for [673] handl
er org.elasticsearch.action.support.master.TransportMasterNodeOperationAction$4@4dc51e57, error [true], resetting

However, when running the same command using the prior plugin version I get a successful response.

Plugin is not compatible with ES 2.1.0

After upgrading ES2.0 to 2.1.0 it does not start with the error message:

Plugin [langdetect] is incompatible with Elasticsearch [2.1.0]. Was designed for version [2.0.0]

ElasticSearch 1.2.1 crashing with langdetect installed

Seen this on both a Windows and Linux server running ES 1.2.1 with the latest langdetect plugin installed...

First I get a slew of warnings like this:

[2014-06-18 11:48:26,535][WARN ][transport ] [Alfie O'Meggan] Registered two transport handlers for action langdetect, handlers: org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@3872bb09, org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$TransportHandler@f505228

Then it crashes outright with the following message:

{1.2.1}: Initialization Failed ...
 1) NoClassDefFoundError[com/fasterxml/jackson/core/Versioned]
        ClassNotFoundException[com.fasterxml.jackson.core.Versioned]2) NoClassDefFoundError[com/fasterxml/jackson/databind/ObjectMapper]

FYI: I checked the plugin directory and the jackson-databind-2.3.3.jar file is there. It's happening on brand new servers with the latest versions of Java and ElasticSearch.

Any ideas?

Seems like landdetect 5.3 does not work or documentation has incorrect examples

My config
ES:

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "D8Tv5qq",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "56FxolywSQW5Xx4WzxI0mg",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Kibana 5.3
and plugins

GET _cat/plugins
D8Tv5qq analysis-icu        5.3.0
D8Tv5qq analysis-morphology 5.3.0
D8Tv5qq langdetect          5.3.0.0

my tryes

  • from Kibana
GET or POST _langdetect 
{
  "text": "das ist ein test"
}

{
  "error": {
    "root_cause": [
      {
        "type": "json_generation_exception",
        "reason": "Can not write a field name, expecting a value"
      }
    ],
    "type": "json_generation_exception",
    "reason": "Can not write a field name, expecting a value"
  },
  "status": 500
}
  • from cURL
$ curl -XPOST http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"}],"type":"json_generation_exception","reason":"Can not write a field name, expecting a value"},"status":500}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d '{"text":"some text"}'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"}],"type":"illegal_argument_exception","reason":"No endpoint or operation is available at [_langdetect]"},"status":400}

$ curl -XGET http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

$curl -XPOST http://127.0.0.1:9200/_langdetect -d 'some text'
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

so maybe I'm doing somewhat wrong ?
Please help.

regards
Alex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.