Giter VIP home page Giter VIP logo

elasticsearch-plugin-bundle's Introduction

A plugin bundle for Elastisearch

elasticsearch plugin bundle coverage badge License Apache%202.0 blue xbib btn donateCC LG

Important
Because this Elasticsearch plugin is licensed under AGPL 3.0 which is not compatible to SSPL or Elastic License 2.0, the last version supported is Elasticsearch 7.10.2. Sorry for inconvenience and thank you for your understanding.

This plugin is the combination of the following plugins:

  • elasticsearch-analysis-autophrase

  • elasticsearch-analysis-baseform

  • elasticsearch-analysis-concat

  • elasticsearch-analysis-decompound

  • elasticsearch-analysis-german

  • elasticsearch-analysis-hyphen

  • elasticsearch-analysis-icu

  • elasticsearch-analysis-naturalsort

  • elasticsearch-analysis-reference

  • elasticsearch-analysis-sortform

  • elasticsearch-analysis-standardnumber

  • elasticsearch-analysis-symbolname

  • elasticsearch-analysis-worddelimiter

  • elasticsearch-analysis-year

  • elasticsearch-mapper-crypt

  • elasticsearch-mapper-langdetect

The plugin code in each plugin is equivalent to the code in this combined bundle plugin.

Table 1. Compatibility matrix

Plugin version

Elasticsearch version

Release date

6.3.2.2

6.3.2

Oct 2, 2018

5.4.1.0

5.4.0

Jun 1, 2017

5.4.0.1

5.4.0

May 12, 2017

5.4.0.0

5.4.0

May 4, 2017

5.3.1.0

5.3.1

Apr 25, 2017

5.3.0.0

5.3.0

Apr 4, 2017

5.2.2.0

5.2.2

Mar 2, 2017

5.2.1.0

5.2.1

Feb 27, 2017

5.1.1.2

5.1.1

Feb 27, 2017

5.1.1.0

5.1.1

Dec 31, 2016

2.3.4.0

2.3.4

Jul 30, 2016

2.3.3.0

2.3.3

May 23, 2016

2.3.2.0

2.3.2

May 11, 2016

2.2.0.6

2.2.0

Mar 25, 2016

2.2.0.3

2.2.0

Mar 6, 2016

2.2.0.2

2.2.0

Mar 3, 2016

2.2.0.1

2.2.0

Feb 22, 2016

2.2.0.0

2.2.0

Feb 8, 2016

2.1.1.2

2.1.1

Dec 30, 2015

2.1.1.0

2.1.1

Dec 21, 2015

2.1.0.0

2.1.0

Nov 27, 2015

2.0.0.0

2.0.0

Oct 28, 2015

1.6.0.0

1.6.0

Jun 30, 2015

1.5.2.1

1.5.2

Jun 30, 2015

1.5.2.0

1.5.2

Apr 27, 2015

1.5.1.0

1.5.1

Apr 23, 2015

1.5.0.0

1.5.0

Mar 31, 2015

1.4.4.0

1.4.4

Apr 26, 2015

1.4.0.6

1.4.0

Feb 23, 2015

1.4.0.5

1.4.0

Jan 28, 2015

1.4.0.4

1.4.0

Jan 19, 2015

1.4.0.3

1.4.0

Dec 16, 2014

1.4.0.1

1.4.0

Nov 10, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.4.0.0/elasticsearch-plugin-bundle-5.4.0-plugin.zip

or

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.3.0.0/elasticsearch-plugin-bundle-5.3.0.0-plugin.zip

Do not forget to restart the node after installing.

Elasticsearch 2.x

./bin/plugin install 'http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/2.3.3.0/elasticsearch-plugin-bundle-2.3.3.0-plugin.zip'

or

./bin/plugin install 'http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/2.3.4.0/elasticsearch-plugin-bundle-2.3.4.0-plugin.zip'

Do not forget to restart the node after installing.

Elasticsearch 1.x

./bin/plugin -install bundle -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/1.6.0.0/elasticsearch-plugin-bundle-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Documentation

Hyphen analyzer

ICU

Langdetect

Standardnumber

More to come.

Examples

German normalizer

The german_normalizer is equivalent to Elasticsearch german_normalization. It performs umlaut treatment with vocal expansion which is typical for german language.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "filter": {
               "umlaut": {
                  "type": "german_normalize"
               }
            },
            "analyzer": {
               "umlaut": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": [
                     "umlaut",
                     "lowercase"
                  ]
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "umlaut"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Jörg Prante"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "Jörg"
        }
    }
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "joerg"
        }
    }
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
           "text": "jorg"
        }
    }
}

International components for Unicode

The plugin contains an extended version of the Lucene ICU functionality with a dependancy on ICU 58.2

Available are icu_collation, icu_folding, icu_tokenizer, icu_numberformat, icu_transform

icu_collation

The icu_collation analyzer can apply rbbi ICU rule files on a field.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "icu_german_collate": {
                  "type": "icu_collation",
                  "language": "de",
                  "country": "DE",
                  "strength": "primary",
                  "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü"
               },
               "icu_german_collate_without_punct": {
                  "type": "icu_collation",
                  "language": "de",
                  "country": "DE",
                  "strength": "quaternary",
                  "alternate": "shifted",
                  "rules": "& ae , ä & AE , Ä& oe , ö & OE , Ö& ue , ü & UE , ü"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fielddata" : true,
               "analyzer": "icu_german_collate"
            },
            "catalog_text" : {
               "type": "text",
               "fielddata" : true,
               "analyzer": "icu_german_collate_without_punct"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Göbel",
    "catalog_text" : "Göbel"
}

PUT /test/docs/2
{
    "text" : "Goethe",
    "catalog_text" : "G-oethe"
}

PUT /test/docs/3
{
    "text" : "Goldmann",
    "catalog_text" : "Gold*mann"
}

PUT /test/docs/4
{
    "text" : "Göthe",
    "catalog_text" : "Göthe"
}

PUT /test/docs/5
{
    "text" : "Götz",
    "catalog_text" : "Götz"
}


POST /test/docs/_search
{
    "query": {
        "match_all": {
        }
    },
    "sort" : {
        "text" : { "order" : "asc" }
    }
}

POST /test/docs/_search
{
    "query": {
        "match_all": {
        }
    },
    "sort" : {
        "catalog_text" : { "order" : "asc" }
    }
}

icu_folding

The icu_folding character filter folds characters in strings according to Unicode folding rules. UTR#30 is retracted, but still used here.

PUT /test
{
   "settings": {
          "index":{
        "analysis":{
            "char_filter" : {
                "my_icu_folder" : {
                   "type" : "icu_folding"
                }
            },
            "tokenizer" : {
                "my_icu_tokenizer" : {
                    "type" : "icu_tokenizer"
                }
            },
            "filter" : {
                "my_icu_folder_filter" : {
                    "type" : "icu_folding"
                },
                "my_icu_folder_filter_with_exceptions" : {
                    "type" : "icu_folding",
                    "name" : "utr30",
                    "unicodeSetFilter" : "[^åäöÅÄÖ]"
                }
            },
            "analyzer" : {
                "my_icu_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "my_icu_tokenizer",
                    "filter" : [ "my_icu_folder_filter" ]
                },
                "my_icu_analyzer_with_exceptions" : {
                    "type" : "custom",
                    "tokenizer" : "my_icu_tokenizer",
                    "filter" : [ "my_icu_folder_filter_with_exceptions" ]
                }
            }
        }
    }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fielddata" : true,
               "analyzer": "my_icu_analyzer"
            },
            "text2" : {
               "type": "text",
               "fielddata" : true,
               "analyzer": "my_icu_analyzer_with_exceptions"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Jörg Prante",
    "text2" : "Jörg Prante"
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text" : "jörg"
        }
    }
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text" : "jorg"
        }
    }
}

POST /test/docs/_search
{
    "query": {
        "match": {
            "text2" : "jörg"
        }
    }
}

// no hit

POST /test/docs/_search
{
    "query": {
        "match": {
            "text2" : "jorg"
        }
    }
}

icu_tokenizer

The icu_tokenizer can use rules from file. Here, we set up rules to prevent tokenization of words with hyphen.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "tokenizer": {
               "my_hyphen_icu_tokenizer": {
                  "type": "icu_tokenizer",
                  "rulefiles": "Latn:icu/Latin-dont-break-on-hyphens.rbbi"
               }
            },
            "analyzer" : {
               "my_icu_analyzer" : {
                   "type" : "custom",
                   "tokenizer" : "my_hyphen_icu_tokenizer"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_icu_analyzer"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "we do-not-break on hyphens"
}

POST /test/docs/_search?explain
{
    "query": {
        "term": {
            "text" : "do-not-break"
        }
    }
}

icu_numberformat

With the icu_numberformat filter, you can index numbers as they are spelled out in a language.

PUT /test
{
   "settings": {
       "index":{
        "analysis":{
            "filter" : {
                "spellout_de" : {
                  "type" : "icu_numberformat",
                  "locale" : "de",
                  "format" : "spellout"
                }
            },
            "analyzer" : {
               "my_icu_analyzer" : {
                   "type" : "custom",
                   "tokenizer" : "standard",
                   "filter" : [ "spellout_de" ]
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_icu_analyzer"
            }
         }
      }
   }
}

GET /test/docs/_mapping

PUT /test/docs/1
{
    "text" : "Das sind 1000 Bücher"
}

POST /test/docs/_search?explain
{
    "query": {
        "match": {
            "text" : "eintausend"
        }
    }
}

Baseform

Try it out

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "baseform",
      "language": "de"
    }
  ],
  "text": "Ich gehe dahin"
}
{
 "index":{
    "analysis":{
        "filter":{
            "baseform":{
                "type" : "baseform",
                "language" : "de"
            }
        },
        "tokenizer" : {
            "baseform" : {
               "type" : "standard",
               "filter" : [ "baseform", "unique" ]
            }
        }
    }
 }
}

WordDelimiterFilter2

Try it out

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "worddelimiter2"
    }
  ],
  "text": "PowerShot Wi-Fi SD500"
}
{
    "index":{
        "analysis":{
            "filter" : {
                "wd" : {
                   "type" : "worddelimiter2",
                   "generate_word_parts" : true,
                   "generate_number_parts" : true,
                   "catenate_all" : true,
                   "split_on_case_change" : true,
                   "split_on_numerics" : true,
                   "stem_english_possessive" : true
                }
            }
        }
    }
}

Decompound

This is an implementation of a word decompounder plugin for Elasticsearch.

Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.

This code is a reworked implementation of the Baseforms Tool found in the ASV toolbox of Chris Biemann, Automatische Sprachverarbeitung of Leipzig University.

Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.

Decompound examples

Try it out

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "decompound"
    }
  ],
  "text": "PowerShot Donaudampfschiff"
}
In the mapping, use a token filter of type "decompound"

{ "index":{ "analysis":{ "filter":{ "decomp":{ "type" : "decompound" } }, "tokenizer" : { "decomp" : { "type" : "standard", "filter" : [ "decomp" ] } } } } }

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"

It is recommended to add the `Unique token filter http://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html`_ to skip tokens that occur more than once.

Also the Lucene german normalization token filter is provided

{ "index":{ "analysis":{ "filter":{ "umlaut":{ "type":"german_normalize" } }, "tokenizer" : { "umlaut" : { "type":"standard", "filter" : "umlaut" } } } } }

The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".

Threshold

The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.

The default threshold value is 0.51. You can modify it in the settings

{ "index" : { "analysis" : { "filter" : { "decomp" : { "type" : "decompound", "threshold" : 0.51 } }, "tokenizer" : { "decomp" : { "type" : "standard", "filter" : [ "decomp" ] } } } } }

Subwords

Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true

{
   "index" : {
      "analysis" : {
          "filter" : {
              "decomp" : {
                  "type" : "decompound",
                  "subwords_only" : true
              }
          },
          "tokenizer" : {
              "decomp" : {
                 "type" : "standard",
                 "filter" : [ "decomp" ]
              }
          }
      }
   }
}

Caching

The time consumed by the decompound computation may increase your overall indexing time drastically if applied in the billions. You can configure a least-frequently-used cache for mapping a token to the decompounded tokens with the following settings:

use_cache: true - enables caching cache_size - sets cache size, default: 100000 cache_eviction_factor - sets cache eviction factor, valida values are between 0.00 and 1.00, default: 0.90

{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
        "filter": {
          "decomp":{
            "type" : "decompound",
            "use_payload": true,
            "use_cache": true
          }
        },
        "analyzer": {
          "decomp": {
            "type": "custom",
            "tokenizer" : "standard",
            "filter" : [
              "decomp",
              "lowercase"
            ]
          },
          "lowercase": {
            "type": "custom",
            "tokenizer" : "standard",
            "filter" : [
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "decomp",
          "search_analyzer": "lowercase"
        }
      }
    }
  }
}

Exact phrase matches

The usage of decompounds can lead to undesired results regarding phrase queries. After indexing, decompound tokens ca not be distinguished from original tokens. The outcome of a phrase query "Deutsche Bank" could be Deutsche Spielbankgesellschaft, what is clearly an unexpected result. To enable "exact" phrase queries, each decoumpound token is tagged with additional payload data.

To evaluate this payload data, you can use the exact_phrase as a wrapper around a query containing your phrase queries.

use_payload - if set to true, enable payload creation. Default: false

 ```
{
  "query": {
    "exact_phrase": {
      "query": {
        "query_string": {
          "query": "\"deutsche bank\"",
          "fields": [
            "message"
          ]
        }
      }
    }
  }
}
# Langdetect

    curl -XDELETE 'localhost:9200/test'

    curl -XPUT 'localhost:9200/test'

    curl -XPOST 'localhost:9200/test/article/_mapping' -d '
    {
      "article" : {
        "properties" : {
           "content" : { "type" : "langdetect" }
        }
      }
    }
    '

    curl -XPUT 'localhost:9200/test/article/1' -d '
    {
      "title" : "Some title",
      "content" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
    }
    '

    curl -XPUT 'localhost:9200/test/article/2' -d '
    {
      "title" : "Ein Titel",
      "content" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
    }
    '

    curl -XPUT 'localhost:9200/test/article/3' -d '
    {
      "title" : "Un titre",
      "content" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
    }
    '

    curl -XGET 'localhost:9200/test/_refresh'

    curl -XPOST 'localhost:9200/test/_search' -d '
    {
       "query" : {
           "term" : {
                "content" : "en"
           }
       }
    }
    '
    curl -XPOST 'localhost:9200/test/_search' -d '
    {
       "query" : {
           "term" : {
                "content" : "de"
           }
       }
    }
    '

    curl -XPOST 'localhost:9200/test/_search' -d '
    {
       "query" : {
           "term" : {
                "content" : "fr"
           }
       }
    }
    '

# Standardnumber

Try it out
----
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "standardnumber"
    }
  ],
  "text": "Die ISBN von Elasticsearch in Action lautet 9781617291623"
}
----

    {
       "index" : {
          "analysis" : {
              "filter" : {
                  "standardnumber" : {
                      "type" : "standardnumber"
                  }
              },
              "analyzer" : {
                  "standardnumber" : {
                      "tokenizer" : "whitespace",
                      "filter" : [ "standardnumber", "unique" ]
                  }
              }
          }
       }
    }


- WordDelimiterFilter2: taken from Lucene

- baseform: index also base forms of words (german, english)

- decompound: decompose words if possible (german)

- langdetect: find language code of detected languages

- standardnumber: standard number entity recognition

- hyphen: token filter for shingling and combining hyphenated words (german: Bindestrichwörter), the opposite of the decompound token filter

- sortform: process string forms for bibliographical sorting, taking non-sort areas into account

- year: token filter for 4-digit sequences

- reference:


## Crypt mapper

    {
        "someType" : {
            "_source" : {
                "enabled": false
            },
            "properties" : {
                "someField":{ "type" : "crypt", "algo": "SHA-512" }
            }
        }
    }

## Issues

All feedback is welcome! If you find issues, please post them at [Github](https://github.com/jprante/elasticsearch-plugin-bundle/issues)

# References

The decompunder is a derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden

Copyright (C) 2005 Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

The Compact Patricia Trie data structure can be found in

*Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534*

The compound splitter used for generating features for document classification is described in

*Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland*

The base form reduction step (for Norwegian) is described in

*Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland*



# License

elasticsearch-plugin-bundle - a compilation of useful plugins for Elasticsearch

Copyright (C) 2014 Jörg Prante

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

elasticsearch-plugin-bundle's People

Contributors

jprante avatar thadafinser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-plugin-bundle's Issues

baseform:illegal_argument_exception

Hi,
With the baseform plugin from version 2.3.4.0 of the bundle (and ES 2.3.4) , I got the error message
"reason":{"type":"illegal_argument_exception","reason":"Less than 2 subSpans.size():1"} .

The query string I uesed is
{ "query": { "function_score": { "query": { "bool": { "must": [ { "multi_match": { "query": "all company", "fields": [ "title.en", "content.en^0.1" ] } } ] } }, "functions": [ { "script_score": { "script": "(1)" } } ], "score_mode": "multiply" } }, "highlight": { "fields": { "content.en": { "pre_tags": [ "<span class=\"highlight\">" ], "post_tags": [ "</span>" ], "fragment_size": 200, "number_of_fragments": 1 }, "title.en": { "pre_tags": [ "<span class=\"highlight\">" ], "post_tags": [ "</span>" ], "fragment_size": 200, "number_of_fragments": 1 } } } }
And the index setting and mapping string is
{ "settings":{ "number_of_shards": 3, "number_of_replicas": 1, "analysis": { "analyzer": { "english": { "tokenizer": "standard", "filter": [ "lowercase", "synonym_en", "english_stop", "trim", "baseform", "english_possessive_stemmer", "english_stemmer" ], "char_filter": [ "html_strip" ] }, "ngram_analyzer": { "tokenizer": "ngram_tokenizer" } }, "filter": { "synonym_en": { "type": "synonym", "ignore_case": true, "tokenizer": "standard", "synonyms_path": "analysis/synonym_en.txt" }, "english_stop": { "type": "stop", "stopwords_path": "stopwords/english.txt" }, "english_stemmer": { "type": "stemmer", "language": "english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" }, "baseform" : { "type" : "baseform", "language" : "en" }, "remove_empty": { "type" : "stop", "stopwords" : [ " ", " " ] } } } }, "document": { "_source": { "enabled": true }, "_all": { "enabled": true }, "dynamic": false, "properties": { "id": { "type": "string", "index": "not_analyzed" }, "title": { "type": "string", "index": "analyzed", "fields": { "en": { "type": "string", "analyzer": "english" } } }, "content": { "type": "string", "index": "analyzed", "store": "true", "norms": { "enabled": false }, "fields": { "en": { "type": "string", "analyzer": "english", "norms": { "enabled": false } } } } } } }'

The result of ES searching with the query string reported error message always , unless removing the 'highlight ' party from the query stringor reindex the documents without the 'basethform' module . By the way , the search key words is 'all comanies' what I input , and the word 'all' is a stop word.
So , can you help me to resolve the issue ?
Thank you very much.

Installation link 404

I'm currently trying to install this plugin bundle and the installation link 404's

Baseform: memory optimization

When i use the baseform plugin for some (> 1.000.000) documents, i'm getting this error

[2017-04-06T07:28:07,712][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ultimate-1] fatal error in thread [elasticsearch[ultimate-1][clusterService#updateTask][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_121]
	at org.xbib.elasticsearch.common.fsa.FSABuilder.expandBuffers(FSABuilder.java:468) ~[?:?]
	at org.xbib.elasticsearch.common.fsa.FSABuilder.serialize(FSABuilder.java:418) ~[?:?]
	at org.xbib.elasticsearch.common.fsa.FSABuilder.freezeState(FSABuilder.java:352) ~[?:?]
	at org.xbib.elasticsearch.common.fsa.FSABuilder.add(FSABuilder.java:204) ~[?:?]
	at org.xbib.elasticsearch.common.fsa.Dictionary.loadLines(Dictionary.java:43) ~[?:?]
	at org.xbib.elasticsearch.index.analysis.baseform.BaseformTokenFilterFactory.createDictionary(BaseformTokenFilterFactory.java:39) ~[?:?]
	at org.xbib.elasticsearch.index.analysis.baseform.BaseformTokenFilterFactory.<init>(BaseformTokenFilterFactory.java:27) ~[?:?]
	at org.xbib.elasticsearch.plugin.bundle.BundlePlugin$$Lambda$379/386311625.get(Unknown Source) ~[?:?]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:361) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenFilterFactories(AnalysisRegistry.java:171) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:155) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:145) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:427) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:392) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.execute(MetaDataCreateIndexService.java:364) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

Settings definition different between orginal langdetect and bundle

Had a hard time today figuring out why my application slowed down around 20 times. After a lot of profiling I found langdetect to be the issue. Finally compared orginal langdetect plugin and the plugin-bundle and wrote a unitTests to measure execution time.

The reason is quite simple: orignal langdetect plugin assumes settings as
langdetect.languages = en,de,fr
while the plugin-bundle wants to see
languages = en,de,fr
in elasticsearch.yml

This applies to all settings (compare src\main\java\org\xbib\elasticsearch\module\langdetect\LangdetectService.java for details)

Is this intended? If yes, I will push an update to the docs...

BTW: I also tried the parameter ?profile=/langdetect/short-text/ since it appeared to me it could speed up detection (probably at cost of accuracy). But in all my tries I always got "profile": "/langdetect/" returned.

ES 5.1 alread useable?

First off thank you for this bundle - looks really promising.

In the README the compability matrix says, it's already compatible with ES 5.1.

When i browser to some plugins, they seem to be a bit older, like https://github.com/jprante/elasticsearch-analysis-german

I also don't find the 5.1.1.0 release here https://github.com/jprante/elasticsearch-plugin-bundle/releases
or here https://repo1.maven.org/maven2/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/
like mentioned in the docs.

generating baseform inside decompunder vs. standalone baseform. redundant

I have skimmed through the code of decompounder plugin and noticed that in addition to doing decompounding itself, it generates baseform of the last word. While it is good per se, the implementation of baseform generation inside decompounder differs from that of the separate baseform plugin: in decompounder it is a heuristic algorithm (Patricia trie?) and in the baseline plugin it is a mere list-based mapping.

Would it be possible to unify the approach to baseform generation? I suggest combining both approached into a single algorithm:

  1. try the mapping-based approach
  2. and if it fails, use heuristics (Patricia trie)

There is a couple of issues that need to addressed in the combined approach. Namely:

  1. the general baseform generator handles any part of speech while decompounder needs to handle nouns only (or mostly nouns, as people may want to decompound adjectives like computergesteuert as well). That said, there could be made available two mappings, one for words coming from decompounder and the other for all other words. The general baseform generator should use both resources, while the decompounder only one.

  2. the general baseform generator is now case-sensitive. The mapping contains entries given in the correct, dictionary, case. However, when a word comes from decompounder its letter case is different. Therefore, the baseform generation inside decompounder should rather be case-insensitive.

Does it make sense?

langdetect returns ISO-639-2/B ("ger", "eng", "fre"...) instead of ISO-639-1 ("de", "en", "fr"...)

In opposite to documentation of the single langdetect plugin [https://github.com/jprante/elasticsearch-langdetect] this version returns ISO-639-2/B language code instead of ISO-639-1, due to a different language.json file. This is a little confusing: Does this affect elasticsearch.yml as well? Is it
langdetect.languages = de,en,fr
or
langdetect.languages = ger,eng,fre
in elasticsearch.yml?

I assume, I cannot install the native langdetect plugin on top of the bundle to return to ISO-639-1, right?

"langdetect" mapping issue language code not retrievable

Hello,

I am trying this plugin out to handle document with mixed languages. Unfortunately the type "langdetect" is causing some issue for me.

Here are some info that maybe useful:
ES version 5.1.1
This bundle plugin version 5.1.1.0
smart_cn analysis plugin - latest
kuromoji analysis plugin - latest

Then I did this (following the example):

curl -XDELETE 'localhost:9200/test'
curl -XPUT 'localhost:9200/test'
curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect" }
}
}
}
'
curl -XPUT 'localhost:9200/test/article/1' -d '
{
"title" : "Some title",
"content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?"
}
'

Finally I did the search after calling refresh

curl -XPOST 'localhost:9200/test/_search' -d '
{
"query" : {
"term" : {
"content" : "en"
}
}
}
'
However the search above returns 0 hit.

I double check the mapping and "content" now showing like this:

curl -XGET "localhost:9200/test/_mappings?pretty"

"content" : {
"type" : "langdetect",
"analyzer" : "_keyword",
"include_in_all" : false
}

calling curl -XGET 'localhost:9200/test/_search'
shows this

"_source" : {
"content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?"
}

Based off the examples and the result I was getting, I don't think this is the intended behavior. How should I retrieve the detected language code ?

Thank You!

auto_phrase null pointer

I tried to get an example working, but receiving this error:

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "auto_phrase",
      "ignoreCase": false,
      "includeTokens": false,
      "replaceWhitespaceWith": "-"
    }
  ],
  "text": "Mein Text ist gut"
}
{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[ultimate-1][10.58.9.194:9300][indices:admin/analyze[s]]"
      }
    ],
    "type": "null_pointer_exception",
    "reason": null
  },
  "status": 500
}

installing for elasticsearch 6.x

Hey, thanks for this bundle, it looks great! I'm a little confused how to install for elasticsearch version 6.x though...

I see the latest version tagged here on github is 6.3.2.2, but when I run

./bin/elasticsearch-plugin install https://github.com/jprante/elasticsearch-plugin-bundle/archive/6.3.2.2.zip

I get errors like

$ sudo bin/elasticsearch-plugin install https://github.com/jprante/elasticsearch-plugin-bundle/archive/6.3.2.2.zip
-> Downloading https://github.com/jprante/elasticsearch-plugin-bundle/archive/6.3.2.2.zip
Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/elasticsearch/plugins/.installing-8072144053884413337/plugin-descriptor.properties
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.newByteChannel(Files.java:407)
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
	at java.nio.file.Files.newInputStream(Files.java:152)
	at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:162)
	at org.elasticsearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:713)
	at org.elasticsearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:792)
	at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:775)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:231)
	at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:216)
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
	at org.elasticsearch.cli.Command.main(Command.java:90)
	at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)

I noticed that in the readme you have instructions for older version that use links to other sites like http://search.maven.org... Is there something different about those bundles? Or maybe I'm missing something more basic – I'm fairly new to elasticsearch and this is my first time setting it up from scratch.

WordDelimiter2 not working with ES 1.5.1

JVM: 1.8.0_45 ES: 1.5.1:

[Elastica\Exception\ResponseException]
  IndexCreationException[[shop_de] failed to create index]; nested: ElasticsearchIllegalArgumentException[failed to find analyzer type [worddelimiter2
  ] or tokenizer for [svb_wordDelimiter]]; nested: NoClassSettingsException[Failed to load class setting [type] with value [worddelimiter2]]; nested:
  ClassNotFoundException[org.elasticsearch.index.analysis.worddelimiter2.Worddelimiter2AnalyzerProvider];

I used exactly the filter definition from the examples in README.md (only changed name):

"svb_wordDelimiter" => array(
                            "type" => "worddelimiter2",
                            "generate_word_parts" => true,
                            "generate_number_parts" => true,
                            "catenate_all" => true,
                            "split_on_case_change" => true,
                            "split_on_numerics" => true,
                            "stem_english_possessive" => true
                        )

EDIT: Same error with baseform filter?

Docs: searching for example

Hello,

i tried now to complete the examples for Kibana, see
https://gist.github.com/ThaDafinser/d27b4fa9d144b0083ee7dad37484fdd8

For the example i've gone through the complete plugin-list
https://github.com/jprante/elasticsearch-plugin-bundle#a-plugin-bundle-for-elastisearch

For those plugins i couldn't find docs ( @jprante could cou help me here pls?)

  • elasticsearch-analysis-autophrase
  • elasticsearch-analysis-concat (update: found a small example, but dunno the options)
  • elasticsearch-analysis-sortform
  • elasticsearch-analysis-symbolname (update: found a small example, but dunno the options)
  • elasticsearch-analysis-year (update: found a small example, but dunno the options)

Other missing examples for now (could not create a "live" example yet)

  • could not create over _analyze API for icu_collation
  • elasticsearch-analysis-naturalsort (one example added)
  • elasticsearch-analysis-reference (@todo could not create a working example with ES 5.1.2)
  • elasticsearch-mapper-crypt (one example added)
  • elasticsearch-mapper-langdetect (one example added)

Are there any other things missing? When they are finished: Do you want them in README or in a seperate file?

_langdetect endpoint missing in ES 2.2?

With ES 1.7 we used the _langdetect endpoint to verifiy the language of a document prior to indexing it according to the examples from https://github.com/jprante/elasticsearch-langdetect.

Trying the same with ES.2.2 and bundle 2.2.0.1 the example query now returns

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Das ist ein Test'
{
"error" : {
"root_cause" : [ {
"type" : "invalid_index_name_exception",
"reason" : "Invalid index name [langdetect], must not start with ''",
"index" : "_langdetect"
} ],
"type" : "invalid_index_name_exception",
"reason" : "Invalid index name [langdetect], must not start with ''",
"index" : "_langdetect"
},
"status" : 400
}

Is the endpoint still available somewhere?

ERROR: `elasticsearch` directory is missing in the plugin zip

I'm running the line from the install instructions (using ES 5.4.0):

$ /usr/share/elasticsearch/bin/elasticsearch-plugin install http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.4.0.0/elasticsearch-plugin-bundle-5.4.0-plugin.zip
-> Downloading http://search.maven.org/remotecontent?filepath=org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.4.0.0/elasticsearch-plugin-bundle-5.4.0-plugin.zip
[=================================================] 100%   
ERROR: `elasticsearch` directory is missing in the plugin zip

I've had success before installing this bundle using http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-plugin-bundle/5.4.0.0/elasticsearch-plugin-bundle-5.4.0.0-plugin.zip -- but unfortunately xbib.org is down

I know it's been a while since this release, but I need it for an older app. Any suggestions on there I can get a plugin bundle which will install successfully?

baseform: StackOverflowError in Dictionary.lookup

With the baseform plugin from version 1.4.0.5 of the bundle (and ES 1.4.2), I still get StackOverflowErrors for some strings, like "ist" and "eine" (or longer strings containing them). Yes, these are typical stopwords, but since a lot of people will probably want to use the cutoff_frequency feature of ES instead of fixed lists of stopwords, this is still very relevant (apart from the fact that a token filter just should not throw an exception during normal use).

Following your feedback on the other issue I reported on this (on the baseforms plugin itself), I wrote a small script that goes through the de-lemma-utf8.txt file line by line and checks if the left- and right-hand token are the same when compared case-insensitively. You can take a look at the "script" I used here: https://gist.github.com/dklotz/cf0906d0ff68d9578f8e

Interestingly, that script finds 4937 lines were the two words are identical (apart from case), but "ist" or "eine" are NOT among the words found. So there is probably another error apart from the circular entries. Maybe the parsing logic should also be made robust enough that even circular entries would not produce an exception...

This is an example of the stack trace I'm seeing:

Exception in thread "main" org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
    at org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:61)
    at com.fileee.search.impl.DefaultSearchClient.analyze(DefaultSearchClient.java:385)
    at com.fileee.search.impl.DefaultSearchClientTest.main(DefaultSearchClientTest.java:870)
Caused by: java.util.concurrent.ExecutionException: java.lang.StackOverflowError
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:288)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:261)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:72)
    ... 3 more
Caused by: java.lang.StackOverflowError
    at java.nio.charset.CharsetDecoder.replaceWith(CharsetDecoder.java:303)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:207)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:233)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:84)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:81)
    at sun.nio.cs.UTF_8.newDecoder(UTF_8.java:68)
    at java.lang.StringCoding.decode(StringCoding.java:213)
    at java.lang.String.<init>(String.java:451)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:58)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
...

plugin settings

Hi
I try to add some custom settings for some of the plugins in the bundle, but can't understand how settings work.
For example to add custom setting to Decompounder, I go to DecompoundTokenFilterFactory and try to read my custom setting in createDecompounder(Settings settings).
But 'settings' always comes empty!?
What am I getting wrong?
Also if I add an existing property, e.g. subwords_only: true, it seems to be read by the plugin, but still the method createDecompounder the settings object is empty (I'm checking with settings.toDelimitedString)

Thanks for any hints!

baseform: less word forms returned than defined in the resource

Situation: The baseform resource de-lemma-utf8.txt defines various outcomes for one input word, for example,

Zuschlage	Zuschlag
Zuschlage	zuschlagen

I would expect that all outcomes will be returned, as the correct baseform depends on the part of speech.

If the resource is used case-insensitively, the number of such collisions will increase, now comprising cases like:

Gefahren	Gefahr
gefahren	fahren

Would it be possible to fix the plugin to return all entries given in the resource?

Thanx

Support for ES 7.16.3

I wanted to contribute it and started fixing compilation errors (e.g. Builder is not a generic anymore, no BuilderContext but ContentPath in 7.11.2, etc.) but found that the last version of org.codelibs.elasticsearch.module/analysis-common is 7.10.2 as you mentioned here: elastic/elasticsearch#27527 (comment)
If you have time it will be perfect if you add support for ES 7.16.3.

Thanks

Bug with elasticsearch 1.4.4?

When installing the plugin in v.1.4.6.0 to ES 1.4.4 I was not able to recreate my index:

[Elastica\Exception\Connection\HttpException]
Operation timed out

I updated to Elasticsearch 1.5 and reinstalled recent bundle plugin version => Error is gone

Can't set "languages" for fields of type langdetect when profile is shorttext

There is a check (

) and when the profile is shorttext then languages_short_text is used but it is never set in builder/parser: (only languages are set)

decompound filter returns non-compound words twice

First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).

This is the simplified version of my index settings to reproduce the problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder]
            filter:
                decompounder:
                    type: decompound

Querying /_analyze with the text Grundbuchamt Anwältin returns:

tokens:
- token: "Grundbuchamt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Grund"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "buch"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "amt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1

As you can see, the token Anwältin is returned twice with the same offset and position.

(Setting subwords_only to true eliminates the duplicates by the way.)

Do you have an idea how we might fix this behaviour?

_langdetect REST Endpoint

Hi Jörg,

after a long time staying with ES2.3 we decided to move on to ES5.3 now. All the plugins work fine, except the _langdetect REST endpoint and I figured that even the parts of documentation I added to your langdetect Repo quite a while ago seem to be obsolete now:

GET _langdetect
{
   "text": "das ist ein test"
}

gives me

{
   "error": {
      "root_cause": [
         {
            "type": "json_generation_exception",
            "reason": "Can not write a field name, expecting a value"
         }
      ],
      "type": "json_generation_exception",
      "reason": "Can not write a field name, expecting a value"
   },
   "status": 500
}

whereas
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'

returns

{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}

Unfortunately our code relies pretty much on the REST endpoint for checking the language before a document is indexed.

The Handler for _langdetect still exists and is accessible. So what am I missing? Can you give me a hint, please?

langdetect error (500): duplicate of the same language profile, using REST endpoint

I have noticed a strange error caused by langdetect, I haven't seen on my old 1.7 setup before:
I am using PHP Elasticsearch\Client which uses Guzzle for the HTTP connection (which may or may not be part of the problem):

Everything is fine, if I just have one active thread on the PHP server talking to the ES cluster. When I open a second thread, I randomly see Exceptions is ES like

[2016-03-25 01:21:23,599][ERROR][org.xbib.elasticsearch.module.langdetect.LangdetectService] duplicate of the same language profile: en java.io.IOException: duplicate of the same language profile: en at org.xbib.elasticsearch.module.langdetect.LangdetectService.addProfile(LangdetectService.java:205) at org.xbib.elasticsearch.module.langdetect.LangdetectService.loadProfileFromResource(LangdetectService.java:199) at org.xbib.elasticsearch.module.langdetect.LangdetectService.load(LangdetectService.java:148) at org.xbib.elasticsearch.module.langdetect.LangdetectService.setProfile(LangdetectService.java:223) at org.xbib.elasticsearch.action.langdetect.TransportLangdetectAction.doExecute(TransportLangdetectAction.java:32) at org.xbib.elasticsearch.action.langdetect.TransportLangdetectAction.doExecute(TransportLangdetectAction.java:16) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70) at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58) at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:351) at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52) at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83) at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:351) at org.xbib.elasticsearch.rest.action.langdetect.RestLangdetectAction.handleRequest(RestLangdetectAction.java:30) at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54) at org.elasticsearch.rest.RestController.executeHandler(RestController.java:207)

The language is different in each log entry and each logentry seems to relale to a different request.
I am using the REST endpoint and I have limited the languages in elasticsearch.yml to about 10 languages.
Before I drill deeper experimenting with combinations of settings and all that time consuming stuff I hope you can give me a hint about the best startpoint of investigation....

Thx in advance!

How to integrate the plugin?

I'm trying to setup a new search server with the ability to index german documents.
Therefore I discovered your plugin bundle, which seems to cover my needs.
Unfortunately I'm not able to integrate the plugin properly.
An example how I have tried it:
PUT http://huclmaid01:9200/movies
{
"settings":{
"index":{
"analysis":{
"filter":{
"umlaut":{
"type":"german_normalize"
}
},
"tokenizer" : {
"umlaut" : {
"type":"standard",
"filter" : "umlaut"
}
}
}
}
}
}

The tokens still contains the umlauts:
POST http://huclmaid01:9200/movies/_analyze?tokenizer=umlaut
{
"text": "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Rhein in der Nähe von Köln hat viel Ökosteuer gekostet"
}

What am I doing wrong?

How to use the decomposer ?

I am looking for an example mapping file that shows how to use the decompounder ? I am familiar with the ES dictionary_decompounder, but if I understand well, the plugin provides a decompounder that does not require a word list. My question is: what is the syntax to be used in the mapping file (filter, analyzer, tokenizer) so that the decompounder is used during analysis ?

Hope you can help
Marc

Plugin langdetect not working under ES 2.0

Trying the langdetect example from README.md with ES-2.0.0-Beta1 I get no hits, but also no error message.

When I change the command for creating the mapping to:

curl -XPOST 'localhost:9200/test/article/_mapping' -d '
{
"article" : {
"properties" : {
"content" : { "type" : "langdetect", "store" : "yes" }
}
}
}
'
I get the following error

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Mapping definition for [content] has unsupported parameters: [store : yes]"}],"type":"mapper_parsing_exception","reason":"Mapping definition for [content] has unsupported parameters: [store : yes]"},"status":400}

Do you have an idea what could be the problem?

langdetect REST endpoint & test fixes for 5.3.3

I've forked this project and fixed some issues: https://github.com/edudev/elasticsearch-plugin-bundle/tree/5.3

The fixes are as follows:

  • FstDecompounder::createGlueMorphemes made changes to the passed parameter, which may be FstDecompounder::morphemes. This results in the morphemes sometimes being [e, n, ne, re, s, se, sn, sne] and sometimes being [e, en, ens, er, es, n, ns, s]. Making changes to a passed parameter is mostly not a good idea so my change simply works with a copy of this glue string array;
  • The documentation for the langdetect REST API was incorrect. The API was changed when the plugin was updated to 5.3.0, but no changes were made to the docs;
  • With the update to 5.3.0 the ActionResponse::toXContent method was updated accordingly, but the changes were not reflected in LangdetectResponse::toXContent;
  • With the update to 5.3.0 endpoints for /_isbn/{value} and /_langdetect/{profile} were added, but they simply can't be reached. If a request is made to one of them, Elasticsearch simple treats _isbn or _langdetect as an index;
  • I've added tests for the above bugfixes - the tests pass with my changes but fail without them;

I've applied the above fixes for Elasticsearch 5.3.3, but don't have the resources (time) to apply these bugfixes to the other versions, nor do I know if other versions are affected by these bugs.
I can't submit a pull request as there is no branch for version 5.3.X of the plugin. My changes are on top of tag 5.3.1.0.

P.S. I'd be glad to make a pull request if @jprante can make a 5.3 branch of tag 5.3.1.0.

How can I use 'combo'?

Hello,

I was upgrade my ElasticSearch from 1.4.2 to 2.2.0. I was using gem: https://github.com/jprante/elasticsearch-analysis-german and it includes 'elasticsearch-analysis-combo'. In readme is writing that you use also combo(...combo: apply more than one analyzer on a field...), but when I use type: 'combo' in my code I got error:

Elasticsearch::Transport::Transport::Errors::BadRequest: [400] {"error":{"root_cause":[{"type":"index_creation_exception","reason":"failed to create index"}],"type":"illegal_argument_exception","reason":"Unknown Analyzer type [combo] for [default]"},"status":400}

Here is setting:

...
analyzer: {
  german:      {
    filter:    %w(lowercase trim icu_folding icu_normalizer german_snow german_stop),
    type:      'custom',
    tokenizer: 'icu_tokenizer',
  },
  default: {
    sub_analyzers: %w(standard german),
    type:          'combo'
  }
}
...

I got installed plugin in ES and server was reloaded many times :)

Do you have some idea, what is wrong in my code?

Thank you

Idea to improve decompounding

In my current "analysis of analyzers" it turned out that lots of times when searches fail, it is due to wrongly decompounded words.

A lot of those words are on the list of the baseform dictionary already.

Maybe its a nobrainer but did you ever consider to add a dictionary functionality to the decompounder and feed the baseform dictionary to it to exclude those words from decompounding?

From my current point of view, this could boost results pretty much:

Examples that must not be decompounded: loskaufen, loslassen, hochziehen, hochdrücken...

Of course it could also be helpful to decompound these words, but in many cases they get decompounded in a wrong way since the pre-syllable is not detected as such by the decompounder. Finally it is a pitty when the baseform filter adds the baseform and the decompounder ruins it in the next step...
Another way could be to tag words to exclude them from further processing within the analyzer. I thought I have seen something like this somewhere but I cannot find it anymore.

Probably all thougts that you already had before... So what is your opinion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.