Giter VIP home page Giter VIP logo

Comments (5)

jprante avatar jprante commented on August 20, 2024 1

I can implement the following.

The scenario is like this: First, configure a mapping with languages you want to detect in languages. Then, configure other fields where to map the text of successfully detected languages to, in a parameter language_to.

{
  "someType" : {
    "properties" : {
      "someField":{
        "type" : "langdetect",
        "languages" : [ "de", "en", "fr", "nl", "it" ],
        "language_to" : {
          "de": "german_field",
          "en": "english_field"
        }
      },
      "german_field" : {
        "analyzer" : "german",
        "type": "string"
      },
      "english_field" : {
        "analyzer" : "english",
        "type" : "string"
      }
    }
  }
}

In this example, submitting a text "This is a small example of english text" to someField will index en to thesomeField field with langdetect type, but also the text will be passed to the field english_field. German text would be indexed into field german_field, using a different analyzer.

It is up to the user to configure the field analyzers and the language_to mapping appropriately. There are cases where detected languages don't have a Lucene language analyzer. So it is not possible to implement a total automatic scenario, covering all languages that can be detected, and covering all Lucene language analyzers.

Another issue is indexing multilanguage text into a single field. Here I recommend the ICU analyzer. ICU can apply normalization / folding / tokenization based on Unicode scripts which is the best method to search for multilanguage in a single field. Stemming is not applied.

from elasticsearch-langdetect.

lokeshmadan avatar lokeshmadan commented on August 20, 2024

Hi,
I see so many requests for auto-language detection and use right analyzer for indexing. Is this issue currently being tracked ? Any response is really appreciated.
Regards

from elasticsearch-langdetect.

gibrown avatar gibrown commented on August 20, 2024

With the removal of _analyzer being specified in the query (in elastic/elasticsearch#9279), auto selection of the analyzer for a field doesn't really make sense as far as I can tell. Each field has only a single analyzer associated with it, so you can't really analyze on the fly based on lang detect.

So either you are putting your content into a field that is agnostic about the analyzer and doing to lang detection to filter on, or you make one call to determine the language of your content, and then index your data to the appropriate field for the appropriate analyzer.

So for instance we have separate fields like:

  • content.en
  • content.es
  • content.fr
  • etc...
  • content.default

from elasticsearch-langdetect.

jprante avatar jprante commented on August 20, 2024

Released version 2.4.4.1 with the language_to feature.

from elasticsearch-langdetect.

apatrida avatar apatrida commented on August 20, 2024

@jprante for language detection can you provide a default/fallback language_to when threshold is below some confidence level? (which probably would point at an analyzer that is somewhat language neutral with ICU that is safer across a lot of languages but not perfect).

from elasticsearch-langdetect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.