Giter VIP home page Giter VIP logo

Comments (3)

jprante avatar jprante commented on July 19, 2024

Yes, the input data can not be reliable processed if text is either short (single words) or short and mixed. To me it makes sense: in first text there is the word facebook and posts, in the second there is no english word.

This restriction is due to the underlying lang detect module, this plugin can not change this.

from elasticsearch-langdetect.

juliendangers avatar juliendangers commented on July 19, 2024

Yes I agree that it makes sense that english is detected with the url in it. But I do not see the sense of using url in language detection.

I've done the following :

  • added a pattern for url
private final static Pattern urlPattern = Pattern.compile("^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);

(not sure Pattern.UNICODE_CHARACTER_CLASS is necessary here)

  • replaced
text.replaceAll(word.pattern(), " ")

By

text.replaceAll(word.pattern(), " ").replaceAll(urlPattern.pattern(), " ")

in Detector.detect and Detector.detectAll

But you're right, this should be done in the underlying lang detect module, I'm going to submit a PR to it.

This issue can be closed, don't you think ?

from elasticsearch-langdetect.

jprante avatar jprante commented on July 19, 2024

I see the point that URL is not text. But there is many data that is not text. So I think URL/URI is only one example.

For this plugin, I think the most viable approach is to only use input for lang detect that is preprocessed in the sense that it is recognizable language.

Most general approach would be part-of-speech (POS) tagging like in natural language processing / text mining. It would be a good idea to combine POS tagger with language detection like this plugin can do.

from elasticsearch-langdetect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.