Comments (3)
Yes, the input data can not be reliable processed if text is either short (single words) or short and mixed. To me it makes sense: in first text there is the word facebook and posts, in the second there is no english word.
This restriction is due to the underlying lang detect module, this plugin can not change this.
from elasticsearch-langdetect.
Yes I agree that it makes sense that english is detected with the url in it. But I do not see the sense of using url in language detection.
I've done the following :
- added a pattern for url
private final static Pattern urlPattern = Pattern.compile("^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
(not sure Pattern.UNICODE_CHARACTER_CLASS is necessary here)
- replaced
text.replaceAll(word.pattern(), " ")
By
text.replaceAll(word.pattern(), " ").replaceAll(urlPattern.pattern(), " ")
in Detector.detect and Detector.detectAll
But you're right, this should be done in the underlying lang detect module, I'm going to submit a PR to it.
This issue can be closed, don't you think ?
from elasticsearch-langdetect.
I see the point that URL is not text. But there is many data that is not text. So I think URL/URI is only one example.
For this plugin, I think the most viable approach is to only use input for lang detect that is preprocessed in the sense that it is recognizable language.
Most general approach would be part-of-speech (POS) tagging like in natural language processing / text mining. It would be a good idea to combine POS tagger with language detection like this plugin can do.
from elasticsearch-langdetect.
Related Issues (20)
- Seems like landdetect 5.3 does not work or documentation has incorrect examples HOT 6
- problem with decoding escaped unicode string HOT 2
- Error installing 5.3.0.2 plugin on ES 5.3 HOT 2
- Detection problem with Unicode escaped JSON HOT 2
- Am I using the short-profile correctly? current example outdated HOT 1
- ETA for langdetect for ES 5.3.2 HOT 1
- REST API returns an application/yaml response
- ES 5.4 HOT 3
- Is it possible to update mapping for "langdetect" type?
- Can't store langdetect field ? HOT 1
- Can't aggregate ? HOT 2
- plugin [langdetect] is incompatible with version [5.4.1]; was designed for version [5.4.0]
- ES 5.5
- High memory usage with langdetect
- Search language not found HOT 1
- How to set config in elasticsearch.yml
- ES 5.6 support
- mapper_parsing_exception for some languages
- Cannot start elasticsearch when change language file
- ES 6.x support for language detection? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elasticsearch-langdetect.