o19s / elyzer Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 15.0 34 KB

"Stop worrying about Elasticsearch analyzers", my therapist says

License: Apache License 2.0

Python 100.00%

custom-analyzer elasticsearch elasticsearch-analyzers

elyzer's People

Stargazers

Watchers

Forkers

mindis consulthys rabb-bit baiyunping333 tomzhang kaikim felixonmars xuliang102663 imbra hkazuakey alicethehive frutik ashishkjay jakegesslertr hasnain2808

elyzer's Issues

Don't require --text for the text

Since the sole purpose of the tool is to analyze the text I think --text isn't necessary and it should simply accept all unparsed args as text:
elyzer --index myIndex --analyzer standard text to analyze

Currently, this needs to be prefixed and quotes:
elyzer --index myIndex --analyzer standard --text "text to analyze"

The following will fail or behave unexpected:

... --text text to analzye => fails with unrecognized arguments: to analyze
... --text text --text to --text analyze => only analyzes the last one, silently drops the first two

I want to use it with ES 1.7.X.

I am using ES 1.7 version and couldn't find a way to install it. Is it supported for 1.7 ?

Error

Hi, I got this error while running this command. I am using elasticsearch version 5.0.1

Command

elyzer --es "http://localhost:9200" --index index_name --analyzer analyzer_name --text "some string"

Error:

CHAR_FILTER: special_char_filter
Traceback (most recent call last):
  File "/usr/local/bin/elyzer", line 9, in <module>
    load_entry_point('elyzer==0.2.2', 'console_scripts', 'elyzer')()
  File "/usr/local/lib/python2.7/dist-packages/elyzer/__main__.py", line 30, in main
    es=es))
  File "/usr/local/lib/python2.7/dist-packages/elyzer/elyzer.py", line 65, in stepWise
    char_filters=",".join(charFiltersInUse))
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/indices.py", line 29, in analyze
    '_analyze'), params=params, body=body)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 329, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 106, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'request [/autocomplete_ahs_index/_analyze] contains unrecognized parameter: [char_filters] -> did you mean [char_filter]?')

Unicode support

Great work! Also, it would be nice if this supports unicode text input (Example: café). See the error below:

➜  elyzer git:(master) python __main__.py --es "http://localhost:9200" --index my_index --analyzer my_analyzer "café"
TOKENIZER: kuromoji_tokenizer
Traceback (most recent call last):
  File "__main__.py", line 47, in <module>
    main()
  File "__main__.py", line 36, in main
    es=es))
  File "/Users/toiwa/Projects/Private/elyzer/elyzer/elyzer.py", line 72, in stepWise
    analyzeResp = es.indices.analyze(index=indexName, body=body)
  File "/Library/Python/2.7/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Library/Python/2.7/site-packages/elasticsearch/client/indices.py", line 32, in analyze
    '_analyze'), params=params, body=body)
  File "/Library/Python/2.7/site-packages/elasticsearch/transport.py", line 284, in perform_request
    body = self.serializer.dumps(body)
  File "/Library/Python/2.7/site-packages/elasticsearch/serializer.py", line 50, in dumps
    raise SerializationError(data, e)
elasticsearch.exceptions.SerializationError: ({'text': 'caf\xc3\xa9', 'char_filter': [], 'tokenizer': u'kuromoji_tokenizer'}, UnicodeDecodeError('ascii', '"caf\xc3\xa9"', 4, 5, 'ordinal not in range(128)'))

Error handling when analyzer not found

I would like to see a nice message telling me the analyzer was not found instead of a python exception.

Support environment variables

Since providing an index and an analyzer is required, it would be nice to support environment variables so there can be set some defaults to the current shell environment.

Some suggestions:

ELYZER_ROOT_URL=http://server/9123 => same as --es http://server/9123
ELYZER_INDEX=myIndex => same as --index myIndex
ELYZER_ANALYZER=myAnalyzer => same as --analyzer myAnalzer`

The environment variables would be the fallback if the appropriate argument is not passed to elyzer. I.e. command line arguments "always win".

Together with #6 this would make a good fit to make this a quick and handy tool:

export ELYZER_ROOT_URL=http://server:9123
export ELYZER_INDEX=myIndex
export ELYZER_ANALYZER=myAnalyzer
$ elyzer analyze this text for me thank you

Use explain analyzer API

The two remaining shortcomings could be addressed by setting explain: true when calling the Analyze API (available since ES 2.2)

Only works for custom analyzers right now (as it accesses the settings for your index)
Attributes besides the token text and position would be handy

No Support For Authentication

Currently this can only be used for an unsecured instance, could basic authentication please be incorporated?

{'error': {'root_cause': [{'type': 'security_exception', 'reason': 'missing authentication credentials for REST request [/my_index/_settings]', 'header': {'WWW-Authenticate': 'Basic realm="security" charset="UTF-8"'}}], 'type': 'security_exception', 'reason': 'missing authentication credentials for REST request [/my_index/_settings]', 'header': {'WWW-Authenticate': 'Basic realm="security" charset="UTF-8"'}}, 'status': 401}

Support for built-in analyzers

It would be nice if some of the built-in analyzers were also supported. Mainly the following ones which don't require any specific configuration:

simple
whitespace
keyword
stop
standard
snowball

NameError: name 'unicode' is not defined

$ elyzer --es foo --index bar --analyzer my_analizer 'xxx'
Traceback (most recent call last):
  File "/usr/bin/elyzer", line 9, in <module>
    load_entry_point('elyzer==1.1.0', 'console_scripts', 'elyzer')()
  File "/usr/lib/python3.4/site-packages/elyzer/__main__.py", line 42, in main
    es=es))
  File "/usr/lib/python3.4/site-packages/elyzer/elyzer.py", line 36, in getAnalyzer
    normalizeAnalyzer(analyzer)
  File "/usr/lib/python3.4/site-packages/elyzer/elyzer.py", line 19, in normalizeAnalyzer
    if (isinstance(analyzer['char_filter'], str) or isinstance(analyzer['char_filter'], unicode)):
NameError: name 'unicode' is not defined

Do not you have people of the same problem?

Process fails since normalizers have been introduced

In order to show the state of the tokens at each step of the analysis process, elyzer performs an analyze query for each stage: first char filters, then tokenizer, and finally token filters.

The problem is that on ES side, the analysis process has changed since they introduced normalizers. They assume that if there is no normalizer, no analyzer and no tokenizer in the request but either a token filter or a char_filter, then the analyze request should behave like a normalizer.

In the case shown here, elyzer will first perform a request for the html_strip character filter and ES will think it is about a normalizer, hence the error since html_strip is not a valid char_filter for normalizers.

import problem

hey, i'm getting some kind of import problem. maybe you can help?

running python 3.4.3 through pyenv

here's what i got after doing pip install elyzer

elyzer --es "http://localhost:9200" --index companies --analyzer services_analyzer --text "coding"
Traceback (most recent call last):
  File "/Users/alexchoi/.pyenv/versions/3.4.3/bin/elyzer", line 9, in <module>
    load_entry_point('elyzer==0.1.1', 'console_scripts', 'elyzer')()
  File "/Users/alexchoi/.pyenv/versions/3.4.3/lib/python3.4/site-packages/pkg_resources/__init__.py", line 519, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/Users/alexchoi/.pyenv/versions/3.4.3/lib/python3.4/site-packages/pkg_resources/__init__.py", line 2630, in load_entry_point
    return ep.load()
  File "/Users/alexchoi/.pyenv/versions/3.4.3/lib/python3.4/site-packages/pkg_resources/__init__.py", line 2310, in load
    return self.resolve()
  File "/Users/alexchoi/.pyenv/versions/3.4.3/lib/python3.4/site-packages/pkg_resources/__init__.py", line 2316, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/alexchoi/.pyenv/versions/3.4.3/lib/python3.4/site-packages/elyzer/__main__.py", line 3, in <module>
    from elyzer import stepWise, getAnalyzer
ImportError: cannot import name 'stepWise'

Visualize analysis

I've had this idea ever since my first GET _analyze on Kibana dev-tools - why don't we have the tokens visualized?

This could useful for the devs working with Kibana dev-tools.

I've tried creating Kibana plugins on the weekend, seems like no easy fit. If anyone has useful resources that'll be great.

o19s / elyzer Goto Github PK

elyzer's People

Stargazers

Watchers

Forkers

elyzer's Issues

Recommend Projects

Recommend Topics

Recommend Org