Giter VIP home page Giter VIP logo

elasticsearch-analysis-ansj's Introduction

##前言

这是一个elasticsearch的中文分词插件,基于ansj2,感谢原作者,感谢ansj的作者和社区贡献人员。

目前我fork下来主要是为了升级版本,以及添加redis的认证功能

关于几个中文搜索引擎可以看看下面这两篇

http://www.chepoo.com/ik-ansj-mmseg4j-segmentation-performance-comparison.html http://my.oschina.net/apdplat/blog/228615?p=1

##版本对应

plugin elasticsearch
1.0.0Release 0.90.2
master 1.X.0

==========

##插件安装

进入Elasticsearch目录运行如下命令

./bin/plugin -u http://maven.ansj.org/org/ansj/elasticsearch-analysis-ansj/1.x.1/elasticsearch-analysis-ansj-1.x.1-release.zip -i ansj

##下载安装

进入http://maven.ansj.org/org/ansj/elasticsearch-analysis-ansj/ 直接下 载zip包解压到plugin目录下..

##编译安装

  • 第一步,你要有一个elasticsearch的服务器(废话)

  • 第二步,把代码clone到本地

  • 第三步,mvn clean package

  • 第四步,进入$Project_Home/target/releases 目录,

  • 第五步,拷贝$Project_Home/target/releases/目录下的zip包到任意位置,并解压

  • 第六步,将解压后的文件拷贝到$ES_HOME/plugins目录下,词典拷贝到config/ansj下面

  • 第七步,配置分词插件,将下面配置粘贴到,es下config/elasticsearch.yml 文件末尾。

==========

##分词文件配置:

#####简单配置:

################################## ANSJ PLUG CONFIG ################################
index:
   analysis:
     analyzer:
     	index_ansj:
     		type: ansj_index
     	query_ansj:
     		type: ansj_query
     		
index.analysis.analyzer.default.type: ansj_index

#####高级配置:

################################## ANSJ PLUG CONFIG 
注意:tokenizer的加载会在analyzer之前,如果你配置了redis,记得下面关于tokenizer一定要配置
################################
index:
   analysis:
     analyzer:
       index_ansj:
           alias: [ansj_index_analyzer]
           type: ansj_index
           is_name: false
           redis:
               pool:
                   maxactive: 20
                   maxidle: 10
                   maxwait: 100
                   testonborrow: true
               ip: master.redis.yao.com:6379
               channel: ansj_term
       query_ansj:
           alias: [ansj_query_analyzer]
           type: ansj_query
           is_name: false
           redis:
               pool:
                   maxactive: 20
                   maxidle: 10
                   maxwait: 100
                   testonborrow: true
               ip: master.redis.yao.com:6379
               channel: ansj_term
       customer_ansj_index:
           tokenizer: index_ansj_token
           filter: [sysfilter]
       customer_ansj_query:
           tokenizer: query_ansj_token
           filter: [sysfilter]
     tokenizer:
        index_ansj_token:
            type: ansj_index_token
            is_name: false
            is_num: false
            is_quantifier: false
            redis:
                pool:
                    maxactive: 20
                    maxidle: 10
                    maxwait: 100
                    testonborrow: true
                ip: master.redis.yao.com:6379
                channel: ansj_term
        query_ansj_token:
            type: ansj_query_token
            is_name: false
            is_num: false
            is_quantifier: false
            redis:
                pool:
                    maxactive: 20
                    maxidle: 10
                    maxwait: 100
                    testonborrow: true
                ip: master.redis.yao.com:6379
                channel: ansj_term
      filter:
        sysfilter:
            type: synonym
            synonyms:
                - 片,颗 =>粒

人名识别,建议先去看看我发表的问题

如果有和我一样问题的同学,建议关闭人名识别 以上配置中redis并不是必需的,user_path可以是一个目录,注释了的都具有默认值,可不配置 如果使用redis功能,请确认一下,在user_path下有ext.dic这个文件

如果你的log日志中出现如下字样,恭喜你,成功了。(日志在$ES_HOME/logs下,哪个文件,当然就是你的集群名称啦,知道的无视这段吧)

[2013-10-25 18:23:55,427][INFO ][ansj-analyzer            ] ansj停止词典加载完毕!
[2013-10-25 18:24:01,509][INFO ][ansj-analyzer            ] ansj分词器预热完毕,可以使用!
[2013-10-25 18:24:01,523][INFO ][ansj-redis-pool          ] master.redis.yao.com:6379
[2013-10-25 18:24:01,607][INFO ][ansj-analyzer            ] redis守护线程准备完毕,ip:master.redis.yao.com:6379,port:6379,channel:ansj_term
[2013-10-25 18:24:01,617][INFO ][ansj-redis-msg           ] subscribe channel:ansj_term and subscribedChannels:1

##使用

测试

可以使用分词器测试接口还看到效果:

  • 索引分词
curl -XGET http://host:9200/[index]/_analyze?analyzer=ansj_index&text=%E5%8C%97%E4%BA%AC%E9%A6%96%E9%83%BD%E6%9C%BA%E5%9C%BA%E5%8D%97%E8%B7%AF
  • 查询分词
curl -XGET http://[host]:9200/[index]/_analyze?analyzer=ansj_query&text=%E5%8C%97%E4%BA%AC%E9%A6%96%E9%83%BD%E6%9C%BA%E5%9C%BA%E5%8D%97%E8%B7%AF

在mapping中,加入analyzer设置和tokenizer设置,请注意,分词和索引使用不一样的分词器

"byName": {
  "type": "string",
  "index_analyzer": "index_ansj",
  "search_analyzer": "query_ansj"
},
"name": {
  "type": "string",
  "index_analyzer": "customer_ansj_index",
  "search_analyzer": "customer_ansj_query"
},

然后通过redis发布一个新词看看

redis-cli
publish ansj_term u:c:视康

是不是分词发生了变化

redis-cli
publish ansj_term u:d:视康

又回来了

然后通过redis发布一个歧义词

redis-cli
publish ansj_term a:c:减肥瘦身-减肥,nr,瘦身,v

是不是分词发生了变化

redis-cli
publish ansj_term a:d:减肥瘦身

又回来了

#####example

curl -XPUT http://localhost:9200/ansj2blog
curl -XPUT http://localhost:9200/ansj2blog/_mapping/vincent -d '
{
  "properties":
  { 
  
  "title":{"type":"string","index":"analyzed","index_analyzer": "index_ansj","search_analyzer": "query_ansj"},
   "content":{"type":"string","index":"analyzed","index_analyzer": "customer_ansj_index","search_analyzer": "customer_ansj_query"},
    "tags":{"type":"string","index":"analyzed","index_analyzer": "index_ansj","search_analyzer": "query_ansj"}
  }
}
'

index...

curl -XPOST http://localhost:9200/ansj2blog/vincent/1 -d '
{
{"content":"**驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
}
'

curl -XPOST http://localhost:9200/ansj2blog/vincent/2 -d '
{
{"content":"公安部:各地校车将享最高路权"}
}
'

curl -XPOST http://localhost:9200/ansj2blog/vincent/3 -d '
{
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘**渔船"}
}
'

curl -XPOST http://localhost:9200/ansj2blog/vincent/4 -d '
{
{"content":"美国留给伊拉克的是个烂摊子吗"}
}
'

query with highlighting

curl -XPOST http://localhost:9200/ansj2blog/vincent/_search -d '
{
{
    "query" : { "term" : { "content" : "**" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
}
'

here is result:

{
    "took": 15,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.076713204,
        "hits": [
            {
                "_index": "ansj2blog",
                "_type": "vincent",
                "_id": "1",
                "_score": 0.076713204,
                "_source": {
                    "content": "**驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                },
                "highlight": {
                    "content": [
                        "<tag1>**</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                    ]
                }
            },
            {
                "_index": "ansj2blog",
                "_type": "vincent",
                "_id": "3",
                "_score": 0.076713204,
                "_source": {
                    "content": "中韩渔警冲突调查:韩警平均每天扣1艘**渔船"
                },
                "highlight": {
                    "content": [
                        "中韩渔警冲突调查:韩警平均每天扣1艘<tag1>**</tag1>渔船"
                    ]
                }
            }
        ]
    }
}

elasticsearch-analysis-ansj's People

Contributors

4eversm avatar ansjsun avatar defp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.