Giter VIP home page Giter VIP logo

elasticsearch-minhash's Introduction

Elasticsearch MinHash Plugin Java CI with Maven

Overview

MinHash Plugin provides b-bit MinHash algorithm for Elasticsearch. Using a field type and a token filter provided by this plugin, you can add a minhash value to your document.

Version

Versions in Maven Repository

Issues/Questions

Please file an issue.

Installation

$ $ES_HOME/bin/elasticsearch-plugin install org.codelibs:elasticsearch-minhash:7.14.0

Getting Started

Add MinHash Analyzer

First, you need to add a minhash analyzer when creating your index:

$ curl -XPUT 'localhost:9200/my_index' -d '{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["minhash"]
        }
      }
    }
  }
}'

You are free to change tokenizer/char_filter/filter settings, but the minhash filter needs to be added as a last filter.

Add MinHash field

Put a minhash field into an index mapping:

$ curl -XPUT "localhost:9200/my_index/_mapping" -d '{
  "properties":{
    "message":{
      "type":"string",
      "copy_to":"minhash_value"
    },
    "minhash_value":{
      "type":"minhash",
      "store":true,
      "minhash_analyzer":"minhash_analyzer"
    }
  }
}'

The field type of minhash is of binary type. The above example calculates a minhash value of the message field and stores it in the minhash_value field.

Get MinHash Value

Add the following document:

$ curl -XPUT "localhost:9200/my_index/_doc/1" -d '{
  "message":"Fess is Java based full text search server provided as OSS product."
}'

The minhash value is calculated automatically when adding the document. You can check it as below:

$ curl -XGET "localhost:9200/my_index/_doc/1?pretty&stored_fields=minhash_value,_source"

The response is:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{
      "message":"Fess is Java based full text search server provided as OSS product."
    },
  "fields" : {
    "minhash_value" : [ "KV5rsUfZpcZdVojpG8mHLA==" ]
  }
}

References

Change the number of bits and hashes

To change the number of bits and hashes, set them to a token filter setting:

$ curl -XPUT 'localhost:9200/my_index' -d '{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["my_minhash"]
        }
      }
    },
    "filter":{
      "my_minhash":{
        "type":"minhash",
        "seed":100,
        "bit":2,
        "size":32
      }
    }
  }
}'

The above allows to set the number of bits to 2, the number of hashes to 32 and the seed of hash to 100.

elasticsearch-minhash's People

Contributors

carldea avatar davidefiocco avatar deka0106 avatar dependabot[bot] avatar keiichiw avatar marevol avatar pocke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-minhash's Issues

how can I perform search query against minhash field?

Is there any chance I can perform search query against minhash field to find similar documents?

I create minhash analyzer, add mapping to store minhash with the following code:

PUT /my_index
{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["minhash"]
        }
      }
    }
  }
}

PUT /my_index/_doc/_mapping
{
  "_doc":{
    "properties":{
      "message":{
        "type":"text",
        "copy_to":"minhash_value"
      },
      "minhash_value":{
        "type":"minhash",
        "minhash_analyzer":"minhash_analyzer"
      }
    }
  }
}

PUT /my_index/_doc/1
{
  "message":"Sample text"
}

GET /my_index/_doc/1?pretty&stored_fields=minhash_value,_source

Here I can see that the "minhash_value" is properly calculated and stored.

I am trying to query similar documents using this advice

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["minhash_value"],
            "like" : "7MCNkXlsr8O9pYZs6eSnig==",
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

got the following error

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 5,
    "failures" : [
      {
        "shard" : 0,
        "index" : "my_index",
        "node" : "T25wlUSYSUimAg8ppcStGw",
        "reason" : {
          "type" : "query_shard_exception",
          "reason" : """
failed to create query: {
  "more_like_this" : {
    "fields" : [
      "minhash_value"
    ],
    "like" : [
      "KV5rsUfZpcZdVojpG8mHLA=="
    ],
    "max_query_terms" : 12,
    "min_term_freq" : 1,
    "min_doc_freq" : 5,
    "max_doc_freq" : 2147483647,
    "min_word_length" : 0,
    "max_word_length" : 0,
    "minimum_should_match" : "30%",
    "boost_terms" : 0.0,
    "include" : false,
    "fail_on_unsupported_field" : true,
    "boost" : 1.0
  }
}
""",
          "index_uuid" : "GX2WM-oMTUqQ3hgKZPk28Q",
          "index" : "my_index",
          "caused_by" : {
            "type" : "illegal_argument_exception",
            "reason" : "more_like_this only supports text/keyword fields: [minhash_value]"
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

any advice how to perform search?

elasticsearch version: 6.5.1
plugin version: 6.5.0

I've tried the same with es version 5.6.14 and 5.6.1 plugin version but got the same error

Error to set custom analyzer in elasticsearch 7.8.1

Hello, i try to execute first command from guide:

{
  "settings":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["minhash"]
        }
      }
    }
  }
}

I replace index key to settings cause elastic version 7 not support this syntax. When i execute this command i have a error:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"
    },
    "status": 400
}

I install the plugin by the using the elasticsearch-plugin and the installation finishes correctly.

Failed to find minhash filter in kibana Dev app

Hello,
after installing minhash filter in elasticsearch, I started kibana app and in the Dev tool, I copy&pasted the code to create the minhash filter mapping:
curl -XPUT 'localhost:9200/my_index' -d '{
"index":{
"analysis":{
"analyzer":{
"minhash_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["minhash"]
}
}
}
}
}'

But then, I got this error:

{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"
}
],
"type": "illegal_argument_exception",
"reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"
},
"status": 400
}

First position increment must be > 0 (got 0) for field hash

I've tried a few combinations of things in 5.6 without much luck. First I tried:

curl -XPUT '192.168.1.2:9200/wordvec' -d '{
  "settings": {
    "analysis": {
      "filter": {
        "min_hash_filter": {
          "type": "min_hash"
        }
      },
      "analyzer": {
        "minhash_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["minhash"]
        }
      }
    }
  }
}'
curl -XPUT '192.168.1.2:9200/wordvec/_mapping/fasttext' -d '{
  "fasttext": {
    "properties": {
      "word": {
        "type": "text"
      },
      "vector": {
        "type": "keyword",
        "copy_to": "hash"
      },
      "hash": {
        "type": "minhash",
        "analyzer": "minhash_analyzer",
        "store": "true",
        "index": "analyzed"
      }
    }
  }
}

and get:

"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [hash] has different [index] values from other types of the same index"

So then try setting the vector field to string, which gets set to text by 5.6 as it seems like string is deprecated, but keeping the hash field as type minhash, and get the following error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [hash] has different [index] values from other types of the same index"}],"type":"illegal_argument_exception","reason":"mapper [hash] has different [index] values from other types of the same index"}

So then I try setting the hash field to string as well, and this at least lets me create the mapping, but then when I go to push some data, I get:

        "error" : {
          "type" : "illegal_argument_exception",
          "reason" : "first position increment must be > 0 (got 0) for field 'hash'"
        }

Any ideas would be much appreciated!

Thanks so much

Doing fuzzy search or more_like_this query on minhash type?

Hello,
I tried to perform a search query on the stored minhash_field... e.g with the fuzzy search or with the more_like_this query .... but I get an error it cannot use the query on type minhash:

GET /test_minhash/_doc/_search/
{
"query": {
"fuzzy" : {
"minhash_value" : {
"value": "reKED0r9qtIDAC8JIpx8Dw==",
"boost": 1.0,
"fuzziness": 5,
"prefix_length": 0,
"max_expansions": 100
}
}
},
"stored_fields": ["minhash_value"]
}

============================================================
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "failed to create query: {\n "fuzzy" : {\n "minhash_value" : {\n "value" : "reKED0r9qtIDAC8JIpx8Dw==",\n "fuzziness" : "5",\n "prefix_length" : 0,\n "max_expansions" : 100,\n "transpositions" : false,\n "boost" : 1.0\n }\n }\n}",
"index_uuid": "FjpcyDT3RIK__bSA5JG1yg",
"index": "test_minhash"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test_minhash",
"node": "A_Rbp6ykRxOP6R-KYRZTeA",
"reason": {
"type": "query_shard_exception",
"reason": "failed to create query: {\n "fuzzy" : {\n "minhash_value" : {\n "value" : "reKED0r9qtIDAC8JIpx8Dw==",\n "fuzziness" : "5",\n "prefix_length" : 0,\n "max_expansions" : 100,\n "transpositions" : false,\n "boost" : 1.0\n }\n }\n}",
"index_uuid": "FjpcyDT3RIK__bSA5JG1yg",
"index": "test_minhash",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Can only use fuzzy queries on keyword and text fields - not on [minhash_value] which is of type [minhash]"
}
}
}
]
},
"status": 400
}

what is the best approach how to do this?

_score values for predicting similar news articles

Let's assume I got a set of news articles in my ES store. Is there a way to use MinHash score value to check a new article if it fits to any article in ES. So what I want to acchieve is the following: let's assume there are 2 articles on the same subject. One is from MSNBC and the other from TheGuardian. How can I recognize by the score value that they represent the same subject?

This plugin was built with an older plugin structure. Contact the plugin author to remove the intermediate "elasticsearch" directory within the plugin zip.

Hello,
When I try to install elasticsearch-minhash I get the
Error: "This plugin was built with an older plugin structure. Contact the plugin author to remove the intermediate "elasticsearch" directory within the plugin zip."

What should I do? Or do I have to copy the unzipped elasticsearch-minhash-master into the elasticsearch folder somewhere?

How to use copy_bits_to operator?

I was using minhash plugin with kibana, but I couldnt retrieve the bits keyword field in GET...

PUT /test_minhash_test
{
"index":{
"analysis":{
"analyzer":{
"minhash_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["minhash"]
}
}
}
}
}

PUT /test_minhash_test/_doc/_mapping
{
"_doc":{
"properties":{
"message": {
"type":"text",
"copy_to":"minhash_value"
},
"minhash_value":{
"type":"minhash",
"minhash_analyzer":"minhash_analyzer",
"store":true,
"copy_bits_to": "content_minhash_bits"
},
"content_minhash_bits": {
"type": "keyword",
"store":true
}
}
}
}

GET /test_minhash_test/_doc/_search/?pretty&stored_fields=*

What am I doing wrong?

null_pointer_exception

this the error I'm getting when I'm trying to use this plugin:

{
  "error": {
    "root_cause": [
      {
        "type": "generation_exception",
        "reason": "failed to serialize source for type [message]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "Failed to parse mapping [message]: failed to serialize source for type [message]",
    "caused_by": {
      "type": "generation_exception",
      "reason": "failed to serialize source for type [message]",
      "caused_by": {
        "type": "null_pointer_exception",
        "reason": null
      }
    }
  },
  "status": 400
}

the code I'm using :

PUT test
{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["minhash"]
        }
      }
    }
  },
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 0
        }
    },
    "mappings": {
        "message": {
            "_all": {
                "enabled": false
            },
            "properties": {
   
                "text": {
                    "type": "text",
                    "analyzer": "persian",
                    "copy_to":"minhash_value"
                },
                "minhash_value":{
                  "type":"minhash",
                  "minhash_analyzer":"minhash_analyzerhg"
                }
            }
        }
    }
}

Request error: BadRequestError(400, 'illegal_argument_exception', 'failed to find analyzer [custom_analyzer]')

this is my settings:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["synonym_filter"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonym.dic"
}
}
}
}
},
"mappings": {
"properties": {
"question": {"type": "text", "analyzer": "custom_analyzer", "search_analyzer": "custom_analyzer"},
"answer": {"type": "text"},
"file_id": {"type": "text"}
}
}
}

then I used the analyzer ”custom_analyzer“ in tokenizer function:

def tokenization(self, question):
body = {"text": question, "analyzer": 'custom_analyzer'}
tokens = self.es.indices.analyze(index=self.index_name, body=body)
return [token["token"] for token in tokens["tokens"]]

I got the errors:
ERROR - Request error: BadRequestError(400, 'illegal_argument_exception', 'failed to find analyzer [custom_analyzer]')

how to solve this problem

"reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"

按照readme操作,安装好minhash插件后,

curl -XPUT 'localhost:9200/my_index' -d '{
"index":{
"analysis":{
"analyzer":{
"minhash_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":["minhash"]
}
}
}
}
}'
报错误
"type": "illegal_argument_exception",
"reason": "Custom Analyzer [minhash_analyzer] failed to find filter under name [minhash]"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.