Giter VIP home page Giter VIP logo

emoji-search's Introduction

Emoji, flags and emoticons support for Elasticsearch

Add support for emoji and flags in any Lucene compatible search engine!

If you wish to search ๐Ÿฉ to find donuts in your documents, you came to the right place. This project offer synonym files ready for usage in Elasticsearch analyzer.

Test all synonym files on a real Elasticsearch

Requirements to index emoji in Elasticsearch

Version Requirements
Elasticsearch >= 6.7 The standard tokenizer now understand Emoji ๐ŸŽ‰ thanks to Lucene 7.7.0 - no plugin needed !
Elasticsearch >= 6.4 and < 6.7 You need to install the official ICU Plugin. See our blog post about this change.
Elasticsearch < 6.4 You need our custom ICU Tokenizer Plugin, see our blog post (2016).

Run the following test to verify that you get 4 EMOJI tokens:

GET _analyze
{
  "text": ["๐Ÿฉ ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‘ฉโ€๐Ÿš’ ๐Ÿšฃ๐Ÿพโ€โ™€"]
}

The Synonyms, flags and emoticons

What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.

We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:

๐Ÿ‘ฉโ€๐Ÿš’ => ๐Ÿ‘ฉโ€๐Ÿš’, firefighter, firetruck, woman
๐Ÿ‘ฉโ€โœˆ => ๐Ÿ‘ฉโ€โœˆ, pilot, plane, woman
๐Ÿฅ“ => ๐Ÿฅ“, bacon, meat, food
๐Ÿฅ” => ๐Ÿฅ”, potato, vegetable, food
๐Ÿ˜… => ๐Ÿ˜…, cold, face, open, smile, sweat
๐Ÿ˜† => ๐Ÿ˜†, face, laugh, mouth, open, satisfied, smile
๐ŸšŽ => ๐ŸšŽ, bus, tram, trolley
๐Ÿ‡ซ๐Ÿ‡ท => ๐Ÿ‡ซ๐Ÿ‡ท, france
๐Ÿ‡ฌ๐Ÿ‡ง => ๐Ÿ‡ฌ๐Ÿ‡ง, united kingdom

For emoticons, use this mapping with a char_filter to replace emoticons by emoji.

Installation

Download the emoji and emoticon file you want from this repository and store them in PATH_ES/config/analysis (or anywhere Elasticsearch can read).

config
โ”œโ”€โ”€ analysis
โ”‚ย ย  โ”œโ”€โ”€ cldr-emoji-annotation-synonyms-en.txt
โ”‚ย ย  โ””โ”€โ”€ emoticons.txt
โ”œโ”€โ”€ elasticsearch.yml
...

Use them like this (this is a complete english example with Elasticsearch >= 6.7):

PUT /tweets
{
  "settings": {
    "analysis": {
      "filter": {
        "english_emoji": {
          "type": "synonym",
          "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" 
        },
        "emoji_variation_selector_filter": {
          "type": "pattern_replace",
          "pattern": "\\uFE0E|\\uFE0F",
          "replace": ""
        }
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english_with_emoji": {
          "tokenizer": "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "emoji_variation_selector_filter",
            "english_emoji",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "english_with_emoji"
      }
    }
  }
}

You can now test the result with:

GET tweets/_analyze
{
  "field": "content",
  "text": "๐Ÿฉ ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‘ฉโ€๐Ÿš’ ๐Ÿšฃ๐Ÿพโ€โ™€"
}

How to contribute

Build from CLDR SVN

You will need:

  • php cli
  • php zip and curl extensions

Edit the tag in tools/build-released.php and run php tools/build-released.php.

Update emoticons

Run php tools/build-emoticon.php.

Licenses

Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).

This repository in distributed under MIT License. Feel free to use and contribute as you please!

emoji-search's People

Contributors

damienalexandre avatar harmenjanssen avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.