Giter VIP home page Giter VIP logo

esql's Introduction

ESQL: Translate SQL to Elasticsearch DSL

Build Status codecov MIT license

Use SQL to query Elasticsearch. ES V6 compatible.

Supported features

Keywords and functionalities

  • =, !=, <, >, <=, >=, <>, ()
  • arithmetic operators: +, -, *, /, %, >>, <<, (), ~
  • AND, OR, NOT
  • AS
  • LIKE, IN, REGEX, IS NULL, BETWEEN
  • LIMIT, SIZE, OFFSET
  • GROUP BY, ORDER BY
  • GROUP_CONCAT
  • AVG, MAX, MIN, SUM, COUNT
  • date_histogram, histogram, date_range, range
  • HAVING
  • query key value macro (see usage)
  • pagination (search after)
  • pagination for aggregation
  • UDF
  • JOIN
  • nested queries

Attention

  • Arithmetics are allowed in SELECT and WHERE clause. They use script query, and thus are not able to utilize reverse index and can be potentially slow.
  • Aggregation functions can be introduced from SELECT, ORDER BY and HAVING
  • If you want to apply aggregation on some fields, they should not be in type text in ES
  • COUNT(colName) will include documents w/ null values in that column in ES SQL API, while in esql we exclude null valued documents
  • ES SQL API and esql do not support SELECT DISTINCT, a workaround is to query something like SELECT * FROM table GROUP BY colName
  • To use regex query, the column should be keyword type, otherwise the regex is applied to all the terms produced by tokenizer from the original text rather than the original text itself
  • Comparison with arithmetics can be potentially slow since it uses scripting query and thus is not able to take advantage of reverse index. For binary operators, please refer to this link on the precedence. We don't support all of them.

Usage

Please refer to code and comments in esql.go. esql.go contains all the apis that an outside user needs.

Basic Usage

sql := `SELECT COUNT(*), MAX(colA) FROM myTable WHERE colB < 10 GROUP BY colC HAVING COUNT(*) > 20`
e := NewESql()
dsl, _, err := e.ConvertPretty(sql)    // convert sql to dsl
if err == nil {
    fmt.Println(dsl)
}

Custom Query Macro

ESQL support API ProcessQueryKey to register custom policy for colName replacement. It accepts 2 functions, the first function determines whether a colName is to be replaced, the second specifies how to do the replacement. Use case: user has custom field field, but to resolve confict, server stores the field as Custom.field. ProcessQueryKey API can automatically do the conversion.

ESQL support API ProcessQueryValue to register custom policy for value processing. It accepts 2 functions, the first function determines whether a value of a colName is to be processed, the second specifies how to do the processing. Use case: user want to query time in readable format, but server stores time as an integer (unix nano). ProcessQueryValue API can automatically do the conversion.

Below shows an example.

sql := "SELECT colA FROM myTable WHERE colB < 10 AND dateTime = '2015-01-01T02:59:59Z'"
// custom policy that change colName like "col.." to "myCol.."
func myKeyFilter(colName string) bool {
    return strings.HasPrefix(colName, "col")
}
func myKeyProcess(colName string) (string, error) {
    return "myCol"+colName[3:], nil
}
// custom policy that convert formatted time string to unix nano
func myValueFilter(colName string) bool {
    return strings.Contains(colName, "Time") || strings.Contains(colName, "time")
}
func myValueProcess(timeStr string) (string, error) {
    // convert formatted time string to unix nano integer
    parsedTime, _ := time.Parse(defaultDateTimeFormat, timeStr)
    return fmt.Sprintf("%v", parsedTime.UnixNano()), nil
}
// with the 2 policies , converted dsl is equivalent to
// "SELECT myColA FROM myTable WHERE myColB < 10 AND dateTime = '1561678568048000000'
// in which the time is in unix nano format
e := NewESql()
e.ProcessQueryKey(myKeyFilter, myKeyProcess)            // set up macro for key
e.ProcessQueryValue(myValueFilter, myValueProcess)      // set up macro for value
dsl, _, err := e.ConvertPretty(sql)                     // convert sql to dsl
if err == nil {
    fmt.Println(dsl)
}

Pagination

ESQL support 2 kinds of pagination: FROM keyword and ES search_after.

  • FROM keyword: the same as SQL syntax. Be careful, ES only support a page smaller than 10k, if your offset is large than 10k, search_after is necessary.
  • search_after: Once you know the paging tokens, just feed them to Convert or ConvertPretty API in order.

Below shows an example.

// first page
sql_page1 := "SELECT * FROM myTable ORDER BY colA, colB LIMIT 10"
e := NewESql()
dsl_page1, sortFields, err := e.ConvertPretty(sql_page1)

// second page
// 1. Use FROM to retrieve the 2nd page
sql_page2_FROM := "SELECT * FROM myTable ORDER BY colA, colB LIMIT 10 FROM 10"
dsl_page2_FROM, sortFields, err := e.ConvertPretty(sql_page2_FROM)

// 2. Use search_after to retrieve the 2nd page
// we can use sortFields and the query result from page 1 to get the page tokens
sql_page2_search_after := sql_page1
page_token_colA := "123"
page_token_colB := "bbc"
dsl_page2_search_after, sortFields, err := e.ConvertPretty(sql_page2_search_after, page_colA, page_colB)

ES aggregation functions

function signature example
date_histogram date_histogram('field', 'interval', 'format') SELECT date_histogram('mydate', '1M', 'yyyy-MM-dd') FROM dummy
histogram date_histogram('field', 'interval', 'min_doc_count', 'extended_bound_min,extended_bound_max') SELECT histogram('myCol', '5', '1', '2,5') FROM dummy
date_range date_range('colName', 'format', 'val1', 'val2', ...) SELECT date_histogram('mydate', 'MM-yy', 'now-10M/M') FROM dummy`
range range('colName', 'val1', 'val2', ...) SELECT date('myColumn', '0', '10', '50') FROM dummy

Testing

We are using elasticsearch's SQL translate API as a reference in testing. Testing contains 3 basic steps:

  • using elasticsearch's SQL translate API to translate sql to dsl
  • using our library to convert sql to dsl
  • query local elasticsearch server with both dsls, check the results are identical

However, since ES's SQL api is still experimental, there are many features not supported well. For such queries, testing is mannual.

Features not covered yet:

  • LIKE, REGEXP keyword: ES V6.5's sql api does not support regex search but only wildcard (only support shell wildcard % and _)
  • some aggregations and arithmetics are tested by manual check since ES's sql api does not support them well

To run test locally:

  • download elasticsearch v6.5 (optional: kibana v6.5) and unzip them
  • run chmod u+x start_service.sh test.sh
  • run ./start_service.sh <elasticsearch_path> <kibana_path> to start a local elasticsearch server (by default, elasticsearch listens port 9200, kibana listens port 5600)
  • run python gen_test_data.py -dmi 1 1000 20 to insert 1000 documents to the local es
  • run ./test.sh TestSQL to run all the test cases in /testcases/sqls.txt
  • generated dsls are stored in dslsPretty.txt for reference

To customize test cases:

  • modify testcases/sqls.txt
  • run python gen_test_data.py -h for guides on how to insert custom data into your lcoal es
  • invalid query test cases are in testcases/sqlsInvalid.txt

Changes from ES V2 to ES V6

Item ES V2 ES v6/v7
missing check {"missing": {"field": "xxx"}} {"must_not": {"exist": {"field": "xxx"}}}
group by multiple columns nested "aggs" field "composite" flattened grouping

Acknowledgement

This project is originated from elasticsql. Table below shows the improvement.

Item detail
comparison support comparison with arithmetics of different columns
keyword IS support standard SQL keywords IS NULL, IS NOT NULL for missing check
keyword NOT support NOT, convert NOT recursively since elasticsearch's must_not is not the same as boolean operator NOT in sql
keyword LIKE using "wildcard" tag, support SQL wildcard '%' and '_'
keyword REGEX using "regexp" tag, support standard regular expression syntax
keyword GROUP BY using "composite" tag to flatten multiple grouping
keyword ORDER BY using "bucket_sort" to support order by aggregation functions
keyword HAVING using "bucket_selector" and painless scripting language to support HAVING
keyword GROUP_CONCAT support GROUP_CONCAT on multiple columns and costum separator (sort not supported here)
aggregations allow introducing aggregation functions from all HAVING, SELECT, ORDER BY
column name filtering allow user pass an white list, when the sql query tries to select column out side white list, refuse the converting
column name replacing allow user pass an function as initializing parameter, the matched column name will be replaced upon the policy
query value replacing allow user pass an function as initializing parameter, query value will be processed by such function if the column name matched in filter function
pagination also return the sorting fields for future search after usage
optimization using "filter" tag rather than "must" tag to avoid scoring analysis and save time
optimization no redundant {"bool": {"filter": xxx}} wrapped
optimization does not return document contents in aggregation query
optimization only return fields user specifies after SELECT

esql's People

Contributors

jysui123 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.