mabel-dev / opteryx Goto Github PK

🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.

Home Page: https://opteryx.dev

License: Apache License 2.0

Python 96.58% Rust 0.34% Makefile 0.07% Cython 3.02%

python sql analytics big-data serverless sql-engine arrow cloud-run cloud-native query-engine

opteryx's Issues

[FEATURE] Support $PARTITION hints

SELECT * FROM $satellites WHERE $PARTITION.username('green')

This would instruct the engine to try (it can ignore if it wants to) to prune the by_username partitioning to the username=green partition.

✨ support FLATTEN

Work out how best to implement FLATTEN functionality, to flatten a struct/dict into separate columns.

https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.flatten

It appears to want to work on a table - perhaps get the struct column, remove from the source table, apply to filter the column (now a table) and then add the now struct column table to the original table?

WC 07-FEB

SQL HAVING
SQL ORDER BY
$DATE Filters - not ready to start

WC 28-FEB

Sample dataset with lists and structs

[FEATURE] (sources) Execution plan DAG

The current DAG allows for an arbitrary set of unordered in coming edges, which means the planner can't include planning order or left/right nodes. This should be rewritten to have an order to the incoming edges (but not all nodes need to care)

Where Query planners are currently passed to nodes, these should be relations to other nodes.

The Node execution start at the final node and pull chunks through the DAG.

[FEATURE] Planner should plan the reads rather than the Reader

The Planner should plan the reads, including which portions to read and which selections and projections to push down to the reader operator.

This should then be described in JSON (or similar), so it can be hashed and used to persist partial reads to enable caching of 'results'

[FEATURE] ephemeral tables (VALUES)

SELECT *
FROM (VALUES (1,2),(3,4),(340,455)) AS t(a,b)

Allow tables to be defined in the SQL which don't need to be backed by a on disk dataset.

The will be useful for small look ups, e.g. to order by an ordinal field like HIGH, MEDIUM, LOW, or to unify nomenclature like TRUE, YES, ON

[FEATURE] Review using gandiva

https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/

[FEATURE] JOINS

[BUG] RANDOM should be between 0 and 1, not 0 and 1000

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[FEATURE] evaluation node should evaluate entries in the WHERE clause

The Selection node contains a chunk of code from the Evaluation node because the functions in the WHERE are not evaluated ahead of this node.

[BUG] EXPLAIN fails for subqueries

EXPLAIN 
SELECT * FROM (SELECT * FROM $astronauts)

fails with this error: __repr__ returned non-string (type QueryPlanner)

[BUG] Columns aliases are not carried through the aggregation node

when aggregation nodes rebuild the metadata, they don't recreate declared aliases (AS) for columns, only the implied aliases (e.g. table_name.column_name)

[FEATURE] use AST to format code

Use an AST to parse SQL, color code the code.

Shift-Alt-F to format the text.

✨ Rewrite AggregationNode to use pyarrow.table group_by

Test performance first, initial comparison testing had the existing code running faster.

[FEATURE] Reader should support mabel by_ segments (basic support)

Thanks for stopping by to let us know something could be better!

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] WHERE statement with a new Function reports a NoneType column

To replicate:

SELECT * FROM $astronauts
WHERE VARCHAR(Group) = '10'

Type mismatch, unable to compare OTHER (NoneType) with VARCHAR

Whilst NoneType is technically correct (the function column doesn't exist so is nulls), this is not expected or helpful to users.

[BUG] Conversion to list of Dicts is too slow for practical use

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[FEATURE] Date Filters

Filter by dates, use the virtual date filters

[FEATURE] nested functions

Allow functions to be called on the result of other functions

[FEATURE] Refactor JOIN code for performance

CROSS JOIN
LEFT JOIN
INNER JOIN

[FEATURE] Improve Performance of ORDER BY RANDOM()

This currently works by assigning a random number to each row and then sorting by the new column, what would probably be faster is creating a 0..n list, where n is the number of rows, shuffling this list and then using .take() to get the shuffled rows (in chunks)

[BUG] LIMIT 0 does not limit results

[FEATURE] Review using Cylon

https://cylondata.org/docs/python/

This may have no benefit if not distributed, where a device has 2 or more CPUs, is it practical and is it useful?

[FEATURE] Go implementations

https://github.com/axiomhq/hyperloglog
https://github.com/plar/go-adaptive-radix-tree

✨ support UNION statements

UNION wil likely require changes to the parser as it may need to dynamically get the statement being worked on from a list, rather than hard-coded to 0

[FEATURE] standardize temporal queries

https://www.sqlshack.com/temporal-tables-in-sql-server/

The FOR SYSTEM_TIME clause has many variations and options. It is further classified into four temporal sub-clauses. This provides a way to query the data across current and history tables.

AS OF
FROM TO
BETWEEN AND
CONTAINED IN ( , )
ALL

The AS OF clause is used when there is a need to rebuild the original state of the data and need to know the state it was at any specific time in the past. This is possible by specifying the date time as its input.

[FEATURE] Subqueries

Support for table subqueries is a MUST

SELECT results.account
FROM (SELECT * FROM Players) AS results;

Support for SELECT is a could

SELECT account, (SELECT mascot FROM Guilds WHERE Players.guild = id) AS player_mascot
FROM Players;

support for IN is a should

SELECT account
FROM users
WHERE account IN (SELECT account FROM billing WHERE paid = true)

[FEATURE] table aliases

FROM table AS name

need to be able to SELECT, WHERE, ORDER and GROUP on the columns using either the aliases or the original names.

projection should result in unambiguous names, but not be just adding the prefixes to the column names

[BUG] Selections between different types return zero records, not fails

This returns records:

SELECT * FROM $astronauts
WHERE Group = 10

but this returns no records:

SELECT * FROM $astronauts
WHERE Group = '10'

Rather than return zero records, it should report a type mismatch.

[FEATURE] (sources) scan planning

Move the scan planning to the init for the scanner.

This will allow us to access the expected rows and other statistics when optimizing.

We can then use the expected rows to determine which dataset to hold in memory when joining.

This should be the information returned by DESCRIBE dataset which means changes to the writer to record more information.

Per column

Name
Type
Nulls?
Unique
Min
Max
Description

The Planner should plan the reads, including which portions to read and which selections and projections to push down to the reader operator.

This should then be described in JSON (or similar), so it can be hashed and used to persist partial reads to enable caching of 'results'

[FEATURE] Documentation

Some documentation has typos, is incomplete or incorrect.

Documentation is in different formats and isn't in the desired structure

[FEATURE] infix notation

Allow for infix notation, e.g.

SELECT first name || last name AS name

And

WHERE age - 18 = 0

WC 21-FEB

column aliases
date functions

[FEATURE] Split query planner and query plan into different modules

Split the planner into functional parts:

the planning
the plan
the execution

Each of these new components will be smaller and simpler.

[BUG] GROUP BY should split by the column, if if it's not in the projection

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[FEATURE] Support ORDER BY RANDOM()

Thanks for stopping by to let us know something could be better!

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

✨ Support SHOW TABLE(STATUS)

Table information

Size
File type

https://dev.mysql.com/doc/refman/8.0/en/show-tables.html

[FEATURE] struct and list access

SELECT Birth_Place['state'] FROM $astronauts

"projection": [
            {
              "UnnamedExpr": {
                "MapAccess": {
                  "column": {
                    "Identifier": {
                      "value": "Birth_Place",
                      "quote_style": null
                    }
                  },
                  "keys": [
                    {
                      "Value": {
                        "SingleQuotedString": "state"
                      }
                    }
                  ]
                }
              }
            }
          ]

SELECT Misions[0] FROM $astronauts

"MapAccess": {
                  "column": {
                    "Identifier": {
                      "value": "Misions",
                      "quote_style": null
                    }
                  },
                  "keys": [
                    {
                      "Value": {
                        "Number": [
                          "0",
                          false
                        ]
                      }
                    }

[FEATURE] Temporal clauses should be localized

FOR clauses should be specific to the context, for example

SELECT event
FROM log
FOR TODAY
WHERE event NOT IN 
   (
    SELECT event
    FROM log
    FOR YESTERDAY
    )

This will need an alternate implementation to the RegEx based one.

[BUG] Aliases should be the preferred name for columns

[FEATURE] non SELECT queries

> show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | t          | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
+---------------+--------------------+------------+------------+

> show columns from t;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | t          | a           | Int32     | NO          |
| datafusion    | public       | t          | b           | Utf8      | NO          |
| datafusion    | public       | t          | c           | Float32   | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+

✨ Support for CTEs (with statements)

WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Initial design

Execute first, then keep the result in memory. Include in the subsequent queries as a LITERAL_TABLE.

This would probably work with a new reader (literal_tabel_reader), which would also need to be added to the factory.

[FEATURE] Use inbuilt AST features

NOT DONE =>

    /// SUBSTRING(<expr> [FROM <expr>] [FOR <expr>])
    Substring {
        expr: Box<Expr>,
        substring_from: Option<Box<Expr>>,
        substring_for: Option<Box<Expr>>,
    },

SELECT TRIM('#! ' FROM '    #SQL Tutorial!    ') AS TrimmedString;
    /// TRIM([BOTH | LEADING | TRAILING] <expr> [FROM <expr>])\
    /// Or\
    /// TRIM(<expr>)
    Trim {
        expr: Box<Expr>,
        // ([BOTH | LEADING | TRAILING], <expr>)
        trim_where: Option<(TrimWhereField, Box<Expr>)>,
    },

DONE =>

    /// EXTRACT(DateTimeField FROM <expr>)
    Extract {
        field: DateTimeField,
        expr: Box<Expr>,
    },

    /// TRY_CAST an expression to a different data type e.g. `TRY_CAST(foo AS VARCHAR(123))`
    //  this differs from CAST in the choice of how to implement invalid conversions
    TryCast {
        expr: Box<Expr>,
        data_type: DataType,
    },

[FEATURE] Put column descriptions and cardinality information into the metadata - and into the scan node

Having cardinality information will help with query planning - for example deciding between a hash or tree for the accumulation as each perform better at the different ends of the spectrum.

[FEATURE] Support MySQL style partitions

SELECT * FROM employees PARTITION (p1);

https://dev.mysql.com/doc/refman/5.7/en/partitioning-selection.html

This isn't supported by the AST, so will need a custom interpreter like the temporal queries.

✨ support CASE statements

In order to have actual utility, the SQL engine needs some code constructs, CASE statements should be convertible to a UDF function which we can run in the eval step.

[FEATURE] support `CROSS JOIN` and `UNNEST`

This allows columns that are arrays, like a list of references, to be filtered on like this:

SELECT * FROM findings CROSS JOIN UNNEST(cves) as cve WHERE cve = 'CVE-2017-0144

or implied CROSS JOIN like this:

SELECT * FROM findings, UNNEST(cves) as cve WHERE cve = 'CVE-2017-0144

see:
https://medium.com/firebase-developers/using-the-unnest-function-in-bigquery-to-analyze-event-parameters-in-analytics-fb828f890b42

SELECT * FROM UNNEST(['foo', 'bar', 'baz', 'qux', 'corge', 'garply', 'waldo', 'fred']) AS element

[FEATURE] Memcached middleware

include hit/miss counts to the statistics

[FEATURE] Trino compatible API

test client: https://github.com/trinodb/trino-python-client
API documentation: https://trino.io/docs/current/develop/client-protocol.html

mabel-dev / opteryx Goto Github PK

opteryx's Issues

Recommend Projects

Recommend Topics

Recommend Org