mabel-dev / opteryx Goto Github PK

🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.

Home Page: https://opteryx.dev

License: Apache License 2.0

Python 96.94% Rust 0.34% Makefile 0.07% Cython 2.65%

python sql analytics big-data serverless sql-engine arrow cloud-run cloud-native query-engine

opteryx's Introduction

Query your data, where it lives.

A unified SQL interface to unlock insights across your diverse data sources, from blobs stores to databases - effortless cross-platform data analytics.

Documentation • Install • Examples • Get Involved

What is Opteryx?

Opteryx champions the SQL-on-everything approach, streamlining cross-platform data analytics by federating SQL queries across diverse data sources, including database systems like Postgres and datalake file formats like Parquet. The goal is to enhance your data analytics process by offering a unified way to access data from across your organization.

Opteryx is a Python library that combines elements of in-process database engines like SQLite and DuckDB with federative features found in systems like Presto and Trino. The result is a versatile tool for querying data across multiple data sources in a seamless fashion.

Opteryx offers the following features:

SQL queries on data files generated by other processes, such as logs
A command-line tool for filtering, transforming, and combining files
Integration with familiar tools like pandas and Polars
Embeddable as a low-cost engine, enabling portability and allowing for hundreds of analysts to leverage ad hoc databases with ease
Unified and federated access to data on disk, in the cloud, and in on-premises databases, not only through the same interface but in the same query

How Does it Work?

Opteryx processes queries by first determining the appropriate language to interact with various downstream data platforms. It translates your query into SQL, CQL, or a suitable query format for document stores like MongoDB, depending on the data source. This allows Opteryx to efficiently retrieve the necessary data from systems such as MySQL or MongoDB to respond to your query.

Why Use Opteryx?

Familiar Interface

Opteryx supports key parts of the Python DBAPI and SQL92 standard standards which many analysts and engineers will already know how to use.

Consistent Syntax

Opteryx creates a common SQL-layer over multiple data platforms, allowing backend systems to be upgraded, migrated or consolidated without changing any Opteryx code.

Where possible, errors and warnings returned by Opteryx help the user to understand how to fix their statement to reduce time-to-success for even novice SQL users.

Consumption-Based Billing Friendly

Opteryx is well-suited for deployments to environments which are pay-as-you-use, like Google Cloud Run. Great for situations where you have low-volume usage, or multiple environments, where the costs of many traditional database deployment can quickly add up.

Python Ecosystem

Opteryx is Open Source Python, it quickly and easily integrates into Python code, including Jupyter Notebooks, so you can start querying your data within a few minutes. Opteryx integrates with many of your favorite Python data tools, you can use Opteryx to run SQL against pandas and Polars DataFrames, and even execute a JOIN on an in-memory DataFrame and a remote SQL dataset.

Time Travel

Designed for data analytics in environments where decisions need to be replayable, Opteryx allows you to query data as at a point in time in the past to replay decision algorithms against facts as they were known in the past. You can even self-join tables historic data, great for finding deltas in datasets over time. (data must be structured to enable temporal queries)

Fast

Benchmarks on M2 Pro Mac running an ad hoc GROUP BY over a 6 million row parquet file via the CLI in ~1/4th of a second from a cold start (no caching and predefined schema). (different systems will have different performance characteristics)

Instant Elasticity

Designed to run in Knative and similar environments like Google Cloud Run, Opteryx can scale down to zero, and scale up to respond to thousands of concurrent queries within seconds.

Bring your own Data

Opteryx supports multiple query engines, dataframe APIs and storage formats. You can mix-and-match sources in a single query. Opteryx can even JOIN datasets stored in different formats and different platforms in the same query, such as Parquet and MySQL.

Opteryx allows you to query your data directly in the systems where they are stored, eliminating the need to duplicate data into a common store for analytics. This saves you the cost and effort of maintaining duplicates.

Opteryx can push parts of your query to the source query engine, allowing queries to run at the speed of the backend, rather than your local computer.

And if there's not a connector in the box for your data platform; bespoke connectors can be added.

Install

Installing from PyPI is recommended.

pip install opteryx

To build Opteryx from source, refer to the contribution guides.

Opteryx installs with a small set of libraries it needs for core functionality, such as Numpy, PyArrow, and orjson. Some features require additional libraries to be installed, you are notified of these libraries as they are required.

Examples

Filter a Dataset on the Command Line
Execute a Simple Query in Python
Execute SQL on a pandas DataFrame
Query Data on Local Disk
Query Data on GCS
Query Data in SQLite

Try Opteryx now using our interactive labs on Binder.

Filter a Dataset on the Command Line

In this example, we are running Opteryx from the command line to filter one of the internal example datasets and display the results on the console.

python -m opteryx "SELECT * FROM \$astronauts WHERE 'Apollo 11' IN UNNEST(missions);"

this example is complete and should run as-is

Execute a Simple Query in Python

In this example, we are showing the basic usage of the Python API by executing a simple query that makes no references to any datasets.

# Import the Opteryx SQL query engine library.
import opteryx

# Execute a SQL query to evaluate the expression 4 * 7.
# The result is stored in the 'result' variable.
result = opteryx.query("SELECT 4 * 7;")

# Display the first row(s) of the result to verify the query executed correctly.
result.head()

ID	4 * 7
1	28

this example is complete and should run as-is

Execute SQL on a pandas DataFrame

In this example, we are running a SQL statement on a pandas DataFrame and returning the result as a new pandas DataFrame.

# Required imports
import opteryx
import pandas

# Read data from the exoplanets.csv file hosted on Google Cloud Storage
# The resulting DataFrame is stored in the variable `pandas_df`.
pandas_df = pandas.read_csv("https://storage.googleapis.com/opteryx/exoplanets/exoplanets.csv")

# Register the pandas DataFrame with Opteryx under the alias "exoplanets"
# This makes the DataFrame available for SQL-like queries.
opteryx.register_df("exoplanets", pandas_df)

# Perform an SQL query to group the data by `koi_disposition` and count the number
# of occurrences of each distinct `koi_disposition`.
# The result is stored in `aggregated_df`.
aggregated_df = opteryx.query("SELECT koi_disposition, COUNT(*) FROM exoplanets GROUP BY koi_disposition;").pandas()

# Display the aggregated DataFrame to get a preview of the result.
aggregated_df.head()

  koi_disposition  COUNT(*)
0       CONFIRMED      2293
1  FALSE POSITIVE      5023
2       CANDIDATE      2248

this example is complete and should run as-is

Query Data on Local Disk

In this example, we are querying and filtering a file directly. This example will not run as written because the file being queried does not exist.

# Import the Opteryx query engine.
import opteryx

# Execute a SQL query to select the first 5 rows from the 'space_missions.parquet' table.
# The result will be stored in the 'result' variable.
result = opteryx.query("SELECT * FROM 'space_missions.parquet' LIMIT 5;")

# Display the result.
# This is useful for quick inspection of the data.
result.head()

ID	Company	Location	Price	Launched_at	Rocket	Rocket_Status	Mission	Mission_Status
0	RVSN USSR	Site 1/5, Baikonur Cosmodrome,	null	1957-10-04 19:28:00	Sputnik 8K71PS	Retired	Sputnik-1	Success
1	RVSN USSR	Site 1/5, Baikonur Cosmodrome,	null	1957-11-03 02:30:00	Sputnik 8K71PS	Retired	Sputnik-2	Success
2	US Navy	LC-18A, Cape Canaveral AFS, Fl	null	1957-12-06 16:44:00	Vanguard	Retired	Vanguard TV3	Failure
3	AMBA	LC-26A, Cape Canaveral AFS, Fl	null	1958-02-01 03:48:00	Juno I	Retired	Explorer 1	Success
4	US Navy	LC-18A, Cape Canaveral AFS, Fl	null	1958-02-05 07:33:00	Vanguard	Retired	Vanguard TV3BU	Failure

this example requires a data file, space_missions.parquet.

Query Data in SQLite

In this example, we are querying a SQLite database via Opteryx. This example will not run as written because the file being queried does not exist.

# Import the Opteryx query engine and the SqlConnector from its connectors module.
import opteryx
from opteryx.connectors import SqlConnector

# Register a new data store with the prefix "sql", specifying the SQL Connector to handle it.
# This allows queries with the 'sql' prefix to be routed to the appropriate SQL database.
opteryx.register_store(
   prefix="sql",  # Prefix for distinguishing this particular store
   connector=SqlConnector,  # Specify the connector to handle queries for this store
   remove_prefix=True,  # Remove the prefix from the table name when querying SQLite
   connection="sqlite:///database.db"  # SQLAlchemy connection string for the SQLite database
)

# Execute a SQL query to select specified columns from the 'planets' table in the SQL store,
# limiting the output to 5 rows. The result is stored in the 'result' variable.
result = opteryx.query("SELECT name, mass, diameter, density FROM sql.planets LIMIT 5;")

# Display the result.
# This is useful for quickly verifying that the query executed correctly.
result.head()

ID	name	mass	diameter	density
1	Mercury	0.33	4879	5427
2	Venus	4.87	12104	5243
3	Earth	5.97	12756	5514
4	Mars	0.642	6792	3933
5	Jupiter	1898.0	142984	1326

this example requires a data file, database.db.

Query Data on GCS

In this example, we are to querying a dataset on GCS in a public bucket called 'opteryx'.

# Import the Opteryx query engine and the GcpCloudStorageConnector from its connectors module.
import opteryx
from opteryx.connectors import GcpCloudStorageConnector

# Register a new data store named 'opteryx', specifying the GcpCloudStorageConnector to handle it.
# This allows queries for this particular store to be routed to the appropriate GCP Cloud Storage bucket.
opteryx.register_store(
    "opteryx",  # Name of the store to register
    GcpCloudStorageConnector  # Connector to handle queries for this store
)

# Execute a SQL query to select all columns from the 'space_missions' table located in the 'opteryx' store,
# and limit the output to 5 rows. The result is stored in the 'result' variable.
result = opteryx.query("SELECT * FROM opteryx.space_missions LIMIT 5;")

# Display the result.
# This is useful for quickly verifying that the query executed correctly.
result.head()

ID	Company	Location	Price	Launched_at	Rocket	Rocket_Status	Mission	Mission_Status
0	RVSN USSR	Site 1/5, Baikonur Cosmodrome,	null	1957-10-04 19:28:00	Sputnik 8K71PS	Retired	Sputnik-1	Success
1	RVSN USSR	Site 1/5, Baikonur Cosmodrome,	null	1957-11-03 02:30:00	Sputnik 8K71PS	Retired	Sputnik-2	Success
2	US Navy	LC-18A, Cape Canaveral AFS, Fl	null	1957-12-06 16:44:00	Vanguard	Retired	Vanguard TV3	Failure
3	AMBA	LC-26A, Cape Canaveral AFS, Fl	null	1958-02-01 03:48:00	Juno I	Retired	Explorer 1	Success
4	US Navy	LC-18A, Cape Canaveral AFS, Fl	null	1958-02-05 07:33:00	Vanguard	Retired	Vanguard TV3BU	Failure

this example is complete and should run as-is

Further Examples

For further examples, check out the interactive labs on Binder.

Community

Get Involved

Star this repo
Contribute — join us in building Opteryx, through writing code, or inspiring others to use it.
Let us know your ideas, how you are using Opteryx, or report a bug or feature request.
See the contributor documentation for Opteryx. It's easy to get started, and we're really friendly if you need any help!
If you're interested in contributing to the code now, check out GitHub issues. Feel free to ask questions or open a draft PR.

Security

See the project Security Policy for information about reporting vulnerabilities.

License

Opteryx is licensed under Apache 2.0 except where specific modules note otherwise.

Status

Opteryx is in beta. Beta means different things to different people, to us, being beta means:

Core functionality has good regression test coverage to help ensure stability
Some edge cases may have undetected bugs
Performance tuning is incomplete
Changes are focused on feature completion, bugs, performance, reducing debt, and security
Code structure and APIs are not stable and may change

Related Projects

orso DataFrame library
mabel Streaming data APIs
mesos MySQL connector for Opteryx

opteryx's People

Contributors

Stargazers

Watchers

Forkers

gva-jjoyce gva-nigelclarke xb500 joocer fossabot hartl3y94 gitter-badger kaloob yutiansut webclinic017

opteryx's Issues

[BUG] Aliases should be the preferred name for columns

[FEATURE] Date Filters

Filter by dates, use the virtual date filters

[FEATURE] use AST to format code

Use an AST to parse SQL, color code the code.

Shift-Alt-F to format the text.

[BUG] Conversion to list of Dicts is too slow for practical use

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

✨ Support for CTEs (with statements)

WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

Initial design

Execute first, then keep the result in memory. Include in the subsequent queries as a LITERAL_TABLE.

This would probably work with a new reader (literal_tabel_reader), which would also need to be added to the factory.

[FEATURE] table aliases

FROM table AS name

need to be able to SELECT, WHERE, ORDER and GROUP on the columns using either the aliases or the original names.

projection should result in unambiguous names, but not be just adding the prefixes to the column names

[BUG] GROUP BY should split by the column, if if it's not in the projection

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[FEATURE] Temporal clauses should be localized

FOR clauses should be specific to the context, for example

SELECT event
FROM log
FOR TODAY
WHERE event NOT IN 
   (
    SELECT event
    FROM log
    FOR YESTERDAY
    )

This will need an alternate implementation to the RegEx based one.

[FEATURE] Use inbuilt AST features

NOT DONE =>

    /// SUBSTRING(<expr> [FROM <expr>] [FOR <expr>])
    Substring {
        expr: Box<Expr>,
        substring_from: Option<Box<Expr>>,
        substring_for: Option<Box<Expr>>,
    },

SELECT TRIM('#! ' FROM '    #SQL Tutorial!    ') AS TrimmedString;
    /// TRIM([BOTH | LEADING | TRAILING] <expr> [FROM <expr>])\
    /// Or\
    /// TRIM(<expr>)
    Trim {
        expr: Box<Expr>,
        // ([BOTH | LEADING | TRAILING], <expr>)
        trim_where: Option<(TrimWhereField, Box<Expr>)>,
    },

DONE =>

    /// EXTRACT(DateTimeField FROM <expr>)
    Extract {
        field: DateTimeField,
        expr: Box<Expr>,
    },

    /// TRY_CAST an expression to a different data type e.g. `TRY_CAST(foo AS VARCHAR(123))`
    //  this differs from CAST in the choice of how to implement invalid conversions
    TryCast {
        expr: Box<Expr>,
        data_type: DataType,
    },

✨ Support SHOW TABLE(STATUS)

Table information

Size
File type

https://dev.mysql.com/doc/refman/8.0/en/show-tables.html

[FEATURE] Memcached middleware

include hit/miss counts to the statistics

[FEATURE] Split query planner and query plan into different modules

Split the planner into functional parts:

the planning
the plan
the execution

Each of these new components will be smaller and simpler.

[FEATURE] Review using gandiva

https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/

✨ support CASE statements

In order to have actual utility, the SQL engine needs some code constructs, CASE statements should be convertible to a UDF function which we can run in the eval step.

WC 28-FEB

Sample dataset with lists and structs

[FEATURE] (sources) scan planning

Move the scan planning to the init for the scanner.

This will allow us to access the expected rows and other statistics when optimizing.

We can then use the expected rows to determine which dataset to hold in memory when joining.

This should be the information returned by DESCRIBE dataset which means changes to the writer to record more information.

Per column

Name
Type
Nulls?
Unique
Min
Max
Description

The Planner should plan the reads, including which portions to read and which selections and projections to push down to the reader operator.

This should then be described in JSON (or similar), so it can be hashed and used to persist partial reads to enable caching of 'results'

✨ support FLATTEN

Work out how best to implement FLATTEN functionality, to flatten a struct/dict into separate columns.

https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.flatten

It appears to want to work on a table - perhaps get the struct column, remove from the source table, apply to filter the column (now a table) and then add the now struct column table to the original table?

[FEATURE] Reader should support mabel by_ segments (basic support)

Thanks for stopping by to let us know something could be better!

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE] Subqueries

Support for table subqueries is a MUST

SELECT results.account
FROM (SELECT * FROM Players) AS results;

Support for SELECT is a could

SELECT account, (SELECT mascot FROM Guilds WHERE Players.guild = id) AS player_mascot
FROM Players;

support for IN is a should

SELECT account
FROM users
WHERE account IN (SELECT account FROM billing WHERE paid = true)

[FEATURE] standardize temporal queries

https://www.sqlshack.com/temporal-tables-in-sql-server/

The FOR SYSTEM_TIME clause has many variations and options. It is further classified into four temporal sub-clauses. This provides a way to query the data across current and history tables.

AS OF
FROM TO
BETWEEN AND
CONTAINED IN ( , )
ALL

The AS OF clause is used when there is a need to rebuild the original state of the data and need to know the state it was at any specific time in the past. This is possible by specifying the date time as its input.

[BUG] WHERE statement with a new Function reports a NoneType column

To replicate:

SELECT * FROM $astronauts
WHERE VARCHAR(Group) = '10'

Type mismatch, unable to compare OTHER (NoneType) with VARCHAR

Whilst NoneType is technically correct (the function column doesn't exist so is nulls), this is not expected or helpful to users.

[FEATURE] Planner should plan the reads rather than the Reader

The Planner should plan the reads, including which portions to read and which selections and projections to push down to the reader operator.

This should then be described in JSON (or similar), so it can be hashed and used to persist partial reads to enable caching of 'results'

[FEATURE] Support ORDER BY RANDOM()

Thanks for stopping by to let us know something could be better!

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE] Support $PARTITION hints

SELECT * FROM $satellites WHERE $PARTITION.username('green')

This would instruct the engine to try (it can ignore if it wants to) to prune the by_username partitioning to the username=green partition.

WC 07-FEB

SQL HAVING
SQL ORDER BY
$DATE Filters - not ready to start

[FEATURE] evaluation node should evaluate entries in the WHERE clause

The Selection node contains a chunk of code from the Evaluation node because the functions in the WHERE are not evaluated ahead of this node.

[BUG] RANDOM should be between 0 and 1, not 0 and 1000

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behaviour:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behaviour
A clear and concise description of what you expected to happen.

Screenshots or Logs
If applicable, add screenshots or logs to help explain your problem.
!! Be aware not to publish any sensitive information in what you upload.

Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

[FEATURE] non SELECT queries

> show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | t          | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
+---------------+--------------------+------------+------------+

> show columns from t;
+---------------+--------------+------------+-------------+-----------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------+-------------+
| datafusion    | public       | t          | a           | Int32     | NO          |
| datafusion    | public       | t          | b           | Utf8      | NO          |
| datafusion    | public       | t          | c           | Float32   | NO          |
+---------------+--------------+------------+-------------+-----------+-------------+

[BUG] Columns aliases are not carried through the aggregation node

when aggregation nodes rebuild the metadata, they don't recreate declared aliases (AS) for columns, only the implied aliases (e.g. table_name.column_name)

[FEATURE] Review using Cylon

https://cylondata.org/docs/python/

This may have no benefit if not distributed, where a device has 2 or more CPUs, is it practical and is it useful?

[FEATURE] support `CROSS JOIN` and `UNNEST`

This allows columns that are arrays, like a list of references, to be filtered on like this:

SELECT * FROM findings CROSS JOIN UNNEST(cves) as cve WHERE cve = 'CVE-2017-0144

or implied CROSS JOIN like this:

SELECT * FROM findings, UNNEST(cves) as cve WHERE cve = 'CVE-2017-0144

see:
https://medium.com/firebase-developers/using-the-unnest-function-in-bigquery-to-analyze-event-parameters-in-analytics-fb828f890b42

SELECT * FROM UNNEST(['foo', 'bar', 'baz', 'qux', 'corge', 'garply', 'waldo', 'fred']) AS element

[FEATURE] Put column descriptions and cardinality information into the metadata - and into the scan node

Having cardinality information will help with query planning - for example deciding between a hash or tree for the accumulation as each perform better at the different ends of the spectrum.

[FEATURE] struct and list access

SELECT Birth_Place['state'] FROM $astronauts

"projection": [
            {
              "UnnamedExpr": {
                "MapAccess": {
                  "column": {
                    "Identifier": {
                      "value": "Birth_Place",
                      "quote_style": null
                    }
                  },
                  "keys": [
                    {
                      "Value": {
                        "SingleQuotedString": "state"
                      }
                    }
                  ]
                }
              }
            }
          ]

SELECT Misions[0] FROM $astronauts

"MapAccess": {
                  "column": {
                    "Identifier": {
                      "value": "Misions",
                      "quote_style": null
                    }
                  },
                  "keys": [
                    {
                      "Value": {
                        "Number": [
                          "0",
                          false
                        ]
                      }
                    }

[BUG] Selections between different types return zero records, not fails

This returns records:

SELECT * FROM $astronauts
WHERE Group = 10

but this returns no records:

SELECT * FROM $astronauts
WHERE Group = '10'

Rather than return zero records, it should report a type mismatch.

✨ support UNION statements

UNION wil likely require changes to the parser as it may need to dynamically get the statement being worked on from a list, rather than hard-coded to 0

✨ Rewrite AggregationNode to use pyarrow.table group_by

Test performance first, initial comparison testing had the existing code running faster.

[FEATURE] Go implementations

https://github.com/axiomhq/hyperloglog
https://github.com/plar/go-adaptive-radix-tree

[FEATURE] JOINS

[FEATURE] nested functions

Allow functions to be called on the result of other functions

WC 21-FEB

column aliases
date functions

[FEATURE] Documentation

Some documentation has typos, is incomplete or incorrect.

Documentation is in different formats and isn't in the desired structure

[FEATURE] (sources) Execution plan DAG

The current DAG allows for an arbitrary set of unordered in coming edges, which means the planner can't include planning order or left/right nodes. This should be rewritten to have an order to the incoming edges (but not all nodes need to care)

Where Query planners are currently passed to nodes, these should be relations to other nodes.

The Node execution start at the final node and pull chunks through the DAG.

[FEATURE] Trino compatible API

test client: https://github.com/trinodb/trino-python-client
API documentation: https://trino.io/docs/current/develop/client-protocol.html

[FEATURE] infix notation

Allow for infix notation, e.g.

SELECT first name || last name AS name

And

WHERE age - 18 = 0

[BUG] EXPLAIN fails for subqueries

EXPLAIN 
SELECT * FROM (SELECT * FROM $astronauts)

fails with this error: __repr__ returned non-string (type QueryPlanner)

[BUG] LIMIT 0 does not limit results

[FEATURE] Refactor JOIN code for performance

CROSS JOIN
LEFT JOIN
INNER JOIN

[FEATURE] ephemeral tables (VALUES)

SELECT *
FROM (VALUES (1,2),(3,4),(340,455)) AS t(a,b)

Allow tables to be defined in the SQL which don't need to be backed by a on disk dataset.

The will be useful for small look ups, e.g. to order by an ordinal field like HIGH, MEDIUM, LOW, or to unify nomenclature like TRUE, YES, ON

[FEATURE] Improve Performance of ORDER BY RANDOM()

This currently works by assigning a random number to each row and then sorting by the new column, what would probably be faster is creating a 0..n list, where n is the number of rows, shuffling this list and then using .take() to get the shuffled rows (in chunks)

[FEATURE] Support MySQL style partitions

SELECT * FROM employees PARTITION (p1);

https://dev.mysql.com/doc/refman/5.7/en/partitioning-selection.html

This isn't supported by the AST, so will need a custom interpreter like the temporal queries.

mabel-dev / opteryx Goto Github PK

opteryx's Introduction

Query your data, where it lives.

What is Opteryx?

How Does it Work?

Why Use Opteryx?

Familiar Interface

Consistent Syntax

Consumption-Based Billing Friendly

Python Ecosystem

Time Travel

Fast

Instant Elasticity

Bring your own Data

Install

Examples

Filter a Dataset on the Command Line

Execute a Simple Query in Python

Execute SQL on a pandas DataFrame

Query Data on Local Disk

Query Data in SQLite

Query Data on GCS

Further Examples

Community

Security

License

Status

Related Projects

opteryx's People

Contributors

Stargazers

Watchers

Forkers

opteryx's Issues

Recommend Projects

Recommend Topics

Recommend Org