andygrove / datafusion-archive Goto Github PK

View Code? Open in Web Editor NEW

631.0 43.0 57.0 968 KB

DataFusion has now been donated to the Apache Arrow project

Home Page: https://arrow.apache.org/

License: Apache License 2.0

Rust 96.08% Shell 2.53% Dockerfile 1.40%

rust spark sql dataframe cluster compute data arrow

datafusion-archive's Issues

Implement integration test

We need an integration test that runs as part of the ci build:

Start etcd
Start a worker node
Run the SQL console to execute a query (probably means we need a command-line parameter for specifying a query to run)

Implement in-memory GROUP BY for simple aggregate functions

e.g. MIN, MAX, COUNT, SUM, AVG

Generate closures from relational plan

Instead of interpreting the plan and expressions at runtime, the executor should generate closures from the plan and these closures can be called per row. This should be faster.

For interactive queries, worker should stream results to client

Some valid/invalid sql query not accepted/accepted

Invalid but parsed successfully:

SELECT CREATE EXTERNAL TABLE test (id VARCHAR(1)) FROM foo WHERE 1=2
SELECT 1 FROM foo WHERE CREATE EXTERNAL TABLE test (id VARCHAR(1))

Valid (debatable if useful) but not accepted:

CREATE EXTERNAL TABLE test ()

I'm trying to fix the parser, but it seems to be non-trivial. Is there a reason why you wrote your own parser and didn't use some parser combinator library like nom or combine?

Implement SSL transport (HTTPS)

Research effect of NVRAM on row vs column processing performance

Remove need to clone ExecutionContext

Add support for Apache Kudu

This should be fairly simple thanks to https://github.com/danburkert/kudu-rs

I wonder if we could write a simple configuration file.
A very simple .toml file that defines all the necessary configurations.
So when the project forwards and needs more some configurations it would be easier to use and we do not need to pass all the arguments every time.

For example:

# Here we can put some configuration that
# are common to both
[datafusion]
etcd = "http://127.0.0.1:2379"

[worker]
bind = "0.0.0.0:8080"
data_dir = "./path/data_dir"
webroot = "./src/bin/worker"

[console]
# some specific configuration
# to the console

When we initialize the console or the worker we could pass a flag that searches for the configuration.

will try to load the configuration from a default path
worker -c
or we can pass the path to configuration file
worker -c /path/to/config.toml

And likewise for the console

Create logo for the project

Something like this:

https://www.iconfinder.com/icons/360735/atom_discovery_physic_science_icon#size=256

Implement CREATE EXTERNAL TABLE

We need a way to register CSV files and defining a schema via SQL so we can query CSV from a SQL console without having to write Rust code.

The standard way of doing this is by using a CREATE EXTERNAL TABLE command that defines the schema and provides other information required so open the CSV, like the directory path.

Here are docs for how Hive does this:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_data-access/content/moving_data_from_hdfs_to_hive_external_table_method.html

Come up with a good name for this project

Implement IPC mechanism for integrating with other processes and frameworks

Adding IPC support could allow DataFusion to work alongside existing Spark jobs and allow for a gradual migration from Spark to DataFusion.

Also it could allow non-Rust code to be executed by DataFusion.

Implement data type support using generics and add support for all base types

The current type system uses an enum as a wrapper around some standard types but this should probably be re-implemented using generics instead to make it easier to add new types.

The system should support all the standard primitve types (bool, int, long, float, double) as well as string, date/time, and binary.

Implement partitioning

Implement in-memory JOIN (equijoin)

Move code to CircleCI for automated builds

I can see that TravisCI is currently being used, but it might behoove us to move this project to CircleCI.

CircleCI 2.0 gives us a lot of flexibility, and allows us to run things in containers with versions of code that we desire. It also means we don't have to maintain our own build cluster.

Just something to think about. I can take this on if you want; shouldn't take longer than a few hours.

Would it make sense to use Diesel for building the SQL queries?

http://diesel.rs/

Implement in-memory ORDER BY

Google Cloud Dataflow

Hi all, any idea how this kind of project maps to something like Apache Beam?

Refactor parser to avoid copying tokens

This is some of the oldest code in the repo and I used Copy a lot but it would be good to clean this up now before too much more functionality is built.

Dockerize the worker process

Create a Dockerfile to run the worker process (to try out the worker locally just run cargo run worker).

A/C

There should be a bash script to create the worker binary (e.g. cargo build --release) and create the Docker image
The Docker image should not contain Rust or Cargo, just the pre-compiled worker executable
Port 80 should be exposed (for web UI and REST API)
It should be possible to map a volume containing user-defined types and functions

Convert project to a library with examples

Currently the project is a binary project with a main method containing some sample code. We should change it to a library and add examples instead.

Dynamic loading for UDTs and UDFs

This library looks like a good place to start:

https://docs.rs/libloading/0.5.0/libloading/

Users should be able to publish crates containing structs that implement appropriate datafusion traits for types and functions and then register the produced libraries with datafusion at runtime.

Run single-process benchmark vs Apache Spark

For the first benchmark, I'm thinking that a geospatial use case would be good. It would be easy to generate csv files of varying size contianing fake address data and lat/lng co-ordinates and then run a job to translate the lat/lngs to quadkeys or mercator projection.

References:

Implement encryption at rest and between nodes

Implement authentication and authorization framework

Implement efficient serialization of RecordBatch

In order to stream data between nodes and to efficiently swap data out to disk and store in a native format, it will be necessary to serialize to a binary format.

Optimize for ordered nearest left join

KX Systems sold probably a billion dollars worth of licenses to financial institutions by optimizing the following operation: http://code.kx.com/q/ref/joins/#aj-aj0-asof-join

To my knowledge, Spark doesn't have an optimized implementation of this, or if it does, I have no idea what they call it at the moment.

Worker process should register with etcd

Move builds to Circle CI

Implement LIMIT

Great feedback from reddit

I looked at the code 2-3 weeks ago when it was first announced with the 2018 big-data on Rust blog post. I took a cursory look today. I am going through similar exercise in Rust for scratching my own itch after working 4+ years in the same domain on JVM/Scala/etc.

I would like this project to succeed so here are some observations that may help in the long run (caveat - I may have missed some of the points in your code).

The streaming model is more general and it can be easily relaxed to batch processing - the inverse is always hard. On the other side batch processing is more efficient than processing event-by-event. In my experience something like micro-batching works best in terms of flexibility and performance.

Columnar oriented processing is more efficient on the current crop of hardware as it plays better with dispatch overhead, cache locality and data prefetching. It is also a natural extension of micro-batching from point 1. My own benchmarks show 20x difference with batches of 1024 tuples, around 2-3 cycles per arithmetic operation on f64, including SQL NULL correctness.

You definitely don't want to dispatch on types in the leaves, e.g. in the function/expression bodies like https://github.com/andygrove/datafusion-rs/blob/master/src/functions/math.rs#L14 - The data types are compilation concept, they should not exist at run-time. You can use generics, type erasure and columnar dispatch to get it. Think about this in the same vein as Rust vs. Ruby - Rust is faster because it does not need to check the run-time type on each operation because the compiler guarantees it.

Instead of doing switch type interpreter, where you have dispatch cost on each event you may want to think about expressions in more functional way, e.g.

fn gteq(lhs:expr, rhs:expr) -> Box<Fn(input: tuple) -> value>

or if you get into columnar approach:

fn gteq(lhs: expr, rhs: expr) -> Box<Fn(in: &Frame) -> Column>

and compose the whole computation just once before executing it: following a reference once per 1000 tuples is a lot cheaper than going trough a match on each tuple.

You can take a look at a minimalistic example I put together some time ago: https://gist.github.com/luben/95c1c05f36ec56a57f5624c1b40e9f11

good work

good work, good start, I will learn it and do something.

Implement geospatial scalar functions

Implement some basic geospatial functions, drawing inspiration from https://github.com/Esri/spatial-framework-for-hadoop

Design user-defined type traits

Datafusion needs to be able to process tuples containing values that are of a user-defined type not known at compile time. Datafusion doesn't need to understand the types too much but needs to be able to perform serde operations and pass them to user-defined functions.

Implement a worker process with an http interface

SQL Parser should never panic

Need to implement correct error handling i.e. returning Result in SQL parser to avoid panics.

Here's an example:

DataFusion Console
$ SELECT * FROM foo
Executing: SELECT * FROM foo
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError("Prefix parser expected a keyword but found Mult")', /checkout/src/libcore/result.rs:906:4

Make Apache Arrow the native memory model

Research if it is possible to serialize Rust closures and load them in another process

To match the ad-hoc query capabilities of Spark many users will expect to be able to apply functional transformations to DataFrames using custom Rust code without having to package that code in a crate dependency behind a UDF trait.

Is this even possible with Rust?

SQL Console should support multi-line statements

Currently the console executes the input when enter is hit. Normal behavior for a SQL console is to accept multi-line statements terminated with a semicolon.

Design struct for distributed jobs

First use case should be sending a job to read a csv file and partition it into multiple csv files

Consider adding kafka-like functionality

Add support for Parquet

Maybe write a Rust wrapper around this?

https://github.com/apache/parquet-cpp

Tuple should be a trait, not a struct?

Scalar functions

It should be possible to use some pre-defined scalar functions in a query. For example:

SELECT sqrt(x) FROM foo

SELECT * FROM foo WHERE sqrt(x) < sqrt(y)

worker should keep trying to ping etcd and not give up on first error

Currently the worker terminates if it cannot connect to etcd immediately. It should instead keep trying in case etcd is still starting up.

Worker 8987a3f3-71e1-5cca-aadf-bc165f528fac listening on 127.0.0.1:8088 and serving content from /opt/datafusion/www
Heartbeat loop failed: Error { repr: Kind(NotFound) }

HDFS Support

This is the lowest common denominator for being to run some distributed workloads on existing Hadoop clusters.

There is no working pure Rust HDFS library unfortunately, so we'll probably have to start with this one which wraps a C library that wraps the Java client.

http://hyunsik.github.io/hdfs-rs/hdfs/index.html

Implement predicate push down as the first example of an optimization

We should add an example query optimization to make sure that the current structures for the logical plan are going to work well enough.

andygrove / datafusion-archive Goto Github PK

datafusion-archive's Issues

Recommend Projects

Recommend Topics

Recommend Org