andygrove / datafusion-archive Goto Github PK
View Code? Open in Web Editor NEWDataFusion has now been donated to the Apache Arrow project
Home Page: https://arrow.apache.org/
License: Apache License 2.0
DataFusion has now been donated to the Apache Arrow project
Home Page: https://arrow.apache.org/
License: Apache License 2.0
We need an integration test that runs as part of the ci build:
e.g. MIN, MAX, COUNT, SUM, AVG
Instead of interpreting the plan and expressions at runtime, the executor should generate closures from the plan and these closures can be called per row. This should be faster.
Invalid but parsed successfully:
SELECT CREATE EXTERNAL TABLE test (id VARCHAR(1)) FROM foo WHERE 1=2
SELECT 1 FROM foo WHERE CREATE EXTERNAL TABLE test (id VARCHAR(1))
Valid (debatable if useful) but not accepted:
CREATE EXTERNAL TABLE test ()
I'm trying to fix the parser, but it seems to be non-trivial. Is there a reason why you wrote your own parser and didn't use some parser combinator library like nom or combine?
This should be fairly simple thanks to https://github.com/danburkert/kudu-rs
I wonder if we could write a simple configuration file.
A very simple .toml file that defines all the necessary configurations.
So when the project forwards and needs more some configurations it would be easier to use and we do not need to pass all the arguments every time.
For example:
# Here we can put some configuration that
# are common to both
[datafusion]
etcd = "http://127.0.0.1:2379"
[worker]
bind = "0.0.0.0:8080"
data_dir = "./path/data_dir"
webroot = "./src/bin/worker"
[console]
# some specific configuration
# to the console
When we initialize the console or the worker we could pass a flag that searches for the configuration.
worker -c
worker -c /path/to/config.toml
And likewise for the console
Something like this:
https://www.iconfinder.com/icons/360735/atom_discovery_physic_science_icon#size=256
We need a way to register CSV files and defining a schema via SQL so we can query CSV from a SQL console without having to write Rust code.
The standard way of doing this is by using a CREATE EXTERNAL TABLE
command that defines the schema and provides other information required so open the CSV, like the directory path.
Here are docs for how Hive does this:
Adding IPC support could allow DataFusion to work alongside existing Spark jobs and allow for a gradual migration from Spark to DataFusion.
Also it could allow non-Rust code to be executed by DataFusion.
The current type system uses an enum as a wrapper around some standard types but this should probably be re-implemented using generics instead to make it easier to add new types.
The system should support all the standard primitve types (bool, int, long, float, double) as well as string, date/time, and binary.
I can see that TravisCI is currently being used, but it might behoove us to move this project to CircleCI.
CircleCI 2.0 gives us a lot of flexibility, and allows us to run things in containers with versions of code that we desire. It also means we don't have to maintain our own build cluster.
Just something to think about. I can take this on if you want; shouldn't take longer than a few hours.
Hi all, any idea how this kind of project maps to something like Apache Beam?
This is some of the oldest code in the repo and I used Copy
a lot but it would be good to clean this up now before too much more functionality is built.
Create a Dockerfile to run the worker process (to try out the worker locally just run cargo run worker
).
A/C
cargo build --release
) and create the Docker imageCurrently the project is a binary project with a main method containing some sample code. We should change it to a library and add examples instead.
This library looks like a good place to start:
https://docs.rs/libloading/0.5.0/libloading/
Users should be able to publish crates containing structs that implement appropriate datafusion traits for types and functions and then register the produced libraries with datafusion at runtime.
For the first benchmark, I'm thinking that a geospatial use case would be good. It would be easy to generate csv files of varying size contianing fake address data and lat/lng co-ordinates and then run a job to translate the lat/lngs to quadkeys or mercator projection.
References:
In order to stream data between nodes and to efficiently swap data out to disk and store in a native format, it will be necessary to serialize to a binary format.
KX Systems sold probably a billion dollars worth of licenses to financial institutions by optimizing the following operation: http://code.kx.com/q/ref/joins/#aj-aj0-asof-join
To my knowledge, Spark doesn't have an optimized implementation of this, or if it does, I have no idea what they call it at the moment.
I looked at the code 2-3 weeks ago when it was first announced with the 2018 big-data on Rust blog post. I took a cursory look today. I am going through similar exercise in Rust for scratching my own itch after working 4+ years in the same domain on JVM/Scala/etc.
I would like this project to succeed so here are some observations that may help in the long run (caveat - I may have missed some of the points in your code).
The streaming model is more general and it can be easily relaxed to batch processing - the inverse is always hard. On the other side batch processing is more efficient than processing event-by-event. In my experience something like micro-batching works best in terms of flexibility and performance.
Columnar oriented processing is more efficient on the current crop of hardware as it plays better with dispatch overhead, cache locality and data prefetching. It is also a natural extension of micro-batching from point 1. My own benchmarks show 20x difference with batches of 1024 tuples, around 2-3 cycles per arithmetic operation on f64, including SQL NULL correctness.
You definitely don't want to dispatch on types in the leaves, e.g. in the function/expression bodies like https://github.com/andygrove/datafusion-rs/blob/master/src/functions/math.rs#L14 - The data types are compilation concept, they should not exist at run-time. You can use generics, type erasure and columnar dispatch to get it. Think about this in the same vein as Rust vs. Ruby - Rust is faster because it does not need to check the run-time type on each operation because the compiler guarantees it.
Instead of doing switch type interpreter, where you have dispatch cost on each event you may want to think about expressions in more functional way, e.g.
fn gteq(lhs:expr, rhs:expr) -> Box<Fn(input: tuple) -> value>
or if you get into columnar approach:
fn gteq(lhs: expr, rhs: expr) -> Box<Fn(in: &Frame) -> Column>
and compose the whole computation just once before executing it: following a reference once per 1000 tuples is a lot cheaper than going trough a match on each tuple.
You can take a look at a minimalistic example I put together some time ago: https://gist.github.com/luben/95c1c05f36ec56a57f5624c1b40e9f11
good work, good start, I will learn it and do something.
Implement some basic geospatial functions, drawing inspiration from https://github.com/Esri/spatial-framework-for-hadoop
Datafusion needs to be able to process tuples containing values that are of a user-defined type not known at compile time. Datafusion doesn't need to understand the types too much but needs to be able to perform serde operations and pass them to user-defined functions.
Need to implement correct error handling i.e. returning Result
in SQL parser to avoid panics.
Here's an example:
DataFusion Console
$ SELECT * FROM foo
Executing: SELECT * FROM foo
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParserError("Prefix parser expected a keyword but found Mult")', /checkout/src/libcore/result.rs:906:4
To match the ad-hoc query capabilities of Spark many users will expect to be able to apply functional transformations to DataFrames using custom Rust code without having to package that code in a crate dependency behind a UDF trait.
Is this even possible with Rust?
Currently the console executes the input when enter is hit. Normal behavior for a SQL console is to accept multi-line statements terminated with a semicolon.
First use case should be sending a job to read a csv file and partition it into multiple csv files
Maybe write a Rust wrapper around this?
It should be possible to use some pre-defined scalar functions in a query. For example:
SELECT sqrt(x) FROM foo
SELECT * FROM foo WHERE sqrt(x) < sqrt(y)
Currently the worker terminates if it cannot connect to etcd immediately. It should instead keep trying in case etcd is still starting up.
Worker 8987a3f3-71e1-5cca-aadf-bc165f528fac listening on 127.0.0.1:8088 and serving content from /opt/datafusion/www
Heartbeat loop failed: Error { repr: Kind(NotFound) }
This is the lowest common denominator for being to run some distributed workloads on existing Hadoop clusters.
There is no working pure Rust HDFS library unfortunately, so we'll probably have to start with this one which wraps a C library that wraps the Java client.
We should add an example query optimization to make sure that the current structures for the logical plan are going to work well enough.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.