Giter VIP home page Giter VIP logo

anirudhacharya / datafusion-ballista Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/datafusion-ballista

0.0 0.0 0.0 18.26 MB

Apache Arrow Ballista Distributed Query Engine

Home Page: https://arrow.apache.org/ballista

License: Apache License 2.0

Shell 9.00% JavaScript 0.09% Python 4.40% Scala 0.62% Rust 77.09% TypeScript 1.93% CSS 0.12% HTML 0.16% Smarty 0.16% CMake 0.49% Batchfile 0.86% Dockerfile 5.10%

datafusion-ballista's Introduction

Ballista: Distributed SQL Query Engine, built on Apache Arrow

Ballista is a distributed SQL query engine powered by the Rust implementation of Apache Arrow and Apache Arrow DataFusion.

If you are looking for documentation for a released version of Ballista, please refer to the Ballista User Guide.

Overview

Ballista implements a similar design to Apache Spark (particularly Spark SQL), but there are some key differences:

  • The choice of Rust as the main execution language avoids the overhead of GC pauses and results in deterministic processing times.
  • Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
  • The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
  • The use of Apache Arrow as the memory model and network protocol means that data can be exchanged efficiently between executors using the Flight Protocol, and between clients and schedulers/executors using the Flight SQL Protocol

Architecture

A Ballista cluster consists of one or more scheduler processes and one or more executor processes. These processes can be run as native binaries and are also available as Docker Images, which can be easily deployed with Docker Compose or Kubernetes.

The following diagram shows the interaction between clients and the scheduler for submitting jobs, and the interaction between the executor(s) and the scheduler for fetching tasks and reporting task status.

Ballista Cluster Diagram

See the architecture guide for more details.

Features

  • Supports HDFS as well as cloud object stores. S3 is supported today and GCS and Azure support is planned.
  • DataFrame and SQL APIs available from Python and Rust.
  • Clients can connect to a Ballista cluster using Flight SQL.
  • JDBC support via Arrow Flight SQL JDBC Driver
  • Scheduler web interface and REST UI for monitoring query progress and viewing query plans and metrics.
  • Support for Docker, Docker Compose, and Kubernetes deployment, as well as manual deployment on bare metal.

Performance

We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations. These are benchmarks derived from TPC-H and not official TPC-H benchmarks. These results are from running individual queries at scale factor 10 (10 GB) on a single node with a single executor and 24 concurrent tasks.

The tracking issue for improving these results is #339.

benchmarks

Getting Started

The easiest way to get started is to run one of the standalone or distributed examples. After that, refer to the Getting Started Guide.

Project Status

Ballista supports a wide range of SQL, including CTEs, Joins, and Subqueries and can execute complex queries at scale.

Refer to the DataFusion SQL Reference for more information on supported SQL.

Ballista is maturing quickly and is now working towards being production ready. See the roadmap for more details.

Contribution Guide

Please see the Contribution Guide for information about contributing to Ballista.

datafusion-ballista's People

Contributors

andygrove avatar kou avatar wesm avatar kszucs avatar alamb avatar pitrou avatar jimexist avatar nealrichardson avatar dandandan avatar xhochy avatar jorgecarleitao avatar yahonanjing avatar ted-jiang avatar dependabot[bot] avatar xudong963 avatar fsaintjacques avatar yjshen avatar nevi-me avatar paddyhoran avatar liukun4515 avatar matthewmturner avatar tustvold avatar seddonm1 avatar sunchao avatar bkietz avatar emkornfield avatar julienledem avatar thinkharderdev avatar cpcloud avatar houqp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.