Giter VIP home page Giter VIP logo

database-stream-processor's Introduction

License: MIT CI workflow codecov

Database Stream Processor

Database Stream Processor (DBSP) is a framework for computing over data streams that aims to be more expressive and performant than existing streaming engines.

Why DBSP?

Computing over streaming data is hard. Streaming computations operate over changing inputs and must continuously update their outputs as new inputs arrive. They must do so in real time, processing new data and producing outputs with a bounded delay.

Existing streaming engines have limited expressive power, which is why their use in the modern data stack is restricted to data preparation and visualization. Complex business intelligence tasks are left to traditional database systems where they run as periodic (e.g., daily) batch jobs. This is inadequate in the world that increasingly relies on real-time alerts, recommendations, control, etc.

DBSP intends to change the situation by offering a simple promise: to execute any query workload that can run in batch mode in streaming mode.

Overview

DBSP unifies two kinds of streaming data: time series data and change data.

  • Time series data can be thought of as an infinitely growing log indexed by time.

  • Change data represents updates (insertions, deletions, modifications) to some state modeled as a table of records.

In DBSP, a time series is just a table where records are only ever added and never removed or modified. As a result, this table can grow unboundedly; hence most queries work with subsets of the table within a bounded time window. DBSP does not need to wait for all data within a window to become available before evaluating a query (although the user may choose to do so): like all queries, time window queries are updated on the fly as new inputs become available. This means that DBSP can work with arbitrarily large windows as long as they fit within available storage.

DBSP queries are composed of the following classes of operators that apply to both time series and change data:

  1. Per-record operators that parse, validate, filter, transform data streams one record at a time.

  2. The complete set of relational operators: select, project, join, aggregate, etc.

  3. Recursion: Recursive queries express iterative computations, e.g., partitioning a graph into strongly connected components. Like all DPSP queries, recursive queries update their outputs incrementally as new data arrives.

In addition, DBSP supports windowing operators that group time series data into time windows, including various forms of tumbling and sliding windows, windows driven by watermarks, etc.

Theory

Delivering this functionality with strong performance and correctness guarantees requires building on a solid foundation. The theory behind DBSP is described in the accompanying paper:

Applications

TODO

Documentation

The project is still in its early days. API and internals documentation is coming soon.

TODO

Contributing

Setting up git hooks

Execute the following command to make git commit check the code before commit:

GITDIR=$(git rev-parse --git-dir)
ln -sf $(pwd)/tools/pre-push ${GITDIR}/hooks/pre-push

Running Benchmarks against DBSP

The repository has a number of benchmarks available in the benches directory that provide a comparison of DBSP's performance against a known set of tests.

Each benchmark has its own options and behavior, as outlined below.

Nexmark Benchmark

You can run the complete set of Nexmark queries, with the default settings, with:

cargo bench --bench nexmark --features with-nexmark

By default this will run each query with a total of 100 million events emitted at 10M per second (by two event generator threads), using 2 CPU cores for processing the data.

To run just the one query, q3, with only 10 million events, but using 8 CPU cores to process the data and 6 event generator threads, you can run:

cargo bench --bench nexmark --features with-nexmark -- --query q3 --max-events 10000000 --cpu-cores 8 --num-event-generators 6

For further options that you can use with the Nexmark benchmark,

cargo bench --bench nexmark --features with-nexmark -- --help

An extensive blog post about the implementation of Nexmark in DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-with-rust-and-dbsp/

database-stream-processor's People

Contributors

absoludity avatar gz avatar kixiron avatar ryzhyk avatar vmwghbot avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.