Database Stream Processor (DBSP) is a framework for computing over data streams that aims to be more expressive and performant than existing streaming engines.
Computing over streaming data is hard. Streaming computations operate over changing inputs and must continuously update their outputs as new inputs arrive. They must do so in real time, processing new data and producing outputs with a bounded delay.
Existing streaming engines have limited expressive power, which is why their use in the modern data stack is restricted to data preparation and visualization. Complex business intelligence tasks are left to traditional database systems where they run as periodic (e.g., daily) batch jobs. This is inadequate in the world that increasingly relies on real-time alerts, recommendations, control, etc.
DBSP intends to change the situation by offering a simple promise: to execute any query workload that can run in batch mode in streaming mode.
DBSP unifies two kinds of streaming data: time series data and change data.
-
Time series data can be thought of as an infinitely growing log indexed by time.
-
Change data represents updates (insertions, deletions, modifications) to some state modeled as a table of records.
In DBSP, a time series is just a table where records are only ever added and never removed or modified. As a result, this table can grow unboundedly; hence most queries work with subsets of the table within a bounded time window. DBSP does not need to wait for all data within a window to become available before evaluating a query (although the user may choose to do so): like all queries, time window queries are updated on the fly as new inputs become available. This means that DBSP can work with arbitrarily large windows as long as they fit within available storage.
DBSP queries are composed of the following classes of operators that apply to both time series and change data:
-
Per-record operators that parse, validate, filter, transform data streams one record at a time.
-
The complete set of relational operators: select, project, join, aggregate, etc.
-
Recursion: Recursive queries express iterative computations, e.g., partitioning a graph into strongly connected components. Like all DPSP queries, recursive queries update their outputs incrementally as new data arrives.
In addition, DBSP supports windowing operators that group time series data into time windows, including various forms of tumbling and sliding windows, windows driven by watermarks, etc.
Delivering this functionality with strong performance and correctness guarantees requires building on a solid foundation. The theory behind DBSP is described in the accompanying paper:
TODO
The project is still in its early days. API and internals documentation is coming soon.
TODO
Execute the following command to make git commit
check the code before commit:
GITDIR=$(git rev-parse --git-dir)
ln -sf $(pwd)/tools/pre-push ${GITDIR}/hooks/pre-push
The repository has a number of benchmarks available in the benches
directory that provide a comparison of DBSP's performance against a known set of tests.
Each benchmark has its own options and behavior, as outlined below.
You can run the complete set of Nexmark queries, with the default settings, with:
cargo bench --bench nexmark --features with-nexmark
By default this will run each query with a total of 100 million events emitted at 10M per second (by two event generator threads), using 2 CPU cores for processing the data.
To run just the one query, q3, with only 10 million events, but using 8 CPU cores to process the data and 6 event generator threads, you can run:
cargo bench --bench nexmark --features with-nexmark -- --query q3 --max-events 10000000 --cpu-cores 8 --num-event-generators 6
For further options that you can use with the Nexmark benchmark,
cargo bench --bench nexmark --features with-nexmark -- --help
An extensive blog post about the implementation of Nexmark in DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-with-rust-and-dbsp/