erikgrinaker / toydb Goto Github PK

View Code? Open in Web Editor NEW

5.9K 85.0 546.0 3 MB

Distributed SQL database in Rust, written as a learning project

License: Apache License 2.0

Rust 99.87% Shell 0.13%

rust distributed database raft sql mvcc

toydb's Introduction

toyDB

Distributed SQL database in Rust, written as a learning project. Most components are built from scratch, including:

Raft-based distributed consensus engine for linearizable state machine replication.
ACID-compliant transaction engine with MVCC-based snapshot isolation.
Pluggable storage engine with BitCask and in-memory backends.
Iterator-based query engine with heuristic optimization and time-travel support.
SQL interface including projections, filters, joins, aggregates, and transactions.

toyDB is not suitable for real-world use, but may be of interest to others learning about database internals.

Documentation

Architecture guide: a guide to toyDB's architecture and implementation.
SQL examples: comprehensive examples of toyDB's SQL features.
SQL reference: detailed reference documentation for toyDB's SQL dialect.
References: books and other research material used while building toyDB.

Usage

With a Rust compiler installed, a local five-node cluster can be started on localhost ports 9601 to 9605, with data under cluster/*/data:

$ ./cluster/run.sh

A command-line client can be built and used with node 5 on localhost:9605:

$ cargo run --release --bin toysql
Connected to toyDB node "toydb-e". Enter !help for instructions.
toydb> CREATE TABLE movies (id INTEGER PRIMARY KEY, title VARCHAR NOT NULL);
toydb> INSERT INTO movies VALUES (1, 'Sicario'), (2, 'Stalker'), (3, 'Her');
toydb> SELECT * FROM movies;
1|Sicario
2|Stalker
3|Her

toyDB supports most common SQL features, including joins, aggregates, and ACID transactions.

Architecture

toyDB's architecture is fairly typical for distributed SQL databases: a transactional key/value store managed by a Raft cluster with a SQL query engine on top. See the architecture guide for more details.

Tests

toyDB has decent test coverage, with about a thousand tests of core functionality. These consist of in-code unit-tests for many low-level components, golden master integration tests of the SQL engine under tests/sql, and a basic set of end-to-end cluster tests under tests/. Jepsen tests, or similar system-wide correctness and reliability tests, are desirable but not yet implemented.

Execute cargo test to run all tests, or check out the latest CI run.

Benchmarks

toyDB is not optimized for performance, but it comes with a workload benchmarking tool that can run various workloads against a toyDB cluster. For example:

# Start a 5-node toyDB cluster.
$ ./cluster/run.sh
[...]

# Run a read-only benchmark via all 5 nodes.
$ cargo run --release --bin workload read
Preparing initial dataset... done (0.096s)
Spawning 16 workers... done (0.003s)
Running workload read (rows=1000 size=64 batch=1)...

Time   Progress     Txns      Rate       p50       p90       p99      pMax
1.0s       7.2%     7186    7181/s     2.3ms     3.1ms     4.0ms     9.6ms
2.0s      14.4%    14416    7205/s     2.3ms     3.1ms     4.2ms     9.6ms
3.0s      22.5%    22518    7504/s     2.2ms     2.9ms     4.0ms     9.6ms
4.0s      30.3%    30303    7574/s     2.2ms     2.9ms     3.8ms     9.6ms
5.0s      38.2%    38200    7639/s     2.2ms     2.8ms     3.7ms     9.6ms
6.0s      46.0%    45961    7659/s     2.2ms     2.8ms     3.7ms     9.6ms
7.0s      53.3%    53343    7620/s     2.2ms     2.8ms     3.7ms     9.6ms
8.0s      61.2%    61220    7651/s     2.2ms     2.8ms     3.6ms     9.6ms
9.0s      68.2%    68194    7576/s     2.2ms     2.8ms     3.7ms     9.6ms
10.0s     75.8%    75800    7579/s     2.2ms     2.8ms     3.7ms     9.6ms
11.0s     82.9%    82864    7533/s     2.2ms     2.9ms     3.7ms    18.2ms
12.0s     90.6%    90583    7548/s     2.2ms     2.9ms     3.7ms    18.2ms
13.0s     98.3%    98311    7562/s     2.2ms     2.9ms     3.7ms    18.2ms
13.2s    100.0%   100000    7569/s     2.2ms     2.9ms     3.7ms    18.2ms

Verifying dataset... done (0.001s)

The available workloads are:

read: single-row primary key lookups.
write: single-row inserts to sequential primary keys.
bank: makes bank transfers between various customers and accounts. To make things interesting, this includes joins, secondary indexes, sorting, and conflicts.

For more information about workloads and parameters, run cargo run --bin workload -- --help.

Example workload results:

Workload   Time       Txns      Rate       p50       p90       p99      pMax
read       13.2s    100000    7569/s     2.2ms     2.9ms     3.7ms    18.2ms
write      22.2s    100000    4502/s     3.9ms     4.5ms     4.9ms    15.7ms
bank       155.0s   100000     645/s    16.9ms    41.7ms    95.0ms  1044.4ms

Debugging

VSCode provides a very intuitive environment for debugging toyDB. The debug configuration is included under .vscode/launch.json. Follow these steps to set it up:

Install the CodeLLDB extension.
Go to "Run and Debug" tab and select e.g. "Debug unit tests in library 'toydb'".
To debug the binary, select "Debug executable 'toydb'" under "Run and Debug".

Credits

toyDB logo is courtesy of @jonasmerlin.

toydb's People

Contributors

Stargazers

Watchers

Forkers

xiangv lisprez passchaos ugurlu2001 dfrankland chsjiang topecongiro danbruder mgill25 graydon newdb shnwang rylynnnnn komolebi lokenetwork jzhe886 suhaibaffan shiftwinting xregitx meijies chaosstudygroup c0ding tomalloc fr0l scottlarnach shylock-hg leetcode-notes moneytech xuxue1 mmallad devenlu shaunstanislauslau lenkasetsong chao-huang tsaiggo andts tinhhandsome bud-adamas jingmian c0d3x42 knifefly kennethfan fanfind ceciliacsilva cksac lahiguera urahua zeta1999 azumia-azu voloyev earayu sfmunera pedrohenriquerls looookie endlessc yew1eb sourcepirate awesomedatatool linuxai-pku soumojit isgasho zcourts tillenium qqdipps liuhuimin seanchen1991 karunvarma lshoo sinhasantos rustyforks pkking tardunge eliasyaoyc hankskfc neculai-stanciu zapic innerpeace1203 shyoldman panda-sheep asapovk jadegeek xuyifangreeneyes tonghe1900 khiphop 0xack13 fpaschos wonderspig spartakos87 zuojianfa sseg mapbased ohbonsai esppig chenyukang mirrorfork hochanh lenghan4real zhouweiwei18 losfair rpbear88

toydb's Issues

I like this repo

Hi Erik,
Im new to rust and database and i think this repo is a amazing project for learning both things.
Want to contribute some code but i did not find any code of conduct or CONTRIBUTING things.
Is there anything i should notice about that?

Jepsen test

Maybe with Elle: https://github.com/jepsen-io/elle

BitCask storage engine

BitCask is a very simple log-structured key/value store, used by Riak. It basically writes key/value pairs to an append-only log, and keeps a map of keys to file byte offsets in memory, with regular compaction.

This is just the right level of complexity for toyDB, and is suitable both for Raft log and state machine storage.

Iterators should be O(1), streaming, and borrowless

Iterators are currently rather hacky and inefficient, and e.g. use O(log n) lookups per next() call (B+tree and MVCC) or buffer all results (Raft and SQL engines). Ideally, iterators should have O(1) complexity when calling next(), stream all results (with some amount of IO buffering), and don't hold read borrows to the entire data structure for the lifetime of the iterator.

Client pool must revert any transactions on return

API documentation

Build with GitHub Actions and host on GitHub Pages, or something.

Return user-caused Raft state machine errors

These currently crash the node, they should be returned and the operations skipped.

Bank tests

For concurrency and performance testing.

Client serialization retries

The client should automatically retry serialization failures. This is non-trivial due to async closure lifetimes. See bank simulation for implementation example.

Serializable snapshot isolation

Improve SQL key encoding

All storage keys (even individual rows and index entries) currently use full identifiers for tables and columns, they should use integer identifiers instead. They must also escape separators.

Return Result for all tests

Field pushdown

Don’t use gRPC for client/server

Maybe just Tokio TCP sockets with MessagePack. Make client sessions stateful, i.e. track transaction state in server thread.

file not found for module `raft`

hi，when I checkout commit fe8f322. error occurred as follow:
file not found for module raft

Separate Raft thread for heartbeats and elections

Raft currently runs in a single thread, such that long-running state operations block the leader from sending heartbeats and asserting leadership.

Support multi-line input in toysql

This will make it easier to write non-trivial copy/paste examples.

The MVCC anomaly_read_skew test seems to be written incorrectly.

mvcc.rs: anomaly_read_skew function

 let t1 = mvcc.begin()?;
        let t2 = mvcc.begin()?;

        assert_eq!(t1.get(b"a")?, Some(vec![0]));
        t2.set(b"a", vec![2])?;
        t2.set(b"b", vec![2])?;
        t2.commit()?;
        assert_eq!(t1.get(b"a")?, Some(vec![0]));

The annotation explanation is：Read skew is when t1 reads a and b, but t2 modifies b in between the reads. Snapshot isolation prevents this.

At the same time, according to the explanation of this document, different read operations are performed before and after, and in the test, a was read before and after.

https://vladmihalcea.com/a-beginners-guide-to-read-and-write-skew-phenomena/

Therefore, we should instead read b last time

On-disk B+tree store

By default, toyDB uses an on-disk Raft log for persistence and an in-memory B+tree key-value store for SQL state data. It might be interesting to build an on-disk B+tree key-value store as well.

Rewrite Raft call handling

Try to move call routing out of Raft node, and integrate with separate state machine thread (related to #22).

Unit test comment error

follower last_index is 3, last_term is 2, so this should be the old index.

 fn step_solicitvote_last_index_outdated() -> Result<()> {
        let (follower, mut node_rx) = setup()?;
        let mut node = follower.step(Envelope {
            from: 3,
            to: 1,
            term: 3,
            message: Message::Campaign { last_index: 2, last_term: 2 },
        })?;
}

comment: SolicitVote is rejected if last_term is outdated.

so change last_term to last_index.

Add error::Result

Since all Result instances use error::Error, we should add error::Result which always uses this error type.

Predicate pushdown

Hash Join Issue?

Hi @erikgrinaker,

Thank you for open source this great project. I would like to say this is the best resource to learn database principles. Meanwhile, the code quality is awesome, neat and elegant. Every piece is excellent.

I have a question for one of the optimizer, as shown below:

                (Expression::Field(a, a_label), Expression::Field(b, b_label)) => {
                    let (left_field, right_field) = if a < left_size {
                        ((a, a_label), (b - left_size, b_label))
                    } else {
                        ((b, b_label), (a - left_size, a_label))
                    };
                    Ok(Node::HashJoin { left, left_field, right, right_field, outer })
                }

when it is equal join, here we switch to hash join. based on Field(a) and Field(b). Not sure if I miss anything, but what if the corresponding field (column) are not unique ? if there are duplicate values in the fields. I feel the hash join will miss some rows?

The hash map is created by collect the key value pairs:

let right: HashMap<Value, Row> = rrows
.map(|res| match res {
Ok(row) if row.len() <= r => {
Err(Error::Internal(format!("Right index {} out of bounds", r)))
}
Ok(row) => Ok((row[r].clone(), row)),
Err(err) => Err(err),
})
.collect::<Result<_>>()?;

if there are duplicate values, I think only the last pair will exist.

do I miss anything? Appreciate any feedback. Thanks!

Client pipelining

The client should support pipelining: https://docs.rs/tokio-postgres/0.5.3/tokio_postgres/index.html#pipelining

This would e.g. be useful for the bank simulation setup.

Questions about storage and indices

Hello @erikgrinaker!

Thanks for this amazing project. It actually inspired me to create my own toy database as well. I'm also following Designing Data-Intensive Applications and Database Internals. I'm not sure if you have time to answer them, yet here are the questions:

Thanks in advance,

I recently read about LSM from both of the books, now I am confused about your storage engine implementations. They are named as log and KV in ToyDB. To me, "log" is probably an LSM but, what is KV then?
As far as I understood, all the entries are stored in a KV pairs such as "{table_id}_{primary_id}_{col}": row[col]. Is that correct?
How did you implement SQL-level indexes? Separate B-trees? I couldn't locate index creation logic inside the codebase.

Improve cluster status

Status calls should have more content (e.g. cluster peer status) and better formatting.

Integration tests for clustered concurrency

Spin up a three-node cluster and run isolation tests for concurrent txns across multiple nodes.

Raft: wait for initial leader commit before processing queries

https://stackoverflow.com/questions/37207682/raft-some-questions-about-read-only-queries

Handle aborts and disconnects in SQL session

Aborts caused by Raft leadership changes or client disconnections may leave stale transactions. The SQL session and client should try to recover these, by rolling back on drop and resuming transactions on commit/rollback error.

Async Raft engine

The Raft engine should be rewritten to use Tokio async. Related to #22.

Indexes

Depends on #3.

Can't connect to the database

I've cloned the toydb repository and do a cargo build --release successful. I then went into target/release directory and try to run ./toysql and got the following error message:

Error: Internal("No connection could be made because the target machine actively refused it. (os error 10061)")

I run the same command with administrator privilege and got the same error message. I'm on Windows 10 x64 Version 2004 OS build 20201.1000.

Error macros

Or some other convenient way to create errors.

test cluster::isolation::anomaly_dirty_read blocks indefinitely

It looks like:

     Running target/debug/deps/tests-28925d9acf338694

running 1 test
test cluster::isolation::anomaly_dirty_read ...

It seems blocked within tokio, but I am not falimiar with tokio.

Hash joins

Errors should roll back automatic transactions

Errors in single-statement transactions currently leave the transaction open, giving serialization errors when other txns try to modify a record.

Client txn API for serialization failures

The client should automatically handle serialization failures by taking a transaction as a closure and retrying it.

Run with error

When I tried to operate this order here, the system said that I run with the error as follows:

cargo run --release --bin toysql
the following packages contain code that will be rejected by a future version of Rust: nom v5.1.2
note: to see what the problems were, use the option --future-incompat-report, or run cargo report future-incompatibilities --id 1
Running target/release/toysql
Error: Internal("Connection refused (os error 111)")

My operation system is Ubuntu22.04 and the rust version is 1.70. Please help me to solve this issue. Thanks!

Client error handling and tests

The client's error handling is faulty, fix this and write integration tests.

Raft panic on call during startup

Submitting a call immediately after starting a cluster causes a panic:

Error: Internal("Unexpected Raft mutate response MutateState { call_id: [250, 91, 73, 148, 245, 202, 77, 21, 184, 190, 68, 192, 45, 3, 88, 140], command: [129, 0, 129, 0, 192] }")

Seems like the node returns the message we submitted, or something. Probably only applies to candidates, since it works fine once the cluster settles.

ZeroMQ for cluster communication?

Schema cache

The KV SQL engine should cache schema lookups.

This needs to be implemented at the MVCC level, so that versioning works - probably by specifying prefixes that should be cached. The downside of this is that it needs to go through serialization.

Explain statements

MVCC parallelism

Hey @erikgrinaker,

I read the architecture guide but maybe I missed this one somehow. I was wondering how MVCC storage behaves in terms of parallelism. I see that there is a Mutex here. So how does that affect execution exactly? Is there really only one transaction running at a time? Like, yeah, this means transactions are concurrent, but still not parallel as I understood.

Sorry if I'm missing something fundamental.

pub struct MVCC<E: Engine> {
    /// The underlying KV engine. It is protected by a mutex so it can be shared between txns.
    engine: Arc<Mutex<E>>,
}

Thanks in advance.

Storage engine

Probably B-tree based, possibly LSM-trees with B-tree indexes.

Should have config options to switch storage engine, both for Raft and SQL.

Resolve fields in planner

Plan nodes should not have unqualified or aliased names, they should all be resolved and validated during planning to simplify the optimization stage.