platonic-io / detsys-testkit Goto Github PK

A test kit for fast and deterministic system tests.

License: MIT License

Java 0.05% Clojure 6.38% Go 17.30% Nix 3.76% Shell 0.52% Emacs Lisp 0.03% Starlark 10.20% Haskell 60.08% C 0.14% R 1.52%

detsys-testkit's Introduction

detsys-testkit

This is an alpha release, please don't share it widely just yet, thanks!

System tests are usually slow, non-deterministic and should therefor be used sparsely -- this project tries to turn this idea on its head.

By abstracting out the non-deterministic parts of the programs we write, e.g. concurrency, and putting them behind interfaces we can then implement these interfaces twice:

using the non-deterministic primitives, e.g. send and receive over the network, this gives you back the original program which you had before introducing the abstraction;
using functions that deterministically manipulate a pure data-structure, e.g. queues representing incoming network messages, this gives you a simulation of what your original program will do once you use the non-deterministic interface.

This simulation forms the basis for our system tests. The two key aspects of this simulation is that it's deterministic (that doesn't mean there's no randomness) and that we can control the passage of time (which in turn means we don't have to wait for timeouts etc), which means we can get fast and deterministic tests.

Warning: This is not a black-box testing approach, and unless your system already hides all non-determinism behind interfaces then it's likely that your system will need a substantial rewrite, which is probably not worth it unless your system is a distributed system that hasn't already been battle-tested.

For more about simulation testing see this document.

Getting started

There are many parts that needs to be understood before we can explain how simulation testing works in detail. It's perhaps easier to start with a concrete example. The following steps gets you started using some of the tools on a simple distributed register example:

Install nix;
Clone this repository;
cd into the repository;
Issue nix-build to compile all binaries;
Run nix-env -if default.nix to package up and install all binaries;
Prepare the database with detsys db up;
Start the scheduler component with detsys scheduler up.

We are now ready to start testing.

In case you don't have a system of your own to test, which is very likely if you are just getting started, you can try it out on one of the examples provided in this repository using the instructions below:

Change directory to where the distributed register example lives with cd src/sut/register;
Run the first test from the test-suite by typing go test -tags json1 -run 1;
Notice how test id 0 and run id 3 doesn't pass the analysis. To debug that run enter detsys debug 0 3;
Navigate up and down through the messages with your arrow keys or j and k, and finally exit the debugger with q or Ctrl-c.

At this point it might make sense to have a look at the go test-suite in example_test.go and the actual implementation of the distributed register. The implementation consists of two parts: a front-end (see frontend1.go) which receives client requests and propagates them to two distributed registers (register.go). The client can be request a write to or a read from the registers, and the idea with there being two registers is that there is some form of redundancy. The implementation is flawed however, as you might have been able to guess from the fact that the test fails. Can you figure out what exactly the implementation does and why it's wrong by using the debugger alone? For further refinements of the implementation see frontend{2,3,4}.go and see if you can figure out why they do or don't work as well.

More about nix can be found here, including a recommended developer workflow and why it's preferable to docker.

How it works on a higher-level

Now that we looked at a concrete example we are in better position to explain the high-level picture of what is going on.

Testing, in general, can be split up into different stages. First we write or generate a test case, then we execute that test case, afterwards we check that outcome of the execution matches our expectations, and finally we might want to aggregate some statistics about some subset of all tests that we have done for further analysis.

When testing concurrent programs or distributed systems there are typically many ways to execute a single test case. These different executions is the non-determinism typically associated with concurrent or distributed systems. For example, consider when two concurrent requests are being made against some service then we typically cannot know which finishes first.

If we want our tests to be deterministic we need to be able to control the scheduling of threads, or in the distributed systems case the scheduling of when messages between the nodes of the system arrive at their destination.

Distributed systems typically consist of several different components that are not necessarily written in the same programming language. Distributed systems need to be resilient in the presence of failures. Distributed systems are also long running, and as they run they might accumulate "junk" which makes them fail over time. Distributed systems need to be able to be upgraded without downtime, in order to do so they need to allow for different software versions of the components to be compatible.

With all these constraints in mind, let us sketch the high-level design of this project.

In order to be able to test how a system performs over time, we must be able to test over time. That is: simulate a weeks worth of traffic today, make sure everything looks fine, close everything down, and then the day after be able to pick up where we left off and continue simulating another weeks worth of traffic and so on. This type of workflow is also useful for when you need to explore a huge state space but with limited resources, e.g. nightly CI runs, where you want to avoid retesting the same thing. To facilitate the bookkeeping necessary to do so we introduce a database for the testing work.

The different stages of testing, e.g. generating, executing, checking, etc, become separate processes which can be run independently and they communicate via the database. This allows for things like generating one test case, executing it several times (especially important for concurrent or distributed systems), check each execution several times all at different moments of time (as we learn new things to assert we can check old execution traces without rerunning the test).

In order to avoid the non-determinism of distributed systems we assume that all components that rely on network communication implement a reactor-like interface.

This interface abstracts out the concurrency from the implementation of the system under test (SUT) and lets us create fake communication channels between the nodes of the system. In fact we route all network messages through a scheduler which assigns arrival times to all messages and there by controls in which order messages arrive, hence eliminating the non-determinism related to networking in a distributed system.

Another big source of non-determinism in distributed systems are faults. Messages might arrive late, not arrive at all, or nodes might crash, etc. The fault-space grows very quickly, so in order to achieve any meaningful coverage we use lineage-driven fault injection. In short what it does is to start with a successful test execution and tries to figure out what steps in the execution were crucial to the outcome and bases the fault injection on that analysis.

Because the SUT can be implemented in multiple different languages, there's a small shim on top of the SUT, called executor, which receives messages from the scheduler and applies them to the SUT. The idea being that this shim can easily be ported to other programming languages.

One consequence of taming the non-determinism is that if we find a problem in our SUT, then we can examine the execution in detail using a time-traveling debugger which allows us to step forwards and backwards.

Here's a high-level diagram of the project:

As you can see, the developer can generate test cases, execute them, check their traces all in separate steps by calling the test library. Typically this is done from the test-suite of the software under test (SUT), however there's also a command-line tool which exposes much of the test library together with some other conveniences such as migrating the database, starting the scheduler and opening the debugger, as we saw in the getting started section above, which you typically only want to do once and therefore outside of the SUT's test-suite.

More examples

Reliable broadcast example from the Lineage-driven fault injection paper (2015) by Peter Alvaro et al.

How it works on a lower-level

For the typical user it should be enough to understand the high-level picture, the examples and the library API for the programming language they want to write the tests in (currently only Golang is supported).

However if you are curious or want to contribute to the project itself it's also helpful to understand more about the components themselves and that's what this section is about.

Let's start by introducing all components:

checker: checks if a test case execution adhears to a given black-box specification, uses Jepsen's Elle transactional consistency checker;
cli: discoverable and unifying interface between the developer and rest of the components;
db: takes care of migrations for the sqlite database, typically only needed once. The schema has one table which provides is a simple append-only event store of json events, and a bunch of SQL VIEWs which are aggregated from the event store;
debugger: visual aid for figuring out what happened during a test case execution;
executor: programming language specific shim on top of the software under test (SUT) for converting programming language agnositic messages from the scheduler to the SUT;
generator: tool for generating test cases (currently merely a placeholder, not suitable for use);
ldfi: suggests faults to inject given past test case executions;
lib: a Golang library which provides the reactor-like interface and exposes all components, so they can be easily called from a test-suite (or from the command-line via cli);
logger: Batches writes to the db, for better performance;
ltl: checks if a test case execution adhears to a given white-box specification, uses linear temporal logic for concise assertions about the global state of all nodes in a SUT;
scheduler: loads a generated test case and executes the test case by forwarding the messages involved in the test case to the appropriate executors one at the time. The new messages that the executor gets back from the SUT when processing the forwarded message gets sent back to the scheduler which randomly, but deterministically using a seed, assigns arrival times for the new messages, which simulates concurrency, and then the process continues until there are no messages left or some max execution time has been reached. See the following document for pseudo code of the scheduler implementation;
sut/register: example test-suite for a simple distributed register;
sut/broadcast: example test-suite for a simple broadcast protocol, taken from the lineage-driven fault injection paper.

Because everything is deterministic, there are a few parameters which uniquely determine the outcome of a test run. They are:

the test case;
the git commit hashes of the SUT and all above components;
the scheduler seed;
the faults that get injected.

A single test case might be executed using different (but compatible) versions (git commits) of the SUT or project components, or using different scheduler seed. Likewise a single test case exeuction might be analysed several times using different checkers for example.

The database is basically organised to facilitate the above and enable as much caching as possible. The components log everything that they are doing to the db in a way that can be reproduced. In the future we hope to enable sharing of the db between developers and CI, to avoid rerunning the same tests.

How to contribute

Have a look at the ROADMAP.md for a high-level overview of where we are heading and see the file CONTRIBUTING.md for more workflow related stuff.

License

See the file LICENSE.

detsys-testkit's People

Contributors

Stargazers

Watchers

Forkers

stevana

detsys-testkit's Issues

Make the debugger display crashes

For a simple counterexample that contains a crash, see the second test in sut/broadcast.

More atoms in LTL checker

At the moment we only have = on the state of node (either before or after the event), it would be useful to be able to use more when expressing properties:

Having the access to the event:

- The message in the event
- The sender/receiver of the event
- The time (simulated) that it was sent/received

Potentially having access to faults? To now if a node has been crashed.

In order to support the time, other than more predicates to talk about duration of times, we might want to also have the possibility of having the local binding operator from hybrid logics. Then we could write formulas like:

[] (request -> \down w. (<> \w'. response /\ Sub(w.time, w'.time) < 10 seconds)

But there might be easier extensions to work with for describing such properties.

Event loop run-time system

So far the design of this project has been very top-down -- from specification
in our heads, down to the software under test (SUT). Basically this can be
summarised by saying that we've introduced enough structure to allow for
deterministic testing, i.e. the Reactor interface, etc.

It's time to start thinking bottom-up -- from the hardware and OS level, up to
the SUT. Trying to answer questsions like:

convenience, i.e. what structure do we need in order to conveniently implement the SUT? Adding
the ability to broadcast is an example of this kind of thinking;
"real" implementation rather than "simulation" implementation;
deployment;
operator experience (application specific observability);
upgrades;
performance;
security;
high-availability;
scalability.

Fortunately a lot more people care about most of these topics so we have a lot
more inspiration to draw from than when tackling the top-down/simulation testing
topic.

Lets address these topics in turn.

Solution sketch for convenience, "real" implementation and performance

Most of these ideas come from: the E programming language, Goblins, CapTP, Cap'n Proto.

One event loop per OS process / CPU core
Several event loops on a single computer is possible
Event loops can also run on remote computers
Actors are spawned on event loops
Actors can send synchronous messages and get replies immediately to actors
in the same event loop
Actors can communicate across event loops, but have to do so asynchronously.
In order to not block the event loop, when an async message is sent a
continuation/callback must be provided which will be invoked by the event
loop once the reply comes back from the remote actor.
Similarly filesystem I/O can be implemented to not block the event loop.
An event loop maintains the following data:
1. a heap of actors
2. a stack of local messages (stack grows as local actors call other local
  actors, and shrinks as the replies come back in)
3. a queue of remote messages (these get dequeued onto the stack)
Advantages over current situation
- Non-blocking event loop
- Calls to local actors is synchronous, so we can avoid manually composing
  two actors in order to avoid passing messages between them
  (asynchronously)
- Related to the above, we get replies so we can avoid maintaining
  "sessions" inside the actors
Possible golang event loop libraries we could build on top: panjf2000/gnet and tidwall/evio,
also see the architecture of nginx for inspiration.

Solution sketch for operator experience

In addition to what we sketched out to store inside the event loop, we could
also store messages received and state diffs in a ring-buffer of some size (in
order to use a fixed amount of memory/disk), if we can make the ring-buffer
remotely dumpable we could use this to build a debugger similar for what we
currently have for tests but for live networks.

We could also make it possible to connect to an event loop and print things like:

current actors' state;
supervisor logs about the last crashes;
whatever interesting statistics we'd like to keep track of.

One could also imagine being able to patch the state of an actor that's in some way stuck.

Solution sketch for deployment and high-availability (of actors)

Most of the following ideas come from Erlang.

According to Wikipedia:

In 1998 Ericsson announced the AXD301 switch, containing over a million lines
of Erlang and reported to achieve a high availability of nine "9"s

One of the key ingredients behind this achievement is something called
supervisor trees. Before explaining the tree aspect, lets first explain what a
supervisor is. A supervisor is an actor whose sole purpose is to make sure
that actors below/underneath it are healthy and working. The way this is
implemented is that the supervisor spawns their children in a way so that if the
children throw an exception the supervisor gets that exception. Each supervisor
has a restart strategy associated with it, for example if one of its children
dies then restart only that child, an other strategy would be to restart all
children, etc. The supervisor also has a max restart intensity, if more than X
restarts happen within some period of time then the supervisor terminates all
children and then it terminates itself.

This is where supervisor trees come in. A supervisor might have other
supervisors as its children and together they form a tree with workers at the
leaves. So if a supervisor terminates itself, the supervisor above it in the
tree will get notified and can restart things according to its strategy etc.

A correctly organised supervisor tree spanning over multiple machines can be
extremely resilient. Because the tree is hierarchical and ordered (from left to right),
we also get graceful degradation where if one subtree fails the components in the rest
of the tree can still provide partial service to client requests.
There's also another consequence of this which is more subtle: crashing is
cheap, because restarts are cheap (fine-grained and merely recreating an object on the heap,
rather than coarse-grained and restarting a whole docker image as in the Kubernetes case).
We know that:

almost all (92%) of the catastrophic system failures are the result of
incorrect handling of non-fatal errors explicitly signaled in software.

and

in 58% of the catastrophic failures, the underlying faults could easily have
been detected through simple testing of error handling code.

so by avoiding to write error handling code (just crash instead) we actually
avoid writing a lot of bugs. If the specification isn't clear about some edge
case, simply crash the program there instead of trying to be clever about it,
the supervisor will do the restarts, the client won't notice any downtime
(perhaps having to do a retry) and if the edge case is rare enough it will work
after the restart (clearing any junk from the state), the developers will get
notified that restarts have happened and can then choose to fix the problem in a
principled way (or perhaps not if it's rare enough).

Supervisor trees can also be used as units of deployment, instead of spawning an
actor on an event loop, we could spawn a supervisor on an event loop and have it
spawn its children and make sure they stay alive.

We could imagine giving the root supervisor special treatment where we let
System D, or Kubernetes or whatever make sure that the event loop stays alive
and whenever it dies it gets restarted and the root supervisor spawned on it.

There are some existing implementations of supervisors in golang, e.g.
go-sup (seems unmaintained) and suture, which we might be able to use,
but its also fairly easy to implement from scratch (~250 lines).

Solution sketch for upgrades

This idea comes from Erlang.

If actors were seralisable (this is a BIG if in golang, but trivial in e.g.
Clojure), we could send a new version over the wire to the event loop which
could swap out the old version (after serving any outstanding requests) for the
new one with zero downtime. You can even imagine automatic rollback to previous
version if too many failures are observed.

Solution sketch for security

The E family has a nice story for security based on capabilities, which
can be implemented using scope alone. The basic idea is
that in order to, for example, communicated with an actor you need a reference
to it and you can only get a reference to it if you spawned the actor or got
sent a message containing the reference. This solves the authorisation problem.
The idea can be extended to any other resource, e.g. "you can only request files
from some particular path of my filesystem if you got a token for that and only
I can create such tokens".

Capability-based security is different from the more commonly used "access
control list" security, e.g. in the filesystem example above, access control
lists correspond to UNIX style filesystem permissions, i.e. this group has
access to this directory. For a longer comparison between the two approaches see
the following paper.

Solution sketch for automatic scaling

One could imagine having the root supervisors be able to monitor the resource
consumption of their event loop, and if it goes above or below some value
automatically provision or shutdown some computers and spawn additional or kill
idle workers from some pool which is load balanced.

The middle

Once we are done with this bottom-up approach, we'll also have to consider the
middle -- the gap that's left after top-down and bottom-up. It's tempting to
avoid thinking too much about this now, as it might constrain our bottom-up
thinking unnecessarily, but we will need to reconcile the tension between the
two eventually.

package priorities conflict while following getting started guide

I run into this issue running nix-env from the getting started guide

$ nix-env -if default.nix 
installing 'detsys-latest' 
building '/nix/store/g3ny0ykryd84dwvi20yjwynxccx22v0z-user-environment.drv'... 
error: packages '/nix/store/4cdb7dvv8kwhh1qbdgs5nxa5d4318nnl-db-latest/migrations/0_create_event_log.sql' and '/nix/store/2m6392bbssl39kyk3cqgralgldgb8hl6-detsys-latest/migrations/0_create_event_log.sql' have the same priority 5; use 'nix-env --set-flag priority NUMBER INSTALLED_PKGNAME' to change the priority of one of the conflicting packages (0 being the highest priority) builder for '/nix/store/g3ny0ykryd84dwvi20yjwynxccx22v0z-user-environment.drv' failed with exit code 1 
error: build of '/nix/store/g3ny0ykryd84dwvi20yjwynxccx22v0z-user-environment.drv' failed

I attempted to change the priority of one of the packages but that does not seem to change anything

(seanlyons)(15:44:25): ~/Code/symbiont.io/detsys-testkit $ nix-env --set-flag priority 0 /nix/store/2m6392bbssl39kyk3cqgralgldgb8hl6-detsys-latest/migrations/0_create_event_log.sql
error: selector '/nix/store/2m6392bbssl39kyk3cqgralgldgb8hl6-detsys-latest/migrations/0_create_event_log.sql' matches no derivations

Any advice on how to proceed?

Useful Context

$ nix --version
nix (Nix) 2.3.10

$ git reflog
544836f (HEAD -> main, origin/main, origin/HEAD) HEAD@{0}: pull: Fast-forward

Use a json encoder that supports private fields in golang

In order to be able to make assertions about how the state of all components changes over time and to make debugging easier, we dump and save the state of components in json format when they receive an event.

In golang we currently use encoding/json for this, but given that it simply drops private fields this isn't ideal (which leads to workarounds such as introducing an extra datatype which exposes the private fields).

Perhaps an other library which also encodes private fields can be used, e.g. https://github.com/json-iterator/go.

Build and run issues on a Mac

Hi, I'd like to report some errors that I ran into on my Mac while going through the getting started guide. Maybe someone will find it useful.

Hash mismatches for go modules

Error

error: hash mismatch in fixed-output derivation '/nix/store/0jgjkfkdw76dkg6vvs2m5p5qw820ljvp-cli-latest-go-modules.drv':
         specified: sha256-NJOPtpqMUEe0F2TIizrUi40l25e0t0eiQr/Y0acQr4I=
            got:    sha256-JPfNPmp+ufZsrCjl7lxdSFefn9SChbIV/8Og6ykOb/Q=
error: 1 dependencies of derivation '/nix/store/s4b6msbg6lysz1k3vhlkz5xva7pi9jfx-cli-latest.drv' failed to build
error: 1 dependencies of derivation '/nix/store/crmx9q265fi88iwx603kqgp3zk7p7lwr-detsys-latest.drv' failed to build

Workaround

Use the "got" hashes.

Comments

Not sure why those have changed. Either I'm using a different version of go or it's platform-dependent. buildGoModule seems to account for that (at least to some extent) with proxyVendor (see here), however, I'm not a go nor nix expert. I also checked all minor versions between 1.15 and 1.19 and I was getting different hashes each time.

Missing `go.sum` entries

Errors

go: [github.com/symbiont-io/detsys-testkit/src/[email protected]](http://github.com/symbiont-io/detsys-testkit/src/[email protected]) requires
	[go.uber.org/[email protected]](http://go.uber.org/[email protected]): missing go.sum entry; to add it:
	go mod download go.uber.org/zap

missing go.sum entry for module providing package [go.uber.org/atomic](http://go.uber.org/atomic) (imported by [go.uber.org/zap](http://go.uber.org/zap)); to add:
	go get go.uber.org/[email protected]

Fix (as above)

go mod download go.uber.org/zap
go get go.uber.org/[email protected]

Segmentation fault when running tests

Error

--- FAIL: TestRegister1 (0.00s)
panic: Binary was compiled with 'CGO_ENABLED=0', go-sqlite3 requires cgo to work. This is a stub
	panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x114d41e]

Workaround

Running export CGO_ENABLED=1 just before the tests seems to fix the issue. I'm confused by this, as the default.nix files contain the correct export, so everything should have been built correctly.

Missing dependency

Error

Error occured during analysis:
exit status 1
Exception in thread "main" java.util.concurrent.ExecutionException: java.io.IOException: Cannot run program "dot": error=2, No such file or directory

Fix

nix-env -iA nixpkgs.graphviz (you may want to add it to your Nix dependencies somewhere)

Other errors

... that could potentially find their place in some troubleshooting guide or at least can stay here for reference

Nix search path issue

Error

file 'nixpkgs/lib' was not found in the Nix search path (add it using $NIX_PATH or -I), at /nix/store/f0my6g5a4d353j7h3771baqfn96f1a3v-gitignore-src/default.nix:1:16

Fix

source ~/.nix-profile/etc/profile.d/nix.sh && nix-channel --update (trail and error from various threads and the combination of these 2 seems to work)

Too many open files

Error

opening directory '/nix/store/iyxfjv4v79i38nj2qyr8m5rblp23affm-maven-compiler-plugin-3.1.jar': Too many open files

Fix

ulimit -n 10240 (obviously)

Investigate switching over to Gomod2nix

In the past we've had some problems with forgetting to bump the vendorSha256 hash.

We should have a look if perhaps Tweag's Gomod2nix tool can help avoiding this: https://www.tweag.io/blog/2021-03-04-gomod2nix/.

Estimate the search space

In order to display progress and to gauge how well our lineage-driven optimisations are working we should try to estimate the search space.

For a first draft, that also includes notes about the tricky parts, see #174.

(Event) logging improvements

Possible improvements to the event logging include:

Make it easier to read the event log (by pretty printing, filtering, search, etc);
Timestamp the log entry at the log site rather than once it reaches the db (which could be delayed due to the queue being long);
If relevant, have components include their version in the logged event (and perhaps tests shouldn't be runable with uncommitted code);
Text logging (so we can get debug information interleaved with system events, e.g. network trace events);
Spans, wide-events or parent events (so we can group/indent children for readability, measure how long spans take to execute, etc);
Log errors before panicking;
"Verbosity level", "debug build" or similar to control/avoid logging non-essential events (text logging, performance metrics, etc);
Make it possible to point the debugger to the event log (or a view of it) for debugging the test system rather than the SUT.

A couple of notes:

The "verbosity level" should likely be part of the "create run" event, so that in case it changes then the run is rerun (once we got hash identifiers #65);
Currently our db reads might be stale (if lots of events are written, a queue builds and then a read happens), we need to check if this ever can becomes a problem. This was fixed in #104.

Automate changelog generation

There are many tools available that allow of a changelog to be automatically generated from the commit log and/or closed PRs/issues.

Which tool produced the best result in our case? If we rely on semantic commit log messages, then we should also document those in CONTRIBUTING.md.

Add support for more different kinds of faults

The faults we currently support are:

Message omissions (which is enough to simulate partitions);
Fail-stop failures, i.e. permanent crashes of nodes.

These are the same faults as used in the linage-driven fault injection papers.

We also partially support:

Messages being delayed (latency) or reordered.

By the nature of the scheduler choosing arrival times of messages (note however that these faults are introduced randomly via the seed of the run and not subject to lineage-driven optimisations).

List of other crash faults we'd like to support:

There's also many byzantine faults one can think of, which basically boil down to:

Arbitrarily change the state of a node at any time;
Arbitrarily change a message between two nodes while it's in transit.

For most of the faults above we know how to introduce them in a random fashion, the tricky part however is to figure out how they interact with the lineage-driven optimisation though.

The debugger starts slowly when there is a lot of network traffic

The sequence diagrams shown in the debugger are generated by calling plantuml once per network event. All these calls are done at the start of the debugger, which creates a slowdown if there are many network events.

Possible solutions:

Generate the sequence diagrams lazily on demand;
Generate the diagrams in the background somehow, outside of the debugger;
Use something faster/draw our own diagrams.

Event loop metrics

Sketch of how we could collect metrics for measuring performance inside the event loop.

Histograms
- logarithmic bucketing rather than sampling
  
  "Unlike popular metric systems today, this does not destroy the accuracy of
  histograms by sampling. Instead, a logarithmic bucketing function compresses
  values, generally within 1% of their true value (although between 0 and 1 the
  precision loss may not be within this boundary). This allows for extreme
  compression, which allows us to calculate arbitrarily high percentiles with no
  loss of accuracy - just a small amount of precision. This is particularly
  useful for highly-clustered events that are tolerant of a small precision loss,
  but for which you REALLY care about what the tail looks like, such as measuring
  latency across a distributed system." -- Tyler "spacejam" Neely
- Simple interface, basically two functions:
  - measure(value: Double, h: Histogram)
  - percentile(p: Double, h: Histogram): Double
  - E.g.:
```
 h := newHistogram
 measure(1, h)
 measure(1, h)
 measure(2, h)
 measure(2, h)
 measure(3, h)
 assert(percentile(0.0, h), 1) // min
 assert(percentile(40.0, h), 1)
 assert(percentile(40.1, h), 2)
 assert(percentile(80.0, h), 2)
 assert(percentile(80.1, h), 3)
 assert(percentile(100.0, h), 3) // max
```
- Implementation idea: any double can be compressed into one of 2^16 buckets,
  with less than 1% compression loss (using the natural logarithm function for
  compression and exponentiation for decompression, hence the name logarithmic
  bucketing)
- Rust crate: https://github.com/spacejam/historian
- Golang lib: https://github.com/spacejam/loghisto
Metrics
- Merely a record of histograms and counters
- Together they capture the following metrics (taken from https://sled.rs/perf.html#metrics):
  - latency - the time that an operation takes
  - throughput - how many operations can be performed in some unit of time
  - utilization - the proportion of time that a system (server, disk, hashmap,
    etc...) is busy handling requests, as opposed to waiting for
    the next request to arrive.
  - saturation - the extent to which requests must queue before being handled
    by the system, usually measured in terms of queue depth
    (length).
  - space - whoah.
- E.g. one histogram for client request/response latency, another one for
  client req saturation (keeping track of what the queue depth was when the
  client req arrived), and a counter for throughput
- Built into the SUT, deployment agnostic, could be sampled by e.g. prometheus
  or anything else?
Event loop metrics
- USE (utilisation, saturation and errors)
  - We already mentioned client req/resp latency, client req saturation and
    throughput above;
  - Main event loop utilisation: keep track of time when finished processing
    last event, at the beginning of processing a new event measure the time
    between last event finished processing and now, add the difference to the
    sum of idle time;
  - We could do similar to above for async I/O worker thread pool also;
  - The async I/O work queue can also be measured for saturation same as the
    event queue;
  - Actor utilisation: # of messages sent to actor / # of total messages sent;
  - Actor space: one crude way would be to check the length of the json string
    when we seralise the state;
  - Errors?
- TSA (Thread State Analysis)
What to optimise?
- Metrics help us figure out how well the SUT performs, but doesn't tell us
  anything about where to optimise if we want it to perform better
- From https://sled.rs/perf.html#scouting-ahead
  - flamegraphs, see also:
    https://github.com/flamegraph-rs/flamegraph#systems-performance-work-guided-by-flamegraphs
  - deletion profiling
  - causal profiling
Questions
- Regarding Neely's comment on sampling, does that mean important things can
  get lost when sampling?
- Can we do causal profiling by virtually speeding up actors (i.e. slowing
  down all but one actor)?
- In Coz (the causal profiler) they measure latency by:
  1. having a transactions counter that gets incremented when a client req
    arrives and decremented when a response is written back to the client
  2. having a counter for throughput which is increased on client resp
  3. using Little's law: latency = transactions / throughput.
  Is this anyhow more accurate than simply taking the difference in time
  between client req and resp? Or do they simply do this because it's more
  efficient in terms of CPU?
- Using Little's law seems to make sense for client req/resp, because they
  have a clear notion of transaction, but internal messages between nodes
  don't have that and so we can't measure latency for those using the law?

Make it easy to find out when two tests diverge

In order to find non-determinisms in the test library or the software under test, it would be useful to be able to tell at which point in time two tests diverged.

This could be a separate program which goes through the test db and compares the (hash of) output of the two tests stepwise in logical time, and if it finds a mismatch it prints the time of divergence. Further comparison can the be done by running two debuggers, one for each test, and navigating to the time of divergence (perhaps a flag can be added to the debugger which tells it to start displaying at some point of time).

Handle SUT panic gracefully

Currently if SUT panics the executor will also panic, it would be good to handle this a bit more gracefully.

The executor should detect that if SUT panics, if so it should return to scheduler with information to Scheduler that the SUT crashed. We can also create a new event, that is emitted from executor/scheduler when SUT crash.
The executor could try to get the last state of the reactor that crashed and emit that.
The scheduler could emit it current state, and then finish

All checkers should probably check if such an event was emitted, if so fail directly.

Use bazel as build system

Currently go code can be built and tested using bazel. Other bazel related tasks:

Speculative fault-space exploration

Currently lineage-driven fault injection (LDFI) returns a set of faults, which we inject in the next concrete run, which gives us a new trace and thus new constraints for the LDFI making it better at guessing what the next set of faults should be, and so on. Clearly, as the set of constraints grows LDFI will get slower.

One possible optimisation that could help in this situation is to have LDFI return several sets of faults instead of just one, and thus do several concrete runs of the test per run of LDFI. We should be able to do this today with minor changes.

An obvious next step would be to run these concrete runs in parallel. This would require more work though, e.g. making the scheduler be able to keep track of several runs at the same time, and it would also put more strain on the db so we probably want to switch over to using the logger component which batches writes to the db. Should be done after #70, #72, and possibly #65.

Automate linting

Perhaps using nix?

Computing coverage

Tests pass, but what exactly is being tested? We currently have fixed/static programs, but the faults that lineage-driven fault injection (LDFI) introduces is variable/dynamic.

Some ideas:

Golang's built in coverage, this is good for getting a feel for which parts of the SUT's code is exercised, but requires manual effort and time to do a proper analysis;
Classify histories/heap traces into equivalence classes based on unit tests a la John Hughes' talk and IOHK's work (perhaps using LTL-like formulas);
SUT metrics, e.g. how many leader elections happend, how many retries were sent, etc
Analyse the faults table that LDFI populates, e.g. how different are the faults that are introduced from each other, or which nodes of the cluster are targeted by crashes;
Usage model / operational profile and path coverage rather than code coverage.

Hash based identifiers rather than counters for tests and runs

Monotonically increasing counters are currently being used as identifiers for tests and runs of tests, using hashes instead would give us several advantages:

Avoid rerunning the same test twice (given that the tests should be deterministic...);
Share tests between users/CI by hash, thus avoiding clashes in the db (i.e. the same run has different counter ids);
Concurrent runs would less likely cause identifier races (if we are not careful two threads could still run the same test and therefore create a hash clash).

The disadvantage of using hash identifier is they are longer.

What exactly to use to create the hash remains to be figured out. It should include everything that makes a run unique and reproducible, i.e. the test, seed, faults, git commits of all software involved, etc.

Generalise test case generation, checking and deployment

Generator

Currently the generator component generates a fixed test case, it would be nice if the generator component could use some generic grammar/schema as basis for generating random test cases. This is not an immediate concern because:

One can choose to avoid using the generator and instead generate a random test case using whatever programming language the SUT testsuite is written in;
Initially, at least, the test cases can be small and handpicked as testing many different interleavings of messages and faults in a single test case is probably more effective in finding bugs than generating many different test cases and testing few interleavings and faults.

However, in the long run especially if we want to use usage model or operational profile based test case generation, then if (1) above is used we will end up reimplementing this functionality for programming language involved.

Checker

Similarly for the checker component, one can currently only check histories that have the exact same structure as the register or list-append model in the Elle checker. While sticking to the Elle checker is probably a good idea when it comes to checking transactional consistency, it would be good if there at least was a generic way of mapping operations from the SUT to the operations in Elle, e.g. say the SUT uses an operation Write to do list appends then there would need to be a way to map this to Elle's :append operation.

Other checkers, e.g. a heap/memory checker, are outside what Elle can handle and so it should also be possible to use multiple different checkers in some way.

Again this is not critical as one can always call out to arbitrary checkers written in the testsuite of the SUT.

Deployment

Currently deployment of the SUT happens via the executor component which is deployed using the library component from the testsuite of the SUT. The initial state of the SUT reactors is constructed inside the testsuite as well. It could be interesting to try to make the deployment step more standalone from the testsuite as it would potentially make the following things easier:

Running the tests against an already deployed SUT;
Doing SUT deployments during test execution (to test upgrades, backup and restore, etc);
Introduce supervisor trees controlling reactors (extend reactors with start and terminate operations);
Deploy the test system itself.

It would also make the executor and library code, which is tied to the programming language of the SUT, smaller. Again nothing critical.

Add ability to change the state of a component while its running

Currently we always initialise new components to some "initial state" when we spin them up. It would be nice to be able to "load" the end state from a previous test run in order to be able to pick up and continue testing from where we left off. It would also be nice to be able to "patch" the state of a running component to do hotfixes or simulate byzantine faults.

Note that something special would have to be done with the Init() messages if a test is resumed.

Proof terms for LTL Checker

Currently the LTL checker just returns a Bool so we don't get any more information other than that the check worked or not. It would be good to have a reason for why a check failed, for example if you have always P it would be good to know at what logical time P failed if the whole formula failed.

A general idea is to move all the negation to the atoms, and for the atoms could give a reason for why the predicate failed. This is something similar to what we did in quickcheck-statemachine https://github.com/advancedtelematic/quickcheck-state-machine/blob/2515c206c5ce6795d01930027edfe69b19a66c8e/src/Test/StateMachine/Logic.hs#L103

Make LDFI smarter

The current implementation of LDFI, isn't as smart as it could probably be. At the moment it uses a sat-solver to pick a fault-set such that the fault-set hasn't been seen before. But we don't have much smarts going in telling what would happen if any particular failure happened. Part of the problem is that we simply doesn't have that information for two reasons:

We don't really know which message made the checker happy.
We don't know why different reactors chooses to make certain messages.

This assumes that there will be some sort of information being sent to a reactor that will make the checker happy (and continue to be happy). If we knew that information (potentially using some model) we could as a strategy have to try to inject faults so that message can't happen. If we can't drop that message (because it is not a valid move according to the failurespec) we need to make it so that the reactor that sent the message don't want to send it (by which we need 2) and this process might be needed to be repeated if the proposed earlier message(s) is also outside the failurespec.

Make the debugger more keyboard friendly

Currently we need to use the mouse to focus on a different window in the debugger, e.g. the sequence diagram or switching to the reactor state of a different node.

At the very least it would be nice if you can use TAB and shift-TAB to switch back and forth between the windows and then scroll, but perhaps even nicer would be if one could scroll any window without having to switch to it first (or automatically switch to it). E.g. shift-J/K or shift-up/down arrow to scroll the sequence diagram or shift-1,2,3,... to switch between the nodes' state.

Refactor the scheduler component

The requirements for the scheduler component have changed since it was first written, e.g. timers seem more practical than ticks, which has made the code more complicated and harder to maintain.

Here's a list of things that could be refactored:

platonic-io / detsys-testkit Goto Github PK

detsys-testkit's Introduction

detsys-testkit

Getting started

How it works on a higher-level

More examples

How it works on a lower-level

How to contribute

See also

License

detsys-testkit's People

Contributors

Stargazers

Watchers

Forkers

detsys-testkit's Issues

Solution sketch for convenience, "real" implementation and performance

Solution sketch for operator experience

Solution sketch for deployment and high-availability (of actors)

Solution sketch for upgrades

Solution sketch for security

Solution sketch for automatic scaling

The middle

Hash mismatches for go modules

Error

Workaround

Comments

Missing go.sum entries

Errors

Fix (as above)

Segmentation fault when running tests

Error

Workaround

Missing dependency

Error

Fix

Other errors

Nix search path issue

Error

Fix

Too many open files

Error

Fix

Generator

Checker

Deployment

Recommend Projects

Recommend Topics

Recommend Org

Missing `go.sum` entries