uwplse / verdi Goto Github PK

A framework for formally verifying distributed systems implementations in Coq

License: BSD 2-Clause "Simplified" License

Makefile 0.06% Python 0.24% Coq 99.19% Shell 0.44% Awk 0.08%

verdi coq coq-library proof distributed-systems

verdi's Introduction

Verdi

Verdi is a Coq framework to implement and formally verify distributed systems. Verdi supports several different fault models ranging from idealistic to realistic. Verdi's verified system transformers (VSTs) encapsulate common fault tolerance techniques. Developers can verify an application in an idealized fault model, and then apply a VST to obtain an application that is guaranteed to have analogous properties in a more adversarial environment.

Building and installation instructions

We recommend installing Verdi via opam, which will automatically build and install its dependencies:

opam repo add coq-extra-dev https://coq.inria.fr/opam/extra-dev
opam install coq-verdi

To build Verdi manually, first install all requirements. Then, run make in the Verdi root directory. This will compile the framework's core specifications and proofs, as well as some simple example systems and their correctness proofs.

To run Verdi systems on real hardware, event handler code must be extracted to OCaml and linked with one of the shims in the Verdi runtime library that handles low-level network communication.

Documentation

To set up your own Verdi-based distributed systems verification project, we recommend basing it on Verdi LockServ.

Verdi LockServ contains a minimalistic implementation of a message-passing lock server and a proof that it maintains mutual exclusion between client nodes. At build time, extracted OCaml code is linked to a runtime library shim to produce an executable program that can be run in a cluster. There is also a simple script to interface with cluster nodes.

In addition to the example verified systems listed below, see the scientific papers and blog posts listed at the Verdi website. See also Verdi Raft, a verified implementation of the Raft distributed consensus protocol.

Files

Core Verdi files:
- Verdi.v: exporting of core Verdi theories, imported by systems
- Net.v: core (unlabeled) network semantics
- LabeledNet.v: labeled network semantics, for use in liveness reasoning
- HandlerMonad.v: a monad for writing network/input handlers
- StatePacketPacket.v: a technique for writing easily decomposable invariants
Example systems:
- Counter.v: counting server with backup
- LockServ.v: lock server with proof of safety
- LiveLockServ.v: lock server with proof of liveness
- VarD.v: vard, a key-value store
Verified system transformers:
- SeqNum.v and SeqNumCorrect.v, a system transformer implementing sequence numbering
- LockServSeqNum.v, the sequence numbering transformer applied to the lock server
- PrimaryBackup.v, a system transformer implementing asynchronous primary-backup replication
- VarDPrimaryBackup.v, the primary-backup transformer applied to the key-value store

verdi's People

Contributors

Stargazers

Watchers

verdi's Issues

Server crashes when trying to produce large packets because of buffer overflow

We found a bug that causes a buffer overflow on the leader when a lagging follower tries to recover. The stack overflow seems to occur within the recursive function “restore_from_log” (Shim.ml) when a very large packet is constructed and before the leader actually tries to send it.

This problem can be reproduced through the following process:
a) start 3 servers;
b) execute one client request;
c) stop a follower server;
d) execute many client requests (in our tests, at least 521,932 requests).
c) restart the server that was stopped

Here’s a sample output produced by the leader when it crashes:

   [Term 1] Sending 50 entries to 2 (currently have 521932 entries), commitIndex=521882_
   [Term 1] Sending 521881 entries to 3 (currently have 521932 entries), commitIndex=521882_
   [Term 1] Received AppendEntriesReply 50 entries true, commitIndex 521883
  Fatal error: exception Stack overflow

Transient system call errors during recovery cause inconsistent re-initialization

During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.

The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (chmod 000 verdi-snapshot-900*).
d) restart all servers
e) issue one GET client request

In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).

Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.

Transfer-based correctness theorem for Raft is missing

We should prove something like "If P is an invariant of the underlying state machine, then P is an invariant of every state machine in Raft". This should follow directly from StateMachineCorrectness.

Thanks to UPenn folks for reporting!

Crash during update of snapshot causes loss of data

A crash of the server when it executed the function that writes a snapshot to disk (updating the existing snapshot) can cause loss of data and prevent the server from recovering correctly afterworlds.

This bug is more serious than issue #38 because it can lead to loss of data. Loss of data can happen because the server, when it crashes while executing the function save, deletes/truncates the existing disk snapshot before it safely writes the new snapshot to disk.

This problem can be reproduced by simulating a crash immediately after the snapshot file is opened with O_TRUNC (save function in Shim.ml) and before the write is actually made, for example, by adding the statement assert(env.saves < 10000);.

It is probably harder to fix this bug than issue #38 because a correct implementation needs to ensure that several steps (i.e., replacing the old snapshot with the new snapshot and truncating the log) are atomic despite crashes.

Verdi paper links broken

The links to the Verdi papers in the readme are currently broken.

The papers are easily Googleable, but I'm not sure if the website is temporarily down or if the links need to be updated.

Specifically the links for verdi.uwplse.org:

Modelling multithreaded applications

After reading the papers and skimming through the examples I noticed that verdi doesn't seem to be suited for modelling concurrent applications, as handler callbacks are deterministic (prove me wrong).

Perhaps something like actor-based concurrency can be emulated using custom network semantics with extended names, say (node * thread) and different message delivery guarantees depending on node part. Even something like shared state can be emulated by an actor... Yet this solution seems too finicky and I would avoid taking this path unless I absolutely have to.

Question: was I right about handlers being deterministic? If so, what is the best approach to the problem in your opinion?

Transformed LockServ missing

The paper discusses the application of the sequence numbering transformer to the lock service, but there is nothing like this in the formalism.

Thanks to folks from UPenn for reporting!

Proposal: verify Verdi shim functions down to machine code

Several parts of the Verdi shims (currently based on OCaml) can be specified and verified down to machine code, for example using VST.

The current limitation is that there is no formal specification of UDP or TCP, which are used in the shims (all shims use TCP in some way).

What commit works for verdi with coq 8.12?

Sincere apology, perhaps this isn't supported but given all my efforts I thought it would be worth at least asking before giving up...is there a commit that would work with verdi and coq 8.12?

I did try --ignore-constraints-on=coq but it doesn't seem to have given me the same success as it did with cheerios uwplse/cheerios#20

this is what I tried:

git clone [email protected]:uwplse/verdi.git deps/verdi
(cd coq-projects/verdi && opam install -y --ignore-constraints-on=coq .)
# this doesn't seem to do anything different than the above attempt, above uses dev & ends up using verdi-runtime
#git clone [email protected]:uwplse/verdi.git deps/verdi
#(cd deps/verdi && git checkout fdb4ede19d2150c254f0ebcfbed4fb9547a734b0 && git rev-parse HEAD)
#(cd deps/verdi && git checkout 3d22ce073f7d16da58eb8e1aa3c71bf8f588a04f && git rev-parse HEAD)
#(cd deps/verdi && git checkout 35508f2af94f9da979ece0cbdfa191019f2c5478 && git rev-parse HEAD) # right before coq >=8.14 warning
#(cd deps/verdi && git checkout cb016cf9d2ae61ff757a0b6fa443b391a5416b63 && git rev-parse HEAD)  # coq >=8.14 warning
#(cd deps/verdi && opam install -y .)

Small typo in README.md

The line
the extraction/vard/vard.ml file
should be
the extraction/vard/ml/vard.ml file

Please do not depend on autogenerated names of `destruct` that are sensitive to the file in which records are defined

My bug minimizer cannot fully minimize the example at coq/coq#14798 (comment) because

verdi/core/GhostSimulations.v

Lines 86 to 88 in 002da48

 destruct net. 

 concludes. 

 destruct nwState. auto.

relies on the fact that the record Verdi.Net.network is defined in some file other than GhostSimulations.v (in this case, in the file Net.v). This is an instance of coq/coq#3523.

Suggested fixes:

-  destruct net.
+  destruct net as [? nwState].
   concludes.
   destruct nwState.

   destruct net.
   concludes.
-  destruct nwState.
+  let nwState := match goal with H : data |- _ => H end in destruct nwState.

Print better error message when trying to build on 8.4

Do not pun Inp/Out in trace_mutual exclusion

Documentation

Hi there, where can I find the documentation for this library?

proof-sizes.awk missing "Proposition"

Basically exactly the title. I've had to add Proposition to my own database when messing with compcert, maybe you don't need it here.

What version of coq-verdi does one need for coq 8.10

I know this is an annoying request, but what is the commit for verdi that allows for coq 8.10 that you recommend?

Actually this seems to have worked:

(iit_synthesis) brando9~ $ opam pin add coq-verdi git+https://github.com/uwplse/verdi.git#f3ef8d77afcac495c0864de119e83b25d294e8bb

[coq-verdi.dev] synchronised (git+https://github.com/uwplse/verdi.git#f3ef8d77afcac495c0864de119e83b25d294e8bb)
coq-verdi is now pinned to git+https://github.com/uwplse/verdi.git#f3ef8d77afcac495c0864de119e83b25d294e8bb (version dev)

The following actions will be performed:
  ∗ install coq-cheerios dev* [required by coq-verdi]
  ∗ install coq-verdi    dev*
===== ∗ 2 =====
Do you want to continue? [Y/n] Y

<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
Processing  3/6: [coq-cheerios: make]
∗ installed coq-cheerios.dev
∗ installed coq-verdi.dev
Done.

Any tips to install Verdi on OS X?

Server assumes that it can read the entire client request with a single recv call

If the server is not able to read with one call the entire client request, which is sent over TCP, the server wrongly presumes that the client is disconnected and throws an exception (function read_from_socket in file Shim.ml).

The problem can be usually reproduced by simply sending the requests using two send calls on the client side.

what are eClient and eId?

hey!

I've been working through your Raft imlementation, and I can figure out what the natural numbers for the eClient and eID fields represent?

 Record entry := mkEntry {
                      eAt : name;
                      eClient : nat;
                      eId : nat;
                      eIndex : logIndex;
                      eTerm : term;
                      eInput : input
                    }.

some clarity would be much appreciated, me and a colleague have been trying to figure out if those Naturals are actually Names, terms, or what?!

many thanks
-Carter

Client-server marshaling allows users to inject commands and to crash the server

The server-client communication protocol relies on spaces to delimit the arguments of requests and newlines to delimit the requests, which are sent over TCP connections. Because neither validation is made on the characters of the command arguments (keys and values) nor are the meta-characters escaped (newlines and spaces), by providing specially crafted input users of vard.py are able to: 1) crash the server; 2) inject commands that are executed by the servers and cause subsequent requests to return wrong results.

To reproduce the bug it suffices to create a client application that issues the following vard.py library calls:

Crash server:
GET("key1 - - \n")

During the execution of the GET the leader crashes and before it terminates the leader produces the following error message:

client disconnected: received invalid input
Fatal error: exception Unix.Unix_error(Unix.EBADF, "send", "")

Inject commands:

PUT(key1,key1) = key1
PUT(key2,key2) = key2
PUT(key3,key3) = key3
GET(key1 - - \n132201621 216857 GET key2) = key1
GET(key1) = key2
GET(key2) = key1
GET(key3) = key2

Note that the last three GET operations produce an incorrect result.

Make extraction work without building all the raft proofs

Something like the following should do it.

cd raft/
make Makefile.coq
make -f Makefile.coq Raft.vo

and then proceed as usual.

Server is unable to recover when disk log is incomplete due to a crash while writing an entry

A crash during a write to the disk log (by M.to_channel in function save on file Shim.ml) can cause a partial entry to be appended to the log. When this happens the server that crashed is not able to recover and produces the following error when starting:
"Fatal error: exception Failure("input_value: truncated object")"

This problem can be simulated by stopping the server and subsequently trimming the last byte from the log. We expect partial writes during crashes to be more likely to occur when the log write cross disk block boundaries.

Proposal: verify Verdi handlers down to machine code for example system

To push the boundaries for how verified distributed systems can be, we would like to use VST (to refine handlers to C) and CompCert to verify Verdi node handlers down to machine code level for some example system, and demonstrate that the system can be executed successfully. One reasonably simple system is Verdi Lockserv.

Cannot get extraction/vard make to work

I've been enjoying the this repo rather much (thanks much for it) in the course of translating it to haskell (https://github.com/cartazio/haver-raft) and have recently tried to get the demo to run. I'm getting an error I can't quite understand from extraction's make (though the verdi/Makefile worked fine). Any thoughts on what's tripping me up?

➜  vard git:(master) uname -a
Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64
➜  vard git:(master) ocaml -version
The OCaml toplevel, version 4.02.3
➜  vard git:(master) ocamlbuild -version
ocamlbuild 4.02.3
➜  vard git:(master) coqc --version
The Coq Proof Assistant, version 8.5beta2 (August 2015)
compiled on Aug 31 2015 15:12:12 with OCaml 4.02.3
➜  vard git:(master) make
make -C coq
/Applications/Xcode.app/Contents/Developer/usr/bin/make -f Makefile.coq
make[2]: *** No targets.  Stop.
make[1]: *** [default] Error 2
make: *** [coq] Error 2
➜  vard git:(master) tree
.
├── Makefile
├── _build
│   ├── _digests
│   ├── _log
│   ├── lib
│   │   ├── Shim.cmi
│   │   ├── Shim.cmo
│   │   ├── Shim.ml
│   │   └── Shim.ml.depends
│   ├── ml
│   │   ├── VarDArrangement.ml
│   │   ├── VarDArrangement.ml.depends
│   │   ├── vard.ml
│   │   └── vard.ml.depends
│   └── ocamlc.where
├── bench
│   ├── bench.py
│   ├── etcd.py
│   ├── etcd.txt
│   ├── setup.py
│   ├── vard.py
│   ├── vard.txt
│   └── vard_open_loop.py
├── coq
│   ├── ExtractVarDRaft.v
│   ├── Makefile
│   ├── Makefile.coq
│   └── VarDRaft.v
├── lib -> ../lib
├── ml
│   ├── VarDArrangement.ml
│   └── vard.ml
└── scripts
    ├── bench-etcd.sh
    ├── bench-vard.sh
    ├── clear-etcd.sh
    ├── clear-verdi.sh
    ├── start-tmux-debug.sh
    └── start-tmux.sh

Proposal: connect Verdi with verified crash-safe filesystem

Currently, Verdi's file operation semantics assumes a stable and atomic storage, which is unrealistic and has led to bugs like uwplse/verdi-raft#51. One option to address this is to connect the semantics to a verified crash-safe file system such as FSCQ.

uwplse / verdi Goto Github PK

verdi's Introduction

Verdi

Meta

Building and installation instructions

Documentation

Files

verdi's People

Contributors

Stargazers

Watchers

Forkers

verdi's Issues

Recommend Projects

Recommend Topics

Recommend Org