xlab-uiuc / slooo Goto Github PK

Slooo: A Fail-slow Fault Injection Testing Framework

Shell 0.37% C++ 0.61% Xonsh 81.65% Python 17.36%

distributed-systems distributed-system fault-tolerance

slooo's Introduction

Slooo: A Fail-slow Fault Injection Testing Framework

Slooo is a Xonsh-based fault injection framework for distributed systems.

Slooo is a part of the DepFast project in which we evaluate the fail-slow fault tolerance of a quorum system using fault injection, e.g., slowing down a node by adding delays and creating contention on the CPU/memory/disk/Network.

Doing such fault injection requires a lot of scripting. Some are application specific (e.g., scripts to start and terminate the system) and some are generic (e.g., injecting certain types of faults).

We wrote a lot of shell scripts for fast scripting, but soon we ran into maintenance hell, especially every time there is a major reorgs of the team (members leaving and joining). After many rounds of energy draining and time wasting, we decided to write a more structured, reusable framework to minimize the overhead.

The choice of using Xonsh comes from the following considerations:

We still need shell scripts for the ease of integration. Specifically, many of the scripts of the system under test are in shell.
We want high-level languages which can build some abstractions and reusable code.

Xonsh serves both – it is a chimera of Python and Shell.

We have used Slooo to test a number of quorum systems, including RethinkDB, MongoDB and TiDB.

The test can be done in a "pseudo-distributed" mode and in cloud environments. The former runs all the tests on one machine and the latter runs the tests in a cloud platform. Currently, we only support Azure Cloud (which sponsored the DepFast project).

Please checkout the tutorial on how to write fault-injection tests using Slooo.

slooo's People

Contributors

Stargazers

Watchers

Forkers

migoxia

slooo's Issues

bug?

https://github.com/xlab-uiuc/slooo/blob/main/SUT/rethink/rethink.xsh#L11

Why mongodb constants are used by RethinkDB?

See #21

Clean `tests`

Please clean the tests you wrote in /tests

rename the directory to DBNAME rather than some ambiguous name, i.e., rename rethink to rethinkdb.
within each folder, name the main test case into test_main.xsh so people know what is the main file to look at (later, it is easy to script because one can just list all the test_main in the tests rather than needing to remember the name of the entry file.
for server_configs.json, please either name it server_config_local.json or server_config_azure.json
Please make sure that you refactor the code rather than introducing a lot of bugs such as in #22 (comment)

Improvements on Slooo

I know discussing slooo improvements might be premature given that my assignment has not been finished yet. I just don't want my ideas from my experience testing the systems to go away:

Possible Improvements
~~1. Sandbox for local mode: When running in local mode, slooo should sandbox node instances in order to~~
~~1. provide more kinds faults in local mode -- for example, disks, network devices can be simulated and thus we can inject disk and network slowness.~~
~~2. get rid of sudoers~~
~~3. avoid interference with local environment.~~
*I am not sure whether the current docker solution is good enough or there are better solutions. Ideally, I think we should provide individualized control over each node, but I am not sure whether it is worth doing so.

Better Result Naming Conventions: Slowness config should be included in the folder name so that we don't have to manage results manually.
Reduce Command-line Output: Some commandline outputs (like cgdelete errors before benchmark is run) could be eliminated / redirected to some log file because those outputs are massive and not meaningful before benchmark is run.
System Usage Recording: This is proposed by Varishith and he is working on this improvement.

Paper Submission Plan

Evaluation Section

Look at the other student's reports to understand how they used slooo and their review of slooo (https://drive.google.com/drive/u/0/folders/1RuV5Y5cJ72K7HneSOtuwPkgTNBE0419U)
Review Sachin Ashok's scripts and report (https://github.com/Sachin-A/PySyncObj)
Enhance assignment 2 to get more results and also fix the holes pointed out by @tianyin (#38)

Point Break

Review https://github.com/xlab-uiuc/slooo/pulls?q=is%3Apr+is%3Aclosed for point break

Repo Updates

Cleanup the Repo (remove dead code)
Unmerged PRs
Merge the Updated Design Repo

Paper Submission

Paper Writing (https://2022.esec-fse.org/track/fse-2022-demonstrations)

Need to add a reqs file

just a reminder to self.

We really have to run through the tests

just skimming through the code and bugs are so easy to be spotted
https://github.com/xlab-uiuc/slooo/blob/main/utils/quorum.xsh#L1

I'm not sure whether those bugs matter (it crashes or just ignore if it cannot find the package).

We need to run through the RethinkDB or MongoDB to verify whether the code works.

Readings

Hi team, could you carefully read the Limpbench paper?

https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf

It's mostly closed related to our work and we need to position our work carefully.

We can pitch any or multiple of the following:

Slooo implements exactly what proposed in Limpbench, so we are a tool paper (hopefully that Limpbench is not released, is it?)
Slooo is much more extensible and can be used for many different types of systems.
Slooo is much more extensible and can be used to simulate more types of faults.
Slooo is not only more extensible but also more versatile (it can do X and Y and Z which is hard to be done by Limpbench)
Please let me know if there is anything you believe you can put on the table.

Feature request: support Copilot

@varshith15 I wonder whether it's possible to support Copilot using Slooo,
https://github.com/princeton-sns/copilot/tree/main/src

It's a different protocol which is supposed to be resilient to single slow leader,
https://www.usenix.org/conference/osdi20/presentation/ngo

Run through the tutorial

@siyuanchai1999 We are going to release the Slooo tool to CS 598XU.

Before that, could you run through the tutorial and see whether everything works?

https://github.com/xlab-uiuc/slooo/blob/main/tutorial.md

`faults`

It's good that you put fault implementations in fault.
https://github.com/xlab-uiuc/slooo/tree/main/faults

However, it is frustrating to read the code there.

Names like 1, 2, 3 are horrible names -- how can your users know what's going on by the names? Name them as the fault's name please (e.g., slow_cpu.sh).

Also, I don't find where 1.sh is called. Are those deadcode?

Why they are not written in xonsh but in shell?

Please write this tutorial

@varshith15 @Essoz

I posted #16 but didn't see any progress. I suspect you don't know how to write it. If that's the case, I prepare the writing as follows,
https://docs.google.com/document/d/1PNEr4W9QiRSeRIkXjEMApsrPZaqZAVhCZ3wEfN_fkA4/edit#

Could you please fill it ASAP?

I read through the codebase. The code is quite not clean :(

There are dead code and redundant code (e.g., the temp.xsh and rethink.xsh under https://github.com/xlab-uiuc/slooo/tree/main/SUT/rethink). As a user, how can I figure out what is what?

Please put hard work as the code quality is really not on par to something for 40 users.

Varshith's Assignment #2

It is in general pretty nicely done! Good job!

There are some information needs to be clarified. Please add those.

There are a few bigger problems. Let me list them here:

Could you inject the faults during the workload run? It seems to me that you inject the workloads before the workload run? That's the reason that your leader crash does not even affect performance. Leader election means availability which should affect performance.
For Q5 and Q6, the choice of 50MB does not look right. You are suppose to use a value that leads to a fail-slow behavior. The fact that 50MB >> the memory needed by RethinkDB node means a bad choice.
You said that it is expected that the performance goes down when a follower node fails slow (no matter whether it is CPU or memory). Why is it expected?
For a quorum write, the write only needs to persist on two nodes (a leader and a follower). So one slow follower is not supposed to cause problems.
In Q10, you mentioned that there is a leader crash? Why it crashes?
Ritesh previously observed that a slow follower could crash the leader (https://tianyin.github.io/pub/depfast.pdf). Do you observe that?

please rename RSM to Quorum

https://github.com/xlab-uiuc/slooo/blob/main/utils/rsm.xsh

It would require some work to refactor the other code.

Is this a template?

https://github.com/xlab-uiuc/slooo/blob/main/utils/rsm_standard_server_configs.json

If so, please name it server_config_template.json

Action Plan for Improving Slooo

What are resources?

https://github.com/xlab-uiuc/slooo/tree/main/resources

Look like completely irrelevant stuff or dead code?

Can I remove them?

"@Tianyin Xu can you confirm if anything else needs to be done?"

Sorry that I just had time to read over your answers again.

Yes, could you finish the following:

For the faults onto leader, can you inject the faults during the workload, so that the results account for the leader re-election?
I can't access the figures in Q10 -- not sure why.
I still really understand the choice of using 50MB -- why not smaller?
In #37, you said "Not exactly sure what the reason might be. Have to investigate more." -- can you do this?

Create a Docker image for the tutorial

@varshith15 will write a dockerfile and test it.

Can I ask people to use Slooo?

@varshith15 @Essoz

I'm teaching a grad-level course on reliable software systems. I want to design an assignment on testing quorum systems.

Can I ask the students to use your framework?

Action plan #3

Please use the following code as template:
template(mongo): https://github.com/xlab-uiuc/slooo/blob/main/mongo/temp.xsh
server_configs: https://github.com/xlab-uiuc/slooo/blob/main/mongo/server_configs.json
super class to inherit from: https://github.com/xlab-uiuc/slooo/blob/main/utils/rsm.xsh

There are a lot of changes in https://github.com/xlab-uiuc/slooo/blob/main/utils/general.xsh
Please update your code carefully

Andrei:

Update the redis code to support local mode
Test redis code in both modes

Varshith:

Update the mongodb code to support local mode
Test mongodb code in both modes
Update the rethinkdb code to support local mode
Test rethinkdb code in both modes
Update the tidb code to support local mode
Test tidb code in both modes

Yuxuan:

Update the polardb code to support local mode
Test polardb code in both modes

Action plan #2 (Deadline Oct 8, Tool should be ready by then) #3

Here's the new action plan for this week. Feel free to add to it or clarify things.

Andrei:

Test Redis slooo code

Varshith:

Test TiDB code

Yuxuan:

Make more config-friendly interface for injecting/specifying fail-slow faults; "For example, a user can feed path to a specific slowness-config file when running run.xsh and then slow.xsh will read from that file when injecting slowness."
Modify PolarDB code (use init_disk instead)

Future goals:

Users should write functions to format the benchmark results into a unified format required by the framework.
Porting/implementing other databases (which ones in particular TBD).
Making the framework even more generalizable.

Why a common util is specific to Mongo?

https://github.com/xlab-uiuc/slooo/blob/main/utils/constants.xsh

Proposal

This is a followup discussion to #39

If you are looking for ideas to strengthen the tool and be more competitive in terms of research.

Here is the next step I would proposal. I actually wrote it as a small project for 598.

1. Point break in Slooo

Slooo is an Xonsh-based fault injection tool. The current Slooo tool hardcoded the faults to inject. For example, to inject a CPU fail-slow fault, it uses cgroup to limit the CPU quota,
https://github.com/xlab-uiuc/slooo/blob/main/faults/fault_inject.xsh#L5-L6

Oftentimes, one needs to experiment with multiple levels of fail-slow faults (e.g., multiple CPU quotas). Currently, s/he has to change it manually in Slooo, which is awkward. An automated solution is needed.

Furthermore, many times, one needs to find the "point break" – the fail-slow faults that cause the worse slowdown but do not turn the system into a crashing behavior (e.g., the minimal memory allocation that does not OOM). Currently, Slooo does not support it.

The project is to build the above two features.

Check Azure credit usage

@Stuart0l @varshith15

Can one of you check the usage of Azure credits?

We have $40,000 from Azure till 6/30/2022.

@Stuart0l mentioned that there are $32,500 remaining. It means we used $7,500 in the past month?

If it is the case, it's likely there are something wrong. At least, we can't avoid $75,00 per month as we only have $40,000 and we should reserve at least half of our credits for the evaluations later.

Could you see:

How is $75,00 used?
How the credits were spent in the last month?
How much we have spent on @varshith15 's MongoDB experiments?
How much we have spent on EPaxos experiments?

It should give us a good idea how to plan the money.

Thanks!

EPaxos Performance Measurement

I re-ran some of the measurements in the EPaxos Revisited paper. The main difference from the original EPaxos paper is that they are run in a WAN environment.

Settings

All experiments are run on 5 server machines and 5 client machines, located at 5 datacenters in the world (WA, CA, VA, UK, JP), with 1 server and 1 client in each datacenter. Each client only sends requests to the server in its own datacenter. All servers are of the Standard D4s v3 type, with 4 vCPUs and 16G RAM each. In each experiment, I captured metrics for 60 seconds at steady-state (run for 120s and drop the first and last 30s).

I ran two sets of experiments. One is running different workloads with the same number of clients. Four workloads are tested.

Zipfian: 0.9 skew, 1 million unique keys in Zipfian distribution, 50% write operations
0% conflict keys: 50% write operations
2% conflict keys: 50% write operations
100% conflict keys: 50% write operations

There are 10 clients at each client machine. Every client only issues operations to the server machine in the same data center.

Another is running the same workload with a different number of clients. I tested the Zipfian workload with 0.9 and 0.6 skew, 1m unique keys, and 50% and 10% operations, using up to 900 clients.

Results

Experiment 1

The following shows the throughput and 99% execution latency of in each datacenter:

The following is taken from Fig. 6 in EPaxos Revisited paper:

The total throughput I get is 600 ops for Zipfian, 615 ops for 0%, 2%, and 333 ops for 100%, which is roughly consistent with what the paper claimed (650 ops for Zipfian, 0% and 2%, 230 ops for 100%).

The 99% latency I get is not all consistent with the results in the paper. The Zipfian workload latency is roughly the same. 2% and 100% latency is nearly the same as Zipfian latency, while 0% latency is about half of Zipfian latency at all datacenters (except JP).

Experiment 2

The following shows the throughput vs. threads in each datacenter

The following shows the throughput vs. latency in each datacenter
skew 0.9, 50% write:

skew 0.6, 10% write:

For 0.9 skew, up to 900 threads, the throughput still increases linearly with the threads. And I didn't see latency increase significantly as the throughput increases. So I believe higher throughput is still achievable. The total throughput at 900 threads is about 30k ops. However, according to Fig. 12, the throughput of unmodified EPaxos at skew 0.9, 50% write is about 15k. I don't know how they got this number as they didn't mention how many clients are used. For 0.6 skew, there is a trend that throughput is out of linear growth with threads. The total throughput at 900 threads is about 37k ops, roughly consistent with the data point in Fig. 12 (40k).

The new merged code broke the assignment

@plzdoo reported that the Assignment #2 is broken,
https://docs.google.com/document/d/1bVw1c9az67OvTKwVSBq41hWpOsuxLbrWNXcJxzkA9t8/edit

It is caused by the recent commit,
6bd8ff9

It seems we were not very careful in backward compatibility. @varshith15

We will need to fix either the code or the doc.

@plzdoo would you be willing to do it?

bad name

https://github.com/xlab-uiuc/slooo/blob/main/utils/general.xsh

general is a very bad name. It's hard for me to know what is general code.

It seems all the code there are related to VM ops? If that's the case, please rename it to vm_utils

Feature requests

IMPORTANT:
Only works with Microsoft Azure (so far).

I accidentally saw this. This is no good, especially for a tool paper.

Not every developer has resources to test stuff in cloud (cloud credits are expensive) -- if it is Azure only, it's hard for you to claim usefulness
Even developers test on cloud, not every developers use Azure -- your tool can't be used for AWS and GCP?

So, this needs to be addressed and (1) is more important than (2).

We need to provide a local mode to allow developers to do testing on a local machine or VM. This can make people's life much easier. Certainly, local testing can't be as versatile as cloud testing, but we should provide whatever needed for the best local testing.

For (2), we already prove that it works for Azure. We don't have to implement AWS and GCP support, but need to convince people that it's easy to extend and how to extend if later someone wants to extend.

[Onboarding task]

Copy of REU_ Testing Reliability of Quorum Systems - Google Docs.pdf

Action plan (Deadline Oct 8, Tool should be ready by then)

Varshith:

Test framework code for MongoDB
Test framework code for RethinkDB
Email Andrew to get clarity on the different types of experiments for TiDB
Port TiDB to the framework
Test framework code for TiDB

Andrei:

Port Redis

@TheEnthralled Please fill/update in this part

Yuxuan:

Finish Analyzing fail-slowness behavior of PolarDB
Test PolarDB on other distros
Have tested PolarDB's compatibility on Ubuntu-20.04 (Hyper-V virtual machine)
Port PolarDB to the framework

Alex:
@CodeTiger927 Please fill/update in this part

Can we use Slack for faster communication?

I ping you on Slack but no response...

We need a FAQ

A handheld tutorial on how to use Slooo

@varshith15 @Essoz

Given that you will have 40 users, could you write a tutorial on how to use Slooo based on one system?

Just pick one system (RethinkDB or MongoDB or whatever you like) and demonstrate each step and required code how to do the following:

kill a node
slow down CPU of a node
slow down memory of a node

(you can choose a follower as the node)

The node can be either a leader or a follow, or both (e.g., in Copilot).

Action plan for slooo

Based on #39 #40, we propose the following steps for improving slooo.

Start with easier improvements that make the code base more structured, to fully make use of the OOP model and to get rid of some heritage from the DepFast project.

Logging useful information like node membership for each experiment to provide a stronger base for result reasoning.
Error detection in code: we want to identify command execution status in utility functions to avoid dumping meaningless outputs into terminal.
Support easy switch between multiple levels of fail-slow faults and allows users to specify multiple slowness configs.
Slooo code redesign using OOP model. We want to allow for easier feature integration and better code readability since it is external users that adapt the tool to the quorum system. The redesign will be considered in parallel with the above three points.

Then we work on

system data collection: Use extra threads to record system usage. We may allow the user to specify sample rates, but specific details need further discussion.

and,

The point break (auto-tuning) feature. Our current thought is that, the tool should run experiments using random levels of fail-slow faults. Then, based on previous results (a statistical approach), the tool narrows the range of fail-slow faults and run a new round of experiments and repeats the process. Finally, we narrow down to the "breaking" point. My concerns for this approach are that: (1) it can be costly to run multiple rounds of experiments, (2) the statistical approach may not work, and thus the tool may find nothing eventually.

finally,

rewrite the slooo documentation contributed by @tianyin

Paper wrtting

Prof. @tianyin, I think we are almost done with porting the code (MongoDB, RethinkDB, TiDB, Redis, PolarDB)
(Hopefully we will have a working version ( version 1 ) of the framework by the end of the week).

We want to start think about the paper now. And I am not exactly sure as to where to start.
Could you please give some pointers on this? As to where to start and what to do.