Preamble
Idea: 51-test-cluster
Title: Testing cluster
Status: Draft
Created: 2017-11-29
Summary
Provision test cluster consisting of Status nodes running the simulation of real user behavior. Setup high-level metrics monitoring and track changes between releases.
Vision
The idea stems from #22 (tools for diagnosing performance regressions). One of the main challenges with it is to simulate real-world load and currently, we have no way to do this. Analyzing performance on a single device is also prone to inaccurate results due to the high variability of hardware, software running in the background and other conditions. We also have no easy to way to gather metrics we want from devices.
This leads to the idea of provisioning a cluster consisting of nodes (status-go, real devices or both), including boot nodes. Cluster may run on its own test network or on existing test network (Ropsten). Each node in the cluster shall be instrumented and configured for metrics collections. Infrastructure for metrics gathering, storing and display should be set up.
![rand_graph](https://user-images.githubusercontent.com/880202/33385026-e68e478a-d527-11e7-8faf-0f7628e71c20.png)
Using graph visualization tools (like Graphana) it'd be possible to see statistically sound performance measurements, pinpoint changes to release/version changes and easily identify regressions.
![screen-shot-2017-09-19-at-12 32 38](https://user-images.githubusercontent.com/880202/33385014-df9aea0a-d527-11e7-9f7e-54a3eff5a4c1.png)
Think about this cluster as a Status network playground, where you can deploy, say, 30% nodes with a new change and easily see the difference in performance metrics against stable version. It also enables further possibilities for data gathering and exploration. Example: by collecting stats about each incoming and outgoing whisper message, we can visualize Whisper protocol behavior which may help to build intuition around it and help to debug/develop future versions of the protocol.
Swarm Participants
- Lead contributor: @divan
- Testing & Evaluation:
- Contributor:
- Contributor:
- UX (if relevant):
Requirements
- There is a cluster provisioned and deployed using available cloud providers
- Deployment scripts and tooling are well documented, codified (i.e. terraform/packer) and understood
- Core team members have clear understanding how to deploy Status build to the cluster
- Status code is instrumented with metrics and Core team have clear understanding how to add new metrics if needed
- Metrics gathering infrastructure is designed and deployed (this includes metrics collection, storage and visualizing software)
- Real-world usage simulation is designed and implemented, so each deployed node automatically starts "behaving as a user"
Goals & Implementation Plan
Implementation of this idea has three roughly independent parts that need to be researched, designed and implemented:
- cluster infrastructure
- metrics part (changes to code and infrastructure)
- usage simulation
Cluster infrastructure
This part should start by evaluating the viable size of the cluster we want to have: 50 nodes, 100, 1000, dynamic? Then, which nodes cluster should consist of: only status-go nodes, real devices/simulators or both.
Then find the best software solution for that. This part requires an understanding of the ethereum discovery process. Solutions like Docker Swarm might be enough, but it might be possible that we'll want to simulate real network topology, for which we'll need to use specialized simulators like Mininet. Each node should probably be isolated using containers, but any isolation alternatives can be evaluated of course. That's unlikely that cluster can run on the modern laptop (it would be awesome though), so the cloud provider should be chosen, whichever easier to work with (AWS/GCP/DO, I guess).
Once the vision of how the cluster should look like is clear, provisioning scripts and tools should be implemented and designed to be developer friendly, with a high level of automatization (again, terraform is probably the right way to go). Ideally, we should be able to deploy as many identical clusters as we wish without any hassle.
In case if cluster runs on the private network, it should setup own bootnodes as well.
Metrics
As the main purpose of having test cluster is to gather data and observe behavior at scale, the code needs to be instrumented to provide those metrics to the metrics collection infrastructure. Here we have two connected parts: code instrumentation and setting up metrics collection infrastructure. Ideally.
Metrics instrumentation
Developers might want to add custom metrics apart from obvious things to measure — CPU, memory, I/O stats, etc. Go code would probably want to report number of goroutines, garbage collection stats, etc, plus many custom things like the number of Jail cells, incoming and outgoing RPC requests, etc.
The task here is to make code instrumentation to be as friendly to the developer as possible: it should be easy to add and test new metrics with the minimal learning curve. One of the examples of such easy approach is expvar
Go stdlib package, which might work perfectly for the pull
model of metrics. Which model to use (pull/push) is a subject to investigate.
Finally, the instrumented code should not go into production. It can be implemented via build tags, or simply by mocking it with dummy NooP metrics sender, which doesn't change resulting binary code.
Metrics infrastructure
This infrastructure should be a part of cluster deployment, so if there are many clusters, each has its own metrics dashboard and tooling. Essentially it involves metrics collection code, storage (for some period of time) and visualization software. There currently a lot of software to choose from, including Prometheus and Graphana, so the best tools should be chosen here.
Then deployment scripts and code should be implemented. Ideally, it should be (almost) zero configuration for nodes.
Usage simulation
This part consists in developing ways of automating user interaction with Status node and researching of real-world user behavior. First one is more or less simple — provide API to talk to the node, and make it do stuff (send messages, create chats, use dApps, send money, etc). The second one is trickier because effectively it's about simulating the whole economy and humans behavior — simulation code should decide who sends the message to whom, how often, how much money to send, how to use dApps, etc.
Obviously, perfect real-world simulation is unlikely to be achieved, we just need the simulation to have two properties:
- sequences of user actions generated should be close to the real world usage (we might grab them from TestFairy sessions)
- probability distributions should be close to what we think is a real-world case (Poisson distribution for independent actions, gather knowledge about real-world usage and improve simulation as much as possible)
Each simulation agent could be independent or controlled by a single node in cluster — subject to investigation, which would be a better approach.
Minimum Viable Product
MVP should consist of:
- a simple cluster of <10 nodes, possibly running on the laptop
- three metrics reported and visualized in the dashboard: CPU, memory, Goroutines (i.e. one custom metric)
- naive user behavior simulation — login, add a new contact, send some message, receive messages, sleep, repeat.
Goal Date: 2017-12-25 (Xmas!)
Description: MVP
Iteration N.1
- increase number of nodes to 100
- setup a cloud infrastructure if needed to support this number
- setup a metrics infrastructure for the cluster in the cloud
Goal Date: 2018-01-20 (adjusted for winter holidays in mind)
Description: Move to the cloud
Iteration N.2
- evaluate and develop user simulation strategy and tooling
- work on developers experience — libs & docs
- add more metrics and better visualizations (including net topology, for example)
Goal Date: 2018-02-10
Description: Make it candy
Supporting Role Communication
Copyright
Copyright and related rights waived via CC0.