envoyproxy / nighthawk Goto Github PK

View Code? Open in Web Editor NEW

340.0 340.0 81.0 41.86 MB

L7 (HTTP/HTTPS/HTTP2/HTTP3) performance characterization tool

License: Apache License 2.0

Python 12.01% C++ 78.95% Shell 2.40% Starlark 6.29% HTML 0.35%

nighthawk's People

Contributors

Stargazers

Watchers

nighthawk's Issues

Validate TLS certificates

Get proper url-parsing in place

Incorporate a full fledged url parser (e.g. Chromium URL libs) so we can correctly parse and validate those.

Rate limiter: bursty behavior concerns due to thread scheduling.

See if we can excite that behavior and address it. We may need to globally enforce rate limiting while doing this and introduce locks, so discussing the design of the rate limiter would be good.
Also see #38

Figure out a way to disable libevent threading support

Disable libevent threading support while/when we don’t need it.
Earlier experiments with this show a few percent gains in max. throughput and measurement-accuracy.

Add the capability to quantify the relationship between bandwidth and latency

Add the ability to quantify latency as a function of queries-per-second

Nighthawk as-a-service

Currently Nighthawk supports a CLI which allows for quick and easy execution of single test runs.
But it would be really neat to have a service next to that, which would accept configuration (GRPC or CLI) and (re-)configure on the fly as requested.

Fix CI test-run flakes

Recently Envoy added an ASSERT that sometimes fires when the integration-test server calls sleep() from it's constructor before affected Nighthawk tests fork() the integration test server.

The time-system enforces that sleep() gets called on a single thread, but this isn't aware of forks.

The fork() was done as a workaround because without it we get into a fight with Envoy's integration test server about who owns the Runtime. This needs to be addressed in Nighthawk to deflake test-runs.

For reference, the offending code (take a deep breath):

nighthawk/test/client_test.cc

Line 44 in 0783204

 // We fork the integration test fixture into a child process, to avoid conflicting 

Add check_format/fix_format in CI

Orchestration

One idea is to implement a series of tests in python that start up release builds of envoy + another server (e.g. nginx) with the same configuration, and run a predefined series of nighthawk tests against them. Adding this may also deprecate client_test.cc which does something hideous to fork the integration test server into its own process in the test framework while avoiding an assert (both NH and Envoy try to own the runtime loader singleton).

HdrHistogram depends on zlib, but we don't capture that

We either need to:

Add zlib as a dependency in bazel
Or add apt install zlib1g-dev as a dependency in README.md

Support gRPC

Add the capability to load-test the gRPC protocol.

Take ownership of the fault-filter delay configuration

We should add the delay as proto field, and synthesize a request header from that which the fault-filter will understand.

Test-server integration with the fault filter

It would be great to synthesize request headers based on the test-server's own configuration to control the delays that Envoy's fault filter induces, instead of having to send a separate request header for that.

Add global rate limiting

We might want to have a global rate limit, since different workers might be moving at different effective rates (e.g. due to hyper-threading imbalance).

Support non-uniform request distributions

Merge Statistic implementation with Envoy's

Currently Nighthawk relies on both Envoy's statistical concepts, as well as implements its own.
These two have converged rapidly to do the same thing. It would be great to unify these two.
One key difference is that Nighthawk has HdrHistogram_c, a feature which would be good to preserve.

track cpu time spend by target during benchmark

Re-use Envoy's stats dumping protos

One step in unifying Envoy's statistics and Nighthawks statistics would be to reuse https://github.com/envoyproxy/envoy/blob/master/api/envoy/admin/v2alpha/metrics.proto and not define its own version.

Integrate with a tracing API (Like OpenTrace, Lightstep, or Zipkin)

Tidy up includes

Envoy and NH header includes started overlapping after the transfer from envoy-perf/nighthawk to the root of this repo, resulting in potential conflicts. Tidy this up.

Generating the compilation database broken as of bazel 0.25

Running bazel/gen_compilation_database.sh fails with:

oschaaf@burst:~/code/envoy-perf-vscode/nighthawk$ bazel/gen_compilation_database.sh 
DEBUG: /home/oschaaf/code/envoy-perf-vscode/nighthawk/bazel-compilation-database-0.3.1/aspects.bzl:99:9: Rule with no sources: @com_google_protobuf//:cc_wkt_protos
ERROR: /home/oschaaf/.cache/bazel/_bazel_oschaaf/385722e931c3493bb3c210a3b1bab888/external/com_lyft_protoc_gen_validate/validate/BUILD:39:1: in //bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect aspect on cc_library rule @com_lyft_protoc_gen_validate//validate:cc_validate: 
Traceback (most recent call last):
	File "/home/oschaaf/.cache/bazel/_bazel_oschaaf/385722e931c3493bb3c210a3b1bab888/external/com_lyft_protoc_gen_validate/validate/BUILD", line 39
		//bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect(...)
	File "/home/oschaaf/code/envoy-perf-vscode/nighthawk/bazel-compilation-database-0.3.1/aspects.bzl", line 120, in _compilation_database_aspect_impl
		target.cc
<target @com_lyft_protoc_gen_validate//validate:cc_validate> (rule 'cc_library') doesn't have provider 'cc'
ERROR: Analysis of aspect '//bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect of //api/client:benchmark_options_cc' failed; build aborted: Analysis of target '@com_lyft_protoc_gen_validate//validate:cc_validate' failed; build aborted
INFO: Elapsed time: 1.335s
INFO: 0 processes.

Some sleuthing pinpointed this this breaking when bazel 0.25 was released.
After reverting to 0.24 this works again. The docker image we use for clang-tidy also
has bazel 0.24, so that still works.

Optimization: max throughput under set latency.

Figure out max throughput under set latency. A variable might be degree of concurrency.

QPS Autoranging

Find the points to sample to draw the QPS vs latency curve with minimal sample points based on examining gradients between existing sampled points

Human readable/writable input/output

Currently we rely on tclap and fmt::format in c++ to interface with the benchmarking libs. Alternatively, we could use python or some such as a specialized tool to implement the front-end (e.g. cli, http server).

Move *.proto into /api/..

Support ramping of load and connections

Work in progress associated to this:

ramping rate limiters - #218
phases concept - #219

Connection-pool configuration

Re-use and apply Envoy's configuration for tuning all available pool settings and limits

Optionize the cpu warmup delay

Currently workers in Nighthawk spend two seconds in a spin/yield loop, polling the clock for assessing time-to-start. This should be configurable, and the default can probably be much shorter.

Improve coverage accuracy

There seems to be some inaccuracy in code coverage measurement, GCOV_EXCLUDE_XXX is ignored and it is some code in headers is always being flagged as not run (inlined code?).

Envoy is anticipated to switch to native coverage as well in the future, at which point it makes sense to revisit this.

Test-server per-request configuration: add an option to clear values

Currently the test server will apply per-request configuration specified in per request-header to the process-level configuration, by performing a proto-level Merge().
This doesn't always work well: it's not possible to override the server configuration with type-specific defaults. For example, response-size (int, default=0) cannot be overridden to 0 when the server-level configuration is non-zero.
A bool-valued field called clear could be helpful here, to allow the client to indicate it wants to fully specify the configuration (and not inherit from what the server has configured).

Disambiguate common headers

Today, header -I paths are setup so we have the following situation:

#include "common/api/api_impl.h"
#include "common/common/cleanup.h"
#include "common/common/thread_impl.h"
#include "common/event/dispatcher_impl.h"
#include "common/event/real_time_system.h"
#include "common/filesystem/filesystem_impl.h"
#include "common/frequency.h"
#include "common/network/utility.h"
#include "common/runtime/runtime_impl.h"
#include "common/thread_local/thread_local_impl.h"
#include "common/uri_impl.h"
#include "common/utility.h"

Some of these headers, e.g. real_time_system.h, come from Envoy's tree, effectively @envoy//source/common/event/real_time_system.h, others come from NH, e.g. //source/common:uri_impl.h.

Ideally we have a cleaner way to visually disambiguate this in the header block.

Snapshot the envoy/nh stats periodically from the workers

Snapshot the envoy/nh stats periodically from the workers for plotting them through time. What happens upon shutdown of the pool is interesting as well, so knowing what the stats look like pre and post shutdown is interesting too.

Make it easy to integrate Nighthawk into CI of other projects

Optionize the DSN family being used (v4/v6/auto).

Improve human readable output for timings

Currently the output lists histograms in seconds, which is hard to read.
Defaulting to milliseconds would be better.

Nighthawk: track-for-future list

Copied from envoyproxy/envoy-perf#32, this issue tracks high priority items and technical debt. This needs to be split out, but for now copy over so we can close this over at envoy-perf.

This looks like a reasonable CLI to start with, but I would also like to take a proto variant of this as an input, and to make that canonical. [WIP]
JSON and proto output of this data; this is probably easy to do and high priority, since most consumers will want this as an option. It's also easy to reformat to something human readable from JSON via a bunch of libraries/tools.
~~gcovr test coverage. first attempts to get it working failed.~~
~~Add content about benchmarking best practices [WIP nighthawk/README.md]~~
~~Implement and use BenchmarkWorker~~
~~Pull in and use C++ HdrHistogram library, get rid of the python stuff for that~~
Currently we copied / hacked some stuff in ssl.h (e.g. MClientContextConfigImpl). I did this to avoid a cascade of dependencies because of one of the argument list of one of the interfaces involved. Revisit, evaluate, and see if we can get rid of this (or else discuss). [WIP]
~~Evaluate and discuss usage of pending_requests in the Nighthawk client. In closed loop mode, we want to detect and warn if we hit this limit because latency measurements will be skewed.~~
Disable libevent threading support
~~Overall, add docs where needed, and check for const correctness.~~
Harvey remarked: I assume this a is a per-worker rate limiter. We might want to have a global rate limit, since different workers might be moving at different effective rates, e.g. due to hyper-threading imbalance. This has to be weighed against potential lock contention, but I think we need something like this.
Use the dispatcher on the worker thead to delay its start instead of calling usleep(). oschaaf: the dispatcher timer minimal resolution is 1ms, which is rather coarse for our purpose. I was thinking that maybe we can leverage the sequencer to guard the worker starting time, and spin to start workers with high precision timings.
Rate limiter: bursty behavior concerns due to thread scheduling. See if we can excite that behavior and address it. We may need to globally enforce rate limiting while doing this and introduce locks, so discussing the design of the rate limiter would be good.
check_format/fix_format in CI
One thing we should decide on early is if we want an OS syscall abstraction layer. It's pretty nice for when you want to mock, and isn't that much additional boilerplate to use. See https://github.com/envoyproxy/envoy/blob/master/include/envoy/api/os_sys_calls.h and its uses in Envoy.
hdrhistogram_c is now included as a submodule. It would be better to make this an external repository in the Bazel sense and inject a BUILD file into it. See https://github.com/envoyproxy/envoy/blob/84d038f5a67a8dc8ca9753e9089a33d8110db3c2/api/bazel/repositories.bzl#L34 [WIP - https://github.com/envoyproxy/envoy-perf/pull/55]
[WIP] Add ASAN/TSAN tests to CI. Work in progress, experienced difficulties with getting this to work. However, we had to resort to quite a hack to override the linked that gets used. We want to revisit that and do it right once we figure out how.
Currently NH supports the simplest of cases only, sending GET requests. Wire through a to-be-designed (Http)RequestSynthesizer in the future to allow for more functionality.
We need to track response_trailers_ and look at grpc-status when doing gRPC load tests later on. It's fine not to do this for now, maybe just leave a TODO
~~Redirect TCLAP logs to Envoy's logging subsystem. See https://github.com/maddouri/tclap/blob/1e1cc4fb9abbc4bfcd62c73085c3c446fca681dd/include/tclap/CmdLineOutput.h~~
~~Do a pass to switch us to using ::testing::XXX in tests.~~
the stats implementation is converging rapidly towards what is already there in Envoy. We should look into reusing Envoy's functionality, and perhaps see if we can upstream HdrHistogram support and double check that the stdev implementation in Envoy is numerically stable
Factor out the boiler plate in BenchmarkClientTest to get a reusable integration test base.
Envoy has stats dumping protos, see https://github.com/envoyproxy/envoy/blob/master/api/envoy/admin/v2alpha/metrics.proto, reuse them.
OS: discuss: Human usable interface & human readable output: currently we rely on tclap and fmt::format in c++ to interface with the benchmarking libs. Alternatively, we could use python or some such as a specialized tool to implement the frontend (e.g. cli, http server).
OS: discuss: Expose tls pool configuration options to control max requests allowed per connection, ciphers used, sigalg, etc. Discuss how we want to do that, do we directly accept Envoy's config for that? Or do we abstract and map our own config to Envoy's version? The first option is quick and easy, but uncontrolled. The latter may allow us to expose options more carefully and at our own pace.
OS: discuss: Snapshot the envoy/nh stats periodically from the workers for plotting them through time. What happens upon shutdown of the pool is interesting as well, so just pre and post shutdown is an interesting distinction too.
OS: discuss: Implement a series of tests in python that start up release builds of envoy + another server (e.g. nginx) with the same configuration, and run a predefined series of nighthawk tests against them. Adding this may also deprecate client_test.cc which does something hideous to fork the integration test server into its own process in the test framework.
OS: Add support for multiple connections in the h2 pool implementation.
OS: Controlling and separately managing connection management
OS: Separately track tls handshake timings, connection set up timings
Proactively prefetching connections that are lost while running a benchmark test would be a nice enhancement on top of the initial prefetching

Allow control of TLS ciphers and settings

Explicit control of TLS ciphers and session-reuse makes it easier when it comes to comparing server to server performance. Additionally, it would be nice to be able to test just the overhead of setting up these connections (and not perform any requests) and/or track times of specific milestones during the connection/tls setup process.

Update CI image // drop linker override hack for ASAN tests

Update the image we use in CI to whatever is being used on Envoy's master branch.

After doing so, see if we can get rid a hack we did to force-override the linked used during the build process in ASAN/TSAN runs, as there have been changes upstream.

Hack in Nighthawk that would be great to get rid of: https://github.com/envoyproxy/nighthawk/blob/master/ci/do_ci.sh#L40
See envoyproxy/envoy#6314 (comment)

Add support for multiple concurrent H2 connections

Currently Nighthawk mostly uses a single http/2 connection to issue requests. This may lead to hotspotting processes on the benchmark target. The expectation is that this will be fixed upstream in Envoy.

Track throughput as a function of load

Add log replay capability

Add support for open-loop latency measurement

Currently Nighthawk is capable of doing closed-loop testing, which means that when configured resource limits are met (e.g. max connections, max streams), no new requests will be issued, even when that means not reaching the requested request pacing configuration (this wil show up in the output as a time-spend-blocked histogram).
In real life, clients will not wait like that, and supporting open-loop testing will help measuring latencies under these circumstances.

An important part of this feature is that load generation should auto-terminate upon detecting a certain amount of in-flight requests, and maybe when it detects certain resource shortages (like running out of file descriptors).

Request method (POST, HEAD, etc)
Headers
Request entity bodies

Proactively reconnect closed connections

Proactively prefetching connections that are lost while running a benchmark test would be a nice enhancement on top of the initial prefetching

envoyproxy / nighthawk Goto Github PK

nighthawk's People

Contributors

Stargazers

Watchers

Forkers

nighthawk's Issues

Recommend Projects

Recommend Topics

Recommend Org