Giter VIP home page Giter VIP logo

nighthawk's People

Contributors

abaptiste avatar ashishb-solo avatar chipmunkie avatar colimitt avatar dependabot[bot] avatar dubious90 avatar eric846 avatar fei-deng avatar htuch avatar jiajunye avatar kbaichoo avatar kushthedude avatar martinezlucas98 avatar mmorel-35 avatar mum4k avatar nareddyt avatar oschaaf avatar pamorgan avatar pemor avatar phlax avatar pradeepcrao avatar qqustc avatar razdeep avatar sebastianavila5 avatar tomjzzhang avatar trevortaoarm avatar wbpcode avatar wjuan-afk avatar yanavlasov avatar zhaomoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nighthawk's Issues

Nighthawk as-a-service

Currently Nighthawk supports a CLI which allows for quick and easy execution of single test runs.
But it would be really neat to have a service next to that, which would accept configuration (GRPC or CLI) and (re-)configure on the fly as requested.

Fix CI test-run flakes

Recently Envoy added an ASSERT that sometimes fires when the integration-test server calls sleep() from it's constructor before affected Nighthawk tests fork() the integration test server.

The time-system enforces that sleep() gets called on a single thread, but this isn't aware of forks.

The fork() was done as a workaround because without it we get into a fight with Envoy's integration test server about who owns the Runtime. This needs to be addressed in Nighthawk to deflake test-runs.

For reference, the offending code (take a deep breath):

// We fork the integration test fixture into a child process, to avoid conflicting

Orchestration

One idea is to implement a series of tests in python that start up release builds of envoy + another server (e.g. nginx) with the same configuration, and run a predefined series of nighthawk tests against them. Adding this may also deprecate client_test.cc which does something hideous to fork the integration test server into its own process in the test framework while avoiding an assert (both NH and Envoy try to own the runtime loader singleton).

Support gRPC

Add the capability to load-test the gRPC protocol.

Test-server integration with the fault filter

It would be great to synthesize request headers based on the test-server's own configuration to control the delays that Envoy's fault filter induces, instead of having to send a separate request header for that.

Add global rate limiting

We might want to have a global rate limit, since different workers might be moving at different effective rates (e.g. due to hyper-threading imbalance).

Merge Statistic implementation with Envoy's

Currently Nighthawk relies on both Envoy's statistical concepts, as well as implements its own.
These two have converged rapidly to do the same thing. It would be great to unify these two.
One key difference is that Nighthawk has HdrHistogram_c, a feature which would be good to preserve.

Tidy up includes

Envoy and NH header includes started overlapping after the transfer from envoy-perf/nighthawk to the root of this repo, resulting in potential conflicts. Tidy this up.

Generating the compilation database broken as of bazel 0.25

Running bazel/gen_compilation_database.sh fails with:

oschaaf@burst:~/code/envoy-perf-vscode/nighthawk$ bazel/gen_compilation_database.sh 
DEBUG: /home/oschaaf/code/envoy-perf-vscode/nighthawk/bazel-compilation-database-0.3.1/aspects.bzl:99:9: Rule with no sources: @com_google_protobuf//:cc_wkt_protos
ERROR: /home/oschaaf/.cache/bazel/_bazel_oschaaf/385722e931c3493bb3c210a3b1bab888/external/com_lyft_protoc_gen_validate/validate/BUILD:39:1: in //bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect aspect on cc_library rule @com_lyft_protoc_gen_validate//validate:cc_validate: 
Traceback (most recent call last):
	File "/home/oschaaf/.cache/bazel/_bazel_oschaaf/385722e931c3493bb3c210a3b1bab888/external/com_lyft_protoc_gen_validate/validate/BUILD", line 39
		//bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect(...)
	File "/home/oschaaf/code/envoy-perf-vscode/nighthawk/bazel-compilation-database-0.3.1/aspects.bzl", line 120, in _compilation_database_aspect_impl
		target.cc
<target @com_lyft_protoc_gen_validate//validate:cc_validate> (rule 'cc_library') doesn't have provider 'cc'
ERROR: Analysis of aspect '//bazel-compilation-database-0.3.1:aspects.bzl%compilation_database_aspect of //api/client:benchmark_options_cc' failed; build aborted: Analysis of target '@com_lyft_protoc_gen_validate//validate:cc_validate' failed; build aborted
INFO: Elapsed time: 1.335s
INFO: 0 processes.

Some sleuthing pinpointed this this breaking when bazel 0.25 was released.
After reverting to 0.24 this works again. The docker image we use for clang-tidy also
has bazel 0.24, so that still works.

QPS Autoranging

Find the points to sample to draw the QPS vs latency curve with minimal sample points based on examining gradients between existing sampled points

Human readable/writable input/output

Currently we rely on tclap and fmt::format in c++ to interface with the benchmarking libs. Alternatively, we could use python or some such as a specialized tool to implement the front-end (e.g. cli, http server).

Optionize the cpu warmup delay

Currently workers in Nighthawk spend two seconds in a spin/yield loop, polling the clock for assessing time-to-start. This should be configurable, and the default can probably be much shorter.

Improve coverage accuracy

There seems to be some inaccuracy in code coverage measurement, GCOV_EXCLUDE_XXX is ignored and it is some code in headers is always being flagged as not run (inlined code?).

Envoy is anticipated to switch to native coverage as well in the future, at which point it makes sense to revisit this.

Test-server per-request configuration: add an option to clear values

Currently the test server will apply per-request configuration specified in per request-header to the process-level configuration, by performing a proto-level Merge().
This doesn't always work well: it's not possible to override the server configuration with type-specific defaults. For example, response-size (int, default=0) cannot be overridden to 0 when the server-level configuration is non-zero.
A bool-valued field called clear could be helpful here, to allow the client to indicate it wants to fully specify the configuration (and not inherit from what the server has configured).

Disambiguate common headers

Today, header -I paths are setup so we have the following situation:

#include "common/api/api_impl.h"
#include "common/common/cleanup.h"
#include "common/common/thread_impl.h"
#include "common/event/dispatcher_impl.h"
#include "common/event/real_time_system.h"
#include "common/filesystem/filesystem_impl.h"
#include "common/frequency.h"
#include "common/network/utility.h"
#include "common/runtime/runtime_impl.h"
#include "common/thread_local/thread_local_impl.h"
#include "common/uri_impl.h"
#include "common/utility.h"

Some of these headers, e.g. real_time_system.h, come from Envoy's tree, effectively @envoy//source/common/event/real_time_system.h, others come from NH, e.g. //source/common:uri_impl.h.

Ideally we have a cleaner way to visually disambiguate this in the header block.

Snapshot the envoy/nh stats periodically from the workers

Snapshot the envoy/nh stats periodically from the workers for plotting them through time. What happens upon shutdown of the pool is interesting as well, so knowing what the stats look like pre and post shutdown is interesting too.

Nighthawk: track-for-future list

Copied from envoyproxy/envoy-perf#32, this issue tracks high priority items and technical debt. This needs to be split out, but for now copy over so we can close this over at envoy-perf.

  • This looks like a reasonable CLI to start with, but I would also like to take a proto variant of this as an input, and to make that canonical. [WIP]
  • JSON and proto output of this data; this is probably easy to do and high priority, since most consumers will want this as an option. It's also easy to reformat to something human readable from JSON via a bunch of libraries/tools.
  • gcovr test coverage. first attempts to get it working failed.
  • Add content about benchmarking best practices [WIP nighthawk/README.md]
  • Implement and use BenchmarkWorker
  • Pull in and use C++ HdrHistogram library, get rid of the python stuff for that
  • Currently we copied / hacked some stuff in ssl.h (e.g. MClientContextConfigImpl). I did this to avoid a cascade of dependencies because of one of the argument list of one of the interfaces involved. Revisit, evaluate, and see if we can get rid of this (or else discuss). [WIP]
  • Evaluate and discuss usage of pending_requests in the Nighthawk client. In closed loop mode, we want to detect and warn if we hit this limit because latency measurements will be skewed.
  • Disable libevent threading support
  • Overall, add docs where needed, and check for const correctness.
  • Harvey remarked: I assume this a is a per-worker rate limiter. We might want to have a global rate limit, since different workers might be moving at different effective rates, e.g. due to hyper-threading imbalance. This has to be weighed against potential lock contention, but I think we need something like this.
  • Use the dispatcher on the worker thead to delay its start instead of calling usleep(). oschaaf: the dispatcher timer minimal resolution is 1ms, which is rather coarse for our purpose. I was thinking that maybe we can leverage the sequencer to guard the worker starting time, and spin to start workers with high precision timings.
  • Rate limiter: bursty behavior concerns due to thread scheduling. See if we can excite that behavior and address it. We may need to globally enforce rate limiting while doing this and introduce locks, so discussing the design of the rate limiter would be good.
  • check_format/fix_format in CI
  • One thing we should decide on early is if we want an OS syscall abstraction layer. It's pretty nice for when you want to mock, and isn't that much additional boilerplate to use. See https://github.com/envoyproxy/envoy/blob/master/include/envoy/api/os_sys_calls.h and its uses in Envoy.
  • hdrhistogram_c is now included as a submodule. It would be better to make this an external repository in the Bazel sense and inject a BUILD file into it. See https://github.com/envoyproxy/envoy/blob/84d038f5a67a8dc8ca9753e9089a33d8110db3c2/api/bazel/repositories.bzl#L34 [WIP - https://github.com/envoyproxy/envoy-perf/pull/55]
  • [WIP] Add ASAN/TSAN tests to CI. Work in progress, experienced difficulties with getting this to work. However, we had to resort to quite a hack to override the linked that gets used. We want to revisit that and do it right once we figure out how.
  • Currently NH supports the simplest of cases only, sending GET requests. Wire through a to-be-designed (Http)RequestSynthesizer in the future to allow for more functionality.
  • We need to track response_trailers_ and look at grpc-status when doing gRPC load tests later on. It's fine not to do this for now, maybe just leave a TODO
  • Redirect TCLAP logs to Envoy's logging subsystem. See https://github.com/maddouri/tclap/blob/1e1cc4fb9abbc4bfcd62c73085c3c446fca681dd/include/tclap/CmdLineOutput.h
  • Do a pass to switch us to using ::testing::XXX in tests.
  • the stats implementation is converging rapidly towards what is already there in Envoy. We should look into reusing Envoy's functionality, and perhaps see if we can upstream HdrHistogram support and double check that the stdev implementation in Envoy is numerically stable
  • Factor out the boiler plate in BenchmarkClientTest to get a reusable integration test base.
  • Envoy has stats dumping protos, see https://github.com/envoyproxy/envoy/blob/master/api/envoy/admin/v2alpha/metrics.proto, reuse them.
  • OS: discuss: Human usable interface & human readable output: currently we rely on tclap and fmt::format in c++ to interface with the benchmarking libs. Alternatively, we could use python or some such as a specialized tool to implement the frontend (e.g. cli, http server).
  • OS: discuss: Expose tls pool configuration options to control max requests allowed per connection, ciphers used, sigalg, etc. Discuss how we want to do that, do we directly accept Envoy's config for that? Or do we abstract and map our own config to Envoy's version? The first option is quick and easy, but uncontrolled. The latter may allow us to expose options more carefully and at our own pace.
  • OS: discuss: Snapshot the envoy/nh stats periodically from the workers for plotting them through time. What happens upon shutdown of the pool is interesting as well, so just pre and post shutdown is an interesting distinction too.
  • OS: discuss: Implement a series of tests in python that start up release builds of envoy + another server (e.g. nginx) with the same configuration, and run a predefined series of nighthawk tests against them. Adding this may also deprecate client_test.cc which does something hideous to fork the integration test server into its own process in the test framework.
  • OS: Add support for multiple connections in the h2 pool implementation.
  • OS: Controlling and separately managing connection management
  • OS: Separately track tls handshake timings, connection set up timings
  • Proactively prefetching connections that are lost while running a benchmark test would be a nice enhancement on top of the initial prefetching

Allow control of TLS ciphers and settings

Explicit control of TLS ciphers and session-reuse makes it easier when it comes to comparing server to server performance. Additionally, it would be nice to be able to test just the overhead of setting up these connections (and not perform any requests) and/or track times of specific milestones during the connection/tls setup process.

Add support for multiple concurrent H2 connections

Currently Nighthawk mostly uses a single http/2 connection to issue requests. This may lead to hotspotting processes on the benchmark target. The expectation is that this will be fixed upstream in Envoy.

Add support for open-loop latency measurement

Currently Nighthawk is capable of doing closed-loop testing, which means that when configured resource limits are met (e.g. max connections, max streams), no new requests will be issued, even when that means not reaching the requested request pacing configuration (this wil show up in the output as a time-spend-blocked histogram).
In real life, clients will not wait like that, and supporting open-loop testing will help measuring latencies under these circumstances.

An important part of this feature is that load generation should auto-terminate upon detecting a certain amount of in-flight requests, and maybe when it detects certain resource shortages (like running out of file descriptors).

Support QUIC

It would be awesome to have support for QUIC eventually. It is anticipated that once this feature is landed in Envoy, adding it here will be low hanging fruit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.