zio / zio-zmx Goto Github PK

View Code? Open in Web Editor NEW

85.0 14.0 48.0 7.87 MB

Monitoring, Metrics and Diagnostics for ZIO

Home Page: https://zio.github.io/zio-zmx/

License: Apache License 2.0

Scala 94.86% JavaScript 4.77% CSS 0.28% HTML 0.09%

zio-zmx's People

Stargazers

Watchers

zio-zmx's Issues

Remove Environmental Requirements From Service Methods

Many of the operators defined on the services in the zio.zmx package have environmental requirements (e.g. listen has a Clock dependency). When working with services it is generally an anti-pattern to have methods that require an environment type since the environmental dependency should be expressed in the service itself. For example, for the listen method we can access the Clock instance that the metrics service already requires and provide that to make sure none of the operators still require a Clock.

This ticket is to go through all the methods on the services in the zio.zmx package and make sure they do not require an environmental type, refactoring them if necessary as described above.

Code Smell: Remove Singleton Service, Reuse App

I had a short review with @jdegoes. Specifically we looked at my implementation of a SingletonService which would guarantee that I am always getting the same instance of the Metrics reporting channel.

The Singleton Service implementation I had was not pure, so that should be removed and the code refactored accordingly.

Also, the ZmxApp should reuse the functionality of App rather than creating a copy of it.

Create Prometheus Codec

Building on #107, this ticket is to implement a codec to convert between our internal representation of the Prometheus data model and the format expected by their API.

A completed ticket would include tests verifying that encoding our internal representation yields the expected result and that decoding known payloads yields the expected representation in our data model.

General Cleanup

This ticket is for general cleanup and polishing of ZIO ZMX code base, focusing on the following:

Remove debug println statements
Case classes should be final
Method names should be lower case
Methods that aren't side effecting with no arguments should not have () after them
Methods that don't have arguments should generally be val rather than def

Implement CoreMetrics

The CoreMetrics service is designed to interface directly with package private methods in FiberContext within ZIO to expose metrics on the execution of ZIO programs. This ticket is to actually implement that service. This is dependent on functionality being added to ZIO to expose the needed functionality, which should hopefully be complete in advance of the hackathon. Assuming that is ready, implementing the service will consist primarily of calling those methods and re-exporting their results, along with adding tests to show that the functionality is working correctly.

Separate RESP protocol and ZMX protocol

RESP protocol is just about serialization and could be implemented separately from ZMX protocol. Where ZMX protocol would be just a specific set of commands on top of RESP.

Upgrade scala versions to 2.12.12 and 2.13.3

Refactor Metrics API to use Services again, make global MetricsChannel optional

Based on the discussion on discord between swachter / atooni, in order to align open PR's:

We would have a metrics service agian, which would expose a unified (backend independent) API for capturing metrics. This would be more or less along the methods that I have moved into the new ZMX package object.
[11:29 AM]
That service can be vertically composed with another Service (in my terms an Instrumentation) which would be responsible for mapping the captured data to a backend specific data model and send it to the backend.
[11:30 AM]
We would prefer the usage of the Metrics Service using Layers, thereby eliminating the requirement for specific shutdown hooks
[11:31 AM]
For convenience we could provide lazy vals (one for statsd and one for prometheus) which could be used instead of the layer based approach, so that users have the choice what they want to use for instrumentation.
[11:34 AM]
We would keep the separation of the datamodels : ZMX / Statsd / Prometheus. Before the datamodel was more or less coupled to StatsD, which was a bit inconvenient for mapping it to Prometheus.
[11:36 AM]
If my understanding is correct, I am happy to create a ticket along these lines and work on that as my time allows. As soon as the ticket is opened I would also create a draft PR to share and discuss early. I would also like to pull @adamfraser into the loop as he was my main contact for reviews.

Do Security Audit

We are creating a ZMX server which will have access to fibers including execution traces, which could potentially have sensitive information about the underlying code being monitored. We need to make sure the ZMX server is secure by default.

Server fails when handling a bad RESP request

When issued a bad bulk string command the ZMX server fails and doesn't respond to subsequent requests.

Example bad request (wrong char number):

echo -ne '*1\r\n$3\r\ndump\r\n' | nc localhost 1111

should be:

echo -ne '*1\r\n$4\r\ndump\r\n' | nc localhost 1111

Use Chunk

We have a variety of cases where we are using List for a collection. Unless there is a specific reason we want linear access we should generally be using ZIO's Chunk data type as it provides good performance for a range of operations that require linear or random access and has fast append and prepend.

This ticket is to go through ZIO ZMX and replaces List with Chunk wherever possible.

Implement ZIO-based Gauge

Implement a gauge whose value can go up and down as described in https://prometheus.io/docs/instrumenting/writing_clientlibs/#gauge

Test Performance of RequestParser In Adversarial Scenario

The RequestParser, located in zio.zmx.diagnostics.parser.scala can parse strings into requests in the Redis serialization protocol. We should add tests of the parser with malformed and very large to make sure that we are failing as soon as we can if a string is malformed and failing if the string exceeds some reasonable size, including tests for this.

Explore Simplifying Signature of Listen Method

The signature of the listen method on the Metrics service is currently:

override def listen(f: List[Metric[_]] => IO[Exception, List[Long]]): ZIO[Clock, Throwable, Fiber.Runtime[Throwable, Nothing]]

It is very unclear from this signature what the List[Long] is supposed to represent. We should consider whether we can can simplify f to `List[Metric[_]] => IO[Exception, Unit]), basically, "you give me a function that takes a list of metrics and does some side effect with them, I will give you back an effect that listens for those forever".

We should also look at whether we can simplify the result type to IO[Throwable, Nothing]. This would indicate that the effect will never succeed because it represents a server that will just keep running forever. The caller of the method could then fork it to get the fiber that is currently returned but we separate concerns a little more.

Finally, we should look at how this method can actually fail and if the error types are as specific as possible.

Implement Fiber Dump Variants

Currently the FiberDumpProvider service implements a single method getFiberDumps that returns a "dump" of the status of all fibers in a program.

This can be problematic because in a very large ZIO program there may be many fibers running at a time so it may take non-trivial time to obtain a dump of all fibers. We should change this to return a ZStream[Any, Nothing, Fiber.Dump] instead of an Iterable[Fiber.Dump] so the user of the fiber dumps can consume them incrementally.

Once we do this we should also add variants that allow the user to specify a particular fiber that they want to obtain a dump for as well as one that allows the user to obtain dumps up to a specified depth of descendants from the designated fibers. Then we can implement the basic "get dumps for every fiber" variant in terms of this.

Note that this may require a PR to ZIO Core to expose the necessary functionality.

Simplify Service Implementations

ZIO ZMX contains several implementations of services in the environment. While this pattern is very useful, it is helpful primarily when we want to support alternative implementations of a service. There are several cases where we currently have functionality implemented as services that will never have more than one implementation. In these cases code can be simplified by moving the functionality to normal methods on an object.

This ticket is to do the following:

Delete the FiberDumpProvider service in the fibers package and inline its functionality into the Diagnostics service in the zio.zmx package object since it is not used anywhere else. Once we do this we can delete the fibers package object entirely.
Make the functionality in the parser package a set of static methods since we only expect to ever have one RESP Parser. I would also rename this to RESPParser for clarity.

Check For Spin Loops

We should review the code to see if we have any places where we are busy polling for a variable to be true. In particular, it looks like in the collect method in the zio.zmx package object we are repeatedly polling the queue. We should see if there is another way to refactor this code to avoid this and review the rest of the code to see if there are any other cases like this.

Delete ZMXServer Trait

The ZMXServer trait currently only has a single method, close that closes the channel. We can delete this trait and have the ZMXServer.make operator return either a ZManaged or a ZLayer that builds in the necessary finalization logic.

Remove ZIO NIO Dependency From UDP Client

The UDP client, located in the UDPClient.scala file, currently uses zio.nio in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.

This ticket is to refactor the UDP client to implement the functionality directly in terms of non-blocking methods from java.nio. A completed ticket will also have tests to verify that this functionality is working correctly.

Expose ZIO's ExecutionMetrics through Diagnostics API

Add Mapping from JMX to ZMX metrics.

This is to add gauges based on existing JMX metrics, for example, the number of threads, current memory consumption, etc.

Add profiling of async code and fibers to ZMX

Another problem with async code is profiling since JVM profilers are stack-based. ZIO now has a unique opportunity to be the first solution to provide decent profiling of async code on the JVM

Implement a ZIO-based Summary

Implement a Summary that samples observations (usually things like request durations) over sliding windows of time and provides instantaneous insight into their distributions, frequencies, and sums as described on https://prometheus.io/docs/instrumenting/writing_clientlibs/#summary

Complete documentation for supported metrics

The documentation needs to include all supported metric types, especially Histograms and Summaries must be added

Remove ZIO NIO Dependency From ZMX Server

The ZMX server, located in the ZMXServer.scala file, currently uses zio.nio in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.

This ticket is to refactor the ZMX server to implement the functionality directly in terms of non-blocking methods from java.nio. A completed ticket will also have tests to verify that this functionality is working correctly.

Implement ZIO-based Counter

Implement monotonically increasing counter as described in https://prometheus.io/docs/instrumenting/writing_clientlibs/#counter

Add more tests

Add documentation section about ZIO ZMX Client

Implement a ZIO-based Histogram

Implement a Histogram that allow aggregatable distributions of events, such as request latencies. This is at its core a counter per bucket as described on https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram

Separate Definition And Interpretation of Statsd Data Model

A data model for StatsD exists in the MetricsDataModel folder. Right now this to a certain extent mixes the definition of a data model (e.g. ServiceCheckStatus.Ok) with the interpretation of that model val value: Int = 0.

We can clean this up by implementing a separate method that "interprets" an object and pattern matching:

def encode(status: ServiceCheckStatus): Int =
  ???

It may make sense to have all the types in the data model inherit from one common super type so we can encode any of them.

Add JS Support

Overview

Currently ZMX Server and client is implemented to run for the JVM using NIO. We need to add the same server capability for used with scalajs

What ScalaJS supports

Scala js does support nio ByteBuffer so we can continue to use that for this implementation:
https://github.com/scala-js/scala-js/blob/master/javalib/src/main/scala/java/nio/ByteBuffer.scala
One option to implement the server is to use nodejs net library. So we don't have to reinvent the wheel to interop scalajs with nodejs net library we can use this dependency as an option: https://github.com/scalajs-io/nodejs

Implementation

Create a base Trait in the shared folder which has the common methods used by both JVM and JS
Actual server implementation overrides server implementation itself, nio for jvm (implemented just needs refactoring to override traid), nodejs net for scalajs side

Acceptance Criteria

Ability to send messages from client to server and for server to respond back using scalajs and java

Expose ZIO's thread pools metrics through Diagnostics API

Use `Set` instead of `SortedSet`

Currently in the ZMXSupervisor and related data types we use a SortedSet of fibers. I don't think there is any reason the set needs to be sorted. An ordinary set of fibers backed by a HashSet would have better performance.

The `ZMXProtocol.Data.FiberDump` serialized must be Array of Bulk Strings not Simple Strings

The fiber dumps contain many new line characters.
RESP Simple Strings cannot contain neither \r nor \n.
Now with proper RESP implementation this is correctly assured (new lines are removed).
So this is now correctly broken 😉

To not lose the new lines ZMXProtocol.Data.FiberDump serialized must be Array of Bulk Strings.

Create Prometheus Server

Building on #107 and #108 this ticket is to create an actual implementation of a ZMX server that interacts with Prometheus using our own data model and encoder.

A completed ticket would implement the same functionality demonstrated in in the PromthetheusSpec.scala file but would not have any dependency on a third party Prometheus client. Our own implementation can be put in the zio.zmx.promthetheus package with a live method that constructs the server.

Add an example application using ZMX

Explore MetricAttributes Data Type

We have several methods on the Metrics service that take a large number of parameters, including serviceCheck and event. This can create a less than ideal API for the user because it can require specifying parameters that are not needed or lead to confusing the order of parameters since several have the same type (e.g. Option[String]).

An alternative is to have a MetricAttributes data type. Then each of these methods could just accept a Set[MetricAttributes] instead of all of these individual arguments and the library would internally traverse those attributes and construct the metric accordingly.

This ticket would involve creating the MetricAttributes data type, including smart constructors that make it easy for users to create the common types of metric attributes that are already represented in these signatures, and functionality to interpret the attributes back to the required parameters when needed, demonstrating that the same functionality that works today works with this new encoding.

Replace `listen` methods of `Metrics.Service` by layer mechanism

Metrics.Service offers listen methods that are used to provide "senders" for aggregated metrics. There is nothing that prevents one from registering multiple senders. In that case multiple collect daemons would compete in polling metrics from the underlying shared ring buffer. This results in surprising behavior.

Create a pure ZIO Prometheus Client

https://prometheus.io/docs/instrumenting/writing_clientlibs/

Revisit `ExecutionMetrics` representation in terms of RESP

In recently merged #140 I've added this TODO comment:

    case ZMXProtocol.Response.Success(data) =>
      data match {
        case executionMetrics: ZMXProtocol.Data.ExecutionMetrics =>
          // TODO: Format of `ExecutionMetrics` serialized could be discussed and revisited.
          Resp.BulkString(executionMetrics.toString).serialize

In that PR I've actually kept the previous format but I think we can improve it with use of other RESP
types (especially with recent proper implementation of RESP added 😉).

The metrics themselves have this form:

abstract class ExecutionMetrics {
  def concurrency: Int
  def capacity: Int
  def size: Int
  def enqueuedCount: Long
  def dequeuedCount: Long
  def workersCount: Int
}

Right now they are sent as one big Bulk String:

$86\r\n
concurrency:1\r\n
capacity:2\r\n
size:3\r\n
enqueued_count:4\r\n
dequeued_count:5\r\n
workers_count:6\r\n

This after parsing from RESP format gives a string of "concurrency:1\r\ncapacity:2\r\nsize:3\r\nenqueued_count:4\r\ndequeued_count:5\r\nworkers_count:6" that client has to parse again.

Definitely Array type could be used. While RESP does not support key-value structures,
I think instead of encoding them as key:value we could use Arrays again.
Then these values, since they are Int/Long, could be represented as RESP Integers.

So I'd propose something along:

Array(
  Array(
    SimpleString("concurrency"),
    Integer(1)
  ),
  Array(
    SimpleString("capacity"),
    Integer(2)
  ),
  [...]
)

I'm leaving this open for discussion.
It's more of question how much of RESP types do we want to use because API clients will have to support it.

But if we're already using decent part of it, then why not just use its full power?
I think that using proper RESP types would make more sense than asking clients to parse the result again, now from of our key:value/CRLF not-standard format 😉

Pinging @jczuchnowski since you've been working on that part.

Make ZMX Protocol Extensible

Currently ZMXProtocol is limited to a predetermined set of types of data, execution metrics, fiber dumps, and string messages. This ticket is to make the protocol more extensible so that the user can also get data on user defined metrics that do not fall into these categories.

A completed ticket will include a test showing an example of an application writing custom metrics and then being able to get access to those metrics.

Send fiber dump as an array

Right now we're sending fiber dump as one string, but it would be more useful to the client if it got it as an array of fiber dumps per each fiber:

*5\r\n
+fiber dump #1\r\n
+fiber dump #2\r\n
+fiber dump #3\r\n
+fiber dump #4\r\n
+fiber dump #5\r\n

Unless there are performance concerns for splitting it on the server.

Benchmark and Explore Optimization Opportunities for Graph Implementation

ZIO ZMX needs to keep track of the graph of fibers within a ZIO program, that is which fibers are descendants of which other fibers. Currently ZIO ZMX uses the Graph data type in zio.xmx.graph to do this, which represents an immutable graph.

The first part of this ticket is to add benchmarks for the performance of ZIO ZMX when generating fiber dumps for ZIO programs that construct extremely large fiber graphs (e..g 100,000 fibers) with different structures (very "deep" trees where one fiber forks another fiber that forks another fiber many times, very "broad" trees where one fiber forks many fibers, mixtures of these two).

After that the next step would be to explore whether there is a more efficient representation that could be used. The current representation is something like AtomicReference[Map[Fiber, Set[Fiber]]]. Alternatively, it could be something like ConcurrentMap[Fiber, Set[Fiber]]. We need the benchmarks to see if that is really an improvement.

Implement CLI tool for ZMX

Simple CLI tool to call ZMX to get fiber dumps and metrics using the existing ZMXClient developed.

Delete listenUnsafe Method

ZIO ZMX supports a concept of an "unsafe" service to allow users to record metrics with less overhead in certain cases. This makes sense in general but the listenUnsafe method is doing things that are directly in the wheelhouse of ZIO and are much less safe without ZIO. We should delete this method and implement it in terms of the existing listen method.

Update all of code with scaladoc comments

Fix microsite deploy error due to latest mdoc release changes

This is the error:

/tmp/docusaurus2351590966069628621install_ssh.sh: line 4: GITHUB_DEPLOY_KEY: unbound variable
[error] java.lang.AssertionError: assertion failed: command returned 1: [/tmp/docusaurus2351590966069628621install_ssh.sh]
[error] 	at scala.Predef$.assert(Predef.scala:223)
[error] 	at mdoc.DocusaurusPlugin$XtensionProcess.execute(DocusaurusPlugin.scala:163)
[error] 	at mdoc.DocusaurusPlugin$.$anonfun$projectSettings$5(DocusaurusPlugin.scala:115)
[error] 	at mdoc.DocusaurusPlugin$.$anonfun$projectSettings$5$adapted(DocusaurusPlugin.scala:96)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:67)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:281)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:19)
[error] 	at sbt.Execute.work(Execute.scala:290)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:281)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:178)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:37)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] 	at java.lang.Thread.run(Thread.java:748)
[error] (docs / docusaurusPublishGhpages) java.lang.AssertionError: assertion failed: command returned 1: [/tmp/docusaurus2351590966069628621install_ssh.sh]
[error] Total time: 22 s, completed Jun 14, 2020 8:12:17 PM

Exited with code exit status 1

sample circleci failure: https://circleci.com/gh/zio/zio-zmx/493

This issue has occured due to this change in latest mdoc release:

https://github.com/scalameta/mdoc/pull/352/files

Correction needed

In circleci env building microsite we need to set GITHUB_DEPLOY_KEY environment variable to the base64 encoded private SSH key that we setup for the microsite deploys.

i.e get the key by doing for example: cat .ssh/id_zmx | base64 | pbcopy

Create Prometheus Data Model

We want ZIO ZMX to support Promtheus as well as statd. We currently support a version of this today, see the PrometheusSpec.scala file in the test folder, but it relies on a dependency on the Prometheus client which we do not want to have. Therefore, we need to do our own implementation of the Prometheus protocol.

The first step to do this is to create a data model for the objects within the Prometheus data model, for example the Counter, Histogram, and CollectorRegistry. At this point this would be a pure data model that would describe the information we would need to interact with Prometheus but would not contain any of the actual logic for converting this data into the format expected by Prometheus.

Remove ZIO NIO Dependency From ZMX Client

The ZMX client, located in the ZMXClient.scala file, currently uses zio.nio in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.

This ticket is to refactor the ZMX client to implement the functionality directly in terms of non-blocking methods from java.nio. A completed ticket will also have tests to verify that this functionality is working correctly.

Reorganize Package Structure To Clarify API

The ZIO ZMX library should expose a small number of core abstractions, primarily the Metrics service in the package.scala file, and then specific implementations of that service in appropriate packages (e.g. a statsd implementation in the zio.zmx.stats package). Everything else that is used to implement these services should be package private to provide a clean API for users and flexibility to evolve implementation details.

This ticket is to reorganize the package structure of the project as follows:

Rename the metrics package to be the statsd package
Rename UDPClient to StatsdClient and UDPClientUnsafe to StatsdClientUnsafe for clarity since UDP is a communication protocol and this package is really concerned with implementation details that are specific to statsd
Move the Metrics.live service in the zio.zmx package object to the zio.zmx.statsd package object since this implementation is specific to statsd
Move the data types in the MetricsDataModel and MetricsConfigDataModel traits to be top level data types within the zio.zmx package and delete the trait. Since the trait is not used anywhere except being extended in the package object this should be the same and will make it easier for users to find these data types.
Other than the live method in zio.zmx.statsd above make sure everything in any of the zio.zmx sub-packages (e.g. diagnostics, fibers, graph, parser) is private[zmx].

After these changes generating ScalaDoc for ZIO ZMX should produce documentation that only shows the values in the zio.zmx package object, the data types currently in MetricsModel and MetricsDataModel, and the live service in the zio.zmx.statsd package.