zio / zio-zmx Goto Github PK
View Code? Open in Web Editor NEWMonitoring, Metrics and Diagnostics for ZIO
Home Page: https://zio.github.io/zio-zmx/
License: Apache License 2.0
Monitoring, Metrics and Diagnostics for ZIO
Home Page: https://zio.github.io/zio-zmx/
License: Apache License 2.0
Many of the operators defined on the services in the zio.zmx
package have environmental requirements (e.g. listen
has a Clock
dependency). When working with services it is generally an anti-pattern to have methods that require an environment type since the environmental dependency should be expressed in the service itself. For example, for the listen
method we can access the Clock
instance that the metrics service already requires and provide
that to make sure none of the operators still require a Clock
.
This ticket is to go through all the methods on the services in the zio.zmx
package and make sure they do not require an environmental type, refactoring them if necessary as described above.
I had a short review with @jdegoes. Specifically we looked at my implementation of a SingletonService which would guarantee that I am always getting the same instance of the Metrics reporting channel.
The Singleton Service implementation I had was not pure, so that should be removed and the code refactored accordingly.
Also, the ZmxApp should reuse the functionality of App rather than creating a copy of it.
Building on #107, this ticket is to implement a codec to convert between our internal representation of the Prometheus data model and the format expected by their API.
A completed ticket would include tests verifying that encoding our internal representation yields the expected result and that decoding known payloads yields the expected representation in our data model.
This ticket is for general cleanup and polishing of ZIO ZMX code base, focusing on the following:
println
statements()
after themval
rather than def
The CoreMetrics
service is designed to interface directly with package private methods in FiberContext
within ZIO to expose metrics on the execution of ZIO programs. This ticket is to actually implement that service. This is dependent on functionality being added to ZIO to expose the needed functionality, which should hopefully be complete in advance of the hackathon. Assuming that is ready, implementing the service will consist primarily of calling those methods and re-exporting their results, along with adding tests to show that the functionality is working correctly.
RESP protocol is just about serialization and could be implemented separately from ZMX protocol. Where ZMX protocol would be just a specific set of commands on top of RESP.
Based on the discussion on discord between swachter / atooni, in order to align open PR's:
We are creating a ZMX server which will have access to fibers including execution traces, which could potentially have sensitive information about the underlying code being monitored. We need to make sure the ZMX server is secure by default.
When issued a bad bulk string command the ZMX server fails and doesn't respond to subsequent requests.
Example bad request (wrong char number):
echo -ne '*1\r\n$3\r\ndump\r\n' | nc localhost 1111
should be:
echo -ne '*1\r\n$4\r\ndump\r\n' | nc localhost 1111
We have a variety of cases where we are using List
for a collection. Unless there is a specific reason we want linear access we should generally be using ZIO's Chunk
data type as it provides good performance for a range of operations that require linear or random access and has fast append and prepend.
This ticket is to go through ZIO ZMX and replaces List
with Chunk
wherever possible.
Implement a gauge whose value can go up and down as described in https://prometheus.io/docs/instrumenting/writing_clientlibs/#gauge
The RequestParser
, located in zio.zmx.diagnostics.parser.scala
can parse strings into requests in the Redis serialization protocol. We should add tests of the parser with malformed and very large to make sure that we are failing as soon as we can if a string is malformed and failing if the string exceeds some reasonable size, including tests for this.
The signature of the listen
method on the Metrics
service is currently:
override def listen(f: List[Metric[_]] => IO[Exception, List[Long]]): ZIO[Clock, Throwable, Fiber.Runtime[Throwable, Nothing]]
It is very unclear from this signature what the List[Long]
is supposed to represent. We should consider whether we can can simplify f
to `List[Metric[_]] => IO[Exception, Unit]), basically, "you give me a function that takes a list of metrics and does some side effect with them, I will give you back an effect that listens for those forever".
We should also look at whether we can simplify the result type to IO[Throwable, Nothing]
. This would indicate that the effect will never succeed because it represents a server that will just keep running forever. The caller of the method could then fork it to get the fiber that is currently returned but we separate concerns a little more.
Finally, we should look at how this method can actually fail and if the error types are as specific as possible.
Currently the FiberDumpProvider
service implements a single method getFiberDumps
that returns a "dump" of the status of all fibers in a program.
This can be problematic because in a very large ZIO program there may be many fibers running at a time so it may take non-trivial time to obtain a dump of all fibers. We should change this to return a ZStream[Any, Nothing, Fiber.Dump]
instead of an Iterable[Fiber.Dump]
so the user of the fiber dumps can consume them incrementally.
Once we do this we should also add variants that allow the user to specify a particular fiber that they want to obtain a dump for as well as one that allows the user to obtain dumps up to a specified depth of descendants from the designated fibers. Then we can implement the basic "get dumps for every fiber" variant in terms of this.
Note that this may require a PR to ZIO Core to expose the necessary functionality.
ZIO ZMX contains several implementations of services in the environment. While this pattern is very useful, it is helpful primarily when we want to support alternative implementations of a service. There are several cases where we currently have functionality implemented as services that will never have more than one implementation. In these cases code can be simplified by moving the functionality to normal methods on an object.
This ticket is to do the following:
FiberDumpProvider
service in the fibers
package and inline its functionality into the Diagnostics
service in the zio.zmx
package object since it is not used anywhere else. Once we do this we can delete the fibers
package object entirely.parser
package a set of static methods since we only expect to ever have one RESP Parser. I would also rename this to RESPParser
for clarity.We should review the code to see if we have any places where we are busy polling for a variable to be true. In particular, it looks like in the collect
method in the zio.zmx
package object we are repeatedly polling the queue. We should see if there is another way to refactor this code to avoid this and review the rest of the code to see if there are any other cases like this.
The ZMXServer trait currently only has a single method, close
that closes the channel. We can delete this trait and have the ZMXServer.make
operator return either a ZManaged
or a ZLayer
that builds in the necessary finalization logic.
The UDP client, located in the UDPClient.scala
file, currently uses zio.nio
in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.
This ticket is to refactor the UDP client to implement the functionality directly in terms of non-blocking methods from java.nio
. A completed ticket will also have tests to verify that this functionality is working correctly.
This is to add gauges based on existing JMX metrics, for example, the number of threads, current memory consumption, etc.
Another problem with async code is profiling since JVM profilers are stack-based. ZIO now has a unique opportunity to be the first solution to provide decent profiling of async code on the JVM
Implement a Summary that samples observations (usually things like request durations) over sliding windows of time and provides instantaneous insight into their distributions, frequencies, and sums as described on https://prometheus.io/docs/instrumenting/writing_clientlibs/#summary
The documentation needs to include all supported metric types, especially Histograms and Summaries must be added
The ZMX server, located in the ZMXServer.scala file, currently uses zio.nio
in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.
This ticket is to refactor the ZMX server to implement the functionality directly in terms of non-blocking methods from java.nio
. A completed ticket will also have tests to verify that this functionality is working correctly.
Implement monotonically increasing counter as described in https://prometheus.io/docs/instrumenting/writing_clientlibs/#counter
Implement a Histogram that allow aggregatable distributions of events, such as request latencies. This is at its core a counter per bucket as described on https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram
A data model for StatsD exists in the MetricsDataModel
folder. Right now this to a certain extent mixes the definition of a data model (e.g. ServiceCheckStatus.Ok
) with the interpretation of that model val value: Int = 0
.
We can clean this up by implementing a separate method that "interprets" an object and pattern matching:
def encode(status: ServiceCheckStatus): Int =
???
It may make sense to have all the types in the data model inherit from one common super type so we can encode any of them.
Currently ZMX Server and client is implemented to run for the JVM using NIO. We need to add the same server capability for used with scalajs
Scala js does support nio
ByteBuffer so we can continue to use that for this implementation:
https://github.com/scala-js/scala-js/blob/master/javalib/src/main/scala/java/nio/ByteBuffer.scala
One option to implement the server is to use nodejs net
library. So we don't have to reinvent the wheel to interop scalajs with nodejs net library we can use this dependency as an option: https://github.com/scalajs-io/nodejs
Ability to send messages from client to server and for server to respond back using scalajs and java
Currently in the ZMXSupervisor
and related data types we use a SortedSet
of fibers. I don't think there is any reason the set needs to be sorted. An ordinary set of fibers backed by a HashSet
would have better performance.
The fiber dumps contain many new line characters.
RESP Simple Strings cannot contain neither \r
nor \n
.
Now with proper RESP implementation this is correctly assured (new lines are removed).
So this is now correctly broken ๐
To not lose the new lines ZMXProtocol.Data.FiberDump
serialized must be Array of Bulk Strings.
Building on #107 and #108 this ticket is to create an actual implementation of a ZMX server that interacts with Prometheus using our own data model and encoder.
A completed ticket would implement the same functionality demonstrated in in the PromthetheusSpec.scala
file but would not have any dependency on a third party Prometheus client. Our own implementation can be put in the zio.zmx.promthetheus
package with a live
method that constructs the server.
We have several methods on the Metrics
service that take a large number of parameters, including serviceCheck
and event
. This can create a less than ideal API for the user because it can require specifying parameters that are not needed or lead to confusing the order of parameters since several have the same type (e.g. Option[String]
).
An alternative is to have a MetricAttributes
data type. Then each of these methods could just accept a Set[MetricAttributes]
instead of all of these individual arguments and the library would internally traverse those attributes and construct the metric accordingly.
This ticket would involve creating the MetricAttributes
data type, including smart constructors that make it easy for users to create the common types of metric attributes that are already represented in these signatures, and functionality to interpret the attributes back to the required parameters when needed, demonstrating that the same functionality that works today works with this new encoding.
Metrics.Service
offers listen methods that are used to provide "senders" for aggregated metrics. There is nothing that prevents one from registering multiple senders. In that case multiple collect daemons would compete in polling metrics from the underlying shared ring buffer. This results in surprising behavior.
In recently merged #140 I've added this TODO
comment:
case ZMXProtocol.Response.Success(data) =>
data match {
case executionMetrics: ZMXProtocol.Data.ExecutionMetrics =>
// TODO: Format of `ExecutionMetrics` serialized could be discussed and revisited.
Resp.BulkString(executionMetrics.toString).serialize
In that PR I've actually kept the previous format but I think we can improve it with use of other RESP
types (especially with recent proper implementation of RESP added ๐).
The metrics themselves have this form:
abstract class ExecutionMetrics {
def concurrency: Int
def capacity: Int
def size: Int
def enqueuedCount: Long
def dequeuedCount: Long
def workersCount: Int
}
Right now they are sent as one big Bulk String:
$86\r\n
concurrency:1\r\n
capacity:2\r\n
size:3\r\n
enqueued_count:4\r\n
dequeued_count:5\r\n
workers_count:6\r\n
This after parsing from RESP format gives a string of "concurrency:1\r\ncapacity:2\r\nsize:3\r\nenqueued_count:4\r\ndequeued_count:5\r\nworkers_count:6"
that client has to parse again.
Definitely Array
type could be used. While RESP does not support key-value structures,
I think instead of encoding them as key:value
we could use Array
s again.
Then these values, since they are Int
/Long
, could be represented as RESP Integer
s.
So I'd propose something along:
Array(
Array(
SimpleString("concurrency"),
Integer(1)
),
Array(
SimpleString("capacity"),
Integer(2)
),
[...]
)
I'm leaving this open for discussion.
It's more of question how much of RESP types do we want to use because API clients will have to support it.
But if we're already using decent part of it, then why not just use its full power?
I think that using proper RESP types would make more sense than asking clients to parse the result again, now from of our key:value
/CRLF
not-standard format ๐
Pinging @jczuchnowski since you've been working on that part.
Currently ZMXProtocol
is limited to a predetermined set of types of data, execution metrics, fiber dumps, and string messages. This ticket is to make the protocol more extensible so that the user can also get data on user defined metrics that do not fall into these categories.
A completed ticket will include a test showing an example of an application writing custom metrics and then being able to get access to those metrics.
Right now we're sending fiber dump as one string, but it would be more useful to the client if it got it as an array of fiber dumps per each fiber:
*5\r\n
+fiber dump #1\r\n
+fiber dump #2\r\n
+fiber dump #3\r\n
+fiber dump #4\r\n
+fiber dump #5\r\n
Unless there are performance concerns for splitting it on the server.
ZIO ZMX needs to keep track of the graph of fibers within a ZIO program, that is which fibers are descendants of which other fibers. Currently ZIO ZMX uses the Graph
data type in zio.xmx.graph
to do this, which represents an immutable graph.
The first part of this ticket is to add benchmarks for the performance of ZIO ZMX when generating fiber dumps for ZIO programs that construct extremely large fiber graphs (e..g 100,000 fibers) with different structures (very "deep" trees where one fiber forks another fiber that forks another fiber many times, very "broad" trees where one fiber forks many fibers, mixtures of these two).
After that the next step would be to explore whether there is a more efficient representation that could be used. The current representation is something like AtomicReference[Map[Fiber, Set[Fiber]]]
. Alternatively, it could be something like ConcurrentMap[Fiber, Set[Fiber]]
. We need the benchmarks to see if that is really an improvement.
Simple CLI tool to call ZMX to get fiber dumps and metrics using the existing ZMXClient developed.
ZIO ZMX supports a concept of an "unsafe" service to allow users to record metrics with less overhead in certain cases. This makes sense in general but the listenUnsafe
method is doing things that are directly in the wheelhouse of ZIO and are much less safe without ZIO. We should delete this method and implement it in terms of the existing listen
method.
This is the error:
/tmp/docusaurus2351590966069628621install_ssh.sh: line 4: GITHUB_DEPLOY_KEY: unbound variable
[error] java.lang.AssertionError: assertion failed: command returned 1: [/tmp/docusaurus2351590966069628621install_ssh.sh]
[error] at scala.Predef$.assert(Predef.scala:223)
[error] at mdoc.DocusaurusPlugin$XtensionProcess.execute(DocusaurusPlugin.scala:163)
[error] at mdoc.DocusaurusPlugin$.$anonfun$projectSettings$5(DocusaurusPlugin.scala:115)
[error] at mdoc.DocusaurusPlugin$.$anonfun$projectSettings$5$adapted(DocusaurusPlugin.scala:96)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error] at sbt.std.Transform$$anon$4.work(Transform.scala:67)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:281)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:19)
[error] at sbt.Execute.work(Execute.scala:290)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:281)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:178)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:37)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:748)
[error] (docs / docusaurusPublishGhpages) java.lang.AssertionError: assertion failed: command returned 1: [/tmp/docusaurus2351590966069628621install_ssh.sh]
[error] Total time: 22 s, completed Jun 14, 2020 8:12:17 PM
Exited with code exit status 1
sample circleci failure: https://circleci.com/gh/zio/zio-zmx/493
This issue has occured due to this change in latest mdoc release:
https://github.com/scalameta/mdoc/pull/352/files
In circleci env building microsite we need to set GITHUB_DEPLOY_KEY
environment variable to the base64 encoded private SSH key that we setup for the microsite deploys.
i.e get the key by doing for example: cat .ssh/id_zmx | base64 | pbcopy
We want ZIO ZMX to support Promtheus as well as statd. We currently support a version of this today, see the PrometheusSpec.scala
file in the test
folder, but it relies on a dependency on the Prometheus client which we do not want to have. Therefore, we need to do our own implementation of the Prometheus protocol.
The first step to do this is to create a data model for the objects within the Prometheus data model, for example the Counter
, Histogram
, and CollectorRegistry
. At this point this would be a pure data model that would describe the information we would need to interact with Prometheus but would not contain any of the actual logic for converting this data into the format expected by Prometheus.
The ZMX client, located in the ZMXClient.scala
file, currently uses zio.nio
in its implementation. We want ZIO ZMX to have no dependencies other than ZIO itself so it is as easy as possible for users to add to their application.
This ticket is to refactor the ZMX client to implement the functionality directly in terms of non-blocking methods from java.nio
. A completed ticket will also have tests to verify that this functionality is working correctly.
The ZIO ZMX library should expose a small number of core abstractions, primarily the Metrics
service in the package.scala
file, and then specific implementations of that service in appropriate packages (e.g. a statsd implementation in the zio.zmx.stats
package). Everything else that is used to implement these services should be package private to provide a clean API for users and flexibility to evolve implementation details.
This ticket is to reorganize the package structure of the project as follows:
metrics
package to be the statsd
packageUDPClient
to StatsdClient
and UDPClientUnsafe
to StatsdClientUnsafe
for clarity since UDP is a communication protocol and this package is really concerned with implementation details that are specific to statsdMetrics.live
service in the zio.zmx
package object to the zio.zmx.statsd
package object since this implementation is specific to statsdMetricsDataModel
and MetricsConfigDataModel
traits to be top level data types within the zio.zmx
package and delete the trait. Since the trait is not used anywhere except being extended in the package object this should be the same and will make it easier for users to find these data types.live
method in zio.zmx.statsd
above make sure everything in any of the zio.zmx
sub-packages (e.g. diagnostics
, fibers
, graph
, parser
) is private[zmx]
.After these changes generating ScalaDoc for ZIO ZMX should produce documentation that only shows the values in the zio.zmx
package object, the data types currently in MetricsModel
and MetricsDataModel
, and the live
service in the zio.zmx.statsd
package.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.