arpnetworking / metrics-cluster-aggregator Goto Github PK

License: Apache License 2.0

Shell 0.83% Batchfile 0.41% Java 98.61% Dockerfile 0.15%

metrics-cluster-aggregator's Introduction

Metrics Cluster Aggregator

Reaggregates statistics received from multiple Metric Aggregator Daemon instances into aggregates across each cluster. Simply, this means combining the values from each host in your fleet. Both the host and cluster values are published to various configurable data sinks.

Usage

Installation

Manual

The artifacts from the build are in metrics-cluster-aggregator/target/appassembler and should be copied to an appropriate directory on your application host(s).

Docker

If you use Docker, we publish a base docker image that makes it easy for you to layer configuration on top of. Create a Docker image based on the image arpnetworking/cluster-aggregator. Configuration files are typically located at /opt/cluster-aggregator/config/. In addition, you can specify CONFIG_FILE (defaults to /opt/cluster-aggregator/config/config.json), PARAMS (defaults to $CONFIG_FILE), LOGGING_CONFIG (defaults to "-Dlogback.configurationFile=/opt/cluster-aggregator/config/logback.xml"), and JAVA_OPTS (defaults to $LOGGING_CONFIG) environment variables to control startup.

Execution

In the installation's bin directory there are scripts to start Metrics Cluster Aggregator: cluster-aggregator (Linux) and cluster-aggregator.bat (Windows). One of these should be executed on system start with appropriate parameters; for example:

/usr/local/lib/metrics-cluster-aggregator/bin/cluster-aggregator /usr/local/lib/metrics-cluster-aggregator/config/config.json

Configuration

Logging

To customize logging you may provide a LogBack configuration file. To use a custom logging configuration you need to define and export an environment variable before executing cluster-aggregator:

export JAVA_OPTS="-Dlogback.configurationFile=/usr/local/lib/metrics-cluster-aggregator/config/logger.xml"

Where /usr/local/lib/metrics-cluster-aggregator/config/logger.xml is the path to your logging configuration file.

Daemon

The Metrics Cluster Aggregator configuration is specified in a JSON file. The location of the configuration file is passed to metrics-cluster-aggregator as a command line argument:

/usr/local/etc/metrics-cluster-aggregator/config/prod.conf

The configuration specifies:

logDirectory - The location of additional logs. This is independent of the logging configuration.
clusterPipelineConfiguration - The location of configuration file for the cluster statistics pipeline.
hostPipelineConfiguration - The location of the configuration file for the host statistics pipeline.
httpHost - The ip address to bind the http server to.
httpPort - The port to bind the http server to.
aggregationHost - The ip address to bind the tcp aggregation server to.
aggregationPort - The port to bind the tcp aggregation server to.
maxConnectionTimeout - The maximum aggregation server client connection timeout in ISO-8601 period notation.
minConnectionTimeout - The minimum aggregation server client connection timeout in ISO-8601 period notation.
jvmMetricsCollectionInterval - The JVM metrics collection interval in ISO-8601 period notation.
rebalanceConfiguration - Configuration for aggregator shard rebalancing.
pekkoConfiguration - Configuration of Pekko.

For example:

{
  "logDirectory": "/usr/local/lib/metrics-cluster-aggregator/logs",
  "clusterPipelineConfiguration": "/usr/local/lib/metrics-cluster-aggregator/config/cluster-pipeline.json",
  "hostPipelineConfiguration": "/usr/local/lib/metrics-cluster-aggregator/config/host-pipeline.json",
  "httpPort": 7066,
  "httpHost": "0.0.0.0",
  "aggregationHost": "0.0.0.0",
  "aggregationPort": 7065,
  "maxConnectionTimeout": "PT2M",
  "minConnectionTimeout": "PT1M",
  "jvmMetricsCollectionInterval": "PT0.5S",
  "rebalanceConfiguration": {
    "maxParallel": 100,
    "threshold": 500
  },
  "pekkoConfiguration": {
    "pekko": {
      "loggers": ["org.apache.pekko.event.slf4j.Slf4jLogger"],
      "loglevel": "DEBUG",
      "stdout-loglevel": "DEBUG",
      "logging-filter": "org.apache.pekko.event.slf4j.Slf4jLoggingFilter",
      "actor": {
        "provider": "org.apache.pekko.cluster.ClusterActorRefProvider",
        "debug": {
          "unhandled": "on"
        }
      },
      "cluster": {
        "sharding": {
          "state-store-mode": "persistence"
        },
        "seed-nodes": [
          "pekko.tcp://[email protected]:2551"
        ]
      },
      "remote": {
        "log-remote-lifecycle-events": "on",
        "netty": {
          "tcp": {
            "hostname": "127.0.0.1",
            "port": 2551
          }
        }
      }
    }
  }
}

Pipeline

Metrics Cluster Aggregator supports a two pipelines. The first is the host pipeline which handles publication of all statistics received from Metrics Aggregator Daemon instances. The second is the cluster pipeline which handles all statistics (re)aggregated by cluster across host statistics from Metrics Aggregator Daemon instances. In both cases the pipeline defines one more destinations or sinks for the statistics.

For example:

{
    "sinks":
    [
        {
            "type": "com.arpnetworking.tsdcore.sinks.CarbonSink",
            "name": "my_application_carbon_sink",
            "serverAddress": "192.168.0.1"
        }
    ]
}

Hocon

The daemon and pipeline configuration files may be written in Hocon when specified with a _.conf extension.

Development

To build the service locally you must satisfy these prerequisites:

JDK8 (Or Invoke with JDKW)
Docker (for Mac)

Note: Requires at least Docker for Mac Beta version Version 1.12.0-rc4-beta19 (build: 10258)

Next, fork the repository, clone and build:

Building:

metrics-aggregator-daemon> ./mvnw verify

To use the local version in your project you must first install it locally:

metrics-aggregator-daemon> ./mvnw install

To debug the server during run on port 9000:

metrics-cluster-aggregator> ./mvnw -Ddebug=true docker:start

To debug the server during integration tests on port 9000:

metrics-cluster-aggregator> ./mvnw -Ddebug=true verify

You can determine the version of the local build from the pom.xml file. Using the local version is intended only for testing or development.

You may also need to add the local repository to your build in order to pick-up the local version:

Maven - Included by default.
Gradle - Add mavenLocal() to build.gradle in the repositories block.
SBT - Add resolvers += Resolver.mavenLocal into project/plugins.sbt.

License

Published under Apache Software License 2.0, see LICENSE

metrics-cluster-aggregator's People

Contributors

Stargazers

Watchers

Forkers

damianball ddimensia metrics-clones phoenixrion happybob007 abbeyqy

metrics-cluster-aggregator's Issues

Support population size calculation in missing sinks

The HttpPostSink supports publishing the metrics samples_sent and samples_dropped if the population size of samples in calculated in serialize method of the implementing class. Currently only the KairosDBSink makes this calculation while the other sinks that extend HttpPostSink leave it empty (it is an Optional) so these metrics are not published for these sinks. In order to support the metrics samples_sent and samples_dropped in all sinks, the serialize method must be implemented to calculate the population size of each byte[] chunks for the following sinks:

SignalFxSink
MonitordSink
KMonDSink
InfluxDBSink
DataDogSink

Remove Deprecated Method Use

There are deprecated methods in use in CirconusSinkActor.java.

Create Cross Cluster Heartbeat Metric

We should create a metric for the inter-node heartbeat interval measured by Akka* Exporting it to AMP via the existing OSS AMP client is sufficient since I am working on exporting all those metrics to Vortex2* I believe these are reasonable places to start:

- https://doc.akka.io/docs/akka/2.5/remoting.html#watching-remote-actors
- https://doc.akka.io/docs/akka/2.5/general/configuration.html (see "failure-detector")
- https://github.com/akka/akka/blob/v2.5.30/akka-remote/src/main/scala/akka/remote/PhiAccrualFailureDetector.scala#L147

The solution that I would recommend is subclassing/encapsulating the standard detector and injecting our class with MetricsFactory to record the heartbeat metrics (by destination node).

Protocol for host and cluster support

I recently did some work to allow for a machine to record metrics pertaining to multiple hosts. The primary use case being monitoring a load balancer host where the metrics pertain to the vip instances. The query log file is being written correctly but the data seems to be intermittently attributed to either the host where the tsd agg instance is running or the spoofed host.

Based on an old Jira ticket and this line, it would appear that when a connection is made to the cluster agg instance that a single host is set for whatever message are sent for the duration of that connection.

How to import Metrics Cluster Aggregator into Eclipse?

I wrote an Eclipse plugin and am now collecting a database of large Java Maven projects. I tried to import Metrics Cluster Aggregator into Eclipse, but I couldn't compile it and run the tests.

I tried changing the java version or maven version, adding that plugin to maven dependencies, and following all of the StackOverflow suggestions, such as removing the.m2 folder, but none of them worked.

Is Metrics Cluster Aggregator Eclipse compatible? Is it possible to run their test classes inside Eclipse? If so, what procedure should I follow?

WF is depricating older proxy versions

There is a change in WF proxy >4.5 that causes CAgg to constantly restart unless you artificially increment a counter. Here are a quick set of changes. Might want to look more closely at the pom file but I don't have time atm.

master...PhoenixRion:master

https://docs.wavefront.com/wavefront_obsolescence_policy.html

Docker Image for 1.0.2 removed, but still needed

Hi guys, we have a team here that was using the docker image for version 1.0.2 (referenced by the latest tag). We're not ready to bump to 1.0.3, but we're also no longer able to access the docker image that contained version 1.0.2. Would you be able to do a quick push of 1.0.2-build docker image and tag it in docker hub?

Commons Accumulator Implementations for Sum

The ArpNetworking/Commons library now has implementations of different Accumulator patterns to sum Double values with different cost (memory/cpu) and precision. We already support these on the read side in the ISM KairosDb-Extensions. This issue is to support these in CAGG; unfortunately, it's not as straight forward as CAGG's Statistic and Calculator pattern were never intended to be parameterized. Since MAD and CAGG shared a data model at one point, the the design constraints and implementation options are similar. Please see the corresponding issue in MAD for details:

ArpNetworking/metrics-aggregator-daemon#209

NPE in 1.11.1

I upgraded my cluster from 1.11.0 to 1.11.1 today and I'm seeing an NPE. It looks like it's coming from an actor, but the lack of a backtrace is a bit concerning.

Cluster is running in proxy mode (as opposed to aggregation mode).

{"time":"2021-01-09T00:57:11.784Z","name":"log","level":"crit",
"data":{"message":null},"exception":
{"type":"java.lang.NullPointerException","message":null,"backtrace":[],"data":
{"_id":"1bd476f2","_class":"java.lang.NullPointerException"}},
"context":{"host":"cagg-1.cagg.default.svc.cluster.local","processId":"1","threadId":
"Metrics-akka.actor.default-dispatcher-4","logger":"a.a.OneForOneStrategy"},
"id":"fa1a0f59-cf72-498f-b552-0b8ee609c7dc","version":"0"}

Add support for renamed meta dimension names

MAD will start sending "_host", "_service", and "_cluster" to CAgg with the next revision. It will do so initially in addition to the existing "host", "service" and "cluster" dimension keys. We should extend CAgg to support both with the underscore prefixed versions taking precedence. Once complete and released we can deprecate sending the old names from MAD (see AggregationServerSink in MAD).

Cluster Aggregator Pipeline Blocks Akka Cluster Formation

We had a scenario where a bad (invalid JSON) cluster pipeline configuration was deployed. When the cluster restarted it seemed that the Akka cluster was unhealthy.

{"time":"2018-07-20T02:58:34.790Z","name":"log","level":"warn","data":{"message":"Cluster Node [akka.tcp://Metrics@iad4f-re22-2a:2551] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://Metrics@iad4d-rd41-38a:2551, status = Up)]. Node roles [dc-default]"},"context":{"host":"iad4f-re22-2a.sjc.dropbox.com","processId":"7","threadId":"Metrics-akka.actor.default-dispatcher-28","logger":"a.c.ClusterCoreDaemon"},"id":"ad8b61f7-4a18-479f-bf98-8ea3289b1f52","version":"0"}

Eventually each node would output:

{"time":"2018-07-20T02:58:20.809Z","name":"log","level":"warn","data":{"message":"Association with remote system [akka.tcp://Metrics@iad4c-rf14-36a:2551] has failed, address is now gated for [5000] ms. Reason: [Disassociated] "},"context":{"host":"iad4f-re22-2a.sjc.dropbox.com","processId":"7","threadId":"Metrics-akka.actor.default-dispatcher-33","logger":"a.r.ReliableDeliverySupervisor"},"id":"7b7baeff-df15-4846-90ba-891a162b3e51","version":"0"}

You would also see some of these:

{"time":"2018-07-20T02:58:27.961Z","name":"log","level":"warn","data":{"message":"heartbeat interval is growing too large: 2001 millis"},"context":{"host":"iad4f-re22-2a.sjc.dropbox.com","processId":"7","threadId":"Metrics-akka.actor.default-dispatcher-67","logger":"a.r.PhiAccrualFailureDetector"},"id":"8c7597b8-3dda-432d-a26c-6eed5a26df1b","version":"0"

And then there was the scary:

{"time":"2018-07-20T02:58:58.790Z","name":"log","level":"info","data":{"message":"Cluster Node [akka.tcp://Metrics@iad4f-re22-2a:2551] - Leader can currently not perform its duties, reachability status: [akka.tcp://Metrics@iad4c-rf14-36a:2551 -> akka.tcp://Metrics@iad4a-rl2-17c:2551: Reachable [Unreachable] (26), akka.tcp://Metrics@iad4c-rf14-36a:2551 -> akka.tcp://Metrics@iad4d-rd41-38a:2551: Unreachable [Unreachable] (12), akka.tcp://Metrics@iad4c-rf14-36a:2551 -> akka.tcp://Metrics@iad4f-re22-2a:2551: Unreachable [Unreachable] (19), akka.tcp://Metrics@iad4d-rd41-38a:2551 -> akka.tcp://Metrics@iad4a-rl2-17c:2551: Reachable [Unreachable] (32), akka.tcp://Metrics@iad4d-rd41-38a:2551 -> akka.tcp://Metrics@iad4c-rf14-36a:2551: Unreachable [Unreachable] (31), akka.tcp://Metrics@iad4d-rd41-38a:2551 -> akka.tcp://Metrics@iad4f-re22-2a:2551: Unreachable [Unreachable] (30), akka.tcp://Metrics@iad4f-re22-2a:2551 -> akka.tcp://Metrics@iad4a-rl2-17c:2551: Unreachable [Unreachable] (27), akka.tcp://Metrics@iad4f-re22-2a:2551 -> akka.tcp://Metrics@iad4c-rf14-36a:2551: Unreachable [Unreachable] (26), akka.tcp://Metrics@iad4f-re22-2a:2551 -> akka.tcp://Metrics@iad4d-rd41-38a:2551: Unreachable [Unreachable] (25)], member status: [akka.tcp://Metrics@iad4a-rl2-17c:2551 Up seen=false, akka.tcp://Metrics@iad4c-rf14-36a:2551 Up seen=false, akka.tcp://Metrics@iad4d-rd41-38a:2551 Up seen=false, akka.tcp://Metrics@iad4f-re22-2a:2551 Up seen=true]"},"context":{"host":"iad4f-re22-2a.sjc.dropbox.com","processId":"7","threadId":"Metrics-akka.actor.default-dispatcher-48","logger":"a.c.Cluster(akka://Metrics)"},"id":"220dc047-2a81-4507-802d-203cd7902b27","version":"0"}

So it seems that Akka cluster formation is dependent on a successful loading of the cluster pipeline. However, intuitively it feels like this should not be the case; or at the very least if this dependency exists and must exist then the cluster formation should not even be attempted if the cluster pipeline configuration cannot be loaded.

Application Fails to Start without a Database

Specifically, if the block in the configuration is missing:

"metrics_clusteragg": {
      "jdbcUrl": "jdbc:h2:/opt/cluster-aggregator/data/metrics:clusteragg;AUTO_SERVER=TRUE;AUTO_SERVER_PORT=7067;MODE=PostgreSQL;INIT=create schema if not exists clusteragg;DB_CLOSE_DELAY=-1",
      "driverName": "org.h2.Driver",
      "username": "sa",
      "password": "secret",
      "maximumPoolSize": 2,
      "minimumIdle": 2,
      "idleTimeout": 0,
      "modelPackages": [ "com.arpnetworking.clusteraggregator.models.ebean" ]
    },

... then the application will not start, throwing a Guice unable-to-initialize type error. This DB (AFAIK) is only used for Circonus Sinks.

Ideally, the application should be able to run without this configuration if no Circonus sinks are in use.

Hello! We found a vulnerable dependency in your project.

Hi! We spot a vulnerable dependency in your project, which might threaten your software. We also found another project that uses the same vulnerable dependency in a similar way as you did, and they have upgraded the dependency. We, thus, believe that your project is highly possible to be affected by this vulnerability similarly. The following shows the detailed information.

Vulnerability description

CVE: CVE-2019-16943
Vulnerable dependency: com.fasterxml.jackson.core:jackson-databind:2.9.8
Vulnerable function: com.fasterxml.jackson.databind.JavaType:isEnumType()
Invocation Path:

com.arpnetworking.tsdcore.sinks.circonus.CirconusClient:handleBrokerListResponse(play.libs.ws.StandaloneWSRequest,play.libs.ws.StandaloneWSResponse)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,com.fasterxml.jackson.core.type.TypeReference)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Upgrade example

Another project also used the same dependency with a similar invocation path, and they have taken actions to resolve this issue.

Project: https://github.com/ZeroOne3010/yetanotherhueapi
Action commit:ZeroOne3010/yetanotherhueapi@962fbf3
Invocation Path:

io.github.zeroone3010.yahueapi.Hue$HueBridgeConnectionBuilder:lambda$initializeApiConnection$0(java.lang.String)
 ⬇️ 
com.fasterxml.jackson.databind.ObjectMapper:readValue(java.lang.String,com.fasterxml.jackson.core.type.TypeReference)
 ⬇️ 
...
 ⬇️ 
com.fasterxml.jackson.databind.JavaType:isEnumType()

Therefore, you might also need to upgrade this dependency. Hope this can help you! 😄

Incorrect sharding when `cluster` etc. is set in dimensions

The Messages.StatisticSetRecord no longer requires the cluster, service and host to be present in their respective fields, instead allowing them to be conveyed via a dimension key-value. The AggClientConnection class defers to the dimensions for retrieval of the cluster, service and host; however, the AggMessageExtractor does not, and still looks in the now-optional fields of the StatisticSetRecord message.

This would appear to be a bug, causing incorrect sharding when the messages sent from MAD have opted to use dimensions to convey cluster, service and host.

Can someone sanity check me on this?

Possible logic bug

... in https://github.com/ArpNetworking/metrics-cluster-aggregator/blob/master/src/main/java/com/arpnetworking/clusteraggregator/aggregation/StreamingAggregator.java#L232 , the parentheses make the logic:

if ( not (samePeriod) AND sameCluster AND sameService AND sameMetric) then
  log_error()

But I'm guessing it was intended to be:

if ( not (samePeriod AND sameCluster AND sameService AND sameMetric)) then
  log_error()

Yeah?

Data Model Improvement

The ClusterAggregator data model is now independent from the TsdAggregator/MetricsAggregatorDaemon data models. Design decisions that were being made to support the combined data model should be re-evaluated. In particular, any decision should address the structure of the data with respect AggregatedData versus PeriodicData and which (if either) is most applicable to ClusterAggregator.

Regardless of which way the data model design evolves the following classes at least need to be considered in the refactoring. They currently use and seem to rely on deprecated methods in AggregatedData.

PeriodicStatisticsActor
PeriodMetrics
BookkeeperPersistence
Emitter
AggMessageExtractor

HTTP Status Endpoint

The system currently has two separate status endpoints. One is a basic standard informational endpoint about the service (name, version, sha). The other provides detailed information about the distributed Akka cluster. The former is configurable and by default uses /status which shadows the latter. The ideal solution would be to combine the static information from the first endpoint with the dynamic information from the second endpoint and make both available on the same configurable endpoint which by default would be /status.

Race condition in AlertSink

In these lines in the AlertSink, there is a race condition (the method is otherwise designed to be thread-safe AFAIK): if one thread does getAndSet with the newConfiguration, another thread may do a second getAndSet and then call shutdown before the first thread has called launch.

Before #24 , this caused an NPE in the DynamicConfig class.

Possible solution is to launch() the newConfiguration before getAndSet(), however this means the new config will be launched before the old is shutdown(); I'm not confident on whether this is acceptable.

Replace WF Proxy

With the latest round of updates, the WaveFront proxy was removed due to dependency conflicts and tests that caused the JVM to terminate. Find a solution to reinstate the WF Proxy or find a suitable alternative.

See #65 for more information.

AJC Incompatibilities

Incompatibilities with the AspectJ compiler used to weave Logback-Steno context information need to be resolved. The up to date list of impacted classes may be found in the pom.xm under the AspectJ plugin configuration in the form of an exclusion list. The problems should be addressed and the exclusions removed. At the time of writing these classes were impacted:

AggClientConnection
AggregationMessage
ParallelLeastShardAllocationStrategy