Giter VIP home page Giter VIP logo

veneur's Introduction

Build Status GoDoc

Table of Contents

What Is Veneur?

Veneur (/vɛnˈʊr/, rhymes with “assure”) is a distributed, fault-tolerant pipeline for runtime data. It provides a server implementation of the DogStatsD protocol or SSF for aggregating metrics and sending them to downstream storage to one or more supported sinks. It can also act as a global aggregator for histograms, sets and counters.

More generically, Veneur is a convenient sink for various observability primitives with lots of outputs!

Use Case

Once you cross a threshold into dozens, hundreds or (gasp!) thousands of machines emitting metric data for an application, you've moved into that world where data about individual hosts is uninteresting except in aggregate form. Instead of paying to store tons of data points and then aggregating them later at read-time, Veneur can calculate global aggregates, like percentiles and forward those along to your time series database, etc.

Veneur is also a StatsD or DogStatsD protocol transport, forwarding the locally collected metrics over more reliable TCP implementations.

Here are some examples of why Stripe and other companies are using Veneur today:

  • reducing cost by pre-aggregating metrics such as timers into percentiles
  • creating a vendor-agnostic metric collection pipeline
  • consolidating disparate observability data (from trace spans to metrics, and more!)
  • improving efficiency over other metric aggregator implementations
  • improving reliability by building a more resilient forwarding system over single points of failure

See Also

We wanted percentiles, histograms and sets to be global. We wanted to unify our observability clients, be vendor agnostic and build automatic features like SLI measurement. Veneur helps us do all this and more!

Status

Veneur is currently handling all metrics for Stripe and is considered production ready. It is under active development and maintenance! Starting with v1.6, Veneur operates on a six-week release cycle, and all releases are tagged in git. If you'd like to contribute, see CONTRIBUTING!

Building Veneur requires Go 1.11 or later.

Features

Vendor And Backend Agnostic

Veneur has many sinks such that your data can be sent one or more vendors, TSDBs or tracing stores!

Modern Metrics Format (Or Others!)

Unify metrics, spans and logs via the Sensor Sensibility Format. Also works with DogStatsD, StatsD and Prometheus.

Global Aggregation

If configured to do so, Veneur can selectively aggregate global metrics to be cumulative across all instances that report to a central Veneur, allowing global percentile calculation, global counters or global sets.

For example, say you emit a timer foo.bar.call_duration_ms from 20 hosts that are configured to forward to a central Veneur. You'll see the following:

  • Metrics that have been "globalized"
    • foo.bar.call_duration_ms.50percentile: the p50 across all hosts, by tag
    • foo.bar.call_duration_ms.90percentile: the p90 across all hosts, by tag
    • foo.bar.call_duration_ms.95percentile: the p95 across all hosts, by tag
    • foo.bar.call_duration_ms.99percentile: the p99 across all hosts, by tag
  • Metrics that remain host-local
    • foo.bar.call_duration_ms.avg: by-host tagged average
    • foo.bar.call_duration_ms.count: by-host tagged count which (when summed) shows the total count of times this metric was emitted
    • foo.bar.call_duration_ms.max: by-host tagged maximum value
    • foo.bar.call_duration_ms.median: by-host tagged median value
    • foo.bar.call_duration_ms.min: by-host tagged minimum value
    • foo.bar.call_duration_ms.sum: by-host tagged sum value representing the total time

Clients can choose to override this behavior by including the tag veneurlocalonly.

Approximate Histograms

Because Veneur is built to handle lots and lots of data, it uses approximate histograms. We have our own implementation of Dunning's t-digest, which has bounded memory consumption and reduced error at extreme quantiles. Metrics are consistently routed to the same worker to distribute load and to be added to the same histogram.

Datadog's DogStatsD — and StatsD — uses an exact histogram which retains all samples and is reset every flush period. This means that there is a loss of precision when using Veneur, but the resulting percentile values are meant to be more representative of a global view.

Datadog Distributions

Because Veneur already handles "global" histograms, any DogStatsD packets received with type dDatadog's distribution type — will be considered a histogram and therefore compatible with all sinks. Veneur does not send any metrics to Datadog typed as a Datadog-native distribution.

Approximate Sets

Veneur uses HyperLogLogs for approximate unique sets. These are a very efficient unique counter with fixed memory consumption.

Global Counters

Via an optional magic tag Veneur will forward counters to a global host for accumulation. This feature was primarily developed to control tag cardinality. Some counters are valuable but do not require per-host tagging.

Sink Routing

Veneur supports routing metrics to specific sinks using the metric_sink_routing configuration field with the structure:

metric_sink_routing:  # or
  - name: string
    match:  # or
      - name:
          kind: any | exact | prefix | regex
          value: string
        tags:  # and
          - kind: exact | prefix | regex
            unset: bool
            value: string
    sinks:
      matched:  # and
        - string
      not_matched:  # and
        - string

The metric_sink_routing field contains a list of routing rules, containing a name field for identifying the rule in logs, a list of matchers, and sinks. A matcher contains a name matcher for matching the name of a metric, and a list of tag matchers for matching tags the metric has; the name matcher and all of the tag matchers must match in order for the matcher to match a given metric.

The kind for the name matcher can be one of any, exact, prefix, or regex. The name matcher matches a metric name:

  • any: always; the name of the metric is ignored, and the value field is unused.
  • exact: if the metric name equals the value field.
  • prefix: if the name starts with the value field.
  • regex: if the name of the metric name matches the regex expression specified in the value field.

The kind of the tag matcher can be one of exact, prefix or regex. The tag matcher matches a metric tag:

  • exact: if the tag equals the value field.
  • prefix: if the tag starts with the value field.
  • regex: if the tag matches the regex expression specified in the value field.

For a tag matcher to match a given metric, if the unset field is not set or is false, the tag matcher must match at least one tag in the metric; if the unset field is true, the tag matcher must match none of the tags in the metric.

If a metric matches any of the entries in the match field of a given rule, it is flushed to all of the sinks listed in the matched field; if a metric matches none of the matchers in a given rule, it is sent to all of the sinks listed in the not_matched section.

Concepts

  • Global metrics are those that benefit from being aggregated for chunks — or all — of your infrastructure. These are histograms (including the percentiles generated by timers) and sets.
  • Metrics that are sent to another Veneur instance for aggregation are said to be "forwarded". This terminology helps to decipher configuration and metric options below.
  • Flushed, in Veneur, means metrics or spans processed by a sink.

By Metric Type Behavior

To clarify how each metric type behaves in Veneur, please use the following:

  • Counters: Locally accrued, flushed to sinks (see magic tags for global version)
  • Gauges: Locally accrued, flushed to sinks (see magic tags for global version)
  • Histograms: Locally accrued, count, max and min flushed to sinks, percentiles forwarded to forward_address for global aggregation when set.
  • Timers: Locally accrued, count, max and min flushed to sinks, percentiles forwarded to forward_address for global aggregation when set.
  • Sets: Locally accrued, forwarded to forward_address for sinks aggregation when set.

Expiration

Veneur expires all metrics on each flush. If a metric is no longer being sent (or is sent sparsely) Veneur will not send it as zeros! This was chosen because the combination of the approximation's features and the additional hysteresis imposed by retaining these approximations over time was deemed more complex than desirable.

Other Notes

  • Veneur aligns its flush timing with the local clock. For the default interval of 10s Veneur will generally emit metrics at 00, 10, 20, 30, … seconds after the minute.
  • Veneur will delay it's first metric emission to align the clock as stated above. This may result in a brief quiet period on a restart at worst < interval seconds long.

Usage

veneur -f example.yaml

See example.yaml for a sample config. Be sure to set the appropriate *_api_key!

Setup

Here we'll document some explanations of setup choices you may make when using Veneur.

Clients

Veneur is capable of ingesting:

  • DogStatsD including events and service checks
  • SSF
  • StatsD as a subset of DogStatsD, but this may cause trouble depending on where you store your metrics.

To use clients with Veneur you need only configure your client of choice to the proper host and port combination. This port should match one of:

  • statsd_listen_addresses for UDP- and TCP-based clients
  • ssf_listen_addresses for SSF-based clients using UDP or UNIX domain sockets.
  • grpc_listen_addresses for both SSF and dogstatsd based clients using GRPC (over TCP).

Einhorn Usage

When you upgrade Veneur (deploy, stop, start with new binary) there will be a brief period where Veneur will not be able to handle HTTP requests. At Stripe we use Einhorn as a shared socket manager to bridge the gap until Veneur is ready to handle HTTP requests again.

You'll need to consult Einhorn's documentation for installation, setup and usage. But once you've done that you can tell Veneur to use Einhorn by setting http_address to einhorn@0. This informs goji/bind to use its Einhorn handling code to bind to the file descriptor for HTTP.

Forwarding

Veneur instances can be configured to forward their global metrics to another Veneur instance. You can use this feature to get the best of both worlds: metrics that benefit from global aggregation can be passed up to a single global Veneur, but other metrics can be published locally with host-scoped information. Note: Forwarding adds an additional delay to metric availability corresponding to the value of the interval configuration option, as the local veneur will flush it to its configured upstream, which will then flush any recieved metrics when its interval expires.

If a local instance receives a histogram or set, it will publish the local parts of that metric (the count, min and max) directly to sinks, but instead of publishing percentiles, it will package the entire histogram and send it to the global instance. The global instance will aggregate all the histograms together and publish their percentiles to sinks.

Note that the global instance can also receive metrics over UDP. It will publish a count, min and max for the samples that were sent directly to it, but not counting any samples from other Veneur instances (this ensures that things don't get double-counted). You can even chain multiple levels of forwarding together if you want. This might be useful if, for example, your global Veneur is under too much load. The root of the tree will be the Veneur instance that has an empty forward_address. (Do not tell a Veneur instance to forward metrics to itself. We don't support that and it doesn't really make sense in the first place.)

With respect to the tags configuration option, the tags that will be added are those of the Veneur that actually publishes to a sink. If a local instance forwards its histograms and sets to a global instance, the local instance's tags will not be attached to the forwarded structures. It will still use its own tags for the other metrics it publishes, but the percentiles will get extra tags only from the global instance.

Proxy

To improve availability, you can leverage veneur-proxy in conjunction with Consul service discovery.

The proxy can be configured to query the Consul API for instances of a service using consul_forward_service_name. Each healthy instance is then entered in to a hash ring. When choosing which host to forward to, Veneur will use a combination of metric name and tags to consistently choose the same host for forwarding.

See more documentation for Proxy Veneur.

Static Configuration

For static configuration you need one Veneur, which we'll call the global instance, and one or more other Veneurs, which we'll call local instances. The local instances should have their forward_address configured to the global instance's http_address. The global instance should have an empty forward_address (ie just don't set it). You can then report metrics to any Veneur's statsd_listen_addresses as usual.

Magic Tag

If you want a metric to be strictly host-local, you can tell Veneur not to forward it by including a veneurlocalonly tag in the metric packet, eg foo:1|h|#veneurlocalonly. This tag will not actually appear in storage; Veneur removes it.

Global Counters And Gauges

Relatedly, if you want to forward a counter or gauge to the global Veneur instance to reduce tag cardinality, you can tell Veneur to flush it to the global instance by including a veneurglobalonly tag in the metric's packet. This veneurglobalonly tag is stripped and will not be passed on to sinks.

Note: For global counters to report correctly, the local and global Veneur instances should be configured to have the same flush interval.

Note: Global gauges are "random write wins" since they are merged in a non-deterministic order at the global Veneur.

Configuration

Veneur expects to have a config file supplied via -f PATH. The included example.yaml explains all the options!

The config file can be validated using a pair of flags:

  • -validate-config: checks that the config file specified via -f is valid YAML, and has correct datatypes for all fields.
  • -validate-config-strict: checks the above, and also that there are no unknown fields.

Configuration via Environment Variables

Veneur and veneur-proxy each allow configuration via environment variables using envconfig. Options provided via environment variables take precedent over those in config. This allows stuff like:

VENEUR_DEBUG=true veneur -f someconfig.yml

Note: The environment variables used for configuration map to the field names in config.go, capitalized, with the prefix VENEUR_. For example, the environment variable equivalent of datadog_api_hostname is VENEUR_DATADOGAPIHOSTNAME.

You may specify configurations that are arrays by separating them with a comma, for example VENEUR_AGGREGATES="min,max"

Monitoring

Here are the important things to monitor with Veneur:

At Local Node

When running as a local instance, you will be primarily concerned with the following metrics:

  • veneur.flush*.error_total as a count of errors when flushing metrics. This should rarely happen. Occasional errors are fine, but sustained is bad.

Forwarding

If you are forwarding metrics to central Veneur, you'll want to monitor these:

  • veneur.forward.error_total and the cause tag. This should pretty much never happen and definitely not be sustained.
  • veneur.forward.duration_ns and veneur.forward.duration_ns.count. These metrics track the per-host time spent performing a forward. The time should be minimal!

At Global Node

When forwarding you'll want to also monitor the global nodes you're using for aggregation:

  • veneur.import.request_error_total and the cause tag. This should pretty much never happen and definitely not be sustained.
  • veneur.import.response_duration_ns and veneur.import.response_duration_ns.count to monitor duration and number of received forwards. This should not fail and not take very long. How long it takes will depend on how many metrics you're forwarding.
  • And the same veneur.flush.* metrics from the "At Local Node" section.

Metrics

Veneur will emit metrics to the stats_address configured above in DogStatsD form. Those metrics are:

  • veneur.sink.metric_flush_total_duration_ns.* - Duration of flushes per-sink, tagged by sink.
  • veneur.packet.error_total - Number of packets that Veneur could not parse due to some sort of formatting error by the client. Tagged by packet_type and reason.
  • veneur.forward.post_metrics_total - Indicates how many metrics are being forwarded in a given POST request. A "metric", in this context, refers to a unique combination of name, tags and metric type.
  • veneur.*.content_length_bytes.* - The number of bytes in a single POST body. Remember that Veneur POSTs large sets of metrics in multiple separate bodies in parallel. Uses a histogram, so there are multiple metrics generated depending on your local DogStatsD config.
  • veneur.forward.duration_ns - Same as flush.duration_ns, but for forwarding requests.
  • veneur.flush.error_total - Number of errors received POSTing via sinks.
  • veneur.forward.error_total - Number of errors received POSTing to an upstream Veneur. See also import.request_error_total below.
  • veneur.gc.number - Number of completed GC cycles.
  • veneur.gc.pause_total_ns - Total seconds of STW GC since the program started.
  • veneur.mem.heap_alloc_bytes - Total number of reachable and unreachable but uncollected heap objects in bytes.
  • veneur.worker.metrics_processed_total - Total number of metric packets processed between flushes by workers, tagged by worker. This helps you find hot spots where a single worker is handling a lot of metrics. The sum across all workers should be approximately proportional to the number of packets received.
  • veneur.worker.metrics_flushed_total - Total number of metrics flushed at each flush time, tagged by metric_type. A "metric", in this context, refers to a unique combination of name, tags and metric type. You can use this metric to detect when your clients are introducing new instrumentation, or when you acquire new clients.
  • veneur.worker.metrics_imported_total - Total number of metrics received via the importing endpoint. A "metric", in this context, refers to a unique combination of name, tags, type and originating host. This metric indicates how much of a Veneur instance's load is coming from imports.
  • veneur.import.response_duration_ns - Time spent responding to import HTTP requests. This metric is broken into part tags for request (time spent blocking the client) and merge (time spent sending metrics to workers).
  • veneur.import.request_error_total - A counter for the number of import requests that have errored out. You can use this for monitoring and alerting when imports fail.
  • veneur.listen.received_per_protocol_total - A counter for the number of metrics/spans/etc. received by direct listening on global Veneur instances. This can be used to observe metrics that were received from direct emits as opposed to imports. Tagged by protocol.

Error Handling

In addition to logging, Veneur will dutifully send any errors it generates to a Sentry instance. This will occur if you set the sentry_dsn configuration option. Not setting the option will disable Sentry reporting.

Performance

Processing packets quickly is the name of the game.

Benchmarks

The common use case for Veneur is as an aggregator and host-local replacement for DogStatsD, therefore processing UDP fast is no longer the priority. That said, we were processing > 60k packets/second in production before shifting to the current local aggregation method. This outperformed both the Datadog-provided DogStatsD and StatsD in our infrastructure.

SO_REUSEPORT

As other implementations have observed, there's a limit to how many UDP packets a single kernel thread can consume before it starts to fall over. Veneur supports the SO_REUSEPORT socket option on Linux, allowing multiple threads to share the UDP socket with kernel-space balancing between them. If you've tried throwing more cores at Veneur and it's just not going fast enough, this feature can probably help by allowing more of those cores to work on the socket (which is Veneur's hottest code path by far). Note that this is only supported on Linux (right now). We have not added support for other platforms, like darwin and BSDs.

TCP connections

Veneur supports reading the statsd protocol from TCP connections. This is mostly to support TLS encryption and authentication, but might be useful on its own. Since TCP is a continuous stream of bytes, this requires each stat to be terminated by a new line character ('\n'). Most statsd clients only add new lines between stats within a single UDP packet, and omit the final trailing new line. This means you will likely need to modify your client to use this feature.

TLS encryption and authentication

If you specify the tls_key and tls_certificate options, Veneur will only accept TLS connections on its TCP port. This allows the metrics sent to Veneur to be encrypted.

If you specify the tls_authority_certificate option, Veneur will require clients to present a client certificate, signed by this authority. This ensures that only authenticated clients can connect.

You can generate your own set of keys using openssl:

# Generate the authority key and certificate (2048-bit RSA signed using SHA-256)
openssl genrsa -out cakey.pem 2048
openssl req -new -x509 -sha256 -key cakey.pem -out cacert.pem -days 1095 -subj "/O=Example Inc/CN=Example Certificate Authority"

# Generate the server key and certificate, signed by the authority
openssl genrsa -out serverkey.pem 2048
openssl req -new -sha256 -key serverkey.pem -out serverkey.csr -days 1095 -subj "/O=Example Inc/CN=veneur.example.com"
openssl x509 -sha256 -req -in serverkey.csr -CA cacert.pem -CAkey cakey.pem -CAcreateserial -out servercert.pem -days 1095

# Generate a client key and certificate, signed by the authority
openssl genrsa -out clientkey.pem 2048
openssl req -new -sha256 -key clientkey.pem -out clientkey.csr -days 1095 -subj "/O=Example Inc/CN=Veneur client key"
openssl x509 -req -in clientkey.csr -CA cacert.pem -CAkey cakey.pem -CAcreateserial -out clientcert.pem -days 1095

Set statsd_listen_addresses, tls_key, tls_certificate, and tls_authority_certificate:

statsd_listen_addresses:
  - "tcp://localhost:8129"
tls_certificate: |
  -----BEGIN CERTIFICATE-----
  MIIC8TCCAdkCCQDc2V7P5nCDLjANBgkqhkiG9w0BAQsFADBAMRUwEwYDVQQKEwxC
  ...
  -----END CERTIFICATE-----
tls_key: |
    -----BEGIN RSA PRIVATE KEY-----
  MIIEpAIBAAKCAQEA7Sntp4BpEYGzgwQR8byGK99YOIV2z88HHtPDwdvSP0j5ZKdg
  ...
  -----END RSA PRIVATE KEY-----
tls_authority_certificate: |
  -----BEGIN CERTIFICATE-----
  ...
  -----END CERTIFICATE-----

Performance implications of TLS

Establishing a TLS connection is fairly expensive, so you should reuse connections as much as possible. RSA keys are also far more expensive than using ECDH keys. Using localhost on a machine with one CPU, Veneur was able to establish ~700 connections/second using ECDH prime256v1 keys, but only ~110 connections/second using RSA 2048-bit keys. According to the Go profiling for a Veneur instance using TLS with RSA keys, approximately 25% of the CPU time was in the TLS handshake, and 13% was decrypting data.

Name

The veneur is a person acting as superintendent of the chase and especially of hounds in French medieval venery and being an important officer of the royal household. In other words, it is the master of dogs. :)

veneur's People

Contributors

aditya-stripe avatar an-stripe avatar andresgalindo-stripe avatar arnavdugar-stripe avatar asf-stripe avatar aubrey-stripe avatar chimeracoder avatar choo-stripe avatar chualan-stripe avatar clin-stripe avatar cory-stripe avatar dependabot[bot] avatar eriwo-stripe avatar evanj avatar gphat avatar joshu-stripe avatar kiran avatar kklipsch-stripe avatar krisreeves-stripe avatar mimran-stripe avatar noahgoldman avatar prudhvi avatar redsn0w422 avatar rma-stripe avatar sdboyer-stripe avatar sjung-stripe avatar tummychow avatar vasi-stripe avatar yanske avatar yasha-stripe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

veneur's Issues

sentry support

veneur should catch panics and report them to sentry. Error reporting to sentry might also be nice, but since errors are generated on a packet-by-packet basis, the existing metrics should cover a lot of our reporting needs there.

Configuration Through Environment Variables?

I'm wondering if there is any interest in enabling configuration through environment variables. This seems to be the default way to configure things with Docker and would be quite easy to enable using the https://github.com/kelseyhightower/envconfig library with the Config struct that is currently used. It could either be a flag that can be used as an alternative to --config or it could default to reading from the environment if --config is not set.

If there is interest in doing this, I'd be willing to contribute the code.

Aggregated histogram metrics are not forwarded to global veneur

I have a local veneur forwarding metrics to a global veneur. If I tag a histogram metric with veneurglobalonly aggregations of that metric are not forwarded to the global veneur (e.g. $ echo "testing.histo:1234|h|#veneurglobalonly,blah:yes" | nc -w 1 -u localhost 18125)

max:
image

99percentile:
image

Can this behaviour be supported?

Non-numeric gauge blocks all metrics in flush interval from being sent

When using the latest related version of veneur inside of Stripe, I noticed that gauges with non-numeric values cause veneur to emit an error message, and fail to deliver all of the metrics contained in the flush interval.

You can reproduce this issue by running this command:

echo -n "this_is_a_bad_metric:nan|g|#shell" >/dev/udp/localhost/8200

You should see this error message in the log:

[2017-05-11 21:37:48.262346] time="2017-05-11T21:37:48Z" level=error msg="Could not render JSON" action=flush error="json: unsupported value: NaN" 

I was only able to reproduce this using a gauge, a counter didn't produce the error in the log. This somewhat confusing, because looking at the code, I thought that this packet would fail to parse:

return nil, fmt.Errorf("Invalid number for metric value: %s", valueChunk)

Which would trigger an error higher up the stack:

log.WithFields(logrus.Fields{

Causing just that one metric to be excluded. This doesn't appear to be happening for some reason.

Logrus configuration

By default, logrus outputs to stderr. When running in GCE with the Stackdriver agent, stderr lines end up with severity 'warning', even when the level in the text-format is 'Info'.

As the logrus readme says this is how it is supposed to be used:

  if Environment == "production" {
    log.SetFormatter(&log.JSONFormatter{})
  } else {
    // The TextFormatter is default, you don't actually have to do this.
    log.SetFormatter(&log.TextFormatter{})
  }

Can we integrate something like that in Veneur? That way I believe stackdriver picks up the log level. I'm not sure though. In either case, stuff like "Completed flush to Datadog" and "Completed forward to upstream Veneur" should go to stdout, right?

Common tags on sinks: overwrite or not?

One of the issues that came up in #386 was whether it was appropriate for the gRPC sink's "common tags" to supersede any tags on the spans it ingests. This struck @asf-stripe as odd, as it did me, once he mentioned it - but i then checked the datadog sink and noticed that it does, indeed, overwrite tags on the incoming span.

Given that the application of common tags is itself somewhat uneven (#388), and that i didn't see (on quick ctrl-F-based perusal) any direct discussion of it in the README, i thought it worth opening an issue to explicitly clarify what the intended behavior is.

[1.8.1] Unable to specify "zero" SSF Listener Addresses via environment configuration

Hi there!

I ran into an issue today upgrading to the latest docker image of Veneur. There appears to be a bug when trying to set up the ssf_listener_addresses field via the environmental configuration suggested here: https://github.com/stripe/veneur/tree/master/public-docker-images#running. As far as I can tell, the initial configuration is provided by the example.yaml file, which specifies "default" values for ssf_listener_addresses. If I wanted to change those I'd have to overwrite them by providing -e VENEUR_SSFLISTENERADDRESSES="" to the docker run command. One small difference is that I'm using my Kubernetes deployment to pass the environment variables rather than directly using docker run

Due to the way envconfig works, I can't specify an empty string and have it translate to an empty slice. If I specify an empty string for this value, envconfig will parse it as a slice of size 1. If I don't specify a value for VENEUR_SSFLISTENERADDRESSES, I'll get the default value from the example.yaml. In the first case, an empty string will be passed to this line which causes a fatal error. In the latter case, I appear to be hanging infinitely when flushing metrics. Some quick debugging on my own end led me to believe it was because I had invalid addresses specified as my ssf_listener_addresses in the example.yaml file. My debug logs just showed workers 1-96 flushing without ever actually doing anything.

It looks like this could be fixed by commenting out the actual values for ssf_listener_addresses in the example.yaml file. I could also have just created my own docker image, but this seemed faster initially. I have confirmed that this issue is not present in release 1.7.0, presumably because the multiple ssf listener address functionality had yet to be added.

I hope this is enough information, but if not let me know. I also understand this may be an edge case you don't want to fix.

Thanks,

Chad

Unusual behaviour with aggregation

We’re aggregating metrics from our rails app, and seeing a few strange behaviours when we turn this on for two clusters. We don’t have any clear understanding of what’s going on here, but hoping you might be able to shed some light, or help us debug!

For a brief rundown on our architecture, we have a Veneur local agent running on each box that runs rails, and then a global aggregator collecting everything and shipping it off to Datadog. We’ve also patched the global aggregator consider local tags, as well as global.

The problem we’re seeing is when two clusters with a similar name (example: production-api-thing and production-api-thing-other) are both aggregated through the same global aggregator, we see a strange pattern in the way metrics arrive at Datadog.

dd_veneur_debug_board___datadog

The problem occurs when both clusters are aggregated. If only one or the other is being aggregated, we do not see this behaviour, only when both are. The metrics flip as seen between 14:00 and 14:30, but there doesn’t seem to be a pattern in how long it takes to flip between.

In this time, we don’t see anything untoward in the metrics we’re collecting from Veneur itself, since we’re still emitting the same number of metrics, they’re just being aggregated by the wrong bucket.

Have you any suggestions how we might dig into this further, or what might be the problem?

veneur-proxy doesn't work on Kubernetes via GRPC

I've been trying to use veneur-proxy on our setup that only has one veneur-global and a bunch of veneur-local's running as sidecars on each pod that needs to send metrics to Datadog.

The setup with just one veneur-global works fine, but after fiddling around with the proxy, the farthest I've got was:

time="2019-11-11T15:29:14Z" level=debug msg="Found TCP port" port=8128
time="2019-11-11T15:29:14Z" level=debug msg="Got destinations" destinations="[http://172.21.153.68:8128]" service=veneur-global
time="2019-11-11T15:29:41Z" level=error msg="Proxying failed" duration=2.684632ms error="failed to forward to the host 'http://172.21.153.68:8128' (cause=forward, metrics=1): failed to send 1 metrics over gRPC: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address http://172.21.153.68:8128: too many colons in address\"" protocol=grpc
time="2019-11-11T15:29:44Z" level=debug msg="About to refresh destinations" acceptingForwards=false consulForwardGRPCService=veneur-global consulForwardService= consulTraceService=

That error is similar to when the forward GRPC address is added with the http:// prefix and looking at the code it seams that the "http://" is hardcoded added by the KubernetesDiscoverer while it's not added by Consul.

What is the reasoning for that? I don't want to fallback to sending this over HTTP just because the discoverer can't handle this.

Thanks for your time.

Unable to send to SignalFX

We have built binary and its sending StatsD metrics to DD which is one of the sinks.
But the SignalFX metrics are not coming even though it says success.

Then we changed the ingest URL to rubbish string and we got exact same "INFO[0110] Completed flush to SignalFx" message - so we now thing NONE of the SFX config is recognised.

We have tried this in Docker and native go binary.
Again if the DD metrics work from same go app, its just getting the other SignalFX sinc to work.

Here is the SignalFX specific veneur confg and logs ....

log.txt
config.yaml.zip

RFC: TCP/TLS support?

Summary: Would you be interested in merging a patch that adds TCP statsd, and requires TLS client certificates?

Details:

For slightly bizarre reasons, we need to listen for stats on a port that is exposed to the Internet (most of our application is in Google App Engine, with other services on Google Compute Engine. They are not on the same network). For our services that do this, we require a TLS client certificate, to ensure that only our trusted clients can connect. As a result, I've hacked veneur to support the following:

  1. Accepting stats from a TCP socket. This has one caveat: messages must be terminated with a \n, which not all clients do (many, including Datadog's only include \n if packing multiple metrics in a single packet, and omit it for the last one).

  2. Requiring a TLS client certificate signed by a given key to accept the connection.

This isn't 100% in production for us (Bluecore) yet, but we are testing it now, with the goal of using this to deliver most of our stats from App Engine to Datadog. It would be convenient for us if you would merge our patch. However, I understand this is a pretty esoteric feature and we might be the only users in the world. I'm happy to maintain a private fork if that makes more sense.

Thanks!

Bucket alignment

After migrating from dd-agent to veneur we have the following problem on DataDog:

screen shot 2017-10-20 at 11 35 54

On the left dd-agent aligns flushes at aligned intervals (0s, 10s, 20s, ... after minute boundaries), and the datapoints are only defined at those x-axis-values, while on the right the 10s buckets veneur (from many different instances) are aligned to 0s, 5s, 10s, 15s, ... intervals.

The alignment should happen equally to how dd-agent is doing this for correct behavior in DataDog.

The way dd-agent works is by

bucket_start = 1508492676 % interval
// bucket_start = 1508492670

while veneur uses a ticker. If every instance starts at a distinct time, and you have a suffienciently large number of instances, the flushes are uniformly distributed over that 10s interval.

Proposed solution would be something like this:
https://stackoverflow.com/questions/33267106/golang-time-ticker-how-to-begin-on-even-timestamps

Veneur sinks support for Wavefront

Not sure this is the right way or forum to ask this question, however we would like integrate veneur with Wavefront.
is there already support available for that sink or is there any work going towards it which we can keep track on?

Datadog api key is leaking into logs in default config

Starting up the software with no configuration overrides etc. places logging into debug mode which results in log messages like this:

time="2018-08-24T20:13:02Z" level=debug msg="POSTed successfully" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=<MY ACTUAL API KEY>" request_headers="map[Content-Encoding:[deflate] Traceid:[7918023244848592696] Parentid:[3942115212271789064] Spanid:[1544772588644824963] Resource:[flush] Content-Type:[application/json]]" request_length=29313 response="{\"status\": \"ok\"}" response_headers="map[Content-Length:[16] Dd-Pool:[propjoe] X-Content-Type-Options:[nosniff] Strict-Transport-Security:[max-age=15724800;] Date:[Fri, 24 Aug 2018 20:13:02 GMT] Content-Type:[text/json]]" status="202 Accepted"

While I think there is some value in outputting the datadog apikey in a debug message I think that both for security and sanity that logging should probably not be set to debug level by default. Or that examples like this one (https://varnull.adityamukerjee.net/2018/04/05/observing-kubernetes-services-with-veneur/) should probably show how to override the log level.

Consider a better default config file for your container.

I recently did an upgrade from a fairly old version of the tool (4.x) to 11.x and found that with the inclusion and default use of your default config file my deployment went from working to spewing errors right and left.

It's my opinion that all sinks should probably have an enabled true/false flag but also maybe not providing values that cause them to activate by default would be nice. A short list of things I can remember modifying the config file to disable, splunk, falconer, signalfx, gprc, xray...

Allow listening for local metrics/SSF on a UNIX domain socket

Right now, veneur opens a UDP socket, and optionally a TCP socket to listen for data; these are both fine, but not the best choices for local operation. Some highlights of the downsides of an internet socket for local metrics include:

  • for UDP: MTU and max packet sizes and MTU start to matter, especially on larger events and metrics with long tag values.
  • TCP: Connection overhead; you end up going through the entire (the entire simulated with some shortcuts, but) TCP/IP stack for sending data
  • TCP: makes you wait for an ack.
  • Both: less flexibility when running veneur on a container host - have to make that port accessible to the container with potential network shenanigans.

The solution to this is fairly well-used in the industry: UNIX domain sockets (aka local sockets). This is a path in the file system that clients connect to (e.g. Ruby, go) and then can treat like a normal socket.

Major upsides of UNIX domain sockets are:

  • Reliability: Data sent into the local socket is guaranteed to make it out the other end, in order. Buffer sizes are more forgiving and allow for more data to be sent than e.g. in UDP datagrams.
  • Performance: data sent goes into a kernel buffer directly with no waiting for acks.
  • Safety: We can be sure that only local clients can ever connect to veneur, even if we make config mistakes.
  • More safety / clarity: The file in the file system has an owner and permissions (which makes it easier to grok who can connect).
  • Operability in containers: You can run a single veneur on the container host and mount the socket into a container like any other file.

This will also require a mild re-work of client libraries (most dogstatsd clients speak only UDP, and you'll need to be more careful sending data, as they can and do block if you send with default socket options), but I think implementing UNIX domain sockets and letting client libraries start using them will be a major win for everyone.

Veneur CPU usage based on metric packets processed per second.

Hi Team
We are seeing veneur using 1 logical CPU for about 60k metrics processed per second. We are sending stats to Veneur via java datadog client over Unix Domain Socket. Just wanted to know if such usage is expected or if we configured something bad.

Attaching the signalfx charts that show the stats:

veneur.worker.metrics_processed_total chart:

Screen Shot 2019-11-11 at 2 01 18 PM

Logical CPU's used by Veneur chart:

Screen Shot 2019-11-11 at 2 01 30 PM

Force all metrics to be sent to the global aggregator

We want to use Veneur in a mode where we run a set of "leaf" Veneurs which receive stats from applications, and forward ALL metrics to a global Veneur. This is because we don't care about host tags, and so we don't want to pay for them :). We've gotten away with having a single Veneur instance so far, but we are starting to run into CPU limits, and we figure this scheme will allow us to scale substantially farther without needing to shard.

We are going to hack this into a local fork and make sure it works for us. If it does, would you be possibly interested in a pull request to add this as some sort of configuration option? I can understand that this may be a strange enough use case to not want it, and I don't want to do the work to figure out how to make this configurable if it won't be accepted.

Thanks!

veneur-prometheus over reports counts and histograms

The Problem

Prometheus instrumented applications report an ever increasing number for counters (and histograms). That is if 10 events come in and you query it you will get 10, if 5 more come in, you will get 15, etc. Until the application is restarted and the count starts over.

Our bridge code is not accounting for this when sending to statsd/dogstatsd. So from statsd perspective it thinks that 10 things happened, then 15 things happened, etc. This means that for long running processes you will get dramatic over reporting of events.

You can verify this behavior by observing this test: https://github.com/stripe/veneur/blob/kklipsch/prometheus/cmd/veneur-prometheus/main_test.go#L22

Proposal

To fix this behavior the most straight forward approach is to take 2 points and subtract them to calculate the counter and histogram values. In the main case this works fine. There are 3 special cases that need to be accounted for:

  • Restart of veneur-prometheus: On restart you will lose your statefulness on the previous value. This means that during restarts you will miss statistics. This can be mitigated somewhat by saving state, but that becomes very complicated (what if the restart takes a really long time, do you want to report to statsd all of the events that happened during that period in a single flush interval to statsd?). Propose to ignore this case and just understand the stats gap that is generated by restarts.
  • Restart of the application being monitored clears counts: This will cause the counter to go smaller than the previous sample. In that case you can't know how many increments have happened on the stat (for instance previous value was 1250, restart happens and then when you poll again the value is 600). You know a minimum of 600 events happened, but it could be any number above that as the counter could have gotten up to an arbitrary X before the restart happened. Propose to report the minimum value for this case.
  • Restart of the application being monitored allows for histogram buckets to change. If the restart was part of a version change, the static buckets in histograms (and summaries I believe) can change. This means that you'd need to keep track of your histogram metrics by bucket and the code needs to be aware of this changing and continue to operate if it happens.

Can this be used to aggregate `host` into one?

Hey there!

We have plenty of dd agents which are each adding their own host metric and therefore multiplying the number of metrics we have for datadog.

Would it be possible to use Veneur to aggregate host to be only one value? We don't use the host split the different agents give us.

Solutions for Tee-ing statsd traffic

I have a weird edgecase scenario and can't tell if veneur can support it. I'd like to tee off some statsd traffic that is sent to a veneur instance. I'd like to send it to both datadog and a local prometheus (I assumed via the statsd exporter). But I don't see a sink for statsd or prometheus. Any advice?

Documentation unclear on how local Veneur should be used

Hi, first of all, thanks for making this!

I would like a clarification (that I couldn't find in the current README) on how a local Veneur is intended to be used (i.e. what's the best practice?):

  • should a local Veneur instance replace DogStatsD from the dd-agent entirely?

OR

  • should DogStatsD merely be configured to forward its metrics to the local Veneur instance?

I expect the answer is the latter scenario, based on the comment for udp_address in the configuration section of the README, but I was unable to find a firm, explicit recommendation in the documentation.

Package rename in vendored dependencies

Hello! We're working on integrating Veneur into our (mono)repo, and are running up against some vendoring issues.

Specifically, x/net/http in this repo is older than the one we're using, which contains some package renames (namely lex/httplex moved into http/httpguts).

Here's the specific change: golang/net@cbb82b5

I've bumped the version locally to the latest upstream, and everything looks good, so I'd be more than happy to offer that up as a patch.

Thanks!

Error in assertion about maximum required size.

I apologize for the incorrect comment in my original Java code about the maximum size for a merging digest. In fact, the bound is 2 * ceiling(compression), not as in the code.

This changes this line: https://github.com/stripe/veneur/blob/master/tdigest/merging_digest.go#L75

Here is a proof:

In the algorithm, we convert q values to k values using

k = compression * (asin(2 * q - 1) + \pi / 2) / \pi

For q = 1, k_max = compression. This is the largest possible value k can have and the number of centroids is less than or equal to ceiling(2 * compression). We only consider cases where compression >= 1 which implies that we have 2 or more centroids.

Take n as the number of centroids. The sum of the k-sizes for these centroids must be exactly k_max.

If n is even, then suppose that n > 2 ceiling(k_max). Further, to avoid collapse to a smaller number of centroids, we know that all adjacent pairs of centroid weights must be greater than 1. But we can form k_max consecutive pairs which must add up to more than k_max. This implies that the weights on the items after the k_max pairs must be negative which is impossible. Thus, n > 2 ceiling(k_max) is impossible for even n.

If n is odd, then we will have n > 2 ceiling(k_max) + 1 if n > 2 ceiling(k_max). Again, we can form at least k_max pairs, each with k_size greater than 1. There will be at least 3 centroids after these pairs. But this implies that the k-size of the remaining centroids will be less than or equal to one. But the first two of these must have k-size together greater or equal to one so the remaining centroids must have negative k-size. Thus, n > 2 ceiling(k_max) + 1 > 2 ceiling(k_max) is impossible for n odd.

By contradiction, n <= 2 ceiling(compression)

Change the README and Website to explain in plain English what this does

Please see the discussion over on https://news.ycombinator.com/item?id=17586185

Someone did finally explain what this package does and this is what they said:

"The use case: you have more than a hundred machines emitting lots of monitoring data, much of which is uninteresting except in aggregate form. Instead of paying to store millions of data points and then computing the aggregates later, Veneur can calculate things like percentiles itself and only forwards those. ... It also has reliability benefits when operating over a network that might drop UDP packets, such as the public internet."

That makes sense to me. But when I first came across this package, I could not figure out what it does even after reading the entire README! I've been a programmer for decades, so I don't think it's me. I've never heard of "observability data" and a search didn't help. The Wikipedia article on observability left me even more confused. And for a while I thought it might have to do with observables, like in RxJS....

Same criticism applies to the website. "Veneur is a distributed, fault-tolerant observability pipeline. It's fast, memory-efficient, and capable of aggregating global metrics"

I have no idea what that means. Global weather metrics? Something to do with the globe I guess...

I did finally figure out what this package does by following the link at the bottom of the website. This article https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog

"high performance and global aggregation for Datadog"

Getting closer.

Which led me to DataDog, where I finally get plain English: "Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere."

What a journey! README -> Veneur Website -> Blog Post -> Datadog website -> "Oh this is about monitoring applications and collecting that monitoring data!

I suggest something at the very top of your README and on your website along the lines of ""If you monitor applications using Datadog, Veneur can save you money and increase reliability".

What's the upside for you? If I come across a package and can't use it now, I'll bookmark it and come back to it when I need it. That won't happen if I can't quickly figure out what category the software belongs in and how to tag it.

Public Docker image

Feature request:
I see that this has a maintained Dockerfile, which has allowed me to compile veneur with minimal effort. However, I can't seem to find an official docker image for running veneur. While I can certainly fork and tweak this repo to fit into my own build pipeline, I feel like I'm redoing effort that others have already done. Would it be possible to release a docker image with release tags? I'm happy to provide a PR for a Dockerfile that builds a production worthy image, but it would have to fit into Stripe's build pipeline.

Does Veneur retry failed datadog flushes?

I'm not that good at Go so I couldn't find my answer from the code.

Does Veneur retry sending failed datadog flushes? I've seen fair amount of flushes fail due to http errors (timeouts) and I'm wondering if those are just warnings or could be dataloss.
...: time="2018-10-10T10:22:44-05:00" level=warning msg="Could not execute request" action=flush error="net/http: request canceled (Client.Timeout exceeded while awaiting headers)" host=app.datadoghq.com path=/api/v1/series

Are these requests retried? And if so, how many retries before the segment is lost?

Upgrade github.com/gogo/protobuf to later version

https://github.com/gogo/protobuf/releases is at 1.1.1 right now and the veneur dependency is pinned to 0.5.0 (https://github.com/stripe/veneur/blob/master/Gopkg.toml#L45)

1.x protobuf introduced some new types and fields in generated protobuf types that aren't available in 0.5.0.

We ran into an issue pulling in an API that pinned the version and dep was unable to resolve a dependency without using an override.

Overriding to 1.1.1 seemed to work, but figured I'd drop a line here to see if we can just roll that dependency forward

Local instance tags on aggregated metrics

When metrics are forwarded from the local instances to a global aggregator, tags that are configured on the local instances aren't included.

What we'd like to do is to have each local instance include a cluster tag, which will get forwarded with the aggregated percentiles/counters, so that we can have one central aggregator but still be able to group metrics by the cluster which it's being sent from (host level isn't important to us, but cluster level is).

Can you explain the reasoning behind making this decision? Is it possible to add this as an option to allow aggregation based on some tags?

Dynamic per tag api key support for signalfx sink

Hi!

The intended use case for this is having a per-service API Key, without having to hard code them all in the config file. I'd be adding this to the SignalFX sink.

This idea is @prudhvi's, and I'd be working with him to handle the meat of the implementation. He said he'd chatted with someone and that he'd gotten affirmation that a PR with this logic would be accepted, if it was generic enough.

I had a rough design in mind, let me know if it sounds reasonable. In a ticker polling loop:

Unix domain socket support for statsd metrics

Hi Team
Would you accept a PR to add support for UDS listening support for statsd metrics.
If so would you accept for listening metrics on DATAGRAM socket, I see that SSF span traces already support UDS via STREAM sockets.

But for statsd metrics thinking DATAGRAM is more suited to mimic UDP behavior and that the standard datadog clients support sending stats to DATAGRAM socket and we prefer the support for listening on datagram socket instead of stream socket for the same reason.
https://github.com/DataDog/datadog-agent/wiki/Unix-Domain-Sockets-support
https://github.com/DataDog/datadog-agent/wiki/Unix-Domain-Sockets-support#client-libraries-state

Not sending float numbers that are less than 1 for Counters to datadog.

Hi,

We recently started using veneur to replace datadog agent and exporting data to datadog. We are not using a release code but used the code from github on July 17th. We have observed that for Counter float value that is less than 1 is not getting sent to datadog. I did a workaround to multiply the same metric with 10 (our number is around 0.2) and it is able to be successfully pushed to datadog.

Is this a known issue? Or is it fixed in the later code/release?

Thanks,
Jingyuan

Histograms .avg uneven

We have a infra-structure of servers that directly inject metrics into an infra-structure of veneur's.
These veneur's forward metrics to a single veneur aggregator.

Some of the servers metrics include measures of time to process requests of an API infra-structure. On this API we also directly inject "requests" metrics into a single veneur.

We are comparing the metrics from both sides and we have uneven results, mainly on histogram's .avg metrics.
When we used local dd-agent on the servers the metrics would match. The dd-agent would receive localhost UDP.
We removed local dd-agent due to the need to free CPU.

The servers don't use local's veneur, directly sending metrics via UDP to veneur's, injecting about 56K metrics / second, about 18K distinct metrics into the veneur's.

We use veneur 1.3.1 on the veneur's, and 1.7.0 on the aggregator.
The veneur's flushes in 10s into the aggregator.
The aggregator flushes in 20s intervals.

veneurs.cfg.txt
aggregator.cfg.txt
both.sysctl.conf.txt

Is this scenario feasible or would you advise another?

datadog metrics error

Hi Friends,
I am running a standalone local veneur docker server with a valid DATADOG_APIKEY Configuration.

After the server starts up I see consistently DataDog connectivity failures. Please help

sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[72e562337c4bff4f] Ot-Tracer-Traceid:[696dc5e6b593e6c0]]" request_length=1196 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:05:58 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:05:58Z" level=info msg="Completed flush to Datadog" metrics=114
sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg=Flushing worker=0
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Checkpointing flushed spans for X-Ray" dropped_spans=6 flushed_spans=6
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Worker count chosen" workers=1
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Chunk size chosen" chunkSize=107
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=info msg="Completed flush to SignalFx" metrics=107 success=true
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[490c3bc208624814] Ot-Tracer-Traceid:[3d89f85a11d4f6d9]]" request_length=1090 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:06:08 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:06:08Z" level=info msg="Completed flush to Datadog" metrics=107
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg=Flushing worker=0
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Checkpointing flushed spans for X-Ray" dropped_spans=6 flushed_spans=7
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Worker count chosen" workers=1
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Chunk size chosen" chunkSize=114
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=info msg="Completed flush to SignalFx" metrics=114 success=true
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[47e7e8050e834905] Ot-Tracer-Traceid:[2a94b0994f734c21]]" request_length=1196 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:06:18 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:06:18Z" level=info msg="Completed flush to Datadog" metrics=114
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"

Need to change Sirupsen/logrus to sirupsen/logrus

Problem

It appears that the official library package has been changed to the lowercased version. (https://github.com/sirupsen/logrus).

This causes problems when attempting to use the test hook

import "github.com/Sirupsen/logrus/hooks/test"

because pulling down that package makes references to the downcased package names which results in an error when attempting to run go test:

cannot use logger (type *"github.com/stripe/veneur/vendor/github.com/sirupsen/logrus".Logger) as type *"github.com/stripe/veneur/vendor/github.com/Sirupsen/logrus".Logger in argument

Proposed solution

Update the bundled in package to the downcased version in the vendor directory and change all the imports to the downcased version of the name

Veneur panic

Our veneur's v1.3.1 started crashing sporadically past thurday.
Yesterday, I compiled from master, and updated all, but the crash seams to persist.

The problem is that it's ocasional and I can't seam to be able to debug ingested metrics.

Thank you for any help.

LOG:
time="2018-11-27T13:38:11Z" level=info msg="Completed flush to Datadog" metrics=2403
fatal error: fault
unexpected fault address 0x10f9000
	/go/src/github.com/stripe/veneur/worker.go:259 +0x3ee fp=0xc000ac7fa8 sp=0xc000ac7c28 pc=0x10f854e
github.com/stripe/veneur.(*Worker).Work(0xc000362fc0)
	/go/src/github.com/stripe/veneur/worker.go:311 +0x9fa fp=0xc000ac7c28 sp=0xc000ac7a90 pc=0x10f8ffa
github.com/stripe/veneur.(*Worker).ProcessMetric(0xc000362fc0, 0xc000ac7dc8)
	/usr/local/go/src/runtime/signal_unix.go:397 +0x275 fp=0xc000ac7a90 sp=0xc000ac7a40 pc=0x441aa5
runtime.sigpanic()
	/usr/local/go/src/runtime/panic.go:608 +0x72 fp=0xc000ac7a40 sp=0xc000ac7a10 pc=0x42bf72
runtime.throw(0x13cc01a, 0x5)
goroutine 134 [running]:

[signal SIGSEGV: segmentation violation code=0x2 addr=0x10f9000 pc=0x10f8ffa]
net.(*conn).Read(0xc00000c050, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*netFD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x408a7b, 0xc00002e000, 0x125a980)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x179
internal/poll.(*FD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*pollDesc).waitRead(0xc000401198, 0xc0004e8000, 0x2000, 0x2000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).wait(0xc000401198, 0x72, 0xffffffffffffff00, 0x15b5280, 0x206e558)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.runtime_pollWait(0x7f131c6a9950, 0x72, 0xc0006c2870)
goroutine 36 [IO wait]:

	/go/src/github.com/stripe/veneur/cmd/veneur/main.go:94 +0x2eb
main.main()
	/go/src/github.com/stripe/veneur/server.go:1074 +0x6f
github.com/stripe/veneur.(*Server).Serve(0xc0005dc600)
goroutine 1 [chan receive, 244 minutes]:

	/go/src/github.com/stripe/veneur/server.go:273 +0x95c
created by github.com/stripe/veneur.NewFromConfig
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000ac7fd8 sp=0xc000ac7fd0 pc=0x45bab1
runtime.goexit()
	/go/src/github.com/stripe/veneur/server.go:277 +0x51 fp=0xc000ac7fd0 sp=0xc000ac7fa8 pc=0x10fe351
github.com/stripe/veneur.NewFromConfig.func1(0xc0005dc600, 0xc000362fc0)


CONFIG:
statsd_listen_addresses:
 - udp://0.0.0.0:8125
tls_key: ""
tls_certificate: ""
tls_authority_certificate: ""
forward_address: ""
forward_use_grpc: false
interval: "10s"
synchronize_with_interval: false
stats_address: "localhost:8125"
http_address: "0.0.0.0:8127"
grpc_address: "0.0.0.0:8128"
indicator_span_timer_name: ""
percentiles:
  - 0.99
  - 0.95
aggregates:
  - "min"
  - "max"
  - "median"
  - "avg"
  - "count"
  - "sum"
num_workers: 96
num_readers: 4
num_span_workers: 10
span_channel_capacity: 100
metric_max_length: 4096
trace_max_length_bytes: 16384
read_buffer_size_bytes: 4194304
debug: false
debug_ingested_spans: false
debug_flushed_metrics: false
mutex_profile_fraction: 0
block_profile_rate: 0
sentry_dsn: ""
enable_profiling: false
datadog_api_hostname: https://app.datadoghq.com
datadog_api_key: "DATADOG_KEY"
datadog_flush_max_per_body: 25000
datadog_trace_api_address: ""
datadog_span_buffer_size: 16384
aws_access_key_id: ""
aws_secret_access_key: ""
aws_region: ""
aws_s3_bucket: ""
flush_file: ""

ADDITIONAL INFO:
They receive metrics via UDP at a rate no lower than 1000/s.
We healthcheck via TCP/8127 /healthcheck and the servers are replaced on failure. (AWS ASG)

S3 plugin - bucket is not in region

Created bucket in region us-west-2 "devops-veneur-test2" with inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<<LOCAL_ACCOUNT_ID>>:root"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::devops-veneur-test2",
                "arn:aws:s3:::devops-veneur-test2/*"
            ]
        }
    ]
}

Created IAM user with inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::devops-veneur-test2/*",
                "arn:aws:s3:::devops-veneur-test2"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:HeadBucket"
            ],
            "Resource": "*"
        }
    ]
}

Created Access Key / Secret Access Key for above user.

Configured veneur /root/.aws/credentials with above credentials
I am able to

aws s3 cp /etc/veneur.cfg s3://devops-veneur-test2/ --region us-west-2

Configured veneur.cfg with:

aws_access_key_id: "<<ACCESS_KEY>>"
aws_secret_access_key: "<<SECRET_ACCESS_KEY>>"
aws_region: "us-west-2"
aws_s3_bucket: "devops-veneur-test2"

Veneur logs:

time="2018-12-03T14:40:19Z" level=info msg="Set mutex profile fraction" MutexProfileFraction=0 previousMutexProfileFraction=0
time="2018-12-03T14:40:19Z" level=info msg="Set block profile rate (nanoseconds)" BlockProfileRate=0
time="2018-12-03T14:40:19Z" level=info msg="Preparing workers" number=32
time="2018-12-03T14:40:19Z" level=info msg="Successfully created AWS session"
time="2018-12-03T14:40:19Z" level=info msg="S3 archives are enabled"
time="2018-12-03T14:40:19Z" level=info msg="Starting server" version=cdcfb315b61057a0ee5c8bb3fd3b3bca0a69059b
time="2018-12-03T14:40:19Z" level=info msg="Starting span workers" n=0
time="2018-12-03T14:40:19Z" level=info msg="Starting span sink" sink=metric_extraction
time="2018-12-03T14:40:19Z" level=info msg="Listening on UDP address" address="0.0.0.0:8125" listeners=4 protocol=statsd
time="2018-12-03T14:40:19Z" level=info msg="Tracing sockets are not configured - not reading trace socket"
time="2018-12-03T14:40:19Z" level=info msg="Starting gRPC server" address="0.0.0.0:8128"
time="2018-12-03T14:40:19Z" level=info msg="Starting Event worker"
time="2018-12-03T14:40:19Z" level=info msg="HTTP server listening" address="0.0.0.0:8127"
time="2018-12-03T14:40:29Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=2
time="2018-12-03T14:40:39Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=37
time="2018-12-03T14:40:49Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=37

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.