Giter VIP home page Giter VIP logo

Comments (17)

matthiasr avatar matthiasr commented on June 11, 2024

Ouch, that sounds bad. Could this be related to some metric that is submitted and poisons the registry?

Logging flags are currently missing, see #111. Also, #80 would allow still getting metrics about the exporter itself even when that happens. That might at least point to where more logging is needed …

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

Also cross-linking #63 for handling of inconsistent labels.

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

We are facing the exact same issue.

We operate a CI system based on Zuul [1] and Nodepool [2]. Both are configured to push their statsd data directly to statsd-exporter. As soon as there's some load on the system, statsd-exporter goes into that "return error 500" mode.

Thanks to K8S liveness probes this is detected and the statsd-exporter is restarted, but it enters failure mode again within seconds. Result: we're sometimes missing >75% of the timeseries data in Prometheus.

We assume, that Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly. We could not yet trace the exact event that triggers the error state in statsd-exporter, but we plan to investigate the issue.

Additional note:
To make things worse, the /-/healthy API still reports "ok". I consider this a separate bug.

[1] https://docs.openstack.org/infra/zuul/
[2] https://docs.openstack.org/infra/nodepool/

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

We did some more investigation on that issue. Our initial assumption

Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly

seems to be incorrect. All received statsd data seems to be fine.

After enabling more logging, we discovered something else:

"Config file changed (\"/opt/statsd-mapping/statsd-mapping.conf\": MODIFY|ATTRIB), attempting reload" source="main.go:114"
"Config reloaded successfully" source="main.go:120"

These reloads happen every 1 to 2 minutes.

Background:
We're running statsd-exporter docker container on OpenShift (K8S) and the statsd-mapping.conf file is mounted into the container as ConfigMap. This seems to trigger the config reload.

Current assumption is a race condition of

  • config reload and
  • incoming statsd events

@matthiasr I would be very interested, what's your opinion on this.

In the meantime we moved the statsd-mapping.conf file into the docker image and the config reloads disappeared. Now we need to wait for the next period of high load. I will provide an update when we get some results.

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

Ok, our assumption of a config reload race condition is also wrong.
The implementation uses a mutex and looks solid.

After silencing these periodic config reloads we still see the same issue with "http error 500 returned".

Need to dig deeper into this. Any help or idea is very welcome ;-)

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

Is there anything in the logs of the exporter during this time? Any indication why it's throwing a 500?

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

Maybe a stupid question (because of lacking go skills):
How do I get some kinde of verbose logging out of the statsd exporter (and underlying libs)?

For now I can not see anything suspicious in the logs. No error of any kind. Only warnings covered by #63 (and these were gathered by simply adding info logs):

time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:0\|ms" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:1\|c" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="A change of configuration created inconsistent metrics for \"zuul_geard_packet\". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: duplicate metrics collector registration attempted" source="exporter.go:295"
``` 

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

The error is happening in the HTTP endpoint handled by the Prometheus client library, and it is not logging errors by default. Additionally, we are using the deprecated one from github.com/prometheus/client_golang/prometheus.

I think the simplest way to get logging is to switch to promhttp and do this but pass an ErrorLogger in the HandlerOpts.

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

@matthiasr Thanks for the hint, will test this asap.

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

@matthiasr with your instructions I was able to get the error message (added line breaks for better readability, removed project data):

time="2018-03-08T16:48:07Z" level=error msg="error gathering metrics:
3 error(s) occurred:\n* collected metric zuul_job_results
label:<name:\"branch\" value:\"master\" >
label:<name:\"hostname\" value:\"xxxxxxxxxx\" >
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"xxxxxxxxxx\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"master\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n"
source="<autogenerated>:1"

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

This looks like you are emitting a mixture of statsd timers and counters under the same metric name, or under different metric names that get mapped to zuul_job_results. The Prometheus client library rejects that because it can't merge the two.

As a simple solution, drop the counters – the summary includes a _total metric that indicates the count of observed timer events.

Would you mind submitting a PR with the logging fix?

from statsd_exporter.

TimoL avatar TimoL commented on June 11, 2024

A PR to handle the underlying issue was just proposed here: #136

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

Some thoughts in light of the changes in the Prometheus client library (copy from #170):

In principle this is something I want to support, but it is not easy to implement. The Prometheus client really prefers consistent metrics; I believe we need to convert to the Collector form and ConstMetrics, and make sure to mark the collector as Unchecked. Alternatively/additionally, we could fully handle expanding label sets ourselves but that may some many edge cases that need handling.

In any case, I would wait until #164 is in, it gets us one step closer to this by handling metrics on a name-by-name basis.

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

I would like the Collector to be in a separate package, so that I can reuse it in graphite_exporter. Also see that for a simple Collector/ConstMetrics style exporter. The statsd case is a superset because it also supports counters, summaries, and histograms.

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

From #179:

Currently if you try to update a metric with a labelset that is inconsistent with whats registered, the metric gets dropped. As of release 0.9 of the Prometheus Go client, specifically 417, inconsistent label dimensions are now allowed.

Since this is now officially allowed, we should indeed support it. I think (but have not looked into it in detail) that the issue is that we send some metrics descriptors, so we don't get treated as an unchecked collector.

from statsd_exporter.

vsakhart avatar vsakhart commented on June 11, 2024

@matthiasr Any plans to work on this soon or are you willing to take a PR?

from statsd_exporter.

matthiasr avatar matthiasr commented on June 11, 2024

PRs are always welcome for any issue! Sorry I didn't make that clear

from statsd_exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.