Hi, I'm seeing a ratio of 10k metrics per 100M of RAM, for production metrics. Int

Are you running one central statd_exporter ? <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

That's a really good find, thanks a lot <a class="user-mention notranslate" data-hover

Upgrade client_golang to reduce memory usage with many metrics about statsd_exporter HOT 25 CLOSED

prometheus commented on June 11, 2024

Upgrade client_golang to reduce memory usage with many metrics

from statsd_exporter.

Comments (25)

SuperQ commented on June 11, 2024

Are you running one central statd_exporter?

The general recommendation is to run one exporter per node, and configure all apps to send to localhost. This allows for easy scaling and eliminates the exporter as a SPoF.

It also helps if you are using UDP to transport statsd metrics, as packet loss is less of an issue.

from statsd_exporter.

matthiasr commented on June 11, 2024

That comes out as ~10KiB per metric, that is indeed a lot.

As @SuperQ says, the first thing I would do is colocate the exporter with every data source, instead of the central statsd server.

I am still interested in getting to the bottom of the memory usage. What are all these bytes? Do we really need them? Any input is welcome!

from statsd_exporter.

matthiasr commented on June 11, 2024

I got a notification for a comment that made good points but appears to be lost?

This might be related to summaries with quantile – if you have a lot of timer metrics, these are translated into summaries and can be this expensive.

Now, generally histograms are probably a better idea here, but open the question of how to configure buckets. On the other hand, one way to mitigate the memory cost of quantiles would be to make the MaxAge, BufCap and AgeBuckets also configurable, so the complexity would be the same.

How should this look in configuration?

from statsd_exporter.

shuz commented on June 11, 2024

@matthiasr I deleted the comment because I find my metrics is still OOM after I switch to histogram (default buckets). And the calculated average memory is still quite high. You can try it using this configuration:

defaults:
  timer_type: histogram

from statsd_exporter.

shuz commented on June 11, 2024

MaxAge, BufCap and AgeBuckets are not configurable now.

I'm trying to make them configurable globally in my branch:

func (c *SummaryContainer) Get(metricName string, labels prometheus.Labels, help string) (prometheus.Summary, error) {
	hash := hashNameAndLabels(metricName, labels)
	summary, ok := c.Elements[hash]
	if !ok {
		summary = prometheus.NewSummary(
			prometheus.SummaryOpts{
				Name:        metricName,
				Help:        help,
				ConstLabels: labels,
				BufCap:      500,
				AgeBuckets:  2,
				MaxAge:      120 * time.Second,
			})

If it works they should also be configurable per metric.

from statsd_exporter.

shuz commented on June 11, 2024

Some testing results. It seems related to the summary timer type.

Test1: 80000 metrics, no timers 
80000 time series
Memory: 100-200m

Test2: 20000 timers with histogram with 10 buckets
280000 time series
memory: 50-60m

Test3: 20000 timers with summary (0.5:0.05, 0.9:0.01, 0.99:0.001), constant value (always 200ms)
100000 time series
memory: 1.7-5.4g

Test4: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
memory: similar to test3

Test5: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
AgeBuckets=2
memory: About 1.3g

Some findings:

When there are enough stats sent, the BufCap and MaxAge doesn't really impact the memory usage.
AgeBuckets=1 is optimal in memory but may result in unstable quantile values. If I use some real data instead of 200ms constant value, the memory usage will be higher. In my test, AgeBuckets=2 reached about 3G memory when using random values r.NormFloat64()*50+200.
Changing the targets from (0.5:0.05, 0.9:0.01, 0.99:0.001) to (0.5:0.05, 0.95:0.005) doesn't change the memory usage a lot.

from statsd_exporter.

matthiasr commented on June 11, 2024

That's a really good find, thanks a lot @shuz. At the very least, this is something to mention in the README.

How do people feel about changing the default?

from statsd_exporter.

brian-brazil commented on June 11, 2024

In my opinion if it is to be changed, the place to change it is in client_golang.

I'm not surprised that summary quantiles are expensive, it's one of the reasons I recommend against them.

from statsd_exporter.

matthiasr commented on June 11, 2024

whether statsd timers are observed into histograms or summaries is already configurable, both per match and globally. I don't know what would need to change in the library?

from statsd_exporter.

matthiasr commented on June 11, 2024

sorry, I wasn't explicit enough. I meant changing the default for the timer_type setting.

from statsd_exporter.

brian-brazil commented on June 11, 2024

Ah, I don't see a problem with changing it.

from statsd_exporter.

shuz commented on June 11, 2024

Some more findings. In our service we find the actual requests doesn't have too many samples. We are in a pattern that there are many dynamic combinations that are measured, although avg or max of p99 doesn't make too much sense.
We were able to make the memory usage much more smooth by making the coldBuf/hotBuf/quantile stream buf all start from 0 capacity.
Another thing we find is that the Gather() call forks a go routine for each metric. This is also a big cost, so we made TTL for the metrics base on the fact that our metrics rate is quite low. However a good work around is to create a version that Gather metrics without forking go-routines, given our simple collectors used by statsd_exporter.

from statsd_exporter.

matthiasr commented on June 11, 2024

What drawbacks does this approach have? How does this behave with concurrent scrapes? Should we do this without too many go routines generally?

For your use case (aggregation across dimensions) histograms are really a better choice though – you can sum these up, grouped as you like, and get reasonable quantile estimations across dimensions.

from statsd_exporter.

brian-brazil commented on June 11, 2024

It's a client golang thing, but I think using goroutines only for custom collectors might make sense. There's no real need with the standard metric types.

from statsd_exporter.

matthiasr commented on June 11, 2024

Do you mean it's a problem in the library, or in how we use it? How can we improve this?

from statsd_exporter.

brian-brazil commented on June 11, 2024

It's client_golang internals, though having a metric object per child doesn't exactly help. @beorn7

from statsd_exporter.

shuz commented on June 11, 2024

In our use case, we setup statsd_exporter as a bridge to existing apps. They used to send quite a lot of metrics with labels using dogstatsd format. And that agent kept running well with around 150M memory, since it flush every 10 seconds.

We find that using statsd_exporter, the memory usage always grows and the average memory used by each metric doesn't feel right.

Then we find the number of goroutines is growing with the memory usage:

Then we hacked the prometheus golang client lib to use only 1 go routine in Gather():

	// make a copy of collectors to prevent concurrent access
	collectors := make([]Collector, 0, len(r.collectorsByID))
	for _, collector := range r.collectorsByID {
		collectors = append(collectors, collector)
	}

	go func(collectors []Collector) {
		for _, collector := range collectors {
			func(collector Collector) {
				defer wg.Done()
				collector.Collect(metricChan)
			}(collector)
		}
	}(collectors)

It's special case since we are handling quite a lot metrics in the container, but it reduced the memory usage by half.

from statsd_exporter.

beorn7 commented on June 11, 2024

Making Gather behave differently depending on the nature of the Collector sounds like a horrible leak in the abstraction. I would much prefer to avoid that. Perhaps it will help to limit the number of goroutines, or do something smarter, which would adapt. I filed prometheus/client_golang#369

from statsd_exporter.

matthiasr commented on June 11, 2024

@brian-brazil just to be sure I understand correctly – part of the problem is how we initialize a new metric object for each label combination here (and in the other Get implementations)? And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

If we ever wanted to expire metrics, that would mean we could only do so on a per-metric not a per-timeseries level, but that's not necessarily bad.

from statsd_exporter.

brian-brazil commented on June 11, 2024

part of the problem is how we initialize a new metric object for each label combination?

Yes.

And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?

Yes, though that's presuming client_golang internals don't change.

from statsd_exporter.

matthiasr commented on June 11, 2024

It seems to me like that's also semantically more how metrics objects are meant to be used.

from statsd_exporter.

sathsb commented on June 11, 2024

@matthiasr I still see high memory usage when using histograms. Just curious to know if the client_golang from prometheus/client_golang#370 is merged in statsd_exporter and if it will fix the issue.

from statsd_exporter.

matthiasr commented on June 11, 2024

@sathsb no, I haven't done that yet – I'm sorry for the long silence, I was away for a few months. While I'm still catching up, would you like to submit a PR that updates the vendored dependency/dependencies?

from statsd_exporter.

matthiasr commented on June 11, 2024

There was an upgrade to a newer, but not new-est, client version in #119 – if you want to pick this up you can probably salvage some of the necessary changes from that.

from statsd_exporter.

matthiasr commented on June 11, 2024

I believe this is done now.

from statsd_exporter.

Upgrade client_golang to reduce memory usage with many metrics about statsd_exporter HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent