Comments (25)
Are you running one central statd_exporter
?
The general recommendation is to run one exporter per node, and configure all apps to send to localhost
. This allows for easy scaling and eliminates the exporter as a SPoF.
It also helps if you are using UDP to transport statsd metrics, as packet loss is less of an issue.
from statsd_exporter.
That comes out as ~10KiB per metric, that is indeed a lot.
As @SuperQ says, the first thing I would do is colocate the exporter with every data source, instead of the central statsd server.
I am still interested in getting to the bottom of the memory usage. What are all these bytes? Do we really need them? Any input is welcome!
from statsd_exporter.
I got a notification for a comment that made good points but appears to be lost?
This might be related to summaries with quantile – if you have a lot of timer metrics, these are translated into summaries and can be this expensive.
Now, generally histograms are probably a better idea here, but open the question of how to configure buckets. On the other hand, one way to mitigate the memory cost of quantiles would be to make the MaxAge, BufCap and AgeBuckets also configurable, so the complexity would be the same.
How should this look in configuration?
from statsd_exporter.
@matthiasr I deleted the comment because I find my metrics is still OOM after I switch to histogram (default buckets). And the calculated average memory is still quite high. You can try it using this configuration:
defaults:
timer_type: histogram
from statsd_exporter.
MaxAge, BufCap and AgeBuckets are not configurable now.
I'm trying to make them configurable globally in my branch:
func (c *SummaryContainer) Get(metricName string, labels prometheus.Labels, help string) (prometheus.Summary, error) {
hash := hashNameAndLabels(metricName, labels)
summary, ok := c.Elements[hash]
if !ok {
summary = prometheus.NewSummary(
prometheus.SummaryOpts{
Name: metricName,
Help: help,
ConstLabels: labels,
BufCap: 500,
AgeBuckets: 2,
MaxAge: 120 * time.Second,
})
If it works they should also be configurable per metric.
from statsd_exporter.
Some testing results. It seems related to the summary timer type.
Test1: 80000 metrics, no timers
80000 time series
Memory: 100-200m
Test2: 20000 timers with histogram with 10 buckets
280000 time series
memory: 50-60m
Test3: 20000 timers with summary (0.5:0.05, 0.9:0.01, 0.99:0.001), constant value (always 200ms)
100000 time series
memory: 1.7-5.4g
Test4: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
memory: similar to test3
Test5: 20000 timers with summary (0.5:0.05, 0.95:0.005), constant value (always 200ms)
AgeBuckets=2
memory: About 1.3g
Some findings:
When there are enough stats sent, the BufCap and MaxAge doesn't really impact the memory usage.
AgeBuckets=1 is optimal in memory but may result in unstable quantile values. If I use some real data instead of 200ms constant value, the memory usage will be higher. In my test, AgeBuckets=2 reached about 3G memory when using random values r.NormFloat64()*50+200
.
Changing the targets from (0.5:0.05, 0.9:0.01, 0.99:0.001) to (0.5:0.05, 0.95:0.005) doesn't change the memory usage a lot.
from statsd_exporter.
That's a really good find, thanks a lot @shuz. At the very least, this is something to mention in the README.
How do people feel about changing the default?
from statsd_exporter.
In my opinion if it is to be changed, the place to change it is in client_golang.
I'm not surprised that summary quantiles are expensive, it's one of the reasons I recommend against them.
from statsd_exporter.
whether statsd timers are observed into histograms or summaries is already configurable, both per match and globally. I don't know what would need to change in the library?
from statsd_exporter.
sorry, I wasn't explicit enough. I meant changing the default for the timer_type
setting.
from statsd_exporter.
Ah, I don't see a problem with changing it.
from statsd_exporter.
Some more findings. In our service we find the actual requests doesn't have too many samples. We are in a pattern that there are many dynamic combinations that are measured, although avg or max of p99 doesn't make too much sense.
We were able to make the memory usage much more smooth by making the coldBuf/hotBuf/quantile stream buf all start from 0 capacity.
Another thing we find is that the Gather() call forks a go routine for each metric. This is also a big cost, so we made TTL for the metrics base on the fact that our metrics rate is quite low. However a good work around is to create a version that Gather metrics without forking go-routines, given our simple collectors used by statsd_exporter.
from statsd_exporter.
What drawbacks does this approach have? How does this behave with concurrent scrapes? Should we do this without too many go routines generally?
For your use case (aggregation across dimensions) histograms are really a better choice though – you can sum these up, grouped as you like, and get reasonable quantile estimations across dimensions.
from statsd_exporter.
It's a client golang thing, but I think using goroutines only for custom collectors might make sense. There's no real need with the standard metric types.
from statsd_exporter.
Do you mean it's a problem in the library, or in how we use it? How can we improve this?
from statsd_exporter.
It's client_golang internals, though having a metric object per child doesn't exactly help. @beorn7
from statsd_exporter.
In our use case, we setup statsd_exporter as a bridge to existing apps. They used to send quite a lot of metrics with labels using dogstatsd format. And that agent kept running well with around 150M memory, since it flush every 10 seconds.
We find that using statsd_exporter, the memory usage always grows and the average memory used by each metric doesn't feel right.
Then we find the number of goroutines is growing with the memory usage:
Then we hacked the prometheus golang client lib to use only 1 go routine in Gather():
// make a copy of collectors to prevent concurrent access
collectors := make([]Collector, 0, len(r.collectorsByID))
for _, collector := range r.collectorsByID {
collectors = append(collectors, collector)
}
go func(collectors []Collector) {
for _, collector := range collectors {
func(collector Collector) {
defer wg.Done()
collector.Collect(metricChan)
}(collector)
}
}(collectors)
It's special case since we are handling quite a lot metrics in the container, but it reduced the memory usage by half.
from statsd_exporter.
Making Gather
behave differently depending on the nature of the Collector
sounds like a horrible leak in the abstraction. I would much prefer to avoid that. Perhaps it will help to limit the number of goroutines, or do something smarter, which would adapt. I filed prometheus/client_golang#369
from statsd_exporter.
@brian-brazil just to be sure I understand correctly – part of the problem is how we initialize a new metric object for each label combination here (and in the other Get implementations)? And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?
If we ever wanted to expire metrics, that would mean we could only do so on a per-metric not a per-timeseries level, but that's not necessarily bad.
from statsd_exporter.
part of the problem is how we initialize a new metric object for each label combination?
Yes.
And we could optimize this by only indexing on the metricName only, and always passing the labels to that metric object?
Yes, though that's presuming client_golang internals don't change.
from statsd_exporter.
It seems to me like that's also semantically more how metrics objects are meant to be used.
from statsd_exporter.
@matthiasr I still see high memory usage when using histograms. Just curious to know if the client_golang from prometheus/client_golang#370 is merged in statsd_exporter and if it will fix the issue.
from statsd_exporter.
@sathsb no, I haven't done that yet – I'm sorry for the long silence, I was away for a few months. While I'm still catching up, would you like to submit a PR that updates the vendored dependency/dependencies?
from statsd_exporter.
There was an upgrade to a newer, but not new-est, client version in #119 – if you want to pick this up you can probably salvage some of the necessary changes from that.
from statsd_exporter.
I believe this is done now.
from statsd_exporter.
Related Issues (20)
- Renaming metrics with match_type Regex? HOT 1
- Support prefixed glob matches HOT 3
- af_agg_ti_start and af_agg_ti_finish are missing HOT 3
- Single tag in metric reporting NaN HOT 3
- Can script be run add or transform a field. HOT 1
- New release needed to address CVE-2023-24538 HOT 1
- Inconsistent debug logging of line payloads from TCP listener HOT 1
- Feature Request: "scale" field in mapping config to convert units
- Unable to expose both native and original histograms HOT 3
- Unable to disable the udp port listening HOT 2
- Gauge metric decreases with negative values HOT 4
- Support for exemplars HOT 2
- Update go to address CVEs
- Momentary spikes in events received HOT 3
- Support exclusions (negative matches) HOT 5
- Use new "created timestamp" for counters and histograms HOT 1
- New release needed with security updates that went in HOT 4
- stackstorm mapping - match_metric_type not working right? HOT 3
- Reject configuration with unknown fields
- Malformed name tag - Log level should be higher HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from statsd_exporter.