beam-telemetry / telemetry_metrics_prometheus_core Goto Github PK
View Code? Open in Web Editor NEWCore Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
License: Apache License 2.0
Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus
License: Apache License 2.0
Hey all,
I would like to suggest that we use a default value for our buckets that other libraries use.
DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)
DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
private double[] buckets = new double[] { .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10 };
Value | Duration | Python | Golang | Java |
---|---|---|---|---|
0.005 | 5ms | Yes | Yes | Yes |
0.01 | 10ms | Yes | Yes | Yes |
0.025 | 25ms | Yes | Yes | Yes |
0.05 | 50ms | Yes | Yes | Yes |
0.075 | 75ms | Yes | No | Yes |
0.1 | 100ms | Yes | Yes | Yes |
0.25 | 250ms | Yes | Yes | Yes |
0.5 | 500ms | Yes | Yes | Yes |
0.75 | 750ms | Yes | No | Yes |
1.0 | 1s | Yes | Yes | Yes |
2.5 | 2.5s | Yes | Yes | Yes |
5.0 | 5s | Yes | Yes | Yes |
7.5 | 7.5s | Yes | No | Yes |
10.0 | 10s | Yes | Yes | Yes |
If doing
def child_spec(_) do
TelemetryMetricsPrometheus.Core.child_spec(metrics: metrics())
end
Dialyzer complains about no local return, as the specs for Core.child_spec
doesn't mention any metrics:
telemetry_metrics_prometheus_core/lib/core.ex
Line 141 in c3cc031
The point is on the type prometheus_option
, which is way more restricted than what for example start_link
will admit.
Please forgive me if this is a dumb question - I'm still learning my way around metrics in general and Prometheus in particular.
Would it make sense to have, at least as an option, the ability to initialise counter metrics to 0 when they're first set up? As it stands (if my reading of things is correct), they'll return a null value until the first time they're incremented. That leads to a not-entirely-intuitive result when later you're using the increase
function in PromQL and the very first event doesn't count because there's no "increase" between null and 1.
It is possible for labels to have values which do not have the String.Chars
protocol implemented, such as Ecto structs. If one of these values is encountered, the entire scrape is unable to be processed.
We should add exception handling around this operation, logging a warning that it occured, and continuing on with the export.
Hi! I'm trying to collect a histogram of the size in bytes of a piece of data in addition to some other timing metrics. When I do this I get warnings about both inconsistent units and not using seconds. This seems incorrect to me as I'm collecting something that's not time-based. I'm relatively new to Prometheus but AFAICT it's not unexpected to use the histogram metric type to collect non-timing data.
As an example:
def metrics do
[
distribution("my_app.response_sent.body_size",
buckets: [0.1, 1.0, 10.0],
unit: {:byte, :kilobyte}
),
distribution("my_app.response_sent.duration",
buckets: [0.01, 0.1, 1.0, 10.0],
unit: :second
)
]
end
This results in the following warning logs:
15:30:10.285 [warn] Multiple time units found in your Telemetry.Metrics definitions.
Prometheus recommends using consistent time units to make view creation simpler.
You can disable this validation check by adding `consistent_units: false` in the validations options on reporter init.
15:30:10.285 [warn] Prometheus requires that time units MUST only be offered in seconds according to their guidelines, though this is not always practical.
https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram.
You can disable this validation check by adding `require_seconds: false` in the validations options on reporter init.
On the whole, I appreciate the warnings, so it would be nice to be able to maybe disable them at an individual metric level?
I'd like to add support for exemplars as in the OpenMetrics Specification.
Hi @bryannaegele! The lib looks great, the only feedback I have is that the dependency on telemetry_poller is probably not necessary. The recent versions of telemetry start its own process to perform measurements and as long as the user depends on it and defines the proper metrics, the metrics will reach out to this lib.
Today this library uses :ets, would it be possible to adopt a way to persist value of metrics when supervisor restarts? Usually when deploying with kubernetes, pods restarts and consequently applications and resets metrics in :ets. I think we can use :ets and save metrics in DETS too, and read metrics from DETS when supervisor start.
I have issues compiling this part:
telemetry_metrics_prometheus_core/lib/core/exporter.ex
Lines 108 to 121 in a0e82ac
Using git bisect I narrowed down the issue to this commit: elixir-lang/elixir@51d23cb
It looks like now ~S
does not work as it did before and some reworks are needed to avoid compile-time errors:
== Compilation error in file lib/core/exporter.ex ==
** (MismatchedDelimiterError) mismatched delimiter found on lib/core/exporter.ex:116:3:
error: unexpected reserved word: end
│
114 │ |> String.replace(~S(\\), ~S(\\\\))
│ └ unclosed delimiter
115 │ |> String.replace(~S(\n), ~S(\\n))
116 │ end
│ └ mismatched closing delimiter (expected ")")
Just to make sure if this a possible DoS if your prometheus not scraping a node and you have a distribution metrics on board? Seems so, am i wrong?
There is no option, to delete an existing entry in ETS table. For example, if I have a sum
metric with some tags
, there is no option to remove value related to a specific set of tags. Because of that, size of reports generated during scrapes can only grow, and there is no possibility to remove values, that are no longer needed from these reports.
I noticed that summary is not yet supported.
I am working on Supavisor performance and our benchmarking shows that TelemetryMetricsPrometheusCore
is main culprit of slowness. It is called in very tight loop (via :telemetry
obviously) and it shows that tags filtering is substantial part of handling event.
I have run simple Benchee script to check out whether it would be possible to speed it up:
defmodule Filtering do
def old_filter(tags, tag_values) do
case Enum.reject(tags, &match?(%{^&1 => _}, tag_values)) do
[] -> :ok
missing_tags -> {:tags_missing, missing_tags}
end
end
def old(tags, tag_values) do
with :ok <- old_filter(tags, tag_values) do
Map.take(tag_values, tags)
end
end
def new(tags, tag_values) do
case Map.take(tag_values, tags) do
vals when map_size(vals) == length(tags) -> vals
vals -> {:tags_missing, tags -- Map.keys(vals)}
end
end
def new_2(tags, tag_values) do
Enum.reduce(tags, {[], []}, fn key, {f, n} ->
case tag_values do
%{^key => v} -> {[{key, v} | f], n}
_ -> {f, [key | n]}
end
end)
|> case do
{vals, []} -> vals
{_, missing} -> {:tags_missing, missing}
end
end
end
tags = Map.new(1..100, &{&1, &1})
keys = Enum.take_random(1..100, 20)
Benchee.run(%{
"matching old" => fn ->
Filtering.old(keys, tags)
end,
"matching new" => fn ->
Filtering.new(keys, tags)
end,
"matching new_2" => fn ->
Filtering.new_2(keys, tags)
end
})
Benchee.run(%{
"missing old" => fn ->
Filtering.old([-1 | keys], tags)
end,
"missing new" => fn ->
Filtering.new([-1 | keys], tags)
end,
"missing new_2" => fn ->
Filtering.new([-1 | keys], tags)
end
})
Results show, that Enum.reject/2
approach disfavours happy path:
Operating System: macOS
CPU Information: Apple M3 Pro
Number of Available Cores: 12
Available memory: 36 GB
Elixir 1.17.1
Erlang 26.2.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s
Benchmarking matching new ...
Benchmarking matching new_2 ...
Benchmarking matching old ...
Name ips average deviation median 99th %
matching new_2 4.57 M 218.79 ns ±9601.91% 166 ns 833 ns
matching new 3.53 M 283.58 ns ±6234.59% 250 ns 375 ns
matching old 2.41 M 414.93 ns ±4529.68% 375 ns 500 ns
Comparison:
matching new_2 4.57 M
matching new 3.53 M - 1.30x slower +64.78 ns
matching old 2.41 M - 1.90x slower +196.14 ns
Operating System: macOS
CPU Information: Apple M3 Pro
Number of Available Cores: 12
Available memory: 36 GB
Elixir 1.17.1
Erlang 26.2.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s
Benchmarking missing new ...
Benchmarking missing new_2 ...
Benchmarking missing old ...
Name ips average deviation median 99th %
missing old 5.74 M 174.27 ns ±9073.23% 166 ns 291 ns
missing new_2 1.31 M 760.70 ns ±2264.45% 709 ns 917 ns
missing new 1.29 M 772.91 ns ±2281.40% 709 ns 958 ns
Comparison:
missing old 5.74 M
missing new_2 1.31 M - 4.36x slower +586.43 ns
missing new 1.29 M - 4.44x slower +598.64 ns
This is rather odd, as metrics library should rather optimise for happy path as we can assume that most of the time (ideally - always) we will see all metadata available (also, there may be some optimisations. For example length(tags)
can be cached to not recompute it each time we compute metrics. Maybe there are some other optimisations that I haven't thought about, but as shown above, we can do quite better on happy path.
I have scoured this repo, and it is not used.
Hello,
we are switching to this library for metrics exposure with an application that used https://hex.pm/packages/prometheus beforehand.
And by doing so we run into usage-scenarios that we can't handle at the moment (correct me please if i am wrong).
The biggest problem is when a sanity-check triggers which leads to sum/counter metrics getting re-initialized with absolute values.
For sum-metrics we found an (ugly) workaround: we pull the metric value from the ets-table and then calculate the delta to our desired absolute value and emit an event that will have the desired effect.
In case of counter metrics this approach is not possible (but its also fine to use sum with prometheus_type: counter
in this case).
So we would really like to able to simply set an absolute value for sum and counter metrics.
Thanks for your attention to this topic,
Ole
I have a sum over the amount of request duration from Plug.Telemetry
more or less defined like this:
sum(
"http.request.duration.seconds.total",
event_name: "phoenix.endpoint.stop",
measurement: :duration,
unit: {:native, :second}
)
Telemetry.Metrics
converts the native time integer to a float.
I think the behaviour makes sense.
The alternative of just using the output from System.convert_time_unit/3
would be that most of my measurements would be 0s because they are sub-second.
However, this fails because the Sum
module's handle_event/4
is using ets:update_counter/4
which only works with integers.
See line 61 of from:
telemetry_metrics_prometheus_core/lib/core/sum.ex
Lines 53 to 67 in cd8bea3
I'm not quite sure how to attack this problem.
If the conversion happened when rendering instead of when storing this wouldn't be an issue, but that only solves it for :native
vs :second
problem.
in general I think it should be possible to sum any numeric. Not just integers.
Thanks for making this library. The abstractions seem really great. We are looking at using this library in our enterprise systems and I have a few design questions.
Have you considered using a dynamic supervisor and spawning a child for every metric that is subscribed to? I'm thinking along the lines of multiple multiple Registry children instead of just one.
My reasoning being that every call to telemetry.execute
is synchronous and has to wait for the registry to processes the message. A high volume of events being emitted could cause a call in Plug to block slowing down web request response times.
Maybe this is a non-issue because event handling is so efficient that it never gets bottlenecked in practice. I'm just curious.
When an event is handled it looks like it is writing to an :ets
table that has {:write_concurrency, true}
, so it should be extremely fast. The only vectors that I can see that could potentially cause an issue is the :keep
and :tag_values
options:
Telemetry.Metrics.counter("http.request.count",
keep: fn _metadata ->
# something with high latency, resource consumption, or error prone
true
end,
tag_values: fn metadata ->
# something with high latency, resource consumption, or error prone
metadata
end
)
I'm not sure of the likelihood of this happening, but Murphy's Law. Given that a developer could cause self harm here if they are not aware of the complete impact of those functions, have you considered reporting metrics in a best-effort type of fashion?
My thought would be something along the lines of spawning a task using Task.Supervisor.async_nolink/2
immediately when an event is received. My reasoning is that I would prefer my application to continue to respond to web requests quickly regardless of my metrics being reported.
If a developer did do something resource intensive in those functions I'm thinking it would eventually manifest itself in an OOM or some other issue, but response latency would not be affected until a complete crash. The host application would be completely protected from a failed task because of a runtime exception or a timeout.
Sorry for the long post and thank you if you made it this far reading. I'm curious what your thoughts are on these questions -- even if for my own learning.
Thanks,
Spencer
TODO:
Such approach may be useful sometimes when you need to manually control when metrics are added to the store because of tight loop performance requirements.
Instead of using custom metric name and overriding event_name
and measurement
.. what about accepting name
in reporter_options
?
From:
Metrics.distribution("prometheus_metrics.scrape.duration.milliseconds",
event_name: [:prometheus_metrics, :plug, :stop],
measurement: :duration,
unit: {:native, :second},
reporter_options: [
buckets: [0.05, 0.1, 0.2, 0.5, 1],
]
)
To:
Metrics.distribution("prometheus_metrics.plug.stop",
unit: {:native, :second},
reporter_options: [
name: "prometheus_metrics.scrape.duration.milliseconds"
buckets: [0.05, 0.1, 0.2, 0.5, 1],
]
)
Hi. We are using your library in our project, and consider it helpful. Thank you.
However, we found a memory leak. If nobody ever asks for metrics through API, ETS dist table is never clean up.
We've got 1Gb of data and 3M of records in dist table before noticing it.
Hi, we have a situation where we are unable to know all the metrics that will be available in advance, and they might also dynamically change while the application is running. I saw TelemetryMetricsPrometheus.Core.Registry.register
, which seems like it would solve our problems. However it is undocumented (probably not meant to be part of the interface), so I am reluctant to use it. Any chance it could be made part of the public interface? Are there any known/expected issues with using it in production code?
This may be useful for situations like measuring :inet.getstat/{1,2}
values without doing computations back and forth. The value returned by this metric is already summed. Technically we could use last_value
metric there, but that would mark value as incorrect type in exported format. That can be problematic for some dashboards or other services that want to provide user meaningful graph by parsing type.
Hello,
Is there any plan to add support for sending metrics to a Pushgateway? I have a few use cases for this including:
application_started
counter metric (it is hard to do this right now because if my app crashes on startup before the first scrape the metric is never consumed by Prometheus).I am not 100% sure where this functionality belongs, I originally started in the prom_ex
library and made my way down to here.
I can implement this if someone can point me in the right direction. I believe we just need to get an extra config value into the Counter/Sum/LastValue/Distribution
modules' handle_event
functions, which we would indicate if the metric should be written to ets
like is it now, or sent to the pushgateway. I just don't know where that extra config value should be declared.
While building I get a few warning for deprecated use of Logger.warn:
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/aggregator.ex:58: TelemetryMetricsPrometheus.Core.Aggregator.filter_and_drop_time_series_with_bad_tag_values/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/registry.ex:196: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
lib/core/registry.ex:203: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
Can I make a PR?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.