Core Prometheus Telemetry.Metrics Reporter package for telemetry_metrics_prometheus

License: Apache License 2.0

Elixir 100.00%

elixir telemetry observability metrics monitoring prometheus reporter

telemetry_metrics_prometheus_core's Issues

Using the same default buckets as other libraries

Hey all,

I would like to suggest that we use a default value for our buckets that other libraries use.

Python

https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544

DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)

Golang

https://github.com/prometheus/client_golang/blob/66837e3298bdc57a828794c23bacb253ed8c04cd/prometheus/histogram.go#L68

DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

Java

https://github.com/prometheus/client_java/blob/6ae0c961cfc19b8ef0575c98ac82ee55b11002ed/simpleclient/src/main/java/io/prometheus/client/Histogram.java#L88

private double[] buckets = new double[] { .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10 };

Overlap

Value	Duration	Python	Golang	Java
0.005	5ms	Yes	Yes	Yes
0.01	10ms	Yes	Yes	Yes
0.025	25ms	Yes	Yes	Yes
0.05	50ms	Yes	Yes	Yes
0.075	75ms	Yes	No	Yes
0.1	100ms	Yes	Yes	Yes
0.25	250ms	Yes	Yes	Yes
0.5	500ms	Yes	Yes	Yes
0.75	750ms	Yes	No	Yes
1.0	1s	Yes	Yes	Yes
2.5	2.5s	Yes	Yes	Yes
5.0	5s	Yes	Yes	Yes
7.5	7.5s	Yes	No	Yes
10.0	10s	Yes	Yes	Yes

`child_spec` specs don't take `metrics` into account

If doing

  def child_spec(_) do
    TelemetryMetricsPrometheus.Core.child_spec(metrics: metrics())
  end

Dialyzer complains about no local return, as the specs for Core.child_spec doesn't mention any metrics:

telemetry_metrics_prometheus_core/lib/core.ex

Line 141 in c3cc031

@spec child_spec(prometheus_options()) :: Supervisor.child_spec()

The point is on the type prometheus_option, which is way more restricted than what for example start_link will admit.

Initialising counter metrics

Please forgive me if this is a dumb question - I'm still learning my way around metrics in general and Prometheus in particular.

Would it make sense to have, at least as an option, the ability to initialise counter metrics to 0 when they're first set up? As it stands (if my reading of things is correct), they'll return a null value until the first time they're incremented. That leads to a not-entirely-intuitive result when later you're using the increase function in PromQL and the very first event doesn't count because there's no "increase" between null and 1.

Add exception handling when formatting labels

It is possible for labels to have values which do not have the String.Chars protocol implemented, such as Ecto structs. If one of these values is encountered, the entire scrape is unable to be processed.

We should add exception handling around this operation, logging a warning that it occured, and continuing on with the export.

`distribution` assumes we are collecting a histogram of time

Hi! I'm trying to collect a histogram of the size in bytes of a piece of data in addition to some other timing metrics. When I do this I get warnings about both inconsistent units and not using seconds. This seems incorrect to me as I'm collecting something that's not time-based. I'm relatively new to Prometheus but AFAICT it's not unexpected to use the histogram metric type to collect non-timing data.

As an example:

def metrics do
  [
       distribution("my_app.response_sent.body_size",
        buckets: [0.1, 1.0, 10.0],
        unit: {:byte, :kilobyte}
      ),
      distribution("my_app.response_sent.duration",
        buckets: [0.01, 0.1, 1.0, 10.0],
        unit: :second
      )
  ]
end

This results in the following warning logs:

15:30:10.285 [warn] Multiple time units found in your Telemetry.Metrics definitions.

Prometheus recommends using consistent time units to make view creation simpler.

You can disable this validation check by adding `consistent_units: false` in the validations options on reporter init.
15:30:10.285 [warn] Prometheus requires that time units MUST only be offered in seconds according to their guidelines, though this is not always practical.

https://prometheus.io/docs/instrumenting/writing_clientlibs/#histogram.

You can disable this validation check by adding `require_seconds: false` in the validations options on reporter init.

On the whole, I appreciate the warnings, so it would be nice to be able to maybe disable them at an individual metric level?

Exemplar support

I'd like to add support for exemplars as in the OpenMetrics Specification.

telemetry_poller probably not necessary

Hi @bryannaegele! The lib looks great, the only feedback I have is that the dependency on telemetry_poller is probably not necessary. The recent versions of telemetry start its own process to perform measurements and as long as the user depends on it and defines the proper metrics, the metrics will reach out to this lib.

persistence when supervisor restarts / down

Today this library uses :ets, would it be possible to adopt a way to persist value of metrics when supervisor restarts? Usually when deploying with kubernetes, pods restarts and consequently applications and resets metrics in :ets. I think we can use :ets and save metrics in DETS too, and read metrics from DETS when supervisor start.

Issues with ~S sigil on Elixir main branch

I have issues compiling this part:

telemetry_metrics_prometheus_core/lib/core/exporter.ex

Lines 108 to 121 in a0e82ac

 defp escape(value) do 

 value 

 |> to_string() 

 |> String.replace(~S("), ~S(\")) 

 |> String.replace(~S(\\), ~S(\\\\)) 

 |> String.replace(~S(\n), ~S(\\n)) 

  end 

 defp escape_help(value) do 

 value 

 |> to_string() 

 |> String.replace(~S(\\), ~S(\\\\)) 

 |> String.replace(~S(\n), ~S(\\n)) 

  end

using Elixir's main branch.

Using git bisect I narrowed down the issue to this commit: elixir-lang/elixir@51d23cb

It looks like now ~S does not work as it did before and some reworks are needed to avoid compile-time errors:

== Compilation error in file lib/core/exporter.ex ==
** (MismatchedDelimiterError) mismatched delimiter found on lib/core/exporter.ex:116:3:
     error: unexpected reserved word: end
     │
 114 │     |> String.replace(~S(\\), ~S(\\\\))
     │                      └ unclosed delimiter
 115 │     |> String.replace(~S(\n), ~S(\\n))
 116 │   end
     │   └ mismatched closing delimiter (expected ")")

Possible DoS while prometheus non scraping?

Just to make sure if this a possible DoS if your prometheus not scraping a node and you have a distribution metrics on board? Seems so, am i wrong?

No option to delete value related to specific set of tags in ETS table

There is no option, to delete an existing entry in ETS table. For example, if I have a sum metric with some tags, there is no option to remove value related to a specific set of tags. Because of that, size of reports generated during scrapes can only grow, and there is no possibility to remove values, that are no longer needed from these reports.

Summary metric

I noticed that summary is not yet supported.

Tag filtering optimises for unhappy path instead of happy path

I am working on Supavisor performance and our benchmarking shows that TelemetryMetricsPrometheusCore is main culprit of slowness. It is called in very tight loop (via :telemetry obviously) and it shows that tags filtering is substantial part of handling event.

I have run simple Benchee script to check out whether it would be possible to speed it up:

defmodule Filtering do
  def old_filter(tags, tag_values) do
    case Enum.reject(tags, &match?(%{^&1 => _}, tag_values)) do
      [] -> :ok
      missing_tags -> {:tags_missing, missing_tags}
    end
  end

  def old(tags, tag_values) do
    with :ok <- old_filter(tags, tag_values) do
      Map.take(tag_values, tags)
    end
  end

  def new(tags, tag_values) do
    case Map.take(tag_values, tags) do
      vals when map_size(vals) == length(tags) -> vals
      vals -> {:tags_missing, tags -- Map.keys(vals)}
    end
  end

  def new_2(tags, tag_values) do
    Enum.reduce(tags, {[], []}, fn key, {f, n} ->
      case tag_values do
        %{^key => v} -> {[{key, v} | f], n}
        _ -> {f, [key | n]}
      end
    end)
    |> case do
      {vals, []} -> vals
      {_, missing} -> {:tags_missing, missing}
    end
  end
end

tags = Map.new(1..100, &{&1, &1})
keys = Enum.take_random(1..100, 20)

Benchee.run(%{
  "matching old" => fn ->
    Filtering.old(keys, tags)
  end,
  "matching new" => fn ->
    Filtering.new(keys, tags)
  end,
  "matching new_2" => fn ->
    Filtering.new_2(keys, tags)
  end
})

Benchee.run(%{
  "missing old" => fn ->
    Filtering.old([-1 | keys], tags)
  end,
  "missing new" => fn ->
    Filtering.new([-1 | keys], tags)
  end,
  "missing new_2" => fn ->
    Filtering.new([-1 | keys], tags)
  end
})

Results show, that Enum.reject/2 approach disfavours happy path:

Operating System: macOS
CPU Information: Apple M3 Pro
Number of Available Cores: 12
Available memory: 36 GB
Elixir 1.17.1
Erlang 26.2.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s

Benchmarking matching new ...
Benchmarking matching new_2 ...
Benchmarking matching old ...

Name                     ips        average  deviation         median         99th %
matching new_2        4.57 M      218.79 ns  ±9601.91%         166 ns         833 ns
matching new          3.53 M      283.58 ns  ±6234.59%         250 ns         375 ns
matching old          2.41 M      414.93 ns  ±4529.68%         375 ns         500 ns

Comparison: 
matching new_2        4.57 M
matching new          3.53 M - 1.30x slower +64.78 ns
matching old          2.41 M - 1.90x slower +196.14 ns
Operating System: macOS
CPU Information: Apple M3 Pro
Number of Available Cores: 12
Available memory: 36 GB
Elixir 1.17.1
Erlang 26.2.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 21 s

Benchmarking missing new ...
Benchmarking missing new_2 ...
Benchmarking missing old ...

Name                    ips        average  deviation         median         99th %
missing old          5.74 M      174.27 ns  ±9073.23%         166 ns         291 ns
missing new_2        1.31 M      760.70 ns  ±2264.45%         709 ns         917 ns
missing new          1.29 M      772.91 ns  ±2281.40%         709 ns         958 ns

Comparison: 
missing old          5.74 M
missing new_2        1.31 M - 4.36x slower +586.43 ns
missing new          1.29 M - 4.44x slower +598.64 ns

This is rather odd, as metrics library should rather optimise for happy path as we can assume that most of the time (ideally - always) we will see all metadata available (also, there may be some optimisations. For example length(tags) can be cached to not recompute it each time we compute metrics. Maybe there are some other optimisations that I haven't thought about, but as shown above, we can do quite better on happy path.

Where are time units actually used?

I have scoured this repo, and it is not used.

resetting sum/counter metrics

Hello,
we are switching to this library for metrics exposure with an application that used https://hex.pm/packages/prometheus beforehand.
And by doing so we run into usage-scenarios that we can't handle at the moment (correct me please if i am wrong).

The biggest problem is when a sanity-check triggers which leads to sum/counter metrics getting re-initialized with absolute values.
For sum-metrics we found an (ugly) workaround: we pull the metric value from the ets-table and then calculate the delta to our desired absolute value and emit an event that will have the desired effect.
In case of counter metrics this approach is not possible (but its also fine to use sum with prometheus_type: counter in this case).

So we would really like to able to simply set an absolute value for sum and counter metrics.

Thanks for your attention to this topic,
Ole

sum of float fails because of ets:update_counter/4 usage

I have a sum over the amount of request duration from Plug.Telemetry more or less defined like this:

      sum(
        "http.request.duration.seconds.total",
        event_name: "phoenix.endpoint.stop",
        measurement: :duration,
        unit: {:native, :second}
      )

Telemetry.Metrics converts the native time integer to a float.
I think the behaviour makes sense.
The alternative of just using the output from System.convert_time_unit/3 would be that most of my measurements would be 0s because they are sub-second.

However, this fails because the Sum module's handle_event/4 is using ets:update_counter/4 which only works with integers.
See line 61 of from:

telemetry_metrics_prometheus_core/lib/core/sum.ex

Lines 53 to 67 in cd8bea3

 def handle_event(_event, measurements, metadata, config) do 

 with true <- EventHandler.keep?(config.keep, metadata), 

 {:ok, measurement} <- 

 EventHandler.get_measurement(measurements, metadata, config.measurement), 

 mapped_values <- config.tag_values_fun.(metadata), 

 :ok <- EventHandler.validate_tags_in_tag_values(config.tags, mapped_values) do 

 labels = Map.take(mapped_values, config.tags) 

 key = {config.name, labels} 

 _res = :ets.update_counter(config.table, key, measurement, {key, 0}) 

 :ok 

 else 

 false -> :ok 

 error -> EventHandler.handle_event_error(error, config) 

 end 

 end

I'm not quite sure how to attack this problem.
If the conversion happened when rendering instead of when storing this wouldn't be an issue, but that only solves it for :native vs :second problem.
in general I think it should be possible to sum any numeric. Not just integers.

Question: Design Considerations

Thanks for making this library. The abstractions seem really great. We are looking at using this library in our enterprise systems and I have a few design questions.

Multiple Handler Processes

Have you considered using a dynamic supervisor and spawning a child for every metric that is subscribed to? I'm thinking along the lines of multiple multiple Registry children instead of just one.

My reasoning being that every call to telemetry.execute is synchronous and has to wait for the registry to processes the message. A high volume of events being emitted could cause a call in Plug to block slowing down web request response times.

Maybe this is a non-issue because event handling is so efficient that it never gets bottlenecked in practice. I'm just curious.

Best-effort reporting

When an event is handled it looks like it is writing to an :ets table that has {:write_concurrency, true}, so it should be extremely fast. The only vectors that I can see that could potentially cause an issue is the :keep and :tag_values options:

Telemetry.Metrics.counter("http.request.count",
  keep: fn _metadata ->
    # something with high latency, resource consumption, or error prone 
    true
  end,
  tag_values: fn metadata ->
    # something with high latency, resource consumption, or error prone 
    metadata
  end
)

I'm not sure of the likelihood of this happening, but Murphy's Law. Given that a developer could cause self harm here if they are not aware of the complete impact of those functions, have you considered reporting metrics in a best-effort type of fashion?

My thought would be something along the lines of spawning a task using Task.Supervisor.async_nolink/2 immediately when an event is received. My reasoning is that I would prefer my application to continue to respond to web requests quickly regardless of my metrics being reported.

If a developer did do something resource intensive in those functions I'm thinking it would eventually manifest itself in an OOM or some other issue, but response latency would not be affected until a complete crash. The host application would be completely protected from a failed task because of a runtime exception or a timeout.

Sorry for the long post and thank you if you made it this far reading. I'm curious what your thoughts are on these questions -- even if for my own learning.

Thanks,
Spencer

Complete transfer to beam-telemetry org

TODO:

Update mix file repo refs
Update readme repo refs
Add a second maintainer

Exposé functions to insert metrics directly (without Telemetry)

Such approach may be useful sometimes when you need to manually control when metrics are added to the store because of tight loop performance requirements.

Prometheus metric name via reporter_options ?

Instead of using custom metric name and overriding event_name and measurement.. what about accepting name in reporter_options?

From:

Metrics.distribution("prometheus_metrics.scrape.duration.milliseconds",
  event_name: [:prometheus_metrics, :plug, :stop],
  measurement: :duration,
  unit: {:native, :second},
  reporter_options: [
    buckets: [0.05, 0.1, 0.2, 0.5, 1],
  ]
)

To:

Metrics.distribution("prometheus_metrics.plug.stop",
  unit: {:native, :second},
  reporter_options: [
    name: "prometheus_metrics.scrape.duration.milliseconds"
    buckets: [0.05, 0.1, 0.2, 0.5, 1],
  ]
)

Memory leak

Hi. We are using your library in our project, and consider it helpful. Thank you.

However, we found a memory leak. If nobody ever asks for metrics through API, ETS dist table is never clean up.

We've got 1Gb of data and 3M of records in dist table before noticing it.

Dynamic metric list

Hi, we have a situation where we are unable to know all the metrics that will be available in advance, and they might also dynamically change while the application is running. I saw TelemetryMetricsPrometheus.Core.Registry.register, which seems like it would solve our problems. However it is undocumented (probably not meant to be part of the interface), so I am reluctant to use it. Any chance it could be made part of the public interface? Are there any known/expected issues with using it in production code?

Support for sum entry that is already summed

This may be useful for situations like measuring :inet.getstat/{1,2} values without doing computations back and forth. The value returned by this metric is already summed. Technically we could use last_value metric there, but that would mark value as incorrect type in exported format. That can be problematic for some dashboards or other services that want to provide user meaningful graph by parsing type.

Add support for Pushgateway

Hello,

Is there any plan to add support for sending metrics to a Pushgateway? I have a few use cases for this including:

batch jobs which may last less than the scrape-interval.
emitting an application_started counter metric (it is hard to do this right now because if my app crashes on startup before the first scrape the metric is never consumed by Prometheus).

I am not 100% sure where this functionality belongs, I originally started in the prom_ex library and made my way down to here.

I can implement this if someone can point me in the right direction. I believe we just need to get an extra config value into the Counter/Sum/LastValue/Distribution modules' handle_event functions, which we would indicate if the metric should be written to ets like is it now, or sent to the pushgateway. I just don't know where that extra config value should be declared.

Logger.warn/1 is deprecated

While building I get a few warning for deprecated use of Logger.warn:

warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/aggregator.ex:58: TelemetryMetricsPrometheus.Core.Aggregator.filter_and_drop_time_series_with_bad_tag_values/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/registry.ex:196: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
  lib/core/registry.ex:203: TelemetryMetricsPrometheus.Core.Registry.register_metrics/2

Can I make a PR?

	defp escape(value) do
	value
	\|> to_string()
	\|> String.replace(~S("), ~S(\"))
	\|> String.replace(~S(\\), ~S(\\\\))
	\|> String.replace(~S(\n), ~S(\\n))
	end

	defp escape_help(value) do
	value
	\|> to_string()
	\|> String.replace(~S(\\), ~S(\\\\))
	\|> String.replace(~S(\n), ~S(\\n))
	end

	def handle_event(_event, measurements, metadata, config) do
	with true <- EventHandler.keep?(config.keep, metadata),
	{:ok, measurement} <-
	EventHandler.get_measurement(measurements, metadata, config.measurement),
	mapped_values <- config.tag_values_fun.(metadata),
	:ok <- EventHandler.validate_tags_in_tag_values(config.tags, mapped_values) do
	labels = Map.take(mapped_values, config.tags)
	key = {config.name, labels}
	_res = :ets.update_counter(config.table, key, measurement, {key, 0})
	:ok
	else
	false -> :ok
	error -> EventHandler.handle_event_error(error, config)
	end
	end

beam-telemetry / telemetry_metrics_prometheus_core Goto Github PK