Giter VIP home page Giter VIP logo

prometheus-net.dotnetruntime's Introduction

prometheus-net.DotNetMetrics

A plugin for the prometheus-net package, exposing .NET core runtime metrics including:

  • Garbage collection collection frequencies and timings by generation/ type, pause timings and GC CPU consumption ratio
  • Heap size by generation
  • Bytes allocated by small/ large object heap
  • JIT compilations and JIT CPU consumption ratio
  • Thread pool size, scheduling delays and reasons for growing/ shrinking
  • Lock contention
  • Exceptions thrown, broken down by type

These metrics are essential for understanding the performance of any non-trivial application. Even if your application is well instrumented, you're only getting half the story- what the runtime is doing completes the picture.

Using this package

Requirements

  • .NET 5.0+ recommended, .NET core 3.1+ is supported
  • The prometheus-net package

Install it

The package can be installed from nuget:

dotnet add package prometheus-net.DotNetRuntime

Start collecting metrics

You can start metric collection with:

IDisposable collector = DotNetRuntimeStatsBuilder.Default().StartCollecting()

You can customize the types of .NET metrics collected via the Customize method:

IDisposable collector = DotNetRuntimeStatsBuilder
	.Customize()
	.WithContentionStats()
	.WithJitStats()
	.WithThreadPoolStats()
	.WithGcStats()
	.WithExceptionStats()
	.StartCollecting();

Once the collector is registered, you should see metrics prefixed with dotnet_ visible in your metric output (make sure you are exporting your metrics).

Choosing a CaptureLevel

By default the library will default generate metrics based on event counters. This allows for basic instrumentation of applications with very little performance overhead.

You can enable higher-fidelity metrics by providing a custom CaptureLevel, e.g:

DotNetRuntimeStatsBuilder
	.Customize()
	.WithGcStats(CaptureLevel.Informational)
	.WithExceptionStats(CaptureLevel.Errors)
	...

Most builder methods allow the passing of a custom CaptureLevel- see the documentation on exposed metrics for more information.

Performance impact of CaptureLevel.Errors+

The harder you work the .NET core runtime, the more events it generates. Event generation and processing costs can stack up, especially around these types of events:

  • JIT stats: each method compiled by the JIT compiler emits two events. Most JIT compilation is performed at startup and depending on the size of your application, this could impact your startup performance.
  • GC stats with CaptureLevel.Verbose: every 100KB of allocations, an event is emitted. If you are consistently allocating memory at a rate > 1GB/sec, you might like to disable GC stats.
  • Exception stats with CaptureLevel.Errors: for every exception throw, an event is generated.

Recycling collectors

There have been long-running performance issues since .NET core 3.1 that could see CPU consumption grow over time when long-running trace sessions are used. While many of the performance issues have been addressed now in .NET 6.0, a workaround was identified: stopping and starting (AKA recycling) collectors periodically helped reduce CPU consumption:

IDisposable collector = DotNetRuntimeStatsBuilder.Default()
	// Recycles all collectors once every day
	.RecycleCollectorsEvery(TimeSpan.FromDays(1))
	.StartCollecting()

While this has been observed to reduce CPU consumption this technique has been identified as a possible culprit that can lead to application instability.

Behaviour on different runtime versions is:

  • .NET core 3.1: recycling verified to cause massive instability, cannot enable recycling.
  • .NET 5.0: recycling verified to be beneficial, recycling every day enabled by default.
  • .NET 6.0+: recycling verified to be less necesarry due to long-standing issues being addressed although some users report recycling to be beneficial, disabled by default but recycling can be enabled.

TLDR: If you observe increasing CPU over time, try enabling recycling. If you see unexpected crashes after using this application, try disabling recycling.

Examples

An example docker-compose stack is available in the examples/ folder. Start it with:

docker-compose up -d

You can then visit http://localhost:3000 to view metrics being generated by a sample application.

Grafana dashboard

The metrics exposed can drive a rich dashboard, giving you a graphical insight into the performance of your application ( exported dashboard available here):

Grafana dashboard sample

Further reading

prometheus-net.dotnetruntime's People

Contributors

blankensteiner avatar cwhsu1984 avatar djluck avatar dustinchilson avatar leohexspoor avatar lodejard avatar nazmialtun avatar pjb3005 avatar tiagotartari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prometheus-net.dotnetruntime's Issues

Record ThreadPool.PendingWorkItemCount Property

This seems like a good candidate to record.

https://docs.microsoft.com/en-us/dotnet/api/system.threading.threadpool.pendingworkitemcount?view=netcore-3.1

In fact, running dotnet counters monitor --process-id 123 gives the results below...of which ThreadPool Queue Length is one.

Press p to pause, r to resume, q to quit.
    Status: Running

[System.Runtime]
    % Time in GC since last GC (%)                         0
    Allocation Rate / 1 sec (B)                       66,480
    CPU Usage (%)                                          0
    Exception Count / 1 sec                                0
    GC Heap Size (MB)                                    234
    Gen 0 GC Count / 60 sec                                0
    Gen 0 Size (B)                               200,991,744
    Gen 1 GC Count / 60 sec                                0
    Gen 1 Size (B)                                   385,440
    Gen 2 GC Count / 60 sec                                0
    Gen 2 Size (B)                               210,807,272
    LOH Size (B)                                  74,140,000
    Monitor Lock Contention Count / 1 sec                  0
    Number of Active Timers                              314
    Number of Assemblies Loaded                          272
    ThreadPool Completed Work Item Count / 1 sec          38
    ThreadPool Queue Length                                0
    ThreadPool Thread Count                                9
    Working Set (MB)                                     242

Document metrics exposed

Currently no documentation of metrics exposed, would be helpful to have a breakdown of metrics exposed per-collector and the label values they expose.

Using with MetricPusher

Hi. Can I use it with MetricPusher? I can't use MetricServer because Prometheus works in Docker in its own network, and I want to send metrics from localhost, when I running application from Visual Studio.

Collect segment information

Currently the library doesn't collect any information about GC memory segments. By doing this, we would be able to calculate:

  • % LOH allocated vs utilized (useful to understand potential LOH fragmentation issues)
  • % SOH allocated vs utilized

See the GCCreateSegment_V1 and GCFreeSegment_V1 event.

[v4] Latest release fails to work with automated binding redirects

In v4 of the package, AssemblyFileVersion is 1.0.0.0. Previous versions had appropriate major.minor from package.

The problem I've encountered is using a dependency that requires v3.4 while I use v4.0. The problem is because my app references an assembly from package v4 that declares AssemblyFileVersion=1.0.0.0 so it generates a redirect of "0.0.0.0-1.0.0.0=>1.0.0.0", and the dependency I use requires DLL with AssemblyFileVersion=3.4.0.0 which doesn't fit into the redirect.

Now, if the v4 had AssemblyFileVersion=4.0.0.0, the redirect .NET generates would be "0.0.0.0-4.0.0.0=>4.0.0.0", which would be used for the 3.4.0.0 requirement.

I hope I've explained the issue enough, since binding redirects belong in hell... ;)

Add dotnet_build_info

Hello,
there seems to be a custom to add a app_build_infoto expose version info. Would it be something to add as default to this package?

Something like this:

var buildversion = Assembly.GetEntryAssembly().GetCustomAttribute<AssemblyFileVersionAttribute>().Version;
var assemblyversion = Assembly.GetEntryAssembly().GetName().Version;
var assemblyname = Assembly.GetEntryAssembly().GetName().Name;
var runtimetargetversion = Assembly.GetEntryAssembly().GetCustomAttribute<TargetFrameworkAttribute>().FrameworkName;
var runtimeversion = RuntimeInformation.FrameworkDescription;
var osversion = RuntimeInformation.OSDescription;

var labels = new string[] { "buildversion", "assemblyversion", "assemblyname", "runtimetargetversion", "runtimeversion", "osversion" };
var labelvalues = new string[] { buildversion, assemblyversion.ToString(), assemblyname, runtimetargetversion, runtimeversion, osversion };
var appinfo = Metrics
                .CreateGauge("dotnet_build_info", "application build information", new GaugeConfiguration
                {
                    LabelNames = labels
                });
appinfo.WithLabels(labelvalues);

JitStatsCollector throwing errors under load

System.ArgumentOutOfRangeException: Counter value cannot decrease.
Parameter name: increment
   at Prometheus.Counter.Child.Inc(Double increment) in d:\a\1\s\Prometheus.NetStandard\Counter.cs:line 29
   at Prometheus.DotNetRuntime.StatsCollectors.JitStatsCollector.ProcessEvent(EventWrittenEventArgs e) in C:\dev\prometheus-net.DotNetRuntime\src\prometheus-net.DotNetRuntime\StatsCollectors\JitStatsCollector.cs:line 71
   at Prometheus.DotNetRuntime.DotNetEventListener.OnEventWritten(EventWrittenEventArgs eventData) in C:\dev\prometheus-net.DotNetRuntime\src\prometheus-net.DotNetRuntime\DotNetEventListener.cs:line 41

Increasing CPU when using thread pool statistics

This might be related to #6, but I'm not sure as .net core 3.1 is used.

We run prometheus-net.DotNetRuntime in an alpine linux docker container with dotnet core 3.1. We noticed the CPU usage is slowly increasing over time. The situation significantly improves when thread pool stats are not collected (see plot below), but even then there still is a measurable increase in CPU usage. There doesn't seem to be a measurable increase in memory consumption.

Initialization with thread pool stats:

collector = DotNetRuntimeStatsBuilder
                .Customize()
                .WithContentionStats()
                .WithJitStats()
                .WithThreadPoolSchedulingStats()
                .WithThreadPoolStats()
                .WithGcStats()
                .WithExceptionStats()
                .StartCollecting();

Initialization without thread pool stats:

collector = DotNetRuntimeStatsBuilder
                .Customize()
                .WithJitStats()
                .WithGcStats()
                .WithExceptionStats()
                .StartCollecting();

CPU increse_2

We also did some CPU profiling on the affected system, but unfortunately 99% of the CPU time is spend in unmanaged code.
image

Any ideas how the increase in CPU could be avoided while still collecting metrics? Or any idea how to further investigate this issue?

Future of this project

I haven't put much work into this project over the past few months due to work commitments but there's a few ideas I have for improvements:

  • Improved sampling (aka toggling noisy event sources on/ off to avoid perf overhead of continuously collecting these events)
  • Extracting the core "event timing" logic into a separate library to enable other metric libraries to expose these metrics
  • Expanded set of metrics collected by looking at other event sources (e.g. Sql commands, ASP.NET core, GRPC, etc.)

@sywhang I'd be interested in learning a bit about the roadmap for telemetry in .NET core if you could spare the time- I'm particularly interested in any future improvements around event collection + filtering and event counters.

Is Fasterflect a required dependency?

Is this something that can be pulled in as code? It doesn't seem like there is too it uses and it would be nice to keep the dependency list down.

NullReferenceException thrown when GcsStats CaptureLevel is set to Verbose and app is running locally in Debug mode

When we run our application locally in Debug mode, in runtime it throws NullReferenceException. It only happens when we configure DotNetRuntimeStats for GcsStats with CaptureLevel.Verbose. If we set CaptureLevel.Informational exception is not thrown.

I couldn't determine exact source of exception. Providing stack trace which I currently get:

System.NullReferenceException: 'Object reference not set to an instance of an object.'
This exception was originally thrown at this call stack:
    System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload(ref System.Diagnostics.Tracing.EventSource.EventMetadata, System.ReadOnlySpan<byte>)
    System.Diagnostics.Tracing.NativeRuntimeEventSource.ProcessEvent(uint, uint, System.DateTime, System.Guid, System.Guid, System.ReadOnlySpan<byte>)
    System.Diagnostics.Tracing.EventPipeEventDispatcher.DispatchEventsToEventListeners()
    System.Threading.Tasks.Task.InnerInvoke()
    System.Threading.Tasks.Task..cctor.AnonymousMethod__272_0(object)
    System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, object)
    System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()

image

I created sample app which shows similar behavior. You can check it out here: https://github.com/AnnaGorozia/PrometheusSample

Expose collecting metrics as separate functionality

Hey, @djluck. First of all, a huge thanks to your effort to make such great library. This is what was missing in dotnet community.

My question is quite simple: do you have plans to expose metrics collector functionality as a separate package? It will allow integrate such functionality with various profiling, metrics and performance monitoring tools.
I will glad to hear your thoughts about it.

The given key '8' was not present in the dictionary exception

Hi
I'm facing an exception thrown by the ThreadPoolMetricsProducer:

System.Collections.Generic.KeyNotFoundException: The given key '8' was not present in the dictionary.
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Prometheus.DotNetRuntime.Metrics.Producers.ThreadPoolMetricsProducer.<RegisterMetrics>b__29_4(ThreadPoolAdjustedEvent e)
   at Prometheus.DotNetRuntime.EventListening.DotNetEventListener.OnEventWritten(EventWrittenEventArgs eventData)

This happens on a dotnet 6 web api application using the prometheus-net.DotNetRuntime version 4.2.3.
The way the collector is initialized:

...
var builder = DotNetRuntimeStatsBuilder.Default();

if (!_options.UseDefaultMetrics)
{
    builder = DotNetRuntimeStatsBuilder.Customize()
        .WithContentionStats(CaptureLevel.Informational)
        .WithGcStats(CaptureLevel.Verbose)
        .WithThreadPoolStats(CaptureLevel.Informational)
        .WithExceptionStats(CaptureLevel.Errors)
        .WithJitStats();
}

builder
    .RecycleCollectorsEvery(_options.RecycleEvery)
    .WithErrorHandler(ex => _logger.LogError(ex, "Unexpected exception occurred in prometheus-net.DotNetRuntime"));

if (_options.UseDebuggingMetrics)
{
    _logger.LogInformation("Using debugging metrics.");
    builder.WithDebuggingMetrics(true);
}

_logger.LogInformation("Starting prometheus-net.DotNetRuntime...");

builder.StartCollecting();
...

Should process_cpu_seconds_total be exported?

Question/Possible Bug

I've noticed that in the example Grafana dashboards, included is a process_cpu_seconds_total metric. This is used with process_cpu_count.

process_cpu_count is working fine for me, but process_cpu_seconds_total is not being published.

However, I can't find this metric anywhere in this repo. So, I'm not actually sure if this is a bug or not. I would have assumed this was just an oversight on the part of the dashboard, and the metric is actually coming from some other source... but process_cpu_count is published, which makes me think that process_cpu_seconds_total should be published as well.

So, should process_cpu_seconds_total be exported? If not, where should we be obtaining that metric for the example Grafana dashboard?

UnsupportedEventParserRuntimeException with v4.2

Hello, I've just updated from version 4.1 to 4.2 and I'm now seeing the exception above.

It's a .NET 6 project and I'm doing DotNetRuntimeStatsBuilder.Customize().WithThreadPoolStats().WithGcStats().StartCollecting().

Reverting back fixes the problem, but I am frankly not sure if all the metrics are really emitted since the project is in development and there is nothing scraping it yet.

Returning object reference error

.WithErrorHandler(ex => _logger.LogError(ex, "Unexpected exception occurred in prometheus-net.DotNetRuntime"));

Always returning object reference error, am I missing anything

Also target .NET Standard 2.1

Seems like all that is needed for this to work is to add it as a target framework in prometheus-net.DotNetRuntime.csproj:

<TargetFrameworks>netcoreapp2.2;netcoreapp3.0;netstandard2.1</TargetFrameworks>

And then add this dependency:

<ItemGroup Condition="'$(TargetFramework)' == 'netstandard2.1'">
  <PackageReference Include="System.Collections.Immutable" Version="1.6.0" />
</ItemGroup>

Would that be possible? I can create a PR for it if you want.

How could we add metrics for custom event counters?

Sometimes, we may use some other event counters, for example when I use Microsoft.Data.SqlClient, I wanna record some metrics from its event counters, is there some recommended method to implement this?

Breaking changes in prometheus-net 3.X

It looks like the old way of registering metrics has moved from the Collector to the Metrics class.
Guidance or updates on how to register these metrics is needed.

There seems to be no way to disable collector recycling

Recycling collectors can apparently crash .NET 6.0 runtime:

I would therefore like to disable recycling. If I understand it right, the recycling is no longer necessary with .NET 6.0 anyway, right? (Based on comments in dotnet/runtime#43985)

However, it seems that there is no option for this! RecycleCollectorsEvery() does not accept a null/zero parameter. I request that this option be added and/or recycling be disabled from 6.0 onwards if it is no longer relevant.

Memory Leak, Increasing CPU usage

This is an awesome library. Thanks for building this.

I dropped this into two .NET Core 2.2 services running in Kubernetes yesterday morning, setting it up to run all of the collectors like this:

DotNetRuntimeStatsBuilder.Customize()
    .WithContentionStats()
    .WithJitStats()
    .WithThreadPoolSchedulingStats()
    .WithThreadPoolStats()
    .WithGcStats()
    .StartCollecting();

Throughout the day I saw memory and CPU usage kept climbing. It looks like there's a memory leak somewhere, and something is causing CPU usage to increase over time.

I already have a metric derived from Process.GetCurrentProcess().WorkingSet64. This is what that metric did yesterday:

image

The green box shows memory usage when I had all of this library's collectors running. This is for one of the two services, and there are three instances of each, which is why there are three lines on the graph. Each line is Process.GetCurrentProcess().WorkingSet64 over time for that instance. Each time the colors change is from a deployment when the pod names change; the query I'm using for that chart treats them as a completely different metric name at that point.

Here's a graph showing the summed rate of Process.GetCurrentProcess().PrivilegedProcessorTime.TotalSeconds and Process.GetCurrentProcess().UserProcessorTime.TotalSeconds for all instances of each of the two services over the past 2 days:

image

Sorry for not including the legend -- it has names I don't want to share. The yellow and green lines on the bottom are privileged processor time. The other two are user processor time. One of them is so spiky because it's running a pretty intense scheduled job every 20 minutes. The other is only responding to HTTP requests.

I turned everything off except the thread pool scheduling stats last night. The most recent parts of the above charts are from only running this:

DotNetRuntimeStatsBuilder.Customize()
    .WithThreadPoolSchedulingStats()
    .StartCollecting();

If there's a problem, then it's not with the thread pool scheduling stats.

I'll add the others back one at a time to try to find where the problem is. I'll report back to this issue with my findings.

Thanks again for this library. I didn't know about the CLR events this library records. I'm thankful you wrote this both to learn about them and also to get insight I didn't know was so readily available.

Add Support for .net5.0

When targeting a 5.0 service this library seems to not fully support .net5.0 targets
image

My library mult-targets netcoreapp3.1 and net5.0 and has issues with the 5.0 build.

I forked and got your lib building just fine on 5.0 off master but noticed you had changes/comments in #43 any issue with a PR adding support for net5.0?

Disposable object exception

Use .NET Core 3.1 WebApi

Preparing steps:

  1. add this code in Configure method (Startup.cs)
        public override void Configure(IApplicationBuilder app, IWebHostEnvironment env)
        {
             DotNetRuntimeStatsBuilder.Default().StartCollecting(); 
        }
  1. add XUnitProject with two and more classes
    namespace foo
    {
        public class FooControllerTests : IClassFixture<WebApplicationFactory<Startup>>
        {
            private readonly WebApplicationFactory<Startup> _factory;

            public FooControllerTests(WebApplicationFactory<Startup> factory)
            {
                _factory = factory;
            }

            [Fact]
            public void TestMethod()
            {
                var client = _factory.CreateClient();
            }
        }

        public class BarControllerTests : IClassFixture<WebApplicationFactory<Startup>>
        {
            private readonly WebApplicationFactory<Startup> _factory;

            public BarControllerTests(WebApplicationFactory<Startup> factory)
            {
                _factory = factory;
            }

            [Fact]
            public void TestMethod()
            {
                var client = _factory.CreateClient();
            }
        }
    }

Execution:
Run tests in parallel. XUnit will throw exception: ".NET runtime metrics are already being collected. Dispose() of your previous collector before calling this method again."

Workaround:
I resloved this by using compilation symbols

          #if !DEBUG
            DotNetRuntimeStatsBuilder.Default().StartCollecting();
          #endif

But me need more clean solution for this problem. How to do this?

Thread pool scheduling metrics are incomplete

Thread pool scheduling metrics can be incomplete, I suspect because a GC can occur between an item being scheduled on the thread pool and being dequeued and executed (the workId payload identifier is just the address of the work item being queued and dequeued). The upshot of this is that timings of thread pool scheduling latency can go missing.

Need to spend some time investigating the correct fix. Suspect converting the workId back to a reference may be required.

Getting EventSource error in cpu-usage callback

My program keeps outputing

EventSource Error: ERROR: Exception during EventCounter cpu-usage metricProvider callback: Attempted to divide by zero.

I only use prometheus-net.DotNetRuntime for CPU monitoring, so I assume it comes from this. It doesn't print a stacktrace or anything, so I don't have any more info.


DotNetRuntimeStatsBuilder
	.Customize()
	.WithContentionStats(CaptureLevel.Informational)
	.WithJitStats()
	.WithThreadPoolStats()
	.WithGcStats(CaptureLevel.Informational)
	.WithExceptionStats()
	.StartCollecting();

.NET 5
Windows 20H2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.