petabridge / akkadotnet-healthcheck Goto Github PK
View Code? Open in Web Editor NEWHealthchecks for Akka.NET Applications :hospital:
License: Apache License 2.0
Healthchecks for Akka.NET Applications :hospital:
License: Apache License 2.0
Need to validate the following via unit tests:
AkkaPersistenceLivenessProbeProvider
from HOCON configuration when starting up an ActorSystem
with it configuredAkkaPersistenceLivenessProbe
should be able to correctly handle subscriptions in any state.AkkaPersistenceLivenessProbe
should correctly report that Akka.Persistence is available when it isAkkaPersistenceLivenessProbe
should correctly report that Akka.Persistence is NOT available when it isn't able to load at startup.AkkaPersistenceLivenessProbe
should correctly report that Akka.Persistence has become unavailable AFTER initially being available (simulate a future change in dis-connectivity.)We will likely need to create some custom Akka.Persistence journal and SnapshotStore implementations in order to succeed in testing these - please take a look at some of the tests we have in the
and
LGTM - need to validate the nuget publication locally (or check what the build server produced.) Don't want to publish any sample projects and need to make sure that the correct `README.md` files are included.
Originally posted by @Aaronontheweb in #148 (review)
During analysing our storage account blob snapshot folder I saw that I had a lot of Akka.Healthchecks snapshots. In the standup yesterday it was mentioned that the suicideProbe should cleanup the journal/snapshot store after it's tests. This is definetely not the case:
At this moment this cluster only has 3 nodes, so old recycled nodes (pods) are still having snapshots/journal records lingering around.
It would be nice to remove the snapshot/journal records used for the healthprobe after the probe is finished. Off course you have the scenario that a pod could crash during the healthprobe, then you still get undeleted journal/snapshots but the chance of that happening is really low and I can live with that.
From some conversations on Discord - it seems like this probe might be a bit too aggressive:
Hi @everyone, I have a small question about the Persistance healthchecks. I think they are changed and they are now cleaning up their snapshots https://github.com/petabridge/akkadotnet-healthcheck, That cleanup sometimes fails with a 404. at the same time the seems to fail, unsure if that is because the delete failed or that is because the creation failed but that brings down the container running akka.net. Is there a way to add fault tolerance for this? Because if I add fault tolerance to the container healthchecks, all healthchecks will have that extra tolerance, which might not be wanted.
Aaronontheweb — 04/24/2024 8:27 AM
cc @Arkatufus - we just made a bunch of bug fixes to these because they were throwing off false positives at startup @kupo1309
do you have a lot of load at startup or something @kupo1309 ? Or does this probe just fail eventually later
kupo1309 — 04/24/2024 8:43 AM
no this is after days/weeks of running, it seems
so my guess is that it is in fact a transient issue in the azurestorage
we are using 1.5.18
Arkatufus — 04/24/2024 9:13 AM
@kupo1309 you can add a layer of resiliency on top of it, like, it needs to fail twice or 3 times in a row before being killed?
kupo1309 — 04/24/2024 9:17 AM
its 3 by default indeed
i am upping it to 10, but indeed it seems like it failed a few times, then it got disassociated and then it got restarted.
TL;DR; - we might need to have this probe persistently fail several times before we mark the node as unhealthy. Failing at the first sign of trouble seems like it compounds problems that busy systems are having.
Add the ability to turn on debug logging for all built-in transports for both liveness and readiness probes.
The logs should also make it clear whether or not it's the liveness OR readiness probe writing to the transport.
Need to implement Recover for LivenessStatus with a test to make sure it is able to correctly signal when the cluster is up. As well as when it is no longer reachable.
[INFO][2/7/2023 10:35:01 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Persistence probe terminated. Recreating...
[INFO][2/7/2023 10:35:11 PM][Thread 0052][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:21 PM][Thread 0054][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=False, JournalPersiste
d=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:31 PM][Thread 0055][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
[INFO][2/7/2023 10:35:41 PM][Thread 0010][akka.tcp://AkkaWebApi@localhost:8081/system/healthcheck-live-persistence] Received recovery status PersistenceLivenessStatus(JournalRecovered=True, SnapshotRecovered=True, JournalPersisted
=True, SnapshotSaved=True, Failures=null) from probe.
This needs to go into debug logging or not get logged at all unless there's a problem.
It seems you can only configure a single probe for readiness (liveness) check. Isn't it possible to run multiple readiness (liveness) checks like 'cluster readiness', 'persistence journal check', 'my custom check1', ...
Should log this automatically without any configuration settings - just to let the end-user know in the startup logs that the system is running with one or both of these tools enabled.
WebApiTemplate.App.csproj: [NU1100] Unable to resolve 'Akka.HealthCheck.Hosting.Web (>= 1.0.0)' for 'net7.0'. PackageSourceMapping is enabled, the following source(s) were not considered: nuget.****
Is the issue here how we're targeting ASP.NET?
Need to rename all projects and the NuGet packages accordingly.
I would like to do akka.net cluster and healthcheck. Do I need Akka.HealthCheck.Cluster or is Akka.HealthCheck enough?
Do you have some example how to check if cluster is up and running and use this for monitoring?
Thank you
Should do a dump of all of the built-in HealthCheck settings at launch, so users can troubleshoot when first configuring the probes and transports.
These HCs are poisonous and can prevent cluster formation. Need to be relaxed.
AkkaClusterLivenessProbe - Liveness probe for clustering.
Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started and has not joined a cluster.
The ActorSystem left the cluster.
Rewrite to
ClusterLivenessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
The ActorSystem leaving the cluster.
AkkaClusterReadinessProbe - Readiness probe for clustering.
Reports healthy when:
The ActorSystem joined a cluster.
The ActorSystem is connected to a cluster
Reports unhealthy when:
The ActorSystem just started has not joined a cluster.
All other nodes in the cluster is unreachable.
Rewrite to:
ClusterReadinessProbeProvider - Liveness probe for clustering.
Reports healthy when:
Reports unhealthy when:
All other nodes in the cluster is unreachable.
Given how this is implemented:
We never actually handle the incoming socket requests, thus the liveness probe will eventually fail given enough time. Need to actually handle the socket request and verify it by sending back some trivial piece of data.
Can you make this fix and verify it via a TCP integration test @izavala ?
The SuicideProbe class is to check the persistence journal state.
It recovers the last event and snapshot from the journal
After it writes a new event and snapshot to the journal
and deletes all old events and snapshots.
The issue is that it already send a RecoveryStatus on successful revoery back
without checking the success of the new persistet event or snapshot
And in the case when the journal success in only recovery (read-mode) and not in persist (write-mode)
then the RecoveryStatus will still be always successful.
The bottom line is that the write and the delete of new "hit" messages is somehow
not used by the healthcheck itself and only makes a hit on one sector of the SSD every 10sec
I'm adding the latest Akka.HealthCheck.Cluster package.
Although the Readme says that it will return unhealthy when Actor system, is up, but still not joined the cluster, the results I'm getting a Healthy status with just the message for both liveness and readiness, that still not joined.
"akka-live-cluster": {
"data": {
"message": "not yet joined cluster"
},
"description": "Akka.NET cluster is alive",
"duration": "00:00:00.0000842",
"status": "Healthy",
"tags": [
"akka",
"live",
"cluster"
]
},
"akka-ready-cluster": {
"data": {
"message": "not yet joined cluster"
},
"description": "Akka.NET cluster is ready",
"duration": "00:00:00.0000657",
"status": "Healthy",
"tags": [
"akka",
"ready",
"cluster"
]
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.