simplystaking / panic_cosmos Goto Github PK

🚨 PANIC for Cosmos

License: GNU General Public License v3.0

Python 99.96% Shell 0.04%

panic_cosmos's Introduction

PANIC for Cosmos

⚠️ This Repo has been archived. Cosmos node monitoring and alerting has been integrated into this product, which we now actively maintain.

PANIC for Cosmos is a lightweight yet powerful open source monitoring and alerting solution for Cosmos nodes by Simply VC. It is compatible with any chain built using the Cosmos SDK. The tool was built with user friendliness in mind, without excluding cool and useful features like phone calls for major alerts and Telegram commands for increased control over your alerter.

The alerter's focus on a modular design means that it is beginner friendly but also developer friendly. It allows the user to decide which components of the alerter to set up while making it easy for developers to add new features. PANIC also offers two levels of configurability, user and internal, allowing more experienced users to fine-tune the alerter to their preference.

We are sure that PANIC will be beneficial for node operators in the Cosmos community and we look forward to feedback. Feel free to read on if you are interested in the design of the alerter, if you wish to try it out, or if you would like to support and contribute to this open source project, or just check out upcoming features.

Design and Features

Click here if you want to dive into the design and feature set of PANIC

Ready, Set, Alert!

Click here if you are ready to try out PANIC on your Cosmos nodes!

Support and Contribution

On top of the additional work that we will put in ourselves to improve and maintain the tool, any support from the community, both through development work or by delegating to the Simply VC validator, will be greatly appreciated.

Who We Are

Simply VC runs highly reliable and secure infrastructure in our own datacentre in Malta, built with the aim of supporting the growth of the blockchain ecosystem. Read more about us on our website and Twitter:

Simply VC website: https://simply-vc.com.mt/
Simply VC Twitter: https://twitter.com/Simply_VC

panic_cosmos's People

Contributors

Stargazers

Watchers

panic_cosmos's Issues

HTTPS with self signed certificates

Hello,

I tried to connect the PANIC to a RPC that is behind an NGINX server with SSL enabled and a self signed certificate.

Setup: 3 servers
PANIC ------(HTTPS)----NGINX:443----(HTTP)-----COSMOS-RPC:26657

Error:

Sep 26 23:51:44 pipenv[23856]: Trying to connect to https://xxxxxxxxxxxxxxxxx:8443/node1/status
Sep 26 23:51:44 pipenv[23856]: Failed to connect to cosmos-node at https://xxxxxxxxxxxxxxxxx:8443/node1
Sep 26 23:51:44 pipenv[23856]: PANIC MAJOR - Node cosmos-node was not accessible during PANIC startup. cosmos-node will NOT be monitored until it is accessible and PANIC restarted afterwards. Some features of PANIC might be affected.

The CA certificate is installed on the machine where PANIC is run.
A curl https://xxxxxxxxxxxxxxxxx:8443/node1/status works fine, no errors prompted.

Then I simply swapped https to http above, same URL, same setup, just HTTP instead of HTTPS and, of course different pot (80) and it worked.

Therefore I guess it should be the self-signed certificate to blame.

Any sugestions ?

First run impression - Documentation

Hi,

I just installed PANIC and tested some of it's alerts. I find it useful with great potential especially after achieving a more granular configuration - thresholds, specific alerts (on/off), maybe customizing alerts and so on.

Things that I had to do outside the documentation in order to make it work:

Make sure you are running the correct python version. If you have more binaries, run the correct one. Example:
pipenv run python3.6 run_setup.py instead of pipenv run python run_setup.py
I had to manually install the following libraries else the setup wouldn't work:

pip install twilio
pip install python-telegram-bot  ## Not the library named simply "telegram" !
pip install redis
pip install python-dateutil

Timeout for last height checked keys

A Redis timeout should be added for last height checked keys so that statuses in Telegram related to these keys do not stick around forever if, for example, the chain name changes (cosmoshub-2 to cosmoshub-3). This improvement might also apply to other keys.

Network Monitor Crashes If Chain Has Not Started

The network monitor crashes if the chain of the nodes being monitored has not started. This can be fixed by checking for the presence of the "last_block_height" value before trying to obtain it, or using the "last_block_height" value under NODE:26657/status/, which seems to be available even before the chain starts.

The error message is:
Network monitor (NETWORK_MONITOR_NAME) terminated due to exception: 'last_block_height'

Network Monitor Should Not Pick Syncing Node

Network monitor should select nodes more intelligently and avoid picking nodes that are not fully operational, such as those that are syncing.

Errors During Alerting Means Possibly Skipped Alerts

PANIC should have a mechanism to keep track of any latest alerts that were not transmitted due to any errors (up to some time or amount treshold). Errors in these scenarios are typically of a networking nature, such as when the alerter loses its internet connection and cannot transmit alerts, but can also be unexpected situations that cause the alerter to malfunction.

It will have to be decided whether to retransmit the alerts when possible, or just let the node operator know that some alerts were not transmitted. The former may confuse the operator if it is not made clear that these alerts are not about the present state of the alerter. A compromise between the two approaches will most likely be implemented.

Bug with /unsnooze and /unmute commands when Redis is offline

I have recently encountered a minor bug when using PANIC commands in Telegram. When snoozing Twilio calls and Redis stops running, the /unsnooze command returns a "Twilio calls have not been snoozed" message. I think it should first check whether Redis is running. The same issue happened with the periodic alive reminder.

Muting and snoozing when reminder and calls disabled

Currently, Telegram commands allows the user to mute and snooze even when Twilio calls and the periodic alive reminder were not enabled. Telegram should let the user know that muting/snoozing does not make sense if the corresponding feature being muted/snoozed was not enabled.

Improve Behaviour During Network Upgrade

When upgrading a network, the height goes back to zero. PANIC should be more aware of this fact and should not require the operator to restart the tool and clear Redis data.

'Precommits' Changed to 'Signatures' in Tendermint >v0.33

PANIC does not currently work for Cosmos SDK chains that use Tendermint >v0.33 since the 'precommits' field when querying a specific block was changed to 'signatures.

Thanks to @kalpatech-team for noticing this.

Twilio and Periodic Alive Reminder Default Snooze/Mute Hours

For user-friendliness, calling /snooze or /mute without an hour value in Telegram should still snooze Twilio or mute the periodic alive reminder, respectively, for a default amount of hours predefined in the internal config.

SMTP Authentication

I could not find a way to enable SMTP Authentication. Is it supported?

It would be nice to have this feature available. Many hosting providers require SMTP authentication for sending email through their servers. This way it would be easier to setup email alerts.

Clear Current Config Without Reconfiguring

Users currently cannot clear the current config without being forced to reconfigure, when running the setup script. For instance, if the alerter is set up to use Redis and the user wishes to disable this interaction, the setup script does not allow the user to clear the Redis-related configurations without being asked for a new Redis configuration.

The repos setup does not force repo name uniqueness

In the current implementation, the repos setup does not force the user to enter unique repo names.

This should not be allowed since a user may by mistake give two distinct repos the same name, and as a result the GitHub monitors would not work as expected.

Per node alert rules - how?

Hello, considering the following setup:

one mainnet validator behind two sentry nodes
one testenet validator

using 4 node declarations:

[node_001]
node_name = mainnet-validator
node_rpc_url = http://mainnet-validator:26657
node_is_validator = true
include_in_node_monitor = true
include_in_network_monitor = true

[node_002]
node_name = mainnet-sentry1
node_rpc_url = http://mainnet-sentry1:26657
node_is_validator = false
include_in_node_monitor = false
include_in_network_monitor = true

[node_003]
node_name = mainnet-sentry2
node_rpc_url = http://mainnet-sentry2:26657
node_is_validator = false
include_in_node_monitor = false
include_in_network_monitor = true

[node_004]
node_name = testnet-validator
node_rpc_url = http://testnet-validator:26657
node_is_validator = true
include_in_node_monitor = false
include_in_network_monitor = false

How can we set up the alerts in such a way that we only receive Twilio alerts for 001 and Telegram alerts for 002/003/004?

The idea is to not get paged unless something really serious is happening(on which the above setup works), but also to not miss alerts on the sentries and testnet on lower priority channels, like telegram.

Periodic alive reminder crashes if Redis not set up

The periodic alive reminder crashes if Redis is not set up because it does not check whether or not Redis is enabled. When Redis is not enabled, the Redis object passed around is None. When the periodic alive reminder uses the None object to check whether or not alive messages are muted, it crashes.

Should have ability to send email to multiple addresses

The email channel should accept multiple recipients, similar to how the twilio channel accepts multiple phone numbers. This involves:

Updating the setup process to accept multiple recipient email addresses.
Updating the config parser to split the semicolon-separated input into a list of addresses.
Updating the email channel to accept and send to multiple email addresses, taking care to catch any exception when sending, so that if one send fails, the next is still attempted.

Issue in the current architecture

As far as I can see on https://github.com/SimplyVC/panic/blob/master/doc/DESIGN_AND_FEATURES.md and after testing the implementation in our current stack, nothing guarantee that the Validator doesn't have an issue or is still running while receiving Node Monitor status.

The node monitor deals with exactly one node, such that multiple node monitors are started up if you set up the alerter with multiple nodes. In a typical monitoring round, the node monitor:

1. Checks if the node is reachable from [RPC_URL]
2. Gets the node's status from [RPC_URL]/status
-Gets and stores the voting power
-Gets and stores the catching-up status
3. Gets the node's net info from [RPC_URL]/net_info
- Gets and stores the number of peers
4. Saves its state and the node's state
5. Sleeps until the next monitoring round

At each step, the network monitor goes through the data sources and picks the first full node that responds ([RPC_URL]/health). Having additional full nodes increases data source redundancy.

The issue is that the current design doesn't check if the information received is effectively being sent by the Cosmos validator. It could be cached data, wrong RPC calls, stuck in loop etc.

The node should at least compare fetched Validator state information with previous state. For example, if block number < previous block number, UTC time < previous UTC time etc then triggers error. Ideally, the information received by the node monitor should be signed by the validator and then verified by the node monitor, so that the node monitor can always be ensured that the validator is running fine.

Multiple printing of alerts in alert logger.

I am noticing that in the alerts logger the same alert is printed twice. This is due to the fact that when a logger has been already created, another handler is added to the same logger.

Update links in internal_config.ini

From what I can observe, some links in the internal_config.ini need to be updated because they still point to cosmos hub 2 data.

Incorrect Total Number of Missing Validators Outputted on Missed Block Alert

Cannot run tests without first creating config files.

The included unit tests cannot be executed before creating the three config files (main/nodes/repos) required for normal operation of PANIC. These however should not be required as the testing code provides its own three config files.

A solution to this would probably have to do away with the internal_parsed and user_parsed files, or at least move the config variables within to some class, so that the config variables are not created as soon as the files are loaded.

Can not install deps in docker container

I try to install this project in docker and i hame some problem with install deps

Dockerfile is

FROM python:3.5.2-alpine

WORKDIR /src

RUN apk add --no-cache  git

RUN git clone https://github.com/SimplyVC/panic_cosmos.git /src && git checkout $MONITORING_TAG

RUN pip install pipenv
RUN apk add --no-cache build-base libffi-dev
RUN mkdir -p /root/.local/share/virtualenv/wheel/3.5/embed/1
RUN pipenv sync

MONITORING_TAG=v1.1.2

and in log file, i see an error in last step

Installing dependencies from Pipfile.lock (bcf234)...
An error occurred while installing configparser==5.0.0 --hash=sha256:cffc044844040c7ce04e9acd1838b5f2e5fa3170182f6fda4d2ea8b0099dbadd --hash=sha256:2ca44140ee259b5e3d8aaf47c79c36a7ab0d5e94d70bd4105c03ede7a20ea5a1! Will try again.
An error occurred while installing cryptography==3.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4' --hash=sha256:384d7c681b1ab904fff3400a6909261cae1d0939cc483a68bdedab282fb89a07 --hash=sha256:8713ddb888119b0d2a1462357d5946b8911be01ddbf31451e1d07eaa5077a261 --hash=sha256:124af7255ffc8e964d9ff26971b3a6153e1a8a220b9a685dc407976ecb27a06a --hash=sha256:8ecef21ac982aa78309bb6f092d1677812927e8b5ef204a10c326fc29f1367e2 --hash=sha256:8ecf9400d0893836ff41b6f977a33972145a855b6efeb605b49ee273c5e6469f --hash=sha256:8e924dbc025206e97756e8903039662aa58aa9ba357d8e1d8fc29e3092322053 --hash=sha256:ce82cc06588e5cbc2a7df3c8a9c778f2cb722f56835a23a68b5a7264726bb00c --hash=sha256:45741f5499150593178fc98d2c1a9c6722df88b99c821ad6ae298eff0ba1ae71 --hash=sha256:dea0ba7fe6f9461d244679efa968d215ea1f989b9c1957d7f10c21e5c7c09ad6 --hash=sha256:4d355f2aee4a29063c10164b032d9fa8a82e2c30768737a2fd56d256146ad559 --hash=sha256:0cbfed8ea74631fe4de00630f4bb592dad564d57f73150d6f6796a24e76c76cd --hash=sha256:9367d00e14dee8d02134c6c9524bb4bd39d4c162456343d07191e2a0b5ec8b3b --hash=sha256:bec7568c6970b865f2bcebbe84d547c52bb2abadf74cefce396ba07571109c67 --hash=sha256:0c608ff4d4adad9e39b5057de43657515c7da1ccb1807c3a27d4cf31fc923b4b --hash=sha256:51e40123083d2f946794f9fe4adeeee2922b581fa3602128ce85ff813d85b81f --hash=sha256:bea0b0468f89cdea625bb3f692cd7a4222d80a6bdafd6fb923963f2b9da0e15f --hash=sha256:4b9303507254ccb1181d1803a2080a798910ba89b1a3c9f53639885c90f7a756 --hash=sha256:ab49edd5bea8d8b39a44b3db618e4783ef84c19c8b47286bf05dfdb3efb01c83 --hash=sha256:a09fd9c1cca9a46b6ad4bea0a1f86ab1de3c0c932364dbcf9a6c2a5eeb44fa77! Will try again.
Installing initially failed dependencies...
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/cli/command.py", line 696, in sync
[InstallError]:       system=state.system
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/core.py", line 2892, in do_sync
[InstallError]:       system=system,
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/core.py", line 1312, in do_init
[InstallError]:       pypi_mirror=pypi_mirror,
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/core.py", line 900, in do_install_dependencies
[InstallError]:       retry_list, procs, failed_deps_queue, requirements_dir, **install_kwargs
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/core.py", line 796, in batch_install
[InstallError]:       _cleanup_procs(procs, failed_deps_queue, retry=retry)
[InstallError]:   File "/usr/local/lib/python3.5/site-packages/pipenv/core.py", line 703, in _cleanup_procs
[InstallError]:       raise exceptions.InstallError(c.dep.name, extra=err_lines)
[pipenv.exceptions.InstallError]: DEPRECATION: Python 3.5 reached the end of its life on September 13th, 2020. Please upgrade your Python as Python 3.5 is no longer maintained. pip 21.0 will drop support for Python 3.5 in January 2021. pip 21.0 will remove support for this functionality.
[pipenv.exceptions.InstallError]: ERROR: Could not find a version that satisfies the requirement configparser==5.0.0
[pipenv.exceptions.InstallError]: ERROR: No matching distribution found for configparser==5.0.0
ERROR: Couldn't install package: configparser
 Package installation failed...

Alerter Should Start Even With Missing Nodes/Repos

Currently, if at least one node or repo is not accessible, the alerter does not start at all. This should be changed so that it does start. This means that PANIC will not have all necessary data from the missing object, and thus some features might be affected. This should be made clear to the operator via alerts.

Network monitor alert not specific enough

Network monitor's "...could not find a live full node..." alert should contain the name of the network monitor, for when multiple network monitors are running. This requires:

Adding a network monitor name argument to CouldNotFindLiveFullNodeAlert and updating the alert text to include the name.
Updating the alert's test to reflect the new alert text.

Stargate update causes exception on node restart

Restarting a cosmos sentry used for network monitoring raises an exception in panic:
Network monitor (cosmoshub-4) terminated due to exception: Expecting value: line 1 column 1 (char 0)
This behavior didn't exist on cosmoshub-3

pipenv set up

After some issues with my pipenv set up i found a couple of things that made it work... I thought i could leave it here or if you want me to add somewhere else let me know.

Pipfile: /tmp/Pipfile
Using /usr/bin/python3.8 (3.8.2) to create virtualenv…
⠙ Creating virtual environment...ModuleNotFoundError: No module named 'virtualenv.seed.via_app_data'

✘ Failed creating virtual environment

[pipenv.exceptions.VirtualenvCreationException]:
Failed to create virtual environment.```

///I uninstalled pipenv:

```$ pip3 uninstall pipenv
Found existing installation: pipenv 2020.6.2
Uninstalling pipenv-2020.6.2:
  Would remove:
    /home/markhneedham/.local/bin/pipenv
    /home/markhneedham/.local/bin/pipenv-resolver
    /home/markhneedham/.local/lib/python3.8/site-packages/pipenv-2020.6.2.dist-info/*
    /home/markhneedham/.local/lib/python3.8/site-packages/pipenv/*
Proceed (y/n)? y
  Successfully uninstalled pipenv-2020.6.2``` 

///And then thought I should check if there was anything left in the ~/.local/bin directory:

```$ ls -alh``` 

///virtualenv was still there! I thought it would have been removed when I uninstalled pipenv, but perhaps it was installed separately when I installed something else, not sure. 

///Anyway, i got rid of virtualenv:
``` pip3 uninstall virtualenv
Found existing installation: virtualenv 20.0.30
Uninstalling virtualenv-20.0.30:
  Would remove:
    /home/markhneedham/.local/bin/virtualenv
    /home/markhneedham/.local/lib/python3.8/site-packages/virtualenv-20.0.30.dist-info/*
    /home/markhneedham/.local/lib/python3.8/site-packages/virtualenv/*
Proceed (y/n)? y
  Successfully uninstalled virtualenv-20.0.30```

///And now we’ll install pipenv again:

```$ pip3 install pipenv```

///Then i could create a virtual environment.

Support for OpsGenie & PagerDuty

More of a question/feature request. Have tools like OpsGenie and PagerDuty been explored as options for alerting? Many operators may use these tools to integrate with existing tooling such as AlterManager. It would be nice to have alerts in a single tools/resource :)

'...has been inaccessible for...' message ignores days

If a node is not accessible for more than 24 hours, the time duration in alert messages starts from hour zero and does not show the number of days. To fix this, instead of adding a number of days to the alert message, these can be translated to hours.