zalando / pgobserver Goto Github PK

A battle-tested, flexible & comprehensive monitoring solution for your PostgreSQL databases

Home Page: https://zalando.github.io/PGObserver/

License: Other

Shell 1.40% Python 33.33% PLpgSQL 24.98% JavaScript 1.01% CSS 2.31% Makefile 0.04% HTML 13.41% Java 21.12% Dockerfile 0.42% TSQL 1.98%

monitoring data-infrastructure

pgobserver's Introduction

PGObserver

PGObserver is a battle-tested monitoring solution for your PostgreSQL databases. It covers almost all the metrics provided by the database engine's internal statistics collector, and works out of the box with all PostgreSQL versions (beginning with 9.0) as well as AWS RDS. You don’t have to install any non-standard, server-side database extensions to take advantage of its core functionality, nor do you need to register any privileged users.

Monitored Metrics Include:

Stored procedure data: number of calls, run time per procedure, self time per procedure
All executed statements data: query runtimes, call counts, time spent on IO
- based on the pg_stat_statements module, which must be enabled on the DB
Table IO statistics: number of sequential and index scans, number of inserts, (hot) updates, deletes, table and index size, heap hits vs disk hits
General database indicators: number of backends, exceptions, deadlocks, temporary files written
Schema usage: procedure calls, IUD
CPU load: needs a plpythonu stored procedure from the sql/data_collection_helpers folder (plpythonu isn't available on RDS)
WAL (XLOG) volumes
Index usage

For some metrics you must install data-gathering wrapper functions — also known as stored procedures — on the server being monitored. This will enable you to circumvent the superuser requirements.

Go here for some PGObserver screenshots and a nice illustration by Zalando Tech resident artist Kolja Wilcke.

Additional Features

With some extra setup (see instructions), you can also:

monitor blocked processes (needs a cron script on the host DB)
monitor pg_stat_statements (needs an enabled pg_stat_statements extension)
Do cron aggregations for speeding up sproc load and database size graphs; these are useful when monitoring tens of instances
export metrics data to InfluxDB for custom charting/dashboarding with Grafana or some other tool

Status

Still in use but does not receive active attention or development as stored procedure usage has dropped in new projects.

How PGObserver Works

A Java application gathers metrics by querying PostgreSQL performance views (pg_stat_*). You can configure gathering intervals for the different metrics on a per-host, per-metric basis. This enables you to gather more details for critical systems and provide fewer details for less-important systems — thereby reducing the amount of data stored.

Additionally, you can configure sets of hosts to monitor from different Java processes — for example, when deploying to multiple locations with limited accessibility.

PGObserver’s frontend is a standalone Python + CherryPy application; the "screenshots" folder includes basic examples. Charts are rendered with the JS Flot library.

To help you generate generate minimalistic test data for a local test setup, we’ve included [this] (https://github.com/zalando/PGObserver/blob/master/frontend/src/testdata.py) script.

Quick Test Run Using Vagrant

Make sure you've installed the latest version of Vagrant. Use Vagrant to clone PGObserver to the machine where you want to run it. Then run from the PGObserver base directory:

git clone https://github.com/zalando/PGObserver.git
cd PGObserver
vagrant up

This last step will take a while, as PGObserver performs the following inside the virtual machine:

Fetches and starts an official PostgreSQL 9.3 Docker image
Compiles the gatherer for you, creates a Docker image, and starts it inside the VM
Creates a Docker image for the frontend and starts it inside the VM
Exposes ports 38080 and 38081 for the frontend and the gatherer, respectively. You can then open the frontend on port 38080 and configure a database cluster to monitor — e.g., http://localhost:38080/hosts/

The easiest way to run it somewhere else is to change the config files and create your own Docker images to deploy. Just point it to the PostgreSQL cluster where you created the PGObserver database.

Setup

Install:

Python 2.7 (to run PGObserver’s frontend)
Pip (to prepare Python dependencies)

pip install -r frontend/requirements.txt

the PostgreSQL contrib modules pg_trgm and btree_gist. These should come with your operating system distribution in a package named postgresql-contrib (or similar).

Create a schema by executing the SQL files from your sql/schema folder on a Postgres database where you want to store monitoring data:

cat sql/schema/*.sql | psql -1 -f - -d my_pgobserver_db

Configuration

Start by preparing your configuration files for gatherer and frontend; the provided examples are good starting points.

set your database connection parameters: name, host and port
configure the usernames and passwords for gatherer and frontend; find defaults here
set gather_group (important for gatherer only; enables many gatherer processes)
create an unprivileged user on the database you want to monitor; to do selects from the system catalogs

Configuring Hosts to Monitor

You can either:

Insert an entry to the monitor_data.hosts table to include the connection details and to-be-monitored features of the cluster you want to monitor (include the same password that you used in the previous step); OR
use the "frontend" web application's (next step) /hosts page, inserting all needed data and pressing "add", followed by "reload" to refresh menus
- set host_gather_group to decide which gatherer monitors which cluster
- to decide which schemas are scanned for sprocs statistics, review the table sproc_schemas_monitoring_configuration. Defaults are provided.

Some features will require you to create according helper functions on the databases being monitored:

CPU load monitoring requires a stored procedure from cpu_load.sql. This is a plpythonu function, so a superuser is needed.
For pg_stat_statement monitoring, you need this file.
For table & index bloat query, you need this.
Blocking processes monitoring requires setup from this folder.

Run the frontend by going into the "frontend" folder and running run.sh, which creates a "python src/web.py" and puts it in the background:

frontend$ ./run.sh --config frontend.yaml

Build the data gatherer single jar, including dependencies, by going to the "gatherer" folder and running:

mvn clean verify assembly:single

Start data monitoring daemons by running run.sh.

Troubleshooting Hint

You might have to change your PostgreSQL server configuration to gather certain types of statistics. Please refer to the Postgres documentation on The Statistics Collector and pg_stat_statements.

Contributions

PGObserver welcomes contributions to the community. Please go to the Issues page to learn more about planned project enhancements and noted bugs. Feel free to make a pull request and we'll take a look.

Thank You

Thank you to our Zalando contributors, as well as Fabian Genter.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pgobserver's People

Contributors

Stargazers

Watchers

pgobserver's Issues

Try to remove bonecp dependency

setup instructions

the set up instructions are a little sparse.

Some additional steps i had to do -

$ easy_install cherrypy 
$ easy_install Jinja2
$ easy_install argparse

Some additional things that could be included:

what version of python is required
what python dependencies are required
what versions of postgresql are supported

Additionally the setup requires directly inserting rows into the database. I think it would be worthwhile to make a simple install script to initialize the data.

Add a top 1h/3h "top slowed down sprocs" section to the main screen

No such file pgobserver.yaml Error (vagrant build).

I know there is a sample one in the gatherer directory, but I'm not sure about the vagrant build process:

==> default: ldconfig deferred processing now taking place
==> default:  ---> a4b7b19c1c01
==> default: Removing intermediate container 1f5d7dcf1b93
==> default: Step 2 : RUN mkdir -p /app
==> default:  ---> Running in 430f78046c53
==> default:  ---> 2d56fb5b4e81
==> default: Removing intermediate container 430f78046c53
==> default: Step 3 : COPY pgobserver.yaml /app/pgobserver.yaml
==> default: pgobserver.yaml: no such file or directory

Failing "Quick Test Run Using Vagrant"

Hi,

I tried to test PGObserver using the Quick Test Run Using Vagrant, and i got an error at STEP 6:

default: Step 6 : RUN pip install -r /app/requirements.txt
default: ---> Running in b64ab7ca8a57
default: Downloading/unpacking CherryPy==3.2.4 (from -r /app/requirements.txt (line 1))
default: Cannot fetch index base URL http://pypi.python.org/simple/
default: Could not find any downloads that satisfy the requirement CherryPy==3.2.4 (from -r /app/requirements.txt (line 1))
default: No distributions at all found for CherryPy==3.2.4 (from -r /app/requirements.txt (line 1))
default: Storing complete log in /root/.pip/pip.log
default: The command '/bin/sh -c pip install -r /app/requirements.txt' returned a non-zero code: 1
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

It may be related to pypi, if so, fell free to close this issue.

Thank you for your work !

Improve log output with current database context

In the log files in some cases the current database is not logged.
For instance:

SEVERE: Error during Load gathering java.sql.SQLException: Timed out waiting for a free available connection. at com.jolbox.bonecp.DefaultConnectionStrategy.getConnectionInternal(DefaultConnectionStrategy.java:88)

does not allow to check the server side.

Frontend: adding a new host should also add the Frontcontroller to the Root object

Currently a re-start of frontend is required

Gatherer - remove extra spaces before storing pg_stat_statement data

Ignore the 'recent 2 API schemas' rule for sprocs if newer version of them were not called yet

On one occasion by looking at the sprocs/all page of pgObserver, I couldn't find the sproc that was actively called on the running database. The reason was, it was older than 2 APIs ago. The problem is, no newer versions of that sproc were called (the backend was still at the N-3 API version) and the data for it was missing altogether.

The limitations, AFAIK, are there for the case the when the sproc changes significantly over the time, the data for the older version becomes irrelevant. Another reason is to reduce the amount of data to process. I think we may still satisfy both those requirements if we only include the data for a sproc which is older than N-2 revisions of API if the following conditions are true:

the sproc belongs to at least N-8 API or later (if one API release corresponds to a single week, it's unlikely that we will be running anything older than 2 months).
there is no data for the calls of the sproc with the same name, but later API revisions.
the sproc was called recently (i.e. during past week)

In other words, if the sproc A exists in R8, R9, R10, R11 and R12 APIs, the latest one in the database is R12 and it was called with R8 and R10 versions only during the last week, then we do include data for R10 only, because there were no calls for R11 and R12.

If, on the other hand, the same sproc gets called with R11 and R12, then only those 2 revisions are included, discarding the other ones.

Make "WAL activity" and "Load" data gathering independent of each other

Currently when load gathering fails (e.g. there is no plpython or zz_utils.get_load_average sproc) then also WAL activity info will be missing

frontend error

Hi!!

I'm trying to start the frontend via run.sh and i get this error

Traceback (most recent call last):
File "web.py", line 88, in
main()
File "web.py", line 76, in main
root.report = report.Report()
AttributeError: 'NoneType' object has no attribute 'report'

You guys can help?

Thanks!

On sproc details page add indication about the schema

currently when the sproc has no parameters there's no way to understand which in which schema the sproc was called

Wrong REGEX to check for valid S3 URL

While trying to deploy the the frontend, I'll receive following error:

Exception: Invalid S3 url: https://s3.eu-central-1.amazonaws.com/laas-config/config.yaml

Most likely, the reason is the regex to check for valid S3 URLs in https://github.com/zalando/PGObserver/blob/149e01867d0be82acfe4fdf184782a53fe4a4c05/frontend/src/aws_s3_configreader.py:

    m = re.match('https?://s3-(.*)\.amazonaws.com/(.*)', s3_url)
    if not m:
        raise Exception('Invalid S3 url: {}'.format(s3_url))

This regex works for all S3 URLs in eu-west-1 where at least our links look as follows:

https://s3-eu-west-1.amazonaws.com/zalando-stups-mint-123456-eu-west-1/app-credentials/client.json

However, in eu-central-1, the links seem to have a slightly different structure:

https://s3.eu-central-1.amazonaws.com/zalando-stups-mint-123456-eu-central-1/pgobserverfrontend/client.json

build problem: some packages cannot be found

during the build I have some problem: [INFO] 13 errors
some packages cannot be found!
I have a compilation errors, can someone help me?
tanks

build.zip

Collect xlog location data

for amount of WAL written

Start using a connection pool for the frontend

Can‘t find the pgobserver.conf file

where is the pgobserver.conf , can't find it.
or how preparing configuration files for gatherer and frontend.

Stored procedure data not collected

In the web interface, the stored procedure timings are not being collected. Is there an extension that needs to be installed? Am I missing any steps?
The only step I have not done is create the CPU load scripts, since I do not need it.

Is it possible to add adhoc queries to graphs ?

OAuth Authorization Response handling not RFC compliant

The current OAuth2-related code in oauth.py expects a scope parameter to be present in the authorization response. Otherwise it won't identify the authorization response as such.

This is not in accordance with RFC6749 (Section 4.1.2, Authorization Code Grant - Authorization Response), could lead to trouble in future and should be removed.

I'll open a PR.

"Unused schemas" view could be improved

In addition to min/max timestamps about gathered data existence we could also show when actually last IUD+call counts changed

Search for stored procedure

While available via url, add search form somewhere

Add a top pg_stat_statements section similar to top sprocs to the main screen

Should be also possible to configure if one wants to see it or not.

PGObserver Docker image

Trying to setup PGObserver on Fedora 25 I run into several problems.

Therefore I wonder, is there a all-in-one 'one-click' PGObserver Docker image planned or already existing?

If not would it be something you would like to have or do you think its not a good idea - if I would create one?

Best,
Oliver

pg_stat_statements not working

Hi, I've got a problem with logging pg_stat_statements ... I've got the module enabled in database (view pg_stat_statements exists in Schemas->public->Views) and there's of course some data. But when I try to access it on frontend, it doesn't show any statistics ... Also host selection is empty (see the picture).

Error log writes only this: ERROR: function zz_utils.get_load_average() does not exist, but I think it's not related to my problem. Any help would be appreciated :)

Add views for top tables by selects/inserts/updates/deletes

"Alltables view" could be extended also. Also there current "7days fixed" span should be made dynamic

configuration problems

Hi there,

I have problems to displayed all tables with Top 10 SProcs....... On the same page, Load Average 15min Sprocs shows a graph. I have already configured the pg_stat_statement module in my cluster.

shared_preload_libraries = 'pg_stat_statements' # (change requires restart) pg_stat_statements.max = 10000 pg_stat_statements.track = all

With this enabled, i am able to do a
select * from pg_stat_statement
when connected to my database in psql and get the expected result, but not in PGObserver web frontend.

I get an warning message in pgmon_java_*****

Nov 24, 2016 2:23:25 PM de.zalando.pgobserver.gatherer.SprocGatherer getSchemasToBeMonitored WARNING: NOT nspname LIKE ANY (array['public','pg\_%','information_schema','tmp%','temp%']::text[]) AND nspname LIKE ANY (array['%']::text[]) Nov 24, 2016 2:23:25 PM de.zalando.pgobserver.gatherer.SprocGatherer gatherData

Are there some configurations of postgresql that i am missing?
Could you please give me some more advice.

Thanks

Matthias

Add handling for graphing sprocs with the same name but different signature

Currently a graph mixing both results is created without a warning

500 Internal Server Error on /perfstatstatements

It seems like the frontend has trouble reading some queries from the pg_stat_statements table:

500 Internal Server Error
...
perf_stat_statements.html, line 84, in block "body"
    <tr class="hiddenrow" style="display:none" ><td colspan="8"><textarea readonly>{{ d['query'] }}</textarea></td>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 210: ordinal not in range(128)

This row seems to cause trouble in perf_stat_statements.html:

<tr class="hiddenrow" style="display:none" ><td colspan="8"><textarea readonly>{{ d['query'] }}</textarea></td>

Same problem with the perf_stat_statements_detailed.html by the way:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 210: ordinal not in range(128)

Hopefully you know how to fix it?

EDIT:

I guess I found the query which causes the problem:

UPDATE afg SET status = status_in, updated_at = now(), display = display_in, urls = urls_in, type = type_in, headline = headline_in, description_1 = description_1_in, description_2 = description_2_in, destination_url = destination (...)

Guess the (...) is the problem? Cause when I delete

<tr class="hiddenrow" style="display:none" ><td colspan="8"><textarea readonly>{{ d['query'] }}</textarea></td>

from perf_stat_statements.html, the /perfstatstatements loads fine. And the show graph url loads perfectly fine for all queries, except for the queries which start with

UPDATE afg SET status = status_in, updated_at = now(),

So I guess the (...) is causing problems.

error in version check of maven in build.sh

The command gathering the Maven version in build.sh should read
MAVEN_VERSION=$(mvn --version | grep -o '3.' | head -n 1)

The original (with head first and then grep) writes "3.\n3." into the variable which is different from "3."

Timeframe filter

If zoomed in to plot recompute top sprocs for that timeframe

Improve index bloat estimation

We had an index that was shown as 0,3 bloat ratio (meaning 0,3%) and 0 bytes to win by the PGObserver estimator, while the estimate at https://github.com/ioguix/pgsql-bloat-estimation/blob/master/btree/btree_bloat.sql shown it as 50% bloated.

After creating the new index with the same parameters it was indeed shown to be smaller than the original one. (the original one was 18GBs, and the replacement took only 7GB). Therefore, it looks like the bloat estimator needs to be re-examined.

PGObserver resets its accumulated sp_calls metrics on failover of the underlying DB

During the failover of the database being monitored we observed that PGObserver resets its sp_calls (total number of calls) metrics to 0, confusing our reporting script. This is a bug, it should instead continue from the previously collected values and, possibly, forgot writing new data while those measurements are not available.

Provide more Documentation for the Frontend

Hi there,

I tried to use the tool today.

It was not possible for me, to get it running properly. At least for the python part I was missing documentation, how to get it running... Can you provide me some more Informations?

Typo in schema.sql

There is a Typo in the schema.sql Script, Line 120.

It is "host_grather_group" and should be "host_gather_group"

Status of this project?

According to the code frequency it looks that this project does not get developed any more:
https://github.com/zalando/PGObserver/graphs/code-frequency

Please tell the users the state of the project in the README.

Is there are new project which I should use instead of this project?

pgobserver.conf

Hi,
I'm trying to install PGObserver. I have cloned the project from git. However, I cannot find the pgobserver.conf file. I'm referring to the third line in the setup guide:
Copy pgobserver.conf to the home folder: ~/.pgobserver.conf

Running find etc on the PGObserver directory didn't return anything useful.

Thanks and best regards,
Andreas

Top pg_stat_statements entries, Host: is empty

Hi,

I have a problem in Pg_stat_statements report: I have nothing data in Host, if I change url with one of my "host_ui_shortname" the data are displayed correct.

It is strange that only in this page my hosts are not loaded,

Could you please give me some information for resolve this problem?

Thanks
luca

configuration problems

Hey there,

i tried this tool today and unfortunately i am having some problems. I am able to start the frontend and also the gatherer. The frontend also display some information about my configured database. So i guess that the basic configuration is okay. However on several pages of the front end, no data is displayed. For example, the tables Top 10 Sprocs last 1 hour by total run time and Top 10 Sprocs last 3 hours by total run time are not displayed, while on the same page, Load Average 15min Sprocs only shows a graph. Furthermore the following tables dont show anything:

Perf. Issues - Tables
Perf. Issues - SProcs
Perf. Issues - Indexes
Pg_Stat_Statements report

Especially Pg_Stat_Statements report is very important for me, but does not display any data. I have already configured the pg_stat_statement module in my cluster like this (added in postgresql.conf):

shared_preload_libraries = 'pg_stat_statements'         # (change requires restart)
pg_stat_statements.max = 10000
pg_stat_statements.track = all

Restarted everything, but still no luck.

With this enabled, i am able to do a
select * from pg_stat_statement
when connected to my database in psql and get the expected result, but not in PGObserver web frontend.

Furthermore i get the following error message when trying to access the "Top Tables" page:

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 656, in respond
response.body = self.handler()
File "/usr/local/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 188, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 34, in __call__
return self.callable(*self.args, **self.kwargs)
File "/home/ilexx/Downloads/PGObserver/frontend/src/tablesfrontend.py", line 132, in index
top_tables = tabledata.getTopTables(None, date_from, date_to, order, limit)
File "/home/ilexx/Downloads/PGObserver/frontend/src/tabledata.py", line 388, in getTopTables
pattern = '%' + pattern + '%' 
TypeError: cannot concatenate 'str' and 'NoneType' objects

Are there some configurations of postgresql that i am missing?
Could you please give me some more advice.
So far, thanks...

Collect more CPU usage statistics

especially io load

Add database column to Performance Analyses views

At the moment we have multiple databases in one cluster. We have even multiple clusters on one host.

If we want to use the Performance Analyses views we can only select hosts. If you have configured multiple databases on one cluster and one host you have multiple hosts in the selection box but no way to view what database you are actually on.

It would be really helpful to use the UI Longname or UI Shortname in the menu to view the performance-stuff, in stead off the hostname. That way you can view all the other databases on the same host or cluster as well.

Fetch config data for gatherer from config file

get rid of hard coded database for data storage

Why not gathering public schema procedures

When the public schema procedures drops In SprocGatherer#getQuery?
I have some procedures in public schema and i need to gather their statistics, Can i include public schema in sql query If there is not side effect?

How to measure Sprocs (stored procedures) ?

It's not very clear to me how to measure the sproc statistics.

So far I know I've got these 3 tables:

sproc_performance_data
sproc_schemas_monitoring_configuration
sprocs

This is my sproc_schemas_monitoring_configuration table:

Do I need to insert rows in those other 2 tables? Or should this be enough? Or should the scmc_host_id be equal to the host id in the hosts table, which is monitored?

We're interested in the data that is available from postgres as described here:
https://wiki.postgresql.org/wiki/Index_Maintenance
http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-SETUP

Is this feature planned?

Thanks for making pgObserver!

On sproc details page add somekind of indication when not the whole data range is available

e.g. you want to see the last 2 weeks graph but only 1 weeks data is available