rapidloop / pgmetrics Goto Github PK

View Code? Open in Web Editor NEW

940.0 940.0 64.0 1.76 MB

Collect and display information and stats from a running PostgreSQL server

Home Page: https://pgmetrics.io

License: Apache License 2.0

Go 99.90% Dockerfile 0.10%

database go golang metrics metrics-gathering monitoring postgres postgresql

pgmetrics's People

Contributors

Stargazers

Watchers

pgmetrics's Issues

report notification queue usage

Collect and report notification queue usage:

pg_notification_queue_usage() - fraction of the asynchronous notification queue currently occupied. Documented at https://www.postgresql.org/docs/current/static/functions-info.html

What determines the sort order for list of table details in pgmetrics output?

Version used: pgmetrics_1.9.2_linux_amd64
Command line parameters: pgmetrics -h HOST -U USER -w DB

I'm executing pgmetrics against two separate environments that run the same database with the same schema. I wanted to see a side-by-side comparison of the results. The bulk of the pgmetrics output is the detailed information listed about the tables in the database being inspected but the sort order for the tables varies wildly between the two environments.

What determines the sort order of the tables when pgmetrics exports that data? Can that sort order be controlled via any parameter so I can get both environments to list table details in the same order? I was not able to find any parameters that might control that.

Thank you.

Windows/SSPI: gives "unknown authentication response: 9"

Trying to authenticate using SSPI on Windows - see

https://wiki.postgresql.org/wiki/Configuring_for_single_sign-on_using_SSPI_on_Windows

I can open a shell window using the Windows domain account and run PostgreSQL commands (Windows domain account is PostgreSQL superuser).

Can I do the same with pgmetrics? I try:

pgmetrics.exe -h localhost -U username -w databasename

Output:

pgmetrics: pq: unknown authentication response: 9

Or is this not supported?

Slow bloat query

I am trying to use pgmetrics on a big (10TB+), busy (1GB/s RW) database. It takes around 5 minutes for pgmetrics to run. I traced the problem to the "bloat query" spinning in CPU, doing no I/O.

I have traced the problem to the bloated pg_class (the irony: pgmetrics does not collect bloat on pg_catalog):
vacuum (full, analyze, verbose) pg_class;

INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 1 removable, 7430805 nonremovable row versions in 158870 pages
DETAIL:  7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO:  analyzing "pg_catalog.pg_class"
INFO:  "pg_class": scanned 60000 of 158869 pages, containing 295 live rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM

pg_class has so many dead rows because the workload is temp-table heavy (creating/destroying 1M+ temporary tables per day) and has long running analytics queries running for 24h+.

PG query planner assumes that index scan on pg_class will be very quick and plans Nested loop with Index scan. However, the index scan has 7M dead tuples to filter out and the query takes more than 200 seconds.

If I create a temp table from pg_class to contain only the live tuples:

CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;

and run the bloat query on pg_class_alive instead of pg_class:

SELECT
   nn.nspname AS schemaname,
   cc.relname AS tablename,
   COALESCE(cc.reltuples,0) AS reltuples,
   COALESCE(cc.relpages,0) AS relpages,
   COALESCE(CEIL((cc.reltuples*((datahdr+8-
     (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 END))+nullhdr2+4))/(8192-20::float)),0) AS otta
 FROM
    pg_class_alive cc
 JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 'information_schema'
 LEFT JOIN
 (
   SELECT
     foo.nspname,foo.relname,
     (datawidth+32)::numeric AS datahdr,
     (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE nullhdr%8 END))) AS nullhdr2
   FROM (
     SELECT
       ns.nspname, tbl.relname,
       SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS datawidth,
       MAX(coalesce(null_frac,0)) AS maxfracsum,
       23+(
         SELECT 1+count(*)/8
         FROM pg_stats s2
         WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND s2.tablename = tbl.relname
       ) AS nullhdr
     FROM pg_attribute att
     JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
     JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
     LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
     AND s.tablename = tbl.relname
     AND s.inherited=false
     AND s.attname=att.attname
     WHERE att.attnum > 0 AND tbl.relkind='r'
     GROUP BY 1,2
   ) AS foo
 ) AS rs
 ON cc.relname = rs.relname AND nn.nspname = rs.nspname
 LEFT JOIN pg_index i ON indrelid = cc.oid
 LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid

it runs in 10s, 20x faster

WDYT about adding such a hack (maybe under an option?) to pgmetrics?

pg_stat_statements scan failed: sql: Scan error on column index 3, name "left": converting NULL to string is unsupported

Getting following error:

pgmetrics: pgmetrics scan failed: sql: Scan error on column index 3, name "left": converting NULL to string is unsupported

from time to time. Running select pg_stat_statements_reset helps, but it does come back
pgmetrics version: 1.8.1
postgres version: 12.1 (Ubuntu 12.1-1.pgdg18.04+1)

Unable to login without password

pgmetrics -h 127.0.0.1 -p 5441 -U postgres --no-password example
pgmetrics: pq: unexpected error: "setting PGSYSCONFDIR not supported"

I have the right entries in .pgpass so this works:

psql -h 127.0.0.1 -p 5441 -U postgres example
psql (10.2 (Ubuntu 10.2-1.pgdg14.04+1))
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

use iso date format everywhere in report

is it possible to enforce iso date time formats everywhere in the report? it lines up much nicer in a report table and it's more international...

PgMetrics not working for AWS Aurora Instances

We are executing pgmetrics for AWS aurora instances with Postgres 14/15 versions but immediately command failed like below.

AWS Aurora PG 14.9 :

[root@localhost pgmetrics_1.13.1_linux_amd64]# ./pgmetrics -h *********.us-east-1.rds.amazonaws.com -U postgres --aws-rds-dbid=bk-pg-poc -f json -o /u01/Aws_Aurora_PG_14.9_Json_OP.txt postgres
Password:
pgmetrics: pg_stat_wal query failed: pq: Function pg_stat_get_wal() is currently not supported for Aurora
[root@localhost pgmetrics_1.13.1_linux_amd64]#

writer Instance:
[root@localhost pgmetrics_1.13.1_linux_amd64]# ./pgmetrics -h *********.us-east-1.rds.amazonaws.com -U postgres --aws-rds-dbid=bk-pg-poc-instance-1 -f json postgres
Password:
pgmetrics: pg_stat_wal query failed: pq: Function pg_stat_get_wal() is currently not supported for Aurora
[root@localhost pgmetrics_1.13.1_linux_amd64]#

AWS Aurora PG 15.4

[root@localhost pgmetrics_1.13.1_linux_amd64]# ./pgmetrics -h ********.us-east-1.rds.amazonaws.com -U postgres --aws-rds-dbid=bk-pg-poc -f json -o /u01/Aws_Aurora_PG_15.4_Json_OP.txt postgres
Password:
pgmetrics: pg_stat_wal query failed: pq: Function pg_stat_get_wal() is currently not supported for Aurora
[root@localhost pgmetrics_1.13.1_linux_amd64]#

the rapidloop/pq fork causes a conflict when using the collector package

Hi,

When using the collector package with other code that uses the lib/pq standard package, you get an error:

panic: sql: Register called twice for driver postgres

This is due to pgmetrics using the fork of lib/pq, rapidloop/pq.

Is it possible to change this to use the standard library so that it will work with other code?

Thanks

PGBOUNCER 1.21 breaking change.

PGBOUNCER with 1.21 introduced max_prepared_statements and I believe along that came a prepared_statements inside the SHOW SERVERS and SHOW CLIENTS

[Feature] Support connect by Keyword/Value Connection Strings

Hello! Thanks for pgmetrics.
Please add connect by Keyword/Value Connection Strings

In the keyword/value format, each parameter setting is in the form keyword = value, with space(s) between settings. Spaces around a setting's equal sign are optional. To write an empty value, or a value containing spaces, surround it with single quotes, for example keyword = 'a value'. Single quotes and backslashes within a value must be escaped with a backslash, i.e., ' and \.

Example:

host=localhost port=5432 dbname=mydb connect_timeout=10

https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING

For example my cloud:

pgmetrics "host=host \
    port=6432 \
    sslmode=verify-full \
    dbname=dbname \
    user=user \
    target_session_attrs=read-write"

FYI: Chocolatey package

I created chocolatey package for Windows:

https://chocolatey.org/packages/pgmetrics
Src: https://github.com/majkinetor/au-packages/tree/master/pgmetrics

It will be kept up to date via bot within single day from the Github release.

Thanks for the tool.

pg_stat_statements

If the pg_stat_statements extension is installed, collect information from the pg_stat_statements view also.

AWS RDS PostgreSQL - pg_ls_dir and pg_current_logfile

Hi,

I am trying to use pgmetrics against an AWS PostgreSQL RDS (Version 11.5).

It appears the AWS RDS superuser does not have permissions to pg_ls_dir and pg_current_logfile that I see in the code https://github.com/rapidloop/pgmetrics/blob/master/collector/collect.go.

I am seeing the below error messages on the logfile.
ERROR: 42501: permission denied for function pg_current_logfile
ERROR: 42501: permission denied for function pg_current_logfile

Is there any options I can use to eliminate these errors?

Also, is there any plans to modify this to work on opensource PostgreSQL and AWS RDS PostgreSQL?

Thanks and keep up the great work!

Scan error on column index 4, name "pid"

Got this in the latest postgres

pgmetrics: pg_locks query failed: sql: Scan error on column index 4, name "pid": converting driver.Value type ("") to a int: invalid syntax

Incorrect min_wal_size displayed for PG ver >= 10

In 1.10.0, displaying max_wal_size in text reports for PG versions >= 10 has been fixed (513dcc0). However, this has not been done for min_wal_size, which leads to outputs like:

    +--------------------+----------------+
    |            Setting |          Value |
    +--------------------+----------------+
    |          wal_level |        replica |
    |    archive_timeout |              0 |
    |    wal_compression |            off |
    |       max_wal_size | 10240 (10 GiB) |
    |       min_wal_size |  1024 (16 GiB) |

Log parsing clarification

Hi there!

Love the project, it's super useful and provides very interesting insights.

I note that you have documented that you can collect from the log files and display detailed information from these logs, but it's not super clear in which situations it applies.

Specifically, there are two separate use cases where I see issues (or confusion on my part) in getting log parsing to work:

Remote server, even when specifying --log-file paths
Local server, when the log destination is syslog/journald

Can you clarify if it's possible to enable the log parsing (and seemingly associated info in the output, like Slow Queries) when using pgmetrics with either of these configurations?

If it's not possible to do so in either of these situations, can the specific configurations/environment necessary for this be documented?

Thanks very much!

Please, add building instructions

I'm not a go magician. It would be nice to know how to build it myself.

Thanks! :D

When available on PG 11?

I am doing a lot of testing with PG 11 beta4, but pgmetrics does not work for that yet. Any idea when I can start beta testing pgmetrics for PG 11?

Query failed on launch

./pgmetrics --no-password
pgmetrics: pg_roles/pg_auth_members query failed: pq: bigint en dehors des limites

psql -V
psql (PostgreSQL) 10.3 (Debian 10.3-1.pgdg80+1)

PGLOCALEDIR issue

-bash-4.2$ /instance1/dl/pgm/pgmetrics_1.1.0_linux_amd64/pgmetrics --no-password
pgmetrics: pq: unexpected error: "setting PGLOCALEDIR not supported"

i can get past this by unsetting it but why is this raised?

thanks

How to build the project

Hello,
can you provide more information about how to build the project from scratch ? I can't build it with methods on site:
$ go get github.com/rapidloop/pgmetrics/...
../go/src/github.com/rapidloop/pgmetrics/cmd/pgmetrics/report.go:773:32: error: reference to undefined field or method ‘Truncate’
return time.Duration(1e6 * ms).Truncate(time.Millisecond).String()

Then I download the source code and try again:
GOROOT=/root/pgmetrics-1.2.0/vendor go build ./cmd/pgmetrics
cmd/pgmetrics/report.go:26:2: cannot find package "github.com/dustin/go-humanize" in any of:
/root/pgmetrics-1.2.0/vendor/src/github.com/dustin/go-humanize (from $GOROOT)
/root/go/src/github.com/dustin/go-humanize (from $GOPATH)
cmd/pgmetrics/main.go:30:2: cannot find package "github.com/howeyc/gopass" in any of:
/root/pgmetrics-1.2.0/vendor/src/github.com/howeyc/gopass (from $GOROOT)
/root/go/src/github.com/howeyc/gopass (from $GOPATH)
cmd/pgmetrics/collect.go:32:2: cannot find package "github.com/lib/pq" in any of:
/root/pgmetrics-1.2.0/vendor/src/github.com/lib/pq (from $GOROOT)
/root/go/src/github.com/lib/pq (from $GOPATH)
cmd/pgmetrics/main.go:31:2: cannot find package "github.com/pborman/getopt" in any of:
/root/pgmetrics-1.2.0/vendor/src/github.com/pborman/getopt (from $GOROOT)
/root/go/src/github.com/pborman/getopt (from $GOPATH)
cmd/pgmetrics/collect.go:33:2: cannot find package "github.com/rapidloop/pgmetrics" in any of:
/root/pgmetrics-1.2.0/vendor/src/github.com/rapidloop/pgmetrics (from $GOROOT)
/root/go/src/github.com/rapidloop/pgmetrics (from $GOPATH)
cmd/pgmetrics/main.go:33:2: cannot find package "golang.org/x/crypto/ssh/terminal" in any of:
/root/pgmetrics-1.2.0/vendor/src/golang.org/x/crypto/ssh/terminal (from $GOROOT)
/root/go/src/golang.org/x/crypto/ssh/terminal (from $GOPATH)

Thanks

how to collect system info with pgmetrics

dear team,
when I read doc about pgmetrics, I find pgmetrics can collect system info, like cpu/memory load and disk usage,
but I can't find any tip in the help list, or command demo in the doc,

https://pgmetrics.io/docs/metrics.html -- System-level Information

my env:

$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

$ pgmetrics -V
pgmetrics 1.15.0

could you add more info on it?

pg_class query failed error

pgmetrics -h localhost -p 5432 -U postgres --no-password mydb

pgmetrics: pg_class query failed: sql: Scan error on column index 2, name "pg_get_expr": unsupported Scan, storing driver.Value type <nil> into type *string

Using CentOS 7 64 bit and PostgreSQL 11.1.

Installation method:

wget https://github.com/rapidloop/pgmetrics/releases/download/v1.5.1/pgmetrics_1.5.1_linux_amd64.tar.gz

tar -zxvf pgmetrics_1.5.1_linux_amd64.tar.gz
cd pgmetrics_1.5.1_linux_amd64

cp ./pgmetrics /usr/bin/pgmetrics

pgmetrics -V
-->
pgmetrics 1.5.1

Command line improvements

First of all, thanks for your new tool! Here are a few suggestions:

Make table output optional or filterable through a new command parameter. Thousands of tables within a database can cause a lot of unwanted output, perhaps something like --no-tables.
Add a schema option to limit table output, --schema schema1, schema2
Put system information output before table output. That way top level stuff is always at top.

Keep up the good work!

Authentication by client cert does not work

ajr$ psql -h db2 -p 2222
Timing is on.
SET
Time: 0,578 ms
search_path

erdb, syslog, pki, dd, ext_ip4r, archiveopteryx, public
(1 row)

Time: 0,617 ms
psql (10.3, server 10.5)
SSL connection (protocol: TLSv1.2, cipher: DHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

operations=# \q
ajr$ PGSSLMODE=require pgmetrics -h db2 -p 2222 -w
pgmetrics: remote error: tls: handshake failure
ajr$

In pg_hba.conf the related line is
hostssl ... cert clientcert=1

Wrong interpretation of maintenance_work_mem value

Hey,

I've just started playing around with pgmetrics and noticed the following:

in postgresql.conf I have maintenance_work_mem = 128MB which seems to be internally set as value "131072", which means the base unit is KB and not B.

In the pgmetrics output (in the "Vacuum Progress" block) however this seems to be interpreted as B:
maintenance_work_mem | 131072 (128 KiB)

I am not sure if the memory values are internally always in KB (which would make the fix rather easy) or if depending on configuration settings / environment variable / postgres version this might require a lookup for the base unit before pretty-printing it.

How to read some values reported by pgmetrics (e.g. shared_buffers, max_wal_size)

Hello. I think i've an issue similar to #2 Command SHOW shared_buffers; shows

 shared_buffers 
----------------
 4GB
(1 row)

But pgmetrics reports the following value 524288.

    +---------------------------------+-------------------+
    |                         Setting |             Value |
    +---------------------------------+-------------------+
    |                  shared_buffers |  524288 (4.0 GiB) |
    |                        work_mem |    32768 (32 MiB) |
    |            maintenance_work_mem | 1048576 (1.0 GiB) |
    |                    temp_buffers |    1024 (8.0 MiB) |

What are the units and the values on the left ?

Similarly, and more inconsistently SHOW max_wal_size; returns

 max_wal_size 
--------------
 2GB
(1 row)

But pgmetrics reports as follows

    +--------------------+---------------+
    |            Setting |         Value |
    +--------------------+---------------+
    |          wal_level |       replica |
    |    archive_timeout |             0 |
    |    wal_compression |           off |
    |       max_wal_size | 2048 (32 GiB) |
    |       min_wal_size | 512 (8.0 GiB) |
    | checkpoint_timeout |           300 |
    |   full_page_writes |            on |
    |  wal_keep_segments |             0 |
    +--------------------+---------------+

Break up the Postgres shared memory is not available in pgmetrics/pgDash currently.

I have postgresql instance I am running it as a pod in kubernetes cluster. I want to monitor the 1). Shared Memory 2). Process Memory or Backend Memory using pgDash tool. Can you please check on this and add the required features

Using pgmetrics as a library

in its current shape, pgmetrics is a simple and useful command line tool. but we are not able to use it in our environment today because we already have some monitoring code running and we don't want to add another binary which is only responsible for postgresql monitoring.

ideal solution would be use pgmetrics as a library.

the code is pretty clean and as far as i can understand, if we export collect, any other package should be able to use pgmetrics as a library.

no pg_hba.conf entry for host

I'm trying to run pgmetrics against our Heroku postgres database, but I'm receiving the following error message:
pgmetrics: pq: no pg_hba.conf entry for host "xxx.xxx.xx.xx", user "xxxxxxxx", database "xxxxxxxx", SSL off

The the host ip in the error message is not the public ip of the database instance, but looks like the private ip.
Connecting to the database using psql from the same terminal with the same credentials works as expected.
pgmetrics runs succesfully against my localhost database.
I'm using the darwin build of v1.6.3 from https://github.com/rapidloop/pgmetrics/releases
I also tried running the docker image version but there I get the same error message.

Is there a setting I need to provide on the command line for it to work with a Heroku postgres database?

When using JSON output, all "setttings" are strings

In the json output created by:

pgmetrics --no-password --format=json

All of the setting values are stored as strings in the JSON output:

    "maintenance_work_mem": {
      "setting": "65536"
    },
    "max_connections": {
      "setting": "100"
    },
    "max_files_per_process": {
      "setting": "1000"
    },

Would it be possible (or sensible) to output parameters that we "know" will be numbers as actual numbers in JSON? That just makes it easier later on to do mathematics on the result.

We can automatically get the three main groupings of string/integer/real with:

string

SELECT name, setting, vartype, COALESCE(boot_val,'') AS boot_val, source,
       COALESCE(sourcefile,'') AS sourcefile, COALESCE(sourceline,0) AS sourceline
  FROM pg_settings
  WHERE vartype NOT IN ('integer','real')
  ORDER BY name ASC;

integer

SELECT name, setting::integer, vartype, COALESCE(boot_val,'')::integer AS boot_val, source,
       COALESCE(sourcefile,'') AS sourcefile, COALESCE(sourceline,0) AS sourceline
  FROM pg_settings
  WHERE vartype = 'integer'
  ORDER BY name ASC;

real

SELECT name, setting::real, vartype, COALESCE(boot_val,'')::real AS boot_val, source,
       COALESCE(sourcefile,'') AS sourcefile, COALESCE(sourceline,0) AS sourceline
  FROM pg_settings
  WHERE vartype = 'real'
  ORDER BY name ASC;

I guess we don't really need vartype in there, but I thought it may be helpful (or needed) when the JSON is created.

Also, I've added column aliases "AS boot_val", etc. just so that the SQL output is human readable, rather than columns being called "coalesce".

I did a basic test of the theory here:

integer

postgres=# select array_to_json(array_agg(row_to_json(t)))
postgres-# from (
postgres(# SELECT name, setting::integer
postgres(#   FROM pg_settings
postgres(#   WHERE vartype = 'integer'
postgres(#     AND name IN ('maintenance_work_mem','max_connections','max_files_per_process')
postgres(#   ORDER BY name ASC
postgres(# )t
postgres-# ;
                                                               array_to_json
--------------------------------------------------------------------------------------------------------------------------------------------
 [{"name":"maintenance_work_mem","setting":65536},{"name":"max_connections","setting":100},{"name":"max_files_per_process","setting":1000}]
(1 row)

real

postgres=# select array_to_json(array_agg(row_to_json(t)))
postgres-# from (
postgres(# SELECT name, setting::real
postgres(#   FROM pg_settings
postgres(#   WHERE vartype = 'real'
postgres(#   ORDER BY name ASC
postgres(#   LIMIT 3
postgres(# )t
postgres-# ;
                                                                           array_to_json
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [{"name":"autovacuum_analyze_scale_factor","setting":0.1},{"name":"autovacuum_vacuum_scale_factor","setting":0.2},{"name":"bgwriter_lru_multiplier","setting":2}]
(1 row)

pgmetrics 1.4.0 failed to support scram authentication

I have a database (PostgreSQL 10.3) with password of dbuser encrypted with scram-sha-256.
The entry in pg_hba.conf is "host [database] [user] [ip] md5"

When running pgmetrics for the database, it got an error

pgmetrics: pq: unknown authentication response: 10

This indicates scram authentication is not supported.

I have to change the password encryption of dbuser with md5 or disable authentication method in pg_hba.conf by adding entry "host trust" to overcome the error.

In production environment, neither of the above workaround is allowed.

Can you please check and resolve the above issue?

Release as a homebrew package

When popular enough (>50 stars, I think), would be great to have this as a homebrew package so one could just use brew install pgmetrics. I tried to create this package, but it requires building from source and I didn't have the time to figure that out for a go project right now. Also, they want it to be around for more than 30 days (a semi-valid concern).

Postgres 13: wal_keep_segments displays no value

In Postgres 13, pgmetrics displays no value for wal_keep_segments because it has been renamed to wal_keep_size.

Crashes PSQL when table has high Partition Count

I've caused psql14 to crash multiple times trying to run pgmetrics with a hash partitioned table with 1024 partitions.

AWS RDS, Aurora cloudwatch collection error

Hi!
I´m trying to run pgmetrics from linux to get the metrics for RDS Aurora Cluster and I´m getting this error:

pgmetrics: warning: failed to collect from AWS RDS: failed to get CloudWatch metric data: ValidationError: The value idEBSIOBalance% for parameter MetricDataQueries.member.30.Id is not matching the expected pattern ^[a-z][a-zA-Z0-9_]*$. The value idEBSByteBalance% for parameter MetricDataQueries.member.31.Id is not matching the expected pattern ^[a-z][a-zA-Z0-9_]*$. status code: 400, request id: 72bfee0a-a37f-4c4b-aa88-a5a304b4b660

The command that we are running is:

PGPASSWORD=XXXXXXXX ./pgmetrics -h writer-instance-endpoint -U SuperUser --aws-rds-dbid=clstr-instance-identifier -w -fjson database

The database output is ok, just fail with the RDS stats.

Thanks very much!

Graphite format - willing to submit pull request

Hey there, love the details with these metrics!

I wrote a NodeJS script to flatten / convert json output into a graphite line protocol ready format here:
https://gist.github.com/JohnnyMarnell/0ab7e95e2994b7960fe7a32667a529df

If I were to start porting it to Go (n00b), would you consider a pull request?

Something like pgmetrics -f graphite?

Omitting retrieval of index definition

As noticed in 17be4a7 getting index definition waits for an exclusive lock. This causes a lot of scary ERROR entries in PostgreSQL logs like:

2021-09-21 05:40:54.826 EDT [1880711-3] app=pgmetrics ERROR:  canceling statement due to lock timeout
2021-09-21 05:40:54.826 EDT [1880711-4] app=pgmetrics STATEMENT:  SELECT indexrelid, pg_get_indexdef(indexrelid) FROM pg_stat_user_indexes

It'd be useful to have extra omit option that would disable just the retrieval of index definitions, while keeping index stats.

Support Citus 11

Citus 11.x introduced a couple of breaking changes to the tables that pgmetrics queries, which causes errors when pgmetrics is run against Citus clusters on Citus 11. See release notes for details: https://www.citusdata.com/updates/v11-0/. The specific breaking changes are:

citus_worker_stat_activity has been dropped in favor of citus_stat_activity and citus_dist_stat_activity. citusdata/citus#5760. This causes this error when querying:

pgmetrics: warning: citus_worker_stat_activity query failed: pq: relation "citus_worker_stat_activity" does not exist

citus_dist_stat_activity does not have query_hostname and is a view over citus_stat_activity: citusdata/citus#5760 which causes an issue when trying to get query_hostname:

pgmetrics: warning: citus_dist_stat_activity query failed: pq: column "query_hostname" does not exist

citus_lock_waits no longer has column waiting_pid but waiting_gpid (see: citusdata/citus#5766) which causes this issue when trying to query the table:

pgmetrics: warning: citus_lock_waits query failed: pq: column "waiting_pid" does not exist

Would it be possible to get these models updated for the most recent major version of Citus?

get all database details without specifying dbnames?

is there a way to get the detailed, per database report for the whole cluster without the need to specify a list of dbnames?

support PG16

it report following with PG16:

pgmetrics: pg_stat_activity query failed: pq: invalid input syntax for type xid: ""

it works well on PG15.

Problem exporting pg_stat_progress_vacuum table

./pgmetrics_1.4.0_linux_amd64/pgmetrics --host=127.0.0.1 --port=5432

pgmetrics: pg_stat_progress_vacuum query failed: sql: Scan error on column index 1: converting driver.Value type <nil> ("<nil>") to a int: invalid syntax

postgres=> SELECT * from pg_stat_progress_vacuum;
  pid  | datid | datname  | relid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
-------+-------+----------+-------+-------+-----------------+-------------------+--------------------+--------------------+-----------------+-----------------
 14048 | 12996 | postgres |       |       |                 |                   |                    |                    |                 |                
 14058 | 12996 | postgres |       |       |                 |                   |                    |                    |                 |                
 14069 | 12996 | postgres |       |       |                 |                   |                    |                    |                 |

stack

Postgres 9.6 on Google Cloud SQL

Missing (interesting) stats

Congratulations on 1.0!

I saw your release, and I figured I'd take a look at what I have in PgZabbix that I'd really miss, in order to replace my collection with the one you're doing here. ( If I can maintain less code, that's good for me)

Now, before I sit down and butcher something in Go, I figured I'd chat a bit about it to see:

Insert / Update / Fetch / Delete stats

Rationale: We're monitoring on insertion /update rates in the DB level to track anomalies in production environments. Our loads are usually predictable with some time-based variance, causing this to be a very reliable predictor of anything from a peer net outage to application services not working.

Slow query count

Rationale: A growing amount of queries over a certain threshold of speed is usually the sign of something being wrong.

We're separating this into select like queries, all queries, and dml queries (not "insert/update/delete")

Connections used for replication

Actually, we use two methods here, both tracking connections per application, and connections used for replication.
We're checking the medium term number of connections used here to track backups (should run), and state of applications.

If there's interest here, I'll look at moving over my own code to instead integrate directly with pgmetrics, it seems like a nice way to learn Go.

Improve heuristic to check if PG is local

If PostgreSQL is listening on loopback interface, but not necessarily on 127.0.0.1 then

pgmetrics/collector/collect.go

Line 625 in 1d87609

func (c *collector) getLocal() {

fails to detect that PG is actually local.
E.g. if you have /etc/hosts with

127.0.0.1 localhost
127.0.1.1 pg

and you configure PG with

listen_addresses = 'pg'

then PG will listen on 127.0.1.1 but the TCP connection to 127.0.1.1 will appear from 127.0.0.1, i.e. inet_client_addr() will be 127.0.0.1 and inet_server_addr() will be 127.0.1.1.

Maybe a special case for 127.0.0.0/8 would be useful here?

pg_stat_archiver query failed: sql: Scan error on column index 6, name "date_part": converting NULL to int64 is unsupported

After upgrading to AWS RDS Aurora Postgres 11.11, I started getting this error:

pg_stat_archiver query failed: sql: Scan error on column index 6, name "date_part": converting NULL to int64 is unsupported

I was able to fix it with this patch on v1.10.5:

diff --git a/collector/collect.go b/collector/collect.go
index fe4e4f4..262ee19 100644
--- a/collector/collect.go
+++ b/collector/collect.go
@@ -611,7 +611,7 @@ func (c *collector) getWALArchiver() {
 			failed_count,
 			COALESCE(last_failed_wal, ''),
 			COALESCE(EXTRACT(EPOCH FROM last_failed_time)::bigint, 0),
-			EXTRACT(EPOCH FROM stats_reset)::bigint
+			COALESCE(EXTRACT(EPOCH FROM stats_reset)::bigint, 0)
 		  FROM pg_stat_archiver`
 	a := &c.result.WALArchiving
 	if err := c.db.QueryRowContext(ctx, q).Scan(&a.ArchivedCount, &a.LastArchivedWAL,

If anyone else needs to quickly get this fixed before a new release is available that fixes it, here's a Dockerfile snippet to help out building a patched version of 1.10.5.

FROM golang:1.15-buster as pgmetrics_builder

RUN git clone -b v1.10.5 https://github.com/rapidloop/pgmetrics.git
COPY pgmetrics-fix-wal-archiver-stats.patch ./
RUN cd pgmetrics \
    && git apply ../pgmetrics-fix-wal-archiver-stats.patch \
    && CGO_ENABLED=0 go build -ldflags '-s -w -X main.version=1.10.5-patched -extldflags "-static"' -trimpath ./cmd/pgmetrics


FROM your_image

COPY --from=pgmetrics_builder /go/pgmetrics/pgmetrics ./

pg_available_extensions query failed

mc3$ PGSSLMODE=require pgmetrics -h db4 -w -f human -t 60 operations
pgmetrics: pg_available_extensions query failed: sql: Scan error on column index 4, name "comment": unsupported Scan, storing driver.Value type into type *string
mc3$ psql -h db4 postgres
Timing is on.
SET
Time: 7,236 ms
search_path

erdb, syslog, pki, dd, ext_ip4r, archiveopteryx, public
(1 row)

Time: 7,388 ms
psql (10.5, server 10.6)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

postgres=#

rapidloop / pgmetrics Goto Github PK

pgmetrics's People

Contributors

Stargazers

Watchers

Forkers

pgmetrics's Issues

ajr$ psql -h db2 -p 2222 Timing is on. SET Time: 0,578 ms search_path

Insert / Update / Fetch / Delete stats

Slow query count

Connections used for replication

Recommend Projects

Recommend Topics

Recommend Org

ajr$ psql -h db2 -p 2222
Timing is on.
SET
Time: 0,578 ms
search_path