fly-apps / postgres-flex Goto Github PK

Postgres HA setup using repmgr

Dockerfile 1.31% Shell 6.18% Go 92.51%

postgres-flex's Introduction

High Availability Postgres on Fly.io

This repo contains all the code and configuration necessary to run a highly available Postgres cluster in a Fly.io organization's private network. This source is packaged into Docker images which allow you to track and upgrade versions cleanly as new features are added.

Getting started

# Be sure you're running the latest version of flyctl.
fly version update

# Provision a 3 member cluster
fly pg create --name <app-name> --initial-cluster-size 3 --region ord --flex

High Availability

For HA, it's recommended that you run at least 3 members within your primary region. Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the fly.toml file.

Horizontal scaling

Use the clone command to scale up your cluster.

# List your active Machines
fly machines list --app <app-name>

# Clone a machine into a target region
fly machines clone <machine-id> --region <target-region>

Staying up-to-date!

This project is in active development so it's important to stay current with the latest changes and bug fixes.

# Use the following command to verify you're on the latest version.
fly image show --app <app-name>

# Update your Machines to the latest version.
fly image update --app <app-name>

TimescaleDB support

We currently maintain a separate TimescaleDB-enabled image that you can specify at provision time.

fly pg create  --image-ref flyio/postgres-flex-timescaledb:15

Having trouble?

Create an issue or ask a question here: https://community.fly.io/

Contributing

If you're looking to get involved, fork the project and send pull requests.

postgres-flex's People

Contributors

Stargazers

Watchers

postgres-flex's Issues

Add warnings when cluster size does not meet HA requirements

Users coming from Stolon will be used to running 2 node setups, however, with repmgr this is not recommended. We should be able to better accommodate this when witness support is added.

#34

Users should not receive `superuser` privileges when running `fly pg attach`

Right now, all users created via the fly pg attach command will receive superuser privileges. We should make this opt-in, rather than the default.

Updating `--shared-preload-libraries` is highly sensitive to formatting

It's currently super easy for someone to blow away the repmgr reference when updating shared-preload-libraries. When a user needs to specify multiple values it's also important that they wrap the value inside a single quote.

For example:

This formatting will lead to breakage:

fly pg config update -a app-db --shared-preload-libraries "repmgr,pg_stat_statements"

This is the correct formatting:

fly pg config update -a app-db --shared-preload-libraries "'repmgr,pg_stat_statements'"

There's a couple improvements we need to make here:

repmgr should never be removed. If the user attempts to remove it, we should just include the reference by default.
We should work the value has the proper formatting before sending it to the VM.

Image upgrade process not working as expected

Readonly mode for PGBouncer "Session" pools needs to be handled differently.

Given how PGBouncer session pooling works, we will not be able to enable read-only mode as cleanly as we can in transaction mode. When running in session mode, connections will need to be killed / resumed.

Real primary isn't getting written to zombie.lock file when found

The real primary is resolved when a primary comes comes back to life, but it's not getting written to the lock file.

Add a separate Dockerfile that provides TimescaleDB support

Given Timescaledb support for new PG versions can be heavily delayed, we do not want to couple the Timescale extension with our base image. We should consider adding a separate Dockerfile that's built and maintained separately, which will make it easier to separate the upgrade paths.

Add condition that prevents `wal-level` from being set to `minimal` when archiving is enabled.

DB Size metric is showing the aggregate size across all members.

The DB Size as well as other metrics on the Fly metrics dashboard currently convey aggregates across all the Machines associated with a given app. This is obviously not correct and super confusing for the end-user. I believe the primary problem is that the metrics dashboard was originally designed to convey app wide metrics as opposed to machine-level metrics. I'm not exactly sure this can be solved until we figure out how to address the metrics dashboard for V2 apps.

@lubien do you have an ideas on how we could address this?

Add health check that monitors global locks

There are currently two different locks that need to be monitored and alerted on:

Readonly.lock
The cluster has been made read-only. The most common reason for this would be because the primary is nearing disk capacity. There's an existing capacity check that runs on the primary, but it's a bit of a hack.
Zombie.lock
This member has been fenced as we are unable to confirm that the booting/running primary is the actual primary.

Reference:
https://github.com/fly-apps/postgres-flex/blob/master/docs/fencing.md

The alerting should make it obvious what's going on with their cluster and we should have supporting documentation to enable users to address the issue on their own.

postgres-ha or postgres-flex?

I see there is already postgres-ha which provides Postgres HA.

So which one is recommended way to run Postgres HA in fly?

Phoenix migrations fail due to insufficient permissions

** (Postgrex.Error) ERROR 42501 (insufficient_privilege) permission denied for schema public

Easiest way to reproduce this is by going through: https://fly.io/docs/elixir/getting-started/

Move event handling to internal API

The event handler that repmgr calls is currently handled by a standalone file. This works, but it's not great from a logging perspective as all logs associated from events are currently piped to an events.log file rather than to the app logs.

Standby fails to rejoin when associated replication slot has been removed

The Standby is not able to recover properly once the replication slot has been removed.

2023-02-04T22:28:59Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:28:59.448 UTC [692] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1
2023-02-04T22:28:59Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:28:59.449 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:04Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:04.449 UTC [632] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:04Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:04.449 UTC [704] ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:04Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:04.450 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:04Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:04.449 UTC [704] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1
2023-02-04T22:29:09Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:09.453 UTC [635] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:09Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:09.453 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:09Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:09.453 UTC [713] ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:09Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:09.453 UTC [713] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1

Document recommended cluster sizes

Add support for Volume-based Restores

Update Fly.io Postgres docs

There are a few things in Fly's PG docs that need to be updated to reflect the new state of reality.

Docs should include a link to this repo.
TimescaleDB guidance needs to be updated.
...

Add docs on how to properly change the `postgres` password.

If a user changes the postgres password without updating the associated OPERATOR_PASSWORD secret, it will impact their ability to boot on their next restart.

Fly pg commands should verify cluster health before hand

Any fly pg ... commands that require a cluster restart should verify cluster health before running. If there's a failing health check on a standby, for example, the restart process will never complete.

This can be particularly frustrating if the failing health check is a VM check. We may need to figure out a way to ignore VM checks when waiting for the member to become healthy.

Support repmgr witness

The witness is a lightweight component that acts as a voting member in a given cluster. It doesn't hold user-data, it's only purpose is to act as a stand-in so quorum can be met.

As an example, say we have a 2 member setup running witin the same region.

Node A ( primary )
Node B

If Node B goes down, we have to assume a potential network partition and turn the primary read-only.

This is where the witness comes in:

Node A ( primary )
Node B
Node C ( witness )

If Node B goes down, we can continue functioning as we can now interact with the majority of the cluster.

Why would I run a Witness over a 3rd member

You should probably just run a 3rd member. The witness is really just a budget friendly way to achieve HA.

Internal connection audit

All internal connections that are initiated by this codebase should be established using the flypgadmin credentials. This will allow users to clearly differentiate their connections from ours.

Hook `fly pg create` up to the new Machines V2 deploy process

Integrate the PG provision process into the new Machines V2 deploy process. There's a lot of great stuff going into V2 apps that PG apps will want to be able to leverage.

Better Username Generation in Flyctl when attaching to an app

Currently, flyctl just uses the current app's name to generate a database's username. This works well when launching an app with a unique username. However, if a user tries to reattach the database to an app with the same name as the database, it'll fail to attach the database. This can happen when a user deletes an app, keeps the database, then relaunches the app. My current fix for that specific case is to create a new user with the format "{app_name}-{random_number}", which more or less guarantees a unique username. Could this cause any issues that I can't see, currently? Should this naming convention be applied more generally when creating / attaching a database to an app (assuming a username isn't already specified by the user ofc)

Split-brain during in-region network partition

This setup is currently susceptible to split-brain in the event of a network partition when all members remain up.

Here's an example 3 member setup:
Member A ( primary )
Member B
Member C

If a network partition happens that separates Member A from B and C: A | B, C

Member A will continue to be the primary and a new election will be initiated between B and C since Member A will appear down from their perspective. The end result is a split brain.

Why this isn't a problem across regions?
The primary is restricted to the region set by the PRIMARY_REGION environment variable. If a network partition happens between regions, the non-primary region will not hold an election.

What happens if members are restarted?
If Member A is rebooted after the network partition has been established it will become a zombie primary and go readonly. How this works is described here: #49

WAL Archiver/PITR/Disaster recovery (PG Barman Support)

Add support for PG Barman

This should be fairly straight forward to support.

PG Barman will need to run in its own Machine.
It will need to be fronted with haproxy just like a normal member so connections that hit this Machine will still be routed correctly.
The Init/Post Init process will need to be adjusted such that only haproxy and Barman are initialized.

Why Barman over something like Wal-G

Repmgr works with Barman out of the box.
Wal-G is expected to run next to the main database which can lead to resource contention.

Turn primary readonly when disk capacity reaches 90%

In order to prevent users from running their disks into the ground, we need to take some precautions to prevent disks from reaching 100%.

In general, disk capacity isn't a problem we really want to concern ourselves with long term, however, until we a proper alerting system in place a we will need to compromise a bit.

`fly image update` only updates replicas

Looks like this is due to our role definitions and that they differ from the Stolon implementation.

Consider giving users superuser privileges by default.

Users seem to be running into issues creating apps that create extensions due to lack of permissions. Users currently will have to manually grant the user created through the attach process SUPERUSER privileges to get past this.

Reference:
https://www.postgresql.org/docs/current/sql-createextension.html

Repmgr initialize will append new "includes" on each boot.

Replication lag is missing in Fly Dashboard

Members fails to boot when Consul is not available

There's a condition in the post-init that checks the initialization flag within Consul. This currently is used to determine whether or not the member should be initialized as the primary or a standby. This check is unnecessary after the initial setup.

https://github.com/fly-apps/postgres-flex/blob/master/internal/flypg/node.go#L221

Add Postgis extension

Standby Monitor - Threshold for removing standbys should be configurable.

Config synchronization from Consul should not impact the boot process

When a member is rebooted, we will work to sync any user-defined configuration from consul. If consul is not available, this process will fail along with the boot process.

We should be loud about this kind of failure, but make sure it doesn't fail the boot process. In general, this config sync process would only be required on boot if the Machine was down at the time it was initially applied.

Race-condition leading to failed provisions

It appears there's a race condition that's leading to occasional provision failures. Still investigating the root cause.

Feature Stability Checklist

List of features and bits of functionality that need to be confirmed working as intended. Please create a separate issue for any bugs discovered and link it to this issue.

Flyctl

Basics

Provisioning via fly pg create --repmgr ...
Restore from a Primary: fly pg create --repmgr --snapshot-id ...
Restore from a Standby: fly pg create --repmgr --snapshot-id ...
Horizontal scaling: fly machines clone <machine-id>
Image updates: fly image update ... (#75)
Member should be unregistered on machine removal. fly machines stop <machine-id> ... fly machines remove <machine-id>

Administrative

Disk capacity

Primary should automatically go readonly when disk capacity > 90%
Primary should automatically go read/write when disk capacity > 90% and is then brought below 90%.

Note: You can test this by ssh'ing to the primary and using fallocate to fill the disk.

# See current usage/capacity
df -h

# Use fallocate to fill up your disk and see how the cluster responds.
fallocate -l 100M /data/filename

Quorum

Primary should automatically go readonly when quorum is lost. ( E.G. shutting down majority of nodes in a cluster )
Primary should read-only in the event of a network partition where all nodes stay up.

Automatic member unregistration fails if primary role cannot be resolved

Example two member setup:
Member A ( primary )
Member B

Steps to reproduce

Member B is stopped
Member A goes readonly since quorum cannot be met.
Member B is removed via flyctl
The unregistration process fails, since flyctl cannot find the primary.

Manual fix

SSH into a running VM
su postgres && cd ~
repmgr daemon status
repmgr standby unregister --node-id <node-id>

Documentation tasks

Document quorum requirements.
Document how to recover a fenced primary.
Document how to recover when cluster goes read-only due to capacity.
Document how to manually unregister a member in the event flyctl fails to unregister it on removal.
Document how to view repmgr events via repmgr cluster events.
Document Stolon -> Repmgr migration.

Expose cluster events

Repmgr tracks major cluster events within its events table. It would be useful to expose this to the end user through fly pg events list or something along those lines.

https://repmgr.org/docs/5.3/repmgr-cluster-event.html

PG Apps should inject `fly_platform_version` metadata at provision time

Fly apps v2 requires some new metadata that's not being injected on PG apps by default. This can cause a headache in the event users need to adjust the fly.toml file.

Vm-sizes should include performance CPUs

Addressed: superfly/flyctl#1851

Leverage repmgr's built-in health checks

repmgr node check

Node "fdaa:0:2e26:a7b:c850:fa7a:5336:2":
	Server role: OK (node is primary)
	Replication lag: OK (N/A - node is primary)
	WAL archiving: OK (0 pending archive ready files)
	Upstream connection: OK (N/A - node is primary)
	Downstream servers: OK (2 of 2 downstream nodes attached)
	Replication slots: OK (2 of 2 physical replication slots are active)
	Missing physical replication slots: OK (node has no missing physical replication slots)
	Configured data directory: OK (configured "data_directory" is "/data/postgresql")

Sidecar/utility processes health checks

We have a number of processes that run within the VM.

Repmgrd
Postgres
Haproxy
Admin API
Metric exporter

We should write health-checks that verify that these processes are up and functioning as they should.

We should first address https://github.com/superfly/issues/issues/13

Teach `fly machine clone` to consider image digest

We need to ensure the new member created via fly machines clone is running the same image as the source machine. As of right now, a new machine will be created with the latest fly.version

Removing a zombie primary fails from Flyctl

The state of the member needs to be confirmed before we attempt to unregister the node. E.G. We can't run repmgr standby unregister if the metadata claims it's a non-active primary.

Member unregistration errors when member already unregistered.

$ fly machines remove 5683256c73258e
machine 5683256c73258e was found and is currently in stopped state, attempting to destroy...
unregistering postgres member 'fdaa:0:2e26:a7b:7d17:4463:955d:2' from the cluster... (failed)
failed to unregister postgres member: failed to resolve member: no rows in result set
(success)
5683256c73258e has been destroyed

If the member has already been unregistered, then we don't need to return an error. Also, there's clearly an issue with the output if (success) is indicated along with the failure.