Giter VIP home page Giter VIP logo

postgres-flex's Introduction

GitHub release (latest by date) DeepSource

High Availability Postgres on Fly.io

This repo contains all the code and configuration necessary to run a highly available Postgres cluster in a Fly.io organization's private network. This source is packaged into Docker images which allow you to track and upgrade versions cleanly as new features are added.

Getting started

# Be sure you're running the latest version of flyctl.
fly version update

# Provision a 3 member cluster
fly pg create --name <app-name> --initial-cluster-size 3 --region ord --flex

High Availability

For HA, it's recommended that you run at least 3 members within your primary region. Automatic failovers will only consider members residing within your primary region. The primary region is represented as an environment variable defined within the fly.toml file.

Horizontal scaling

Use the clone command to scale up your cluster.

# List your active Machines
fly machines list --app <app-name>

# Clone a machine into a target region
fly machines clone <machine-id> --region <target-region>

Staying up-to-date!

This project is in active development so it's important to stay current with the latest changes and bug fixes.

# Use the following command to verify you're on the latest version.
fly image show --app <app-name>

# Update your Machines to the latest version.
fly image update --app <app-name>

TimescaleDB support

We currently maintain a separate TimescaleDB-enabled image that you can specify at provision time.

fly pg create  --image-ref flyio/postgres-flex-timescaledb:15

Having trouble?

Create an issue or ask a question here: https://community.fly.io/

Contributing

If you're looking to get involved, fork the project and send pull requests.

postgres-flex's People

Contributors

billyb2 avatar dalperin avatar davissp14 avatar ederene20 avatar lubien avatar michaeldwan avatar smorimoto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

postgres-flex's Issues

Updating `--shared-preload-libraries` is highly sensitive to formatting

It's currently super easy for someone to blow away the repmgr reference when updating shared-preload-libraries. When a user needs to specify multiple values it's also important that they wrap the value inside a single quote.

For example:

This formatting will lead to breakage:

fly pg config update -a app-db --shared-preload-libraries "repmgr,pg_stat_statements"

This is the correct formatting:

fly pg config update -a app-db --shared-preload-libraries "'repmgr,pg_stat_statements'"

There's a couple improvements we need to make here:

  1. repmgr should never be removed. If the user attempts to remove it, we should just include the reference by default.
  2. We should work the value has the proper formatting before sending it to the VM.

Add a separate Dockerfile that provides TimescaleDB support

Given Timescaledb support for new PG versions can be heavily delayed, we do not want to couple the Timescale extension with our base image. We should consider adding a separate Dockerfile that's built and maintained separately, which will make it easier to separate the upgrade paths.

DB Size metric is showing the aggregate size across all members.

The DB Size as well as other metrics on the Fly metrics dashboard currently convey aggregates across all the Machines associated with a given app. This is obviously not correct and super confusing for the end-user. I believe the primary problem is that the metrics dashboard was originally designed to convey app wide metrics as opposed to machine-level metrics. I'm not exactly sure this can be solved until we figure out how to address the metrics dashboard for V2 apps.

@lubien do you have an ideas on how we could address this?

Add health check that monitors global locks

There are currently two different locks that need to be monitored and alerted on:

  1. Readonly.lock
    The cluster has been made read-only. The most common reason for this would be because the primary is nearing disk capacity. There's an existing capacity check that runs on the primary, but it's a bit of a hack.

  2. Zombie.lock
    This member has been fenced as we are unable to confirm that the booting/running primary is the actual primary.

Reference:
https://github.com/fly-apps/postgres-flex/blob/master/docs/fencing.md

The alerting should make it obvious what's going on with their cluster and we should have supporting documentation to enable users to address the issue on their own.

Move event handling to internal API

The event handler that repmgr calls is currently handled by a standalone file. This works, but it's not great from a logging perspective as all logs associated from events are currently piped to an events.log file rather than to the app logs.

Standby fails to rejoin when associated replication slot has been removed

The Standby is not able to recover properly once the replication slot has been removed.

2023-02-04T22:28:59Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:28:59.448 UTC [692] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1
2023-02-04T22:28:59Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:28:59.449 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:04Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:04.449 UTC [632] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:04Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:04.449 UTC [704] ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:04Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:04.450 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:04Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:04.449 UTC [704] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1
2023-02-04T22:29:09Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:09.453 UTC [635] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:09Z app[9185e0ef4712d8] ord [info]postgres  | 2023-02-04 22:29:09.453 UTC [578] LOG:  waiting for WAL to become available at 0/8002000
2023-02-04T22:29:09Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:09.453 UTC [713] ERROR:  replication slot "repmgr_slot_1094506454" does not exist
2023-02-04T22:29:09Z app[9080563f609108] ord [info]postgres        | 2023-02-04 22:29:09.453 UTC [713] STATEMENT:  START_REPLICATION SLOT "repmgr_slot_1094506454" 0/8000000 TIMELINE 1

Update Fly.io Postgres docs

There are a few things in Fly's PG docs that need to be updated to reflect the new state of reality.

  1. Docs should include a link to this repo.
  2. TimescaleDB guidance needs to be updated.
  3. ...

Fly pg commands should verify cluster health before hand

Any fly pg ... commands that require a cluster restart should verify cluster health before running. If there's a failing health check on a standby, for example, the restart process will never complete.

This can be particularly frustrating if the failing health check is a VM check. We may need to figure out a way to ignore VM checks when waiting for the member to become healthy.

Support repmgr witness

The witness is a lightweight component that acts as a voting member in a given cluster. It doesn't hold user-data, it's only purpose is to act as a stand-in so quorum can be met.

As an example, say we have a 2 member setup running witin the same region.

Node A ( primary )
Node B

If Node B goes down, we have to assume a potential network partition and turn the primary read-only.

This is where the witness comes in:

Node A ( primary )
Node B
Node C ( witness )

If Node B goes down, we can continue functioning as we can now interact with the majority of the cluster.

Why would I run a Witness over a 3rd member

You should probably just run a 3rd member. The witness is really just a budget friendly way to achieve HA.

Internal connection audit

All internal connections that are initiated by this codebase should be established using the flypgadmin credentials. This will allow users to clearly differentiate their connections from ours.

Better Username Generation in Flyctl when attaching to an app

Currently, flyctl just uses the current app's name to generate a database's username. This works well when launching an app with a unique username. However, if a user tries to reattach the database to an app with the same name as the database, it'll fail to attach the database. This can happen when a user deletes an app, keeps the database, then relaunches the app. My current fix for that specific case is to create a new user with the format "{app_name}-{random_number}", which more or less guarantees a unique username. Could this cause any issues that I can't see, currently? Should this naming convention be applied more generally when creating / attaching a database to an app (assuming a username isn't already specified by the user ofc)

Split-brain during in-region network partition

This setup is currently susceptible to split-brain in the event of a network partition when all members remain up.

Here's an example 3 member setup:
Member A ( primary )
Member B
Member C

If a network partition happens that separates Member A from B and C: A | B, C

Member A will continue to be the primary and a new election will be initiated between B and C since Member A will appear down from their perspective. The end result is a split brain.

Why this isn't a problem across regions?
The primary is restricted to the region set by the PRIMARY_REGION environment variable. If a network partition happens between regions, the non-primary region will not hold an election.

What happens if members are restarted?
If Member A is rebooted after the network partition has been established it will become a zombie primary and go readonly. How this works is described here: #49

WAL Archiver/PITR/Disaster recovery (PG Barman Support)

Add support for PG Barman

This should be fairly straight forward to support.

  1. PG Barman will need to run in its own Machine.
  2. It will need to be fronted with haproxy just like a normal member so connections that hit this Machine will still be routed correctly.
  3. The Init/Post Init process will need to be adjusted such that only haproxy and Barman are initialized.

Why Barman over something like Wal-G

  1. Repmgr works with Barman out of the box.
  2. Wal-G is expected to run next to the main database which can lead to resource contention.

Turn primary readonly when disk capacity reaches 90%

In order to prevent users from running their disks into the ground, we need to take some precautions to prevent disks from reaching 100%.

In general, disk capacity isn't a problem we really want to concern ourselves with long term, however, until we a proper alerting system in place a we will need to compromise a bit.

Config synchronization from Consul should not impact the boot process

When a member is rebooted, we will work to sync any user-defined configuration from consul. If consul is not available, this process will fail along with the boot process.

We should be loud about this kind of failure, but make sure it doesn't fail the boot process. In general, this config sync process would only be required on boot if the Machine was down at the time it was initially applied.

Feature Stability Checklist

List of features and bits of functionality that need to be confirmed working as intended. Please create a separate issue for any bugs discovered and link it to this issue.

Flyctl

Basics

  • Provisioning via fly pg create --repmgr ...
  • Restore from a Primary: fly pg create --repmgr --snapshot-id ...
  • Restore from a Standby: fly pg create --repmgr --snapshot-id ...
  • Horizontal scaling: fly machines clone <machine-id>
  • Image updates: fly image update ... (#75)
  • Member should be unregistered on machine removal. fly machines stop <machine-id> ... fly machines remove <machine-id>

Administrative

  • fly pg attach
  • fly pg detach
  • fly pg connect
  • fly pg restart
  • fly pg users list
  • fly pg db list
  • fly pg failover. ( Not yet available )
  • fly pg config show. ( fixed by #74 )
  • fly pg config update ( fixed by #74 )

Disk capacity

  • Primary should automatically go readonly when disk capacity > 90%
  • Primary should automatically go read/write when disk capacity > 90% and is then brought below 90%.

Note: You can test this by ssh'ing to the primary and using fallocate to fill the disk.

# See current usage/capacity
df -h

# Use fallocate to fill up your disk and see how the cluster responds.
fallocate -l 100M /data/filename 

Quorum

  • Primary should automatically go readonly when quorum is lost. ( E.G. shutting down majority of nodes in a cluster )
  • Primary should read-only in the event of a network partition where all nodes stay up.

Automatic member unregistration fails if primary role cannot be resolved

Example two member setup:
Member A ( primary )
Member B

Steps to reproduce

  1. Member B is stopped
  2. Member A goes readonly since quorum cannot be met.
  3. Member B is removed via flyctl
  4. The unregistration process fails, since flyctl cannot find the primary.

Manual fix

  1. SSH into a running VM
  2. su postgres && cd ~
  3. repmgr daemon status
  4. repmgr standby unregister --node-id <node-id>

Documentation tasks

  • Document quorum requirements.
  • Document how to recover a fenced primary.
  • Document how to recover when cluster goes read-only due to capacity.
  • Document how to manually unregister a member in the event flyctl fails to unregister it on removal.
  • Document how to view repmgr events via repmgr cluster events.
  • Document Stolon -> Repmgr migration.

Leverage repmgr's built-in health checks

repmgr node check

Node "fdaa:0:2e26:a7b:c850:fa7a:5336:2":
	Server role: OK (node is primary)
	Replication lag: OK (N/A - node is primary)
	WAL archiving: OK (0 pending archive ready files)
	Upstream connection: OK (N/A - node is primary)
	Downstream servers: OK (2 of 2 downstream nodes attached)
	Replication slots: OK (2 of 2 physical replication slots are active)
	Missing physical replication slots: OK (node has no missing physical replication slots)
	Configured data directory: OK (configured "data_directory" is "/data/postgresql")

Removing a zombie primary fails from Flyctl

The state of the member needs to be confirmed before we attempt to unregister the node. E.G. We can't run repmgr standby unregister if the metadata claims it's a non-active primary.

Member unregistration errors when member already unregistered.

$ fly machines remove 5683256c73258e
machine 5683256c73258e was found and is currently in stopped state, attempting to destroy...
unregistering postgres member 'fdaa:0:2e26:a7b:7d17:4463:955d:2' from the cluster... (failed)
failed to unregister postgres member: failed to resolve member: no rows in result set
(success)
5683256c73258e has been destroyed

If the member has already been unregistered, then we don't need to return an error. Also, there's clearly an issue with the output if (success) is indicated along with the failure.

DeadmemberTick failing with `no rows in result set`

Looks like this bug was introduced when we changed how the member id's are generated. The memberID is no longer deterministically generated, so we need to pull it from the config after member initialization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.