PostgreSQL high availability: Patroni, etcd and automatic failover
Most teams discover their PostgreSQL database isn't actually highly available the night it goes down. The replica existed, the backups were running — but nobody had planned who promotes the replica, how fast, and how to prevent the old primary from waking up in parallel and corrupting the data.
This article explains what real PostgreSQL HA actually requires: streaming replication, leader election via etcd, automatic failover orchestrated by Patroni — and above all the operational pitfalls nobody mentions until you've lived through them in production.
High availability is not a backup — and not just a replica
Three different mechanisms solve three different problems, and confusing them is the first mistake.
A backup (ideally with Point-In-Time Recovery) protects against data loss or corruption: an unlucky DELETE, a botched migration, a dying disk. It's measured in RPO — how much data you're willing to lose.
A replica is a live copy of your database on another server. It protects against losing a machine, but on its own it decides nothing: if the primary dies, the replica stays patiently read-only, waiting for a human to promote it.
High availability is the automatic continuity of service when the primary disappears. It's measured in RTO — how long your application stays down. It's the mechanism that detects the failure, picks a replica, promotes it, and redirects writes — without waking anyone at 3 AM.
A replica without automatic failover isn't HA: it's a replica plus a pager. And HA without a tested backup isn't protection: a DELETE FROM users replicates to every node in milliseconds. You need all three.
The starting point: streaming replication
PostgreSQL natively replicates its transaction log (the WAL — Write-Ahead Logs) from the primary to one or more replicas. This is the foundation, but the mode you choose has direct consequences.
In asynchronous replication, the primary confirms the transaction to the client without waiting for the replica to acknowledge. It's fast, but if the primary crashes hard, the last transactions that weren't yet replicated are lost: your RPO isn't zero.
In synchronous replication, the primary waits for at least one replica to acknowledge the write before committing. RPO becomes zero — no confirmed transaction is lost — but every commit pays the cost of a network round-trip, and write availability becomes coupled to the replica's availability.
This choice isn't neutral, and it isn't binary: a well-tuned cluster combines both depending on the node. But whatever the mode, replication still doesn't answer the essential question: who becomes primary, and when?
The real problem: who promotes, and how do you avoid split-brain
When the primary goes down, four things have to happen, in order, with no mistakes:
- Detect the failure (and distinguish it from a simple network slowdown).
- Choose the most up-to-date replica to promote.
- Promote that node as the new primary.
- Prevent the old primary from coming back thinking it's still in charge.
Point 4 is the most dangerous. If the old primary reappears after a network partition and keeps accepting writes while a new primary accepts others, you have two diverging sources of truth: that's split-brain, and reconciling the data is painful at best, impossible at worst.
Doing all this by hand is slow and error-prone. You need automatic orchestration — and a single source of truth about the primary's identity that all nodes agree on.
etcd: the source of truth and the quorum
That's the role of the Distributed Configuration Store (DCS). PostgreSQL HA most commonly uses etcd: a distributed key-value store backed by the Raft consensus algorithm, which guarantees that all members see the same data, even under partial failure.
The principle is simple: the primary holds a leader key with a time-to-live (TTL) in etcd, which it must continuously renew. As long as it renews, it's in charge. If it stops — crash, freeze, network isolation — the key expires, and the remaining nodes can elect a new leader. Since only one node can hold a valid leader key at a time, split-brain is structurally prevented.
But etcd itself has to stay reliable, and that's where the quorum comes in. Raft requires a majority (N/2 + 1) of reachable members to make a decision. Hence one non-negotiable rule: an odd number of members. Three members tolerate the loss of one node; five tolerate two. An even number gains you nothing — four members still tolerate one failure, just like three, while multiplying the failure surface.
That's also why, when you only have two data servers, you add a third lightweight etcd member — a witness — purely to reach an odd quorum without paying for a third full server.
Patroni: the orchestrator
Patroni is the component that bridges PostgreSQL and etcd. It runs on each node, manages the local PostgreSQL (startup, configuration, promotion, demotion) and communicates with the DCS.
Concretely, the Patroni on the primary renews the leader key at every cycle (loop_wait). When a primary disappears and its key expires after the ttl, the Patroni instances on the replicas trigger an election: the one with the most advanced WAL — that is, the one with the least lag from the dead primary — is promoted, and the others reconfigure themselves automatically to follow the new primary. Patroni also exposes a REST API (/primary, /replica, /health) that's invaluable for routing — more on that below.
The basic install is almost trivial. It's the fine tuning and the operations that separate a cluster that holds up from one that sabotages itself.
Routing clients to the right node
This is the piece half the tutorials forget. After a failover, the primary has moved to a different machine — and therefore a different IP. If your application still points to the old node, your failover worked perfectly… and your app is still down.
The proven solution: an HAProxy in front of the cluster, querying Patroni's REST API for its health checks. It routes writes to whichever node answers primary on /primary, and can distribute reads across replicas via /replica. On failover, HAProxy automatically follows the new primary. A PgBouncer is often added to pool connections, since PostgreSQL doesn't enjoy spawning too many.
Where it actually breaks
Here's the part you only learn in production. We've seen a cluster fail over dozens of times in a few days — not because the hardware was failing, but because the tuning was too jumpy.
Ghost failovers. A ttl that's too short combined with network instability, and the slightest micro-incident expires the leader key: the cluster fails over when there was no real failure. This is the whole trade-off of the ttl / loop_wait / retry_timeout triad. A low ttl detects real failures quickly but multiplies false positives; a higher ttl (typically 60 s) stabilizes the cluster at the cost of a slightly slower failover. There's no magic value: it's tuned to the actual quality of the network link between the nodes.
Synchronous mode and write availability. Switching to synchronous_mode guarantees zero data loss, but a stuck synchronous replica can freeze writes on the primary. Patroni dynamically manages the list of synchronous replicas (synchronous_standby_names) to limit this risk, but it's a setting to understand, not to flip blindly.
Inter-datacenter latency. Spreading a cluster across multiple sites to survive losing a datacenter adds latency — both to the etcd quorum and to synchronous replication. A network hiccup between two DCs can be enough to trigger an election. The placement of the quorum and the witness becomes an architectural decision in its own right.
Residual split-brain. To absolutely prevent an old primary from continuing to write, you rely on a watchdog that reboots a node unable to demote itself in time, and on the fact that Patroni voluntarily makes a leader stand down when it loses access to the DCS.
And always: HA is not a backup. A bad migration replicates instantly to every node. Without off-site PITR and a tested restore on a regular schedule, your ultra-available cluster will simply serve corrupted data — in a highly available way.
A reference architecture
A serious, no-frills HA PostgreSQL cluster looks like this:
Application
│
HAProxy ──(Patroni API health check)
/ \
Primary Replica(s)
[PG+Patroni][PG+Patroni]
\ /
etcd (odd quorum: 3 members,
e.g. 2 data nodes + 1 witness)
Spread across 2 datacenters
Encrypted off-site backups + restore tested monthly
The building blocks: three PostgreSQL + Patroni nodes (or two data nodes plus a witness), an odd etcd quorum, an HAProxy in front for routing, all spread across two separate datacenters, and off-site backups whose restore is verified every month. None of these blocks are optional.
HA is a discipline, not an install
Setting Patroni up once is the easy 20%. The remaining 80% — tuning the timeouts to your network's reality, running regular failover drills to verify the failover actually works, monitoring that distinguishes a slowdown from a failure, restore tests, and the human response when something goes off-script at 3 AM — is what separates real high availability from "HA on paper". Most self-managed setups drift slowly toward the latter without anyone noticing, right up until the night it matters.
This is exactly what we operate: PostgreSQL in high availability, managed the way we run our own — Patroni + etcd cluster, failover tested under real conditions, verified backups, monitoring and on-call engineers.
If you want continuity without hiring an SRE or locking yourself into opaque cloud billing, that's the point of our managed infrastructure.