PostgreSQL Replication and High Availability
An advanced guide to PostgreSQL replication and high availability, covering streaming replication, read/write splitting in Node.js, automated failover, WAL archiving, and managed service options.
PostgreSQL Replication and High Availability
Overview
PostgreSQL replication is the mechanism that copies data from one database server (the primary) to one or more other servers (standbys or replicas), enabling read scaling, disaster recovery, and zero-downtime maintenance. If your application cannot tolerate minutes of downtime during a database failure or during schema migrations, replication is not optional -- it is foundational infrastructure. This guide covers streaming replication setup, read/write splitting in Node.js, automated failover with Patroni, WAL-based recovery, and practical monitoring for production systems.
Prerequisites
- PostgreSQL 15+ installed (two instances for primary/replica testing)
- Node.js 18+ with the
pgdriver - Docker and Docker Compose (for the local testing environment)
- Basic familiarity with PostgreSQL configuration files (
postgresql.conf,pg_hba.conf) - Understanding of TCP networking and SSH
npm install pg express
Why Replication Matters
A single PostgreSQL instance is a single point of failure. I have watched production databases go down at 2 AM because of a failed disk controller, and without replication, that means your application is offline until someone manually restores from a backup. Backups alone are not enough -- they give you recovery, but recovery takes time. Replication gives you a live copy of your data that is ready to serve traffic within seconds.
There are three primary reasons to replicate:
Read Scaling. Your primary handles all writes. Replicas handle read queries. For read-heavy workloads (which most web applications are), this can multiply your effective throughput by the number of replicas you maintain. I have seen applications go from saturating a single instance at 80% CPU to running comfortably at 30% per node after adding two read replicas.
Disaster Recovery. When your primary dies -- and eventually it will -- a standby server already has a near-complete copy of your data. Failover to a warm standby takes seconds, not hours. The difference between "we were down for 3 seconds" and "we were down for 4 hours while we restored from last night's backup" is often the difference between keeping and losing customers.
Zero-Downtime Maintenance. Major version upgrades, OS patches, hardware replacements -- all of these traditionally require downtime. With replication, you can promote a standby, patch the old primary, re-add it as a standby, and switch back. Your application never goes offline.
Replication Types
PostgreSQL supports several replication mechanisms, each suited to different use cases.
Physical (Streaming) Replication
This is the workhorse. The primary streams its Write-Ahead Log (WAL) to standbys in real time. The standby replays these WAL records to maintain an exact byte-for-byte copy of the primary. Every database, every table, every index -- the entire cluster is replicated.
Streaming replication is what you want for high availability. It is simple, reliable, and fast. The standby can serve read-only queries while replaying WAL.
Logical Replication
Introduced in PostgreSQL 10, logical replication works at the SQL statement level rather than the physical WAL level. You define publications (sets of tables) on the primary and subscriptions on the target. This allows selective replication -- you can replicate specific tables to specific targets, and the target can have its own tables, indexes, and even write operations.
Use logical replication when you need to replicate a subset of tables, replicate between different PostgreSQL major versions, or replicate to a database that also needs to accept writes.
Physical vs. Logical: When to Use Which
| Feature | Physical (Streaming) | Logical |
|---|---|---|
| Granularity | Entire cluster | Per-table |
| Read-only standby | Yes | No (target is read-write) |
| Cross-version | No | Yes |
| DDL replication | Yes | No |
| Failover capable | Yes | Not directly |
| Setup complexity | Low | Medium |
For high availability, use physical streaming replication. For data distribution, ETL pipelines, or cross-version migrations, use logical replication.
WAL (Write-Ahead Log) Fundamentals
Everything in PostgreSQL replication starts with the WAL. Before any data change is written to the actual data files (tables, indexes), it is first written to the WAL. This guarantees durability -- if the server crashes mid-write, PostgreSQL can replay the WAL on startup to recover to a consistent state.
WAL files are 16MB segments stored in pg_wal/ (formerly pg_xlog/). Each segment has a name like 000000010000000000000001. The primary generates WAL continuously as data changes occur.
Streaming replication works by shipping these WAL records from the primary to standbys in real time, over a TCP connection. The standby has a WAL receiver process that connects to the primary's WAL sender process. WAL records flow continuously, typically with sub-second latency on a local network.
-- See current WAL position on the primary
SELECT pg_current_wal_lsn();
-- pg_current_wal_lsn
-- --------------------
-- 0/3A0028D0
-- See WAL position on a standby
SELECT pg_last_wal_replay_lsn();
-- pg_last_wal_replay_lsn
-- ------------------------
-- 0/3A0028D0
The difference between the primary's current WAL LSN (Log Sequence Number) and the standby's replayed LSN is your replication lag.
Setting Up Streaming Replication
This is the step-by-step process I use for setting up a primary and one standby. I will use PostgreSQL 16 paths and syntax.
Primary Configuration
First, create a replication user on the primary:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_password_here';
Edit postgresql.conf on the primary:
# Enable WAL shipping
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
# Optional: enable WAL archiving for PITR
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'
# Listen on all interfaces (or specific IP)
listen_addresses = '*'
Edit pg_hba.conf to allow replication connections from the standby:
# TYPE DATABASE USER ADDRESS METHOD
host replication replicator 10.0.1.0/24 scram-sha-256
Restart the primary:
sudo systemctl restart postgresql
Standby Configuration
Stop PostgreSQL on the standby server, then wipe its data directory and create a base backup from the primary:
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*
sudo -u postgres pg_basebackup \
-h 10.0.1.10 \
-U replicator \
-D /var/lib/postgresql/16/main \
-Fp -Xs -P -R
The -R flag is critical -- it creates the standby.signal file and writes the primary connection info to postgresql.auto.conf. Without it, the standby will start as a standalone primary.
Verify the standby configuration was created:
cat /var/lib/postgresql/16/main/postgresql.auto.conf
# primary_conninfo = 'user=replicator password=strong_password_here host=10.0.1.10 port=5432'
Optionally tune standby-specific settings in postgresql.conf:
# Allow read queries on the standby
hot_standby = on
# Performance tuning for replay
max_standby_streaming_delay = 30s
hot_standby_feedback = on
Start the standby:
sudo systemctl start postgresql
Verify Replication
On the primary, check connected standbys:
SELECT client_addr, state, sent_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;
Expected output:
client_addr | state | sent_lsn | replay_lsn | byte_lag
--------------+-----------+-------------+-------------+----------
10.0.1.11 | streaming | 0/3A003F80 | 0/3A003F80 | 0
A byte_lag of 0 means the standby is fully caught up.
Synchronous vs. Asynchronous Replication
By default, streaming replication is asynchronous: the primary commits a transaction and immediately acknowledges it to the client, without waiting for the standby to confirm receipt. This is fast but introduces a window where data exists only on the primary.
Asynchronous (default):
- Primary commits instantly
- Standby may lag behind by milliseconds to seconds
- If the primary fails, the most recent transactions may be lost
- Negligible performance impact
Synchronous:
- Primary waits for at least one standby to confirm WAL receipt before acknowledging the commit
- Zero data loss guarantee
- Adds network round-trip latency to every write operation
- If the synchronous standby goes down, the primary stalls
To enable synchronous replication on the primary:
# postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
Each standby identifies itself via its application_name in its primary_conninfo:
# postgresql.auto.conf on standby
primary_conninfo = 'user=replicator password=... host=10.0.1.10 application_name=standby1'
My recommendation: use asynchronous replication for most applications. The sub-second lag is acceptable for read replicas, and the performance cost of synchronous writes is significant. Reserve synchronous replication for financial systems or any workload where losing even one committed transaction is unacceptable.
Read Replicas for Load Distribution
Once you have streaming replication running, your standbys can serve read-only queries. This is the simplest and most impactful scaling strategy for read-heavy applications.
The architecture is straightforward:
- All
INSERT,UPDATE,DELETE, andDDLstatements go to the primary - All
SELECTstatements go to one or more replicas - Your application layer routes queries accordingly
This is not automatic. PostgreSQL does not route queries for you. Your application must maintain separate connection pools and decide where to send each query.
Connection Routing in Node.js
Here is how I implement read/write splitting in Node.js using the pg module. The key is maintaining two connection pools -- one for the primary, one for the replica -- and exposing a data access layer that routes queries based on whether they mutate data.
var pg = require('pg');
var primaryPool = new pg.Pool({
host: process.env.PG_PRIMARY_HOST || 'localhost',
port: parseInt(process.env.PG_PRIMARY_PORT || '5432'),
database: process.env.PG_DATABASE || 'myapp',
user: process.env.PG_USER || 'appuser',
password: process.env.PG_PASSWORD || 'password',
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000
});
var replicaPool = new pg.Pool({
host: process.env.PG_REPLICA_HOST || 'localhost',
port: parseInt(process.env.PG_REPLICA_PORT || '5433'),
database: process.env.PG_DATABASE || 'myapp',
user: process.env.PG_USER || 'appuser',
password: process.env.PG_PASSWORD || 'password',
max: 30, // More connections for reads
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000
});
// Error handling for both pools
primaryPool.on('error', function(err) {
console.error('Primary pool error:', err.message);
});
replicaPool.on('error', function(err) {
console.error('Replica pool error:', err.message);
});
/**
* Execute a read query against the replica.
* Falls back to primary if the replica is unavailable.
*/
function queryRead(text, params) {
return replicaPool.query(text, params).catch(function(err) {
console.warn('Replica query failed, falling back to primary:', err.message);
return primaryPool.query(text, params);
});
}
/**
* Execute a write query against the primary.
*/
function queryWrite(text, params) {
return primaryPool.query(text, params);
}
/**
* Execute a transaction (always on the primary).
*/
function transaction(callback) {
return primaryPool.connect().then(function(client) {
return client.query('BEGIN')
.then(function() {
return callback(client);
})
.then(function(result) {
return client.query('COMMIT').then(function() {
client.release();
return result;
});
})
.catch(function(err) {
return client.query('ROLLBACK').then(function() {
client.release();
throw err;
});
});
});
}
module.exports = {
queryRead: queryRead,
queryWrite: queryWrite,
transaction: transaction,
primaryPool: primaryPool,
replicaPool: replicaPool
};
Note the fallback in queryRead. If the replica is down, reads automatically route to the primary. This keeps your application running during replica maintenance.
Connection Pooling with PgBouncer
Direct connections from Node.js to PostgreSQL are fine for small applications. At scale, you want a connection pooler between your application and PostgreSQL. PgBouncer is the standard choice -- it is lightweight, battle-tested, and used by virtually every large PostgreSQL deployment.
A typical production architecture:
App Server 1 ─┐
App Server 2 ──┼──> PgBouncer (primary) ──> PostgreSQL Primary
App Server 3 ─┘
App Server 1 ─┐
App Server 2 ──┼──> PgBouncer (replica) ──> PostgreSQL Replica
App Server 3 ─┘
PgBouncer configuration (pgbouncer.ini):
[databases]
myapp_primary = host=10.0.1.10 port=5432 dbname=myapp
myapp_replica = host=10.0.1.11 port=5432 dbname=myapp
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
server_lifetime = 3600
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0
Use pool_mode = transaction -- this means a server connection is assigned to a client only for the duration of a transaction, then returned to the pool. This is dramatically more efficient than session mode for web applications where most connections are idle between requests.
pgpool-II is an alternative that offers query-based routing (automatically sending SELECT to replicas and writes to the primary). However, it is more complex to configure, heavier at runtime, and introduces its own failure modes. I prefer handling routing in the application layer where I have full control.
Logical Replication for Selective Table Replication
When you need to replicate only specific tables -- for example, shipping product catalog data to an analytics database, or replicating user accounts across regions -- logical replication is the right tool.
On the primary (publisher):
-- Create a publication for specific tables
CREATE PUBLICATION product_pub FOR TABLE products, categories, inventory;
-- Or replicate all tables
CREATE PUBLICATION all_tables_pub FOR ALL TABLES;
On the target (subscriber):
-- Create the target tables first (schema is NOT replicated)
-- Then create a subscription
CREATE SUBSCRIPTION product_sub
CONNECTION 'host=10.0.1.10 dbname=myapp user=replicator password=...'
PUBLICATION product_pub;
Key differences from streaming replication:
- The subscriber is a fully read-write database
- You must create the table schemas on the subscriber manually
- DDL changes (ALTER TABLE, CREATE INDEX) are not replicated
- Each table must have a primary key or REPLICA IDENTITY set
Check subscription status:
SELECT subname, received_lsn, latest_end_lsn,
latest_end_lsn - received_lsn AS lag_bytes
FROM pg_stat_subscription;
Monitoring Replication Lag
Replication lag is the single most important metric for a replicated PostgreSQL deployment. If your replica is 30 seconds behind and you fail over to it, you lose 30 seconds of committed data.
From the Primary
SELECT
application_name,
client_addr,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Realistic output from a healthy system:
application_name | client_addr | state | sync_state | sent_lsn | write_lsn | flush_lsn | replay_lsn | replay_lag_bytes | write_lag | flush_lag | replay_lag
------------------+-------------+-----------+------------+-------------+-------------+-------------+-------------+------------------+-----------+-----------+------------
standby1 | 10.0.1.11 | streaming | async | 0/4E002A38 | 0/4E002A38 | 0/4E002A38 | 0/4E002A38 | 0 | 00:00:00 | 00:00:00 | 00:00:00
standby2 | 10.0.1.12 | streaming | async | 0/4E002A38 | 0/4E002A38 | 0/4E002A38 | 0/4E001B60 | 3800 | 00:00:00 | 00:00:00 | 00:00:00.003
The replay_lag column shows the time lag directly. Anything under 1 second is healthy for most applications.
From the Replica
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay,
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(),
pg_wal_lsn_diff(
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()
) AS replay_lag_bytes;
Alerting on Lag
Set up monitoring that alerts when replication lag exceeds your threshold. Here is a simple Node.js health check:
var db = require('./db');
function checkReplicationLag(thresholdMs) {
var query = "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000 AS lag_ms";
return db.replicaPool.query(query).then(function(result) {
var lagMs = parseFloat(result.rows[0].lag_ms);
if (lagMs > thresholdMs) {
console.error('ALERT: Replication lag is ' + lagMs + 'ms (threshold: ' + thresholdMs + 'ms)');
return { healthy: false, lagMs: lagMs };
}
return { healthy: true, lagMs: lagMs };
}).catch(function(err) {
console.error('ALERT: Cannot reach replica:', err.message);
return { healthy: false, lagMs: -1, error: err.message };
});
}
// Check every 10 seconds, alert if lag exceeds 5 seconds
setInterval(function() {
checkReplicationLag(5000);
}, 10000);
Automated Failover with Patroni
Manual failover is fine for planned maintenance. For unplanned failures at 2 AM, you need automated failover. Patroni is the best tool for this job. It is a Python-based HA manager that uses a distributed consensus store (etcd, Consul, or ZooKeeper) to manage cluster state and perform automatic failover.
How Patroni Works
- Patroni runs as a daemon on each PostgreSQL node
- Each node registers itself with the consensus store (etcd)
- One node holds the leader lock -- that node is the primary
- Patroni continuously monitors the primary's health
- If the primary becomes unreachable, the remaining nodes perform a leader election
- The winning standby is promoted to primary
- Other standbys reconfigure to follow the new primary
Patroni Configuration
scope: myapp-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.10:8008
etcd3:
hosts: 10.0.1.100:2379,10.0.1.101:2379,10.0.1.102:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
hot_standby: on
wal_keep_size: 1GB
initdb:
- encoding: UTF8
- data-checksums
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.10:5432
data_dir: /var/lib/postgresql/16/main
authentication:
superuser:
username: postgres
password: postgres_password
replication:
username: replicator
password: replicator_password
parameters:
shared_buffers: 2GB
effective_cache_size: 6GB
work_mem: 64MB
Connecting Through Patroni
With Patroni, you do not connect directly to a specific PostgreSQL node. Instead, you put a load balancer (HAProxy) or a DNS-based service discovery layer in front of the cluster. HAProxy is the most common choice:
frontend postgresql_primary
bind *:5000
default_backend primary_backend
backend primary_backend
option httpchk GET /primary
http-check expect status 200
server node1 10.0.1.10:5432 check port 8008
server node2 10.0.1.11:5432 check port 8008
server node3 10.0.1.12:5432 check port 8008
frontend postgresql_replicas
bind *:5001
default_backend replica_backend
backend replica_backend
option httpchk GET /replica
http-check expect status 200
balance roundrobin
server node1 10.0.1.10:5432 check port 8008
server node2 10.0.1.11:5432 check port 8008
server node3 10.0.1.12:5432 check port 8008
HAProxy health-checks each node's Patroni REST API. Only the current primary returns 200 for /primary. All standbys return 200 for /replica. Failover is transparent to your application.
repmgr as an Alternative
repmgr is an older, simpler tool for PostgreSQL replication management and failover. It works without a consensus store (no etcd needed) and uses a witness node pattern instead. For smaller deployments where running etcd feels like overkill, repmgr is a solid choice. However, Patroni has become the de facto standard for production HA, and its integration with Kubernetes (via the Zalando PostgreSQL operator) makes it the better long-term investment.
Point-in-Time Recovery (PITR) with WAL Archiving
Replication protects you from server failure, but it does not protect you from bad SQL. If someone runs DELETE FROM orders WHERE 1=1; at 3:15 PM, that statement replicates to all standbys instantly. All your replicas now have the same missing data.
PITR lets you restore your database to any specific moment in time by combining a base backup with archived WAL files.
Enable WAL Archiving
On the primary:
# postgresql.conf
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
archive_timeout = 60 # Force archive every 60 seconds even if WAL not full
For production, archive to object storage (S3, Spaces) instead of local disk. Tools like pgBackRest or WAL-G handle this with compression and encryption:
# WAL-G configuration for DigitalOcean Spaces
export WALG_S3_PREFIX=s3://my-bucket/wal-archive
export AWS_ENDPOINT=https://nyc3.digitaloceanspaces.com
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
# Take a base backup
wal-g backup-push /var/lib/postgresql/16/main
# List backups
wal-g backup-list
Performing PITR
# Stop PostgreSQL
sudo systemctl stop postgresql
# Restore base backup
wal-g backup-fetch /var/lib/postgresql/16/main LATEST
# Create recovery configuration
cat > /var/lib/postgresql/16/main/postgresql.auto.conf << 'EOF'
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '2026-02-07 15:14:00 UTC'
recovery_target_action = 'promote'
EOF
# Create recovery signal
touch /var/lib/postgresql/16/main/recovery.signal
# Start PostgreSQL -- it will replay WAL up to the target time
sudo systemctl start postgresql
The database replays all WAL records up to 15:14:00, one minute before the accidental DELETE. Your data is restored.
High Availability on Managed Services
If you are running on a managed platform, replication and failover are handled for you -- to a degree.
DigitalOcean Managed Databases
DigitalOcean provides managed PostgreSQL with built-in HA:
- Standby nodes: Add standby nodes that automatically fail over if the primary goes down. Failover takes 15-30 seconds.
- Read replicas: Add read-only replicas with separate connection strings. You still implement read/write splitting in your application.
- Automatic backups: Daily backups with 7-day retention. PITR is available on production plans.
- Connection pooling: Built-in PgBouncer with configurable pool modes.
Your .do/app.yaml for an App Platform deployment with separate connection strings:
envs:
- key: PG_PRIMARY_HOST
value: ${db-primary.HOSTNAME}
- key: PG_PRIMARY_PORT
value: ${db-primary.PORT}
- key: PG_REPLICA_HOST
value: ${db-replica.HOSTNAME}
- key: PG_REPLICA_PORT
value: ${db-replica.PORT}
AWS RDS
RDS Multi-AZ provides synchronous replication to a standby in a different availability zone. Failover is automatic and takes 60-120 seconds. Read replicas are asynchronous and can be in different regions.
Azure Database for PostgreSQL
Azure's Flexible Server offers zone-redundant HA with automatic failover. Read replicas support up to 5 replicas with configurable regions.
The critical point with all managed services: you still own the application-level routing. The managed service gives you separate endpoints for primary and replica. Your application code must direct writes to the primary endpoint and reads to the replica endpoint. The db.js module I showed earlier works identically regardless of whether you are running self-managed or managed PostgreSQL.
Testing Your Failover Strategy
An untested failover plan is not a plan. Here is what I test quarterly:
Simulate primary failure: Stop the primary PostgreSQL process (not a graceful shutdown -- kill -9 it). Verify that Patroni or your managed service promotes the standby within your SLA window.
Verify data integrity: After failover, run checksums against a known dataset. Ensure no transactions were lost (or that the loss is within your acceptable window for async replication).
Test application behavior: Does your application reconnect automatically? Does it correctly detect the new primary? Connection pool libraries usually handle reconnection, but verify it.
Measure failover duration: Time the entire process from primary failure to the application successfully writing to the new primary.
Test failback: After promoting the standby, can you reintroduce the old primary as a new standby? With
pg_rewind, this should be possible without a full base backup.
# Test: kill the primary PostgreSQL process
sudo kill -9 $(head -1 /var/lib/postgresql/16/main/postmaster.pid)
# Watch Patroni logs on the standby
journalctl -u patroni -f
# Expected: "promoted self to leader by acquiring session lock"
# Verify from the application
curl http://localhost:3000/health
# Expected: {"status":"ok","primary":"connected","replica":"connected"}
Backup Strategies That Complement Replication
Replication is not a backup. Say it again. Replication is not a backup. If you drop a table on the primary, that DROP replicates to every standby. Replication protects against hardware failure. Backups protect against human error.
A complete strategy includes:
- Continuous WAL archiving to object storage (S3, Spaces)
- Daily base backups via pgBackRest or WAL-G
- Retention policy: Keep 7 daily, 4 weekly, and 12 monthly backups
- Regular restore testing: Restore a backup to a test server at least monthly
- Cross-region copies: Store backups in a different region than your primary
# pgBackRest daily backup cron
0 2 * * * pgbackrest --stanza=myapp --type=diff backup
# Weekly full backup
0 3 * * 0 pgbackrest --stanza=myapp --type=full backup
# Verify backup integrity
pgbackrest --stanza=myapp check
Complete Working Example
Here is a complete Node.js Express application with a read/write splitting data access layer, plus a Docker Compose file for local testing with a primary and replica.
Docker Compose: Primary + Replica
version: '3.8'
services:
pg-primary:
image: postgres:16
container_name: pg-primary
environment:
POSTGRES_USER: appuser
POSTGRES_PASSWORD: apppassword
POSTGRES_DB: myapp
ports:
- "5432:5432"
volumes:
- ./docker/primary/init.sh:/docker-entrypoint-initdb.d/init.sh
- primary_data:/var/lib/postgresql/data
command: >
postgres
-c wal_level=replica
-c max_wal_senders=5
-c max_replication_slots=5
-c hot_standby=on
-c wal_keep_size=256MB
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d myapp"]
interval: 5s
timeout: 3s
retries: 5
pg-replica:
image: postgres:16
container_name: pg-replica
environment:
PGUSER: replicator
PGPASSWORD: replicator_pass
ports:
- "5433:5432"
volumes:
- replica_data:/var/lib/postgresql/data
depends_on:
pg-primary:
condition: service_healthy
entrypoint: |
bash -c "
# Wait for primary
until pg_isready -h pg-primary -U appuser; do sleep 1; done
# If data directory is empty, take a base backup
if [ ! -s /var/lib/postgresql/data/PG_VERSION ]; then
rm -rf /var/lib/postgresql/data/*
pg_basebackup -h pg-primary -U replicator -D /var/lib/postgresql/data -Fp -Xs -P -R
echo 'hot_standby = on' >> /var/lib/postgresql/data/postgresql.auto.conf
fi
exec postgres
"
volumes:
primary_data:
replica_data:
Primary Init Script
Save as docker/primary/init.sh:
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replicator_pass';
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
title VARCHAR(500) NOT NULL,
body TEXT,
published BOOLEAN DEFAULT false,
created_at TIMESTAMP DEFAULT NOW()
);
INSERT INTO users (email, name) VALUES
('[email protected]', 'Alice Johnson'),
('[email protected]', 'Bob Smith');
INSERT INTO posts (user_id, title, body, published) VALUES
(1, 'Getting Started with PostgreSQL', 'A beginner guide...', true),
(2, 'Advanced Indexing Strategies', 'Deep dive into indexes...', true);
EOSQL
# Allow replication connections
echo "host replication replicator all scram-sha-256" >> "$PGDATA/pg_hba.conf"
Node.js Application
db.js -- Data access layer with read/write splitting:
var pg = require('pg');
var config = {
primary: {
host: process.env.PG_PRIMARY_HOST || 'localhost',
port: parseInt(process.env.PG_PRIMARY_PORT || '5432'),
database: process.env.PG_DATABASE || 'myapp',
user: process.env.PG_USER || 'appuser',
password: process.env.PG_PASSWORD || 'apppassword',
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000
},
replica: {
host: process.env.PG_REPLICA_HOST || 'localhost',
port: parseInt(process.env.PG_REPLICA_PORT || '5433'),
database: process.env.PG_DATABASE || 'myapp',
user: process.env.PG_USER || 'appuser',
password: process.env.PG_PASSWORD || 'apppassword',
max: 30,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000
}
};
var primaryPool = new pg.Pool(config.primary);
var replicaPool = new pg.Pool(config.replica);
var replicaHealthy = true;
primaryPool.on('error', function(err) {
console.error('[DB] Primary pool error:', err.message);
});
replicaPool.on('error', function(err) {
console.error('[DB] Replica pool error:', err.message);
replicaHealthy = false;
});
function read(text, params) {
if (!replicaHealthy) {
return primaryPool.query(text, params);
}
return replicaPool.query(text, params).catch(function(err) {
console.warn('[DB] Replica read failed, falling back to primary:', err.message);
replicaHealthy = false;
// Re-check replica health every 30 seconds
setTimeout(function() {
replicaPool.query('SELECT 1').then(function() {
console.log('[DB] Replica is back online');
replicaHealthy = true;
}).catch(function() {
replicaHealthy = false;
});
}, 30000);
return primaryPool.query(text, params);
});
}
function write(text, params) {
return primaryPool.query(text, params);
}
function transaction(callback) {
return primaryPool.connect().then(function(client) {
return client.query('BEGIN')
.then(function() {
return callback(client);
})
.then(function(result) {
return client.query('COMMIT').then(function() {
client.release();
return result;
});
})
.catch(function(err) {
return client.query('ROLLBACK').then(function() {
client.release();
throw err;
});
});
});
}
function getReplicationStatus() {
return primaryPool.query(
"SELECT application_name, client_addr, state, " +
"pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, " +
"replay_lag " +
"FROM pg_stat_replication"
).then(function(result) {
return result.rows;
});
}
function healthCheck() {
var primaryOk = false;
var replicaOk = false;
var lagBytes = null;
return primaryPool.query('SELECT 1')
.then(function() {
primaryOk = true;
return replicaPool.query('SELECT 1');
})
.then(function() {
replicaOk = true;
return getReplicationStatus();
})
.then(function(status) {
if (status.length > 0) {
lagBytes = parseInt(status[0].lag_bytes || '0');
}
return {
status: primaryOk ? 'ok' : 'degraded',
primary: primaryOk ? 'connected' : 'disconnected',
replica: replicaOk ? 'connected' : 'disconnected',
replicationLagBytes: lagBytes
};
})
.catch(function(err) {
return {
status: 'degraded',
primary: primaryOk ? 'connected' : 'disconnected',
replica: replicaOk ? 'connected' : 'disconnected',
error: err.message
};
});
}
module.exports = {
read: read,
write: write,
transaction: transaction,
healthCheck: healthCheck,
getReplicationStatus: getReplicationStatus,
primaryPool: primaryPool,
replicaPool: replicaPool
};
app.js -- Express application:
var express = require('express');
var db = require('./db');
var app = express();
app.use(express.json());
// Health check endpoint (includes replication status)
app.get('/health', function(req, res) {
db.healthCheck().then(function(status) {
var statusCode = status.status === 'ok' ? 200 : 503;
res.status(statusCode).json(status);
});
});
// READ: Get all users (routed to replica)
app.get('/users', function(req, res) {
db.read('SELECT id, email, name, created_at FROM users ORDER BY id')
.then(function(result) {
res.json(result.rows);
})
.catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// READ: Get user by ID (routed to replica)
app.get('/users/:id', function(req, res) {
db.read('SELECT id, email, name, created_at FROM users WHERE id = $1', [req.params.id])
.then(function(result) {
if (result.rows.length === 0) {
return res.status(404).json({ error: 'User not found' });
}
res.json(result.rows[0]);
})
.catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// WRITE: Create a user (routed to primary)
app.post('/users', function(req, res) {
var email = req.body.email;
var name = req.body.name;
db.write(
'INSERT INTO users (email, name) VALUES ($1, $2) RETURNING id, email, name, created_at',
[email, name]
).then(function(result) {
res.status(201).json(result.rows[0]);
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// WRITE: Create a post with transaction (routed to primary)
app.post('/posts', function(req, res) {
var userId = req.body.user_id;
var title = req.body.title;
var body = req.body.body;
db.transaction(function(client) {
return client.query(
'SELECT id FROM users WHERE id = $1', [userId]
).then(function(userResult) {
if (userResult.rows.length === 0) {
throw new Error('User not found');
}
return client.query(
'INSERT INTO posts (user_id, title, body) VALUES ($1, $2, $3) RETURNING *',
[userId, title, body]
);
});
}).then(function(result) {
res.status(201).json(result.rows[0]);
}).catch(function(err) {
var statusCode = err.message === 'User not found' ? 404 : 500;
res.status(statusCode).json({ error: err.message });
});
});
// READ: Get posts with user info (routed to replica)
app.get('/posts', function(req, res) {
var query = [
'SELECT p.id, p.title, p.body, p.published, p.created_at,',
' u.name AS author_name, u.email AS author_email',
'FROM posts p',
'JOIN users u ON u.id = p.user_id',
'WHERE p.published = true',
'ORDER BY p.created_at DESC'
].join('\n');
db.read(query).then(function(result) {
res.json(result.rows);
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// Replication status endpoint (admin)
app.get('/admin/replication', function(req, res) {
db.getReplicationStatus().then(function(status) {
res.json(status);
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
var port = process.env.PORT || 3000;
app.listen(port, function() {
console.log('Server running on port ' + port);
console.log('Primary: ' + (process.env.PG_PRIMARY_HOST || 'localhost') + ':' + (process.env.PG_PRIMARY_PORT || '5432'));
console.log('Replica: ' + (process.env.PG_REPLICA_HOST || 'localhost') + ':' + (process.env.PG_REPLICA_PORT || '5433'));
});
Running the Complete Example
# Start PostgreSQL primary and replica
docker compose up -d
# Wait for replication to establish
sleep 10
# Verify replication is working
docker exec pg-primary psql -U appuser -d myapp \
-c "SELECT client_addr, state, replay_lag FROM pg_stat_replication;"
# Install and start the Node.js app
npm install pg express
node app.js
# Test write (goes to primary)
curl -X POST http://localhost:3000/users \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","name":"Carol Williams"}'
# Test read (goes to replica)
curl http://localhost:3000/users
# Check health (includes replication status)
curl http://localhost:3000/health
Common Issues and Troubleshooting
1. Replica Fails to Start: "database system identifier differs"
FATAL: database system identifier differs between the primary and standby
DETAIL: The primary's identifier is 7321456789012345678, the standby's identifier is 7321456789012345999.
This means the standby's data directory was not created from a base backup of the current primary. The standby has data from a different PostgreSQL cluster entirely. Solution: wipe the standby data directory and run pg_basebackup again.
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*
sudo -u postgres pg_basebackup -h primary_host -U replicator -D /var/lib/postgresql/16/main -Fp -Xs -P -R
sudo systemctl start postgresql
2. Replication Lag Keeps Growing
Check pg_stat_replication on the primary:
application_name | state | replay_lag_bytes
------------------+---------+------------------
standby1 | streaming | 524288000
A lag of 500MB+ that keeps growing means the replica cannot keep up with the write workload. Common causes:
- Slow disk I/O on the replica: The standby replays WAL sequentially, which is I/O-bound. Use SSDs.
- Long-running queries on the replica: Queries block WAL replay due to conflict resolution. Check
max_standby_streaming_delay-- if set too high, long queries delay replay. If set too low, queries get cancelled. - Network bandwidth: Ensure the network between primary and standby can sustain your WAL generation rate.
-- Check for queries blocking replay on the standby
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
3. "FATAL: no pg_hba.conf entry for replication connection"
FATAL: no pg_hba.conf entry for replication connection from host "10.0.1.11", user "replicator"
The primary's pg_hba.conf does not allow replication connections from the standby's IP. Add the correct entry:
host replication replicator 10.0.1.11/32 scram-sha-256
Then reload (not restart):
sudo systemctl reload postgresql
4. Standby Queries Cancelled Due to Recovery Conflict
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might not see results because of a recently started recovery action.
This happens when the primary vacuums or removes row versions that an active query on the standby still needs. Increase max_standby_streaming_delay to give queries more time, or enable hot_standby_feedback so the primary knows which row versions the standby still needs:
# On the standby
max_standby_streaming_delay = 60s
hot_standby_feedback = on
Warning: hot_standby_feedback = on can cause table bloat on the primary, because it prevents vacuum from cleaning up row versions that the standby is still using. Monitor pg_stat_user_tables.n_dead_tup on the primary.
5. WAL Segments Removed Before Standby Could Fetch Them
LOG: started streaming WAL from primary at 0/12000000 on timeline 1
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000012 has already been removed
The primary has deleted WAL segments that the standby still needs. This happens when the standby falls too far behind and wal_keep_size is too small. Solutions:
- Increase
wal_keep_sizeon the primary (e.g.,2GBor more) - Use replication slots (guarantees WAL retention, but can fill your disk if the standby is down for too long):
-- On the primary
SELECT * FROM pg_create_physical_replication_slot('standby1_slot');
Then reference the slot in the standby's primary_conninfo:
primary_slot_name = 'standby1_slot'
Best Practices
Always use replication slots for physical replication. They prevent WAL segments from being deleted before the standby has consumed them. Monitor slot lag to prevent disk exhaustion.
Set
wal_keep_sizeas a safety net, not a primary mechanism. Replication slots are the reliable way to retain WAL.wal_keep_sizeis your backstop if slots are not configured.Monitor replication lag continuously. Set alerts at two thresholds: a warning at 5 seconds and a critical alert at 30 seconds. Lag that grows monotonically indicates a systemic problem.
Test failover regularly. Schedule quarterly failover drills. Document the process, measure the duration, and verify application behavior after promotion. An untested failover plan will fail when you need it most.
Use
pg_rewindfor fast failback. After a failover,pg_rewindcan resynchronize the old primary from the new primary without a full base backup, saving hours on large databases.Keep backups independent from replication. Replication replicas are not backups. Run pgBackRest or WAL-G on a schedule, archive WAL to object storage, and test restores monthly.
Use connection pooling in production. PgBouncer in transaction mode between your application and PostgreSQL reduces connection overhead dramatically. A database that supports 200 direct connections can serve 2000+ application connections through PgBouncer.
Enable
hot_standby_feedbackcautiously. It prevents query cancellations on the replica but can cause vacuum delays on the primary. Monitor dead tuple counts on the primary when this is enabled.Size your replicas identically to the primary. Under-provisioned replicas fall behind under load. When they get promoted during failover, they cannot handle the full write workload. Your standby must be able to run as the primary at any moment.
Implement read-after-write consistency where needed. After a write, the application may immediately read the data. If the read goes to a replica that has not yet replayed the write, the data appears missing. For critical flows (e.g., user registration then immediate login), read from the primary for a short window after the write.
References
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication
- PostgreSQL Documentation: WAL Configuration
- PostgreSQL Documentation: Streaming Replication
- PostgreSQL Documentation: Logical Replication
- Patroni GitHub Repository
- pgBackRest Documentation
- WAL-G GitHub Repository
- PgBouncer Documentation
- repmgr Documentation
- DigitalOcean Managed Databases Documentation
