Skip to main content
emnode / learn
Site Reliability

Add RDS read replicas for read scaling and DR

Read replicas absorb read load and give you a promotion-ready DR target — for read-heavy workloads they pay for themselves in primary-instance right-sizing.

14 min·10 sections·AWS

Last reviewed

Read replicas: the basics

What is an RDS read replica and why does it get flagged?

An RDS read replica is an asynchronous, read-only copy of a primary database instance. For MySQL, MariaDB, PostgreSQL, and Oracle it uses the engine's native logical replication — the primary ships binlog/WAL records and each replica applies them. For Aurora it's a different beast entirely: replicas attach to the same shared storage volume rather than re-applying writes, so lag is typically tens of milliseconds and you can stand up a reader without copying any data.

A database without read replicas is a single point of failure with two distinct problems. Every read query — even the boring SELECTs that drive dashboards, search, reporting, or feature flags — competes with writes on the same instance. And if the primary fails, your recovery option is a Multi-AZ failover (if you have it) or a restore from backup that takes minutes to hours. There's no warm secondary you can promote on a whim, and no regional fallback unless you've planned for one.

Continuity check RDS-DR-004 flags any DB instance running with zero read replicas. The severity is intentionally LOW — plenty of databases are write-heavy enough or small enough that a replica isn't worth the spend. The check is there to make you decide deliberately, not to fail every small dev database in the account.

In this lesson you'll learn when a read replica is the right call and when it isn't, how replication actually works under the hood (and where the lag bites you), and how to add a same-region replica or a cross-region DR replica with the exact CLI commands. You'll see the trade-off between replica cost and the primary right-sizing it unlocks, plus the promote-replica flow you'll actually run during a DR event.

Fun fact

GitHub's October 2018 outage was a replica problem

When GitHub lost connectivity between its East and West Coast sites for 43 seconds, the MySQL cluster failed over to a West Coast primary. The replicas in the East — now stale by tens of seconds of writes — couldn't catch up safely without risking data integrity. The team chose data consistency over uptime and spent 24 hours and 11 minutes reconciling writes by hand. The lesson: read replicas are a powerful DR option, but "asynchronous" means "sometimes behind," and the promotion decision needs a clear runbook before the incident, not during one.

Adding a read replica in action

Marco runs the data platform at a SaaS analytics company. RDS-DR-004 has fired on their production PostgreSQL instance — a db.r5.4xlarge running their reporting workload, $2,400/month. The team's been throwing bigger instances at slow dashboards for a year.

He pulls Performance Insights for the last 7 days. The wait events break down to roughly 78% read I/O and 12% write I/O — a 6.5:1 read:write ratio. Database load is consistently above the vCPU line, and the top query is a 200-line analytical SELECT that runs every 30 seconds from the dashboard service.

He decides on one same-region replica to absorb the reporting reads. If the math works out, the team can right-size the primary down to r5.2xlarge after a week of routing reads off it — net cost neutral, with a promotion-ready DR target as a bonus.

First, create the read replica in the same region. The replica inherits the primary's parameter group, security groups, and KMS key by default.

$ aws rds create-db-instance-read-replica --db-instance-identifier prod-reports-replica-1 --source-db-instance-identifier prod-reports --db-instance-class db.r5.4xlarge --availability-zone eu-west-1b --no-publicly-accessible --tags Key=Purpose,Value=read-scaling Key=Env,Value=prod
{
"DBInstance": {
"DBInstanceIdentifier": "prod-reports-replica-1",
"DBInstanceClass": "db.r5.4xlarge",
"Engine": "postgres",
"DBInstanceStatus": "creating",
"ReadReplicaSourceDBInstanceIdentifier": "prod-reports",
"StorageEncrypted": true,
"MultiAZ": false
}
}
# Replica creation takes 15-40 minutes depending on source volume size.

Same-region replica creation. Sized to match the primary while the read load gets diverted.

Once it's available, verify replication lag before pointing application reads at it. Anything under a few seconds is normal; sustained double-digit seconds means the replica can't keep up and needs investigating.

$ aws rds describe-db-instances --db-instance-identifier prod-reports-replica-1 --query 'DBInstances[0].{Status:DBInstanceStatus,Source:ReadReplicaSourceDBInstanceIdentifier,Lag:StatusInfos[0].Message}' --output table
------------------------------------------
| DescribeDBInstances |
+----------+-------------+---------------+
| Status | Source | Lag |
+----------+-------------+---------------+
| available| prod-reports| 0.4 seconds |
+----------+-------------+---------------+
# 0.4s lag — well within reporting tolerance, safe to route SELECTs.

Replication lag check before cutover. Pair with CloudWatch's ReplicaLag metric for ongoing visibility.

How replication actually worksdeep dive

For MySQL, MariaDB, PostgreSQL, and Oracle, RDS read replicas use the database engine's native logical replication. The primary writes every committed transaction to a binlog (MySQL) or WAL stream (PostgreSQL), the replica connects as a logical client, pulls the stream, and re-applies the statements locally. This is single-threaded apply in older engines and parallel apply in newer ones — but the apply path is still one CPU at the top end. A write-heavy primary can saturate the apply thread on the replica and lag will grow.

Aurora is fundamentally different. Storage is a shared distributed volume sitting underneath every instance in the cluster, so a reader doesn't replay writes at all — it sees the same 6-way-replicated storage the writer is updating. That's why Aurora supports up to 15 readers per cluster, lag is usually tens of milliseconds, and adding a reader doesn't require copying data. The trade-off is you can't span regions with that shared storage, so cross-region Aurora replicas use a separate logical-replication mechanism (Aurora Global Database).

Cross-region replicas add two costs you don't see in a same-region setup. Storage is doubled (the replica region needs its own copy of the data) and every byte of replication traffic is billed as cross-region data transfer at roughly $0.02/GB. For a busy 500GB database that's $300+/month in transfer alone before the instance cost — but you get a region-isolated promotion target, which can be the difference between a 15-minute outage and a 6-hour AWS regional incident.

# Cross-region DR replica in eu-central-1, source primary lives in eu-west-1.
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-reports-dr \
  --source-db-instance-identifier arn:aws:rds:eu-west-1:123456789012:db:prod-reports \
  --destination-region eu-central-1 \
  --kms-key-id alias/rds-eu-central-1 \
  --db-instance-class db.r5.2xlarge \
  --tags Key=Purpose,Value=dr Key=Env,Value=prod

# Monitor lag via the cross-region replica's CloudWatch metrics.
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReplicaLag \
  --dimensions Name=DBInstanceIdentifier,Value=prod-reports-dr \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 60 --statistics Average Maximum

What is the impact of running RDS without read replicas?

The most direct impact is over-provisioning the primary. A read-heavy workload that should be served from a smaller primary plus a replica gets served from one oversized primary instead. For workloads with a read:write ratio above roughly 3:1, the cost of one replica is usually less than the right-sizing it unlocks — a $1,200/month replica that lets you drop a $2,400/month primary to $1,200/month is net zero on cost and adds DR.

The second-order impact is operational fragility. Without a replica, every long-running analytical query, every backup-window I/O spike, every schema migration, and every pg_dump runs against the same instance serving production traffic. Latency goes up at the worst possible moments — month-end reporting, end-of-quarter exports, the moment a new dashboard ships — and the team's only lever is to upsize the instance again.

The third-order impact is DR posture. Without a read replica (or Multi-AZ standby), RPO and RTO are bounded by your backup cadence and restore time. A 1-hour-old automated backup with a 45-minute restore is a 1h45m RPO/RTO floor. A read replica drops RPO to seconds and RTO to the time it takes to promote and repoint applications — usually under five minutes for same-region, ten to fifteen for cross-region.

And the cost of a regional outage without a cross-region replica is its own category. AWS regional events are rare but they happen — and "we don't have a DR plan for that region" usually means a multi-day outage while the team rebuilds from S3 backups in another region. For any system with revenue impact above a few thousand dollars an hour, a cross-region replica is the cheapest insurance you can buy.

How do you add and operate read replicas safely?

Adding a replica is a four-step loop. The order matters — provision before routing, verify lag before cutover, and treat the promotion path as a documented runbook before you ever need it.

1. Profile the read:write ratio and pick the use case

Pull 7-14 days of Performance Insights or CloudWatch metrics — focus on read IOPS vs write IOPS, and database load by wait event. Above a 3:1 read:write ratio, a same-region replica usually pays for itself in primary right-sizing. Below 2:1, the replica is probably DR-only and should be sized accordingly. Don't add a replica because the check fired — add it because the workload actually benefits.

2. Create the replica and verify lag before cutover

Use create-db-instance-read-replica to provision. For same-region, size it to match the read load you intend to send it (often smaller than the primary). For cross-region DR, size it to handle full production traffic after promotion. Wait for status available, then watch ReplicaLag for at least an hour under representative load before routing any application reads to it.

3. Route reads carefully — beware eventual consistency

Asynchronous replication means a read against the replica might miss a write that just happened on the primary. Route safe reads (reporting, search, dashboards, analytics) to the replica; route read-after-write paths (checkout confirmation, user-just-saved settings) to the primary. Most ORMs and connection-pooling layers (PgBouncer, ProxySQL, RDS Proxy with read endpoints for Aurora) support a read/write split with explicit hints — use them.

4. Document and rehearse the promotion runbook

A replica you've never promoted in anger is a replica you can't promote during an outage. Schedule a quarterly DR drill: promote a non-production replica with promote-read-replica, verify the application can repoint cleanly (DNS or connection-string update), measure the actual RTO, and tear it back down. Patching cadence matters too — apply minor-version patches to the replica first, validate, then to the primary, so you never patch the primary into a state the replica can't replicate from.

# DR drill: promote the replica, measure RTO, then tear down.
aws rds promote-read-replica \
  --db-instance-identifier prod-reports-replica-1 \
  --backup-retention-period 7

# Wait for promotion to complete (status moves through 'modifying' to 'available').
aws rds wait db-instance-available \
  --db-instance-identifier prod-reports-replica-1

# Confirm it's now a standalone DB (ReadReplicaSourceDBInstanceIdentifier should be null).
aws rds describe-db-instances \
  --db-instance-identifier prod-reports-replica-1 \
  --query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'

Quick quiz

Question 1 of 5

Your RDS PostgreSQL instance has a 6.5:1 read:write ratio and Performance Insights shows it's CPU-bound on the read path. RDS-DR-004 has flagged it. What's the right move?

You've completed Add RDS read replicas for read scaling and DR. You can now read a workload's read:write ratio, decide whether a replica earns its keep, provision and verify it without breaking consistency guarantees, and rehearse the promotion path before you need it. The next time RDS-DR-004 fires, you'll have a four-step loop ready to run — and a primary that doesn't need to be one size larger than it should be.

Back to the library