Site Reliability

Add RDS read replicas for read scaling and DR

Read replicas absorb read load and give you a promotion-ready DR target — for read-heavy workloads they pay for themselves in primary-instance right-sizing.

14 min·10 sections·AWS

Last reviewed 27 May 2026

Read replicas: the basics

What is an RDS read replica and why does it get flagged?

An RDS read replica is an asynchronous, read-only copy of a primary database instance. For MySQL, MariaDB, PostgreSQL, and Oracle it uses the engine's native logical replication — the primary ships binlog/WAL records and each replica applies them. For Aurora it's a different beast entirely: replicas attach to the same shared storage volume rather than re-applying writes, so lag is typically tens of milliseconds and you can stand up a reader without copying any data.

A database without read replicas is a single point of failure with two distinct problems. Every read query — even the boring SELECTs that drive dashboards, search, reporting, or feature flags — competes with writes on the same instance. And if the primary fails, your recovery option is a Multi-AZ failover (if you have it) or a restore from backup that takes minutes to hours. There's no warm secondary you can promote on a whim, and no regional fallback unless you've planned for one.

Continuity check RDS-DR-004 flags any DB instance running with zero read replicas. The severity is intentionally LOW — plenty of databases are write-heavy enough or small enough that a replica isn't worth the spend. The check is there to make you decide deliberately, not to fail every small dev database in the account.

In this lesson you'll learn when a read replica is the right call and when it isn't, how replication actually works under the hood (and where the lag bites you), and how to add a same-region replica or a cross-region DR replica with the exact CLI commands. You'll see the trade-off between replica cost and the primary right-sizing it unlocks, plus the promote-replica flow you'll actually run during a DR event.

Fun fact

GitHub's October 2018 outage was a replica problem

When GitHub lost connectivity between its East and West Coast sites for 43 seconds, the MySQL cluster failed over to a West Coast primary. The replicas in the East — now stale by tens of seconds of writes — couldn't catch up safely without risking data integrity. The team chose data consistency over uptime and spent 24 hours and 11 minutes reconciling writes by hand. The lesson: read replicas are a powerful DR option, but "asynchronous" means "sometimes behind," and the promotion decision needs a clear runbook before the incident, not during one.

Adding a read replica in action

Marco runs the data platform at a SaaS analytics company. RDS-DR-004 has fired on their production PostgreSQL instance — a db.r5.4xlarge running their reporting workload, $2,400/month. The team's been throwing bigger instances at slow dashboards for a year.

He pulls Performance Insights for the last 7 days. The wait events break down to roughly 78% read I/O and 12% write I/O — a 6.5:1 read:write ratio. Database load is consistently above the vCPU line, and the top query is a 200-line analytical SELECT that runs every 30 seconds from the dashboard service.

He decides on one same-region replica to absorb the reporting reads. If the math works out, the team can right-size the primary down to r5.2xlarge after a week of routing reads off it — net cost neutral, with a promotion-ready DR target as a bonus.

First, create the read replica in the same region. The replica inherits the primary's parameter group, security groups, and KMS key by default.

$ aws rds create-db-instance-read-replica --db-instance-identifier prod-reports-replica-1 --source-db-instance-identifier prod-reports --db-instance-class db.r5.4xlarge --availability-zone eu-west-1b --no-publicly-accessible --tags Key=Purpose,Value=read-scaling Key=Env,Value=prod

{

"DBInstance": {

"DBInstanceIdentifier": "prod-reports-replica-1",

"DBInstanceClass": "db.r5.4xlarge",

"Engine": "postgres",

"DBInstanceStatus": "creating",

"ReadReplicaSourceDBInstanceIdentifier": "prod-reports",

"StorageEncrypted": true,

"MultiAZ": false

}

# Replica creation takes 15-40 minutes depending on source volume size.

Same-region replica creation. Sized to match the primary while the read load gets diverted.

Once it's available, verify replication lag before pointing application reads at it. Anything under a few seconds is normal; sustained double-digit seconds means the replica can't keep up and needs investigating.

$ aws rds describe-db-instances --db-instance-identifier prod-reports-replica-1 --query 'DBInstances[0].{Status:DBInstanceStatus,Source:ReadReplicaSourceDBInstanceIdentifier,Lag:StatusInfos[0].Message}' --output table

------------------------------------------

| DescribeDBInstances |

+----------+-------------+---------------+

| Status | Source | Lag |

+----------+-------------+---------------+

| available| prod-reports| 0.4 seconds |

+----------+-------------+---------------+

# 0.4s lag — well within reporting tolerance, safe to route SELECTs.

Replication lag check before cutover. Pair with CloudWatch's ReplicaLag metric for ongoing visibility.

How replication actually worksdeep dive

For MySQL, MariaDB, PostgreSQL, and Oracle, RDS read replicas use the database engine's native logical replication. The primary writes every committed transaction to a binlog (MySQL) or WAL stream (PostgreSQL), the replica connects as a logical client, pulls the stream, and re-applies the statements locally. This is single-threaded apply in older engines and parallel apply in newer ones — but the apply path is still one CPU at the top end. A write-heavy primary can saturate the apply thread on the replica and lag will grow.

Aurora is fundamentally different. Storage is a shared distributed volume sitting underneath every instance in the cluster, so a reader doesn't replay writes at all — it sees the same 6-way-replicated storage the writer is updating. That's why Aurora supports up to 15 readers per cluster, lag is usually tens of milliseconds, and adding a reader doesn't require copying data. The trade-off is you can't span regions with that shared storage, so cross-region Aurora replicas use a separate logical-replication mechanism (Aurora Global Database).

Cross-region replicas add two costs you don't see in a same-region setup. Storage is doubled (the replica region needs its own copy of the data) and every byte of replication traffic is billed as cross-region data transfer at roughly $0.02/GB. For a busy 500GB database that's $300+/month in transfer alone before the instance cost — but you get a region-isolated promotion target, which can be the difference between a 15-minute outage and a 6-hour AWS regional incident.

# Cross-region DR replica in eu-central-1, source primary lives in eu-west-1.
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-reports-dr \
  --source-db-instance-identifier arn:aws:rds:eu-west-1:123456789012:db:prod-reports \
  --destination-region eu-central-1 \
  --kms-key-id alias/rds-eu-central-1 \
  --db-instance-class db.r5.2xlarge \
  --tags Key=Purpose,Value=dr Key=Env,Value=prod

# Monitor lag via the cross-region replica's CloudWatch metrics.
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReplicaLag \
  --dimensions Name=DBInstanceIdentifier,Value=prod-reports-dr \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 60 --statistics Average Maximum

What is the impact of running RDS without read replicas?

The most direct impact is over-provisioning the primary. A read-heavy workload that should be served from a smaller primary plus a replica gets served from one oversized primary instead. For workloads with a read:write ratio above roughly 3:1, the cost of one replica is usually less than the right-sizing it unlocks — a $1,200/month replica that lets you drop a $2,400/month primary to $1,200/month is net zero on cost and adds DR.

The second-order impact is operational fragility. Without a replica, every long-running analytical query, every backup-window I/O spike, every schema migration, and every pg_dump runs against the same instance serving production traffic. Latency goes up at the worst possible moments — month-end reporting, end-of-quarter exports, the moment a new dashboard ships — and the team's only lever is to upsize the instance again.

The third-order impact is DR posture. Without a read replica (or Multi-AZ standby), RPO and RTO are bounded by your backup cadence and restore time. A 1-hour-old automated backup with a 45-minute restore is a 1h45m RPO/RTO floor. A read replica drops RPO to seconds and RTO to the time it takes to promote and repoint applications — usually under five minutes for same-region, ten to fifteen for cross-region.

And the cost of a regional outage without a cross-region replica is its own category. AWS regional events are rare but they happen — and "we don't have a DR plan for that region" usually means a multi-day outage while the team rebuilds from S3 backups in another region. For any system with revenue impact above a few thousand dollars an hour, a cross-region replica is the cheapest insurance you can buy.

How do you add and operate read replicas safely?

Adding a replica is a four-step loop. The order matters — provision before routing, verify lag before cutover, and treat the promotion path as a documented runbook before you ever need it.

1. Profile the read:write ratio and pick the use case

Pull 7-14 days of Performance Insights or CloudWatch metrics — focus on read IOPS vs write IOPS, and database load by wait event. Above a 3:1 read:write ratio, a same-region replica usually pays for itself in primary right-sizing. Below 2:1, the replica is probably DR-only and should be sized accordingly. Don't add a replica because the check fired — add it because the workload actually benefits.

2. Create the replica and verify lag before cutover

Use create-db-instance-read-replica to provision. For same-region, size it to match the read load you intend to send it (often smaller than the primary). For cross-region DR, size it to handle full production traffic after promotion. Wait for status available, then watch ReplicaLag for at least an hour under representative load before routing any application reads to it.

3. Route reads carefully — beware eventual consistency

Asynchronous replication means a read against the replica might miss a write that just happened on the primary. Route safe reads (reporting, search, dashboards, analytics) to the replica; route read-after-write paths (checkout confirmation, user-just-saved settings) to the primary. Most ORMs and connection-pooling layers (PgBouncer, ProxySQL, RDS Proxy with read endpoints for Aurora) support a read/write split with explicit hints — use them.

4. Document and rehearse the promotion runbook

A replica you've never promoted in anger is a replica you can't promote during an outage. Schedule a quarterly DR drill: promote a non-production replica with promote-read-replica, verify the application can repoint cleanly (DNS or connection-string update), measure the actual RTO, and tear it back down. Patching cadence matters too — apply minor-version patches to the replica first, validate, then to the primary, so you never patch the primary into a state the replica can't replicate from.

# DR drill: promote the replica, measure RTO, then tear down.
aws rds promote-read-replica \
  --db-instance-identifier prod-reports-replica-1 \
  --backup-retention-period 7

# Wait for promotion to complete (status moves through 'modifying' to 'available').
aws rds wait db-instance-available \
  --db-instance-identifier prod-reports-replica-1

# Confirm it's now a standalone DB (ReadReplicaSourceDBInstanceIdentifier should be null).
aws rds describe-db-instances \
  --db-instance-identifier prod-reports-replica-1 \
  --query 'DBInstances[0].ReadReplicaSourceDBInstanceIdentifier'

Quick quiz

Question 1 of 5

Your RDS PostgreSQL instance has a 6.5:1 read:write ratio and Performance Insights shows it's CPU-bound on the read path. RDS-DR-004 has flagged it. What's the right move?

Keep learning

Dig deeper into RDS replication, Aurora's storage-layer model, and DR strategy.

You've completed Add RDS read replicas for read scaling and DR. You can now read a workload's read:write ratio, decide whether a replica earns its keep, provision and verify it without breaking consistency guarantees, and rehearse the promotion path before you need it. The next time RDS-DR-004 fires, you'll have a four-step loop ready to run — and a primary that doesn't need to be one size larger than it should be.

Back to the library

RDS read replicas: what they cost, what they buy, and when they're worth it

A tiering decision driven by read:write ratio — not every database needs one

An RDS read replica is a read-only copy of a primary database that absorbs SELECT traffic. It is billed as a separate instance at standard RDS rates — the same instance family and size you choose, plus its own storage. On a write-heavy database or a low-traffic dev instance the replica is pure overhead. On a read-heavy production database it can offset its own cost by allowing the primary to be right-sized down.

The financial model is straightforward to check: compare the replica's monthly cost against the cost difference between the current primary size and the next size down. A $1,200/month replica that lets the primary drop from $2,400/month to $1,200/month is cost-neutral and adds a promotion-ready recovery target at no incremental spend. If the primary can't be right-sized, the replica is a net cost add — justified only by DR or read-latency requirements, both of which should be budgeted explicitly.

RDS-DR-004 fires at LOW severity deliberately: it is not asserting every database needs a replica, it is asserting every database should have an explicit decision about whether it does. Finance's role is to ensure that decision is quantitative and recorded — read:write ratio drives the cost case, environment tier drives the DR case, and neither should be left to an unchecked engineer default.

This is the finance partner's read on RDS-DR-004. You'll understand the cost model — when a replica pays for itself through primary right-sizing and when it doesn't — the right:write ratio threshold that makes the case, and the cross-region cost add (instance plus data-transfer) that changes the math for DR replicas. No CLI required; the goal is a sharper question at the next cloud cost review.

Fun fact

GitHub's October 2018 outage was a replica problem

How a finance partner reads the replica cost case

Dana is reviewing the monthly cloud bill. The production reporting database — a db.r5.4xlarge at $2,400/month — has been upsized twice in two years. RDS-DR-004 has fired at LOW severity, flagging it has no read replica. Dana's first question is whether the replica would pay for itself.

The read:write ratio from the past 30 days is 6.5:1. A same-region db.r5.4xlarge replica runs about $2,400/month — the same as the current primary. But if routing reporting reads off the primary lets the team drop it to an r5.2xlarge at $1,200/month, the net change is zero: $1,200 saved on the primary, $1,200 added for the replica. What the business gets for that zero net spend is a warm DR target and an end to the recurring 'the dashboards are slow' incidents.

Dana documents the case in the cloud cost review: replica addition is cost-neutral at current read ratio, right-sizing delta to be confirmed after two weeks of routing data. If the primary doesn't drop in size, re-evaluate. The decision is specific, quantified, and time-boxed — not 'we added a replica because the check fired.'

Three cost impacts hidden inside RDS-DR-004

The most directly visible cost impact is chronic primary over-provisioning. A read-heavy database without a replica ends up on the largest instance the team can justify — because there's no other lever. Each upsize compounds the problem: the bill grows, the read-to-write ratio stays the same, and the underlying issue is never resolved. A replica that unlocks a one-tier primary right-size is often cost-neutral or better, but that math only gets run when someone looks at the read:write ratio explicitly.

The second cost impact is invisible until an incident: the absence of a DR target. Without a read replica, recovering from a primary failure means restoring from backup — typically 45 minutes to several hours for a large database, plus whatever data was written since the last backup. That recovery window is a quantifiable exposure. The question for the risk register is: at our revenue rate, what does a two-hour unplanned outage cost? A same-region replica that reduces RTO to under five minutes is almost always cheaper insurance than that number implies.

The third cost impact applies only to cross-region DR replicas: a data-transfer charge on top of the replica instance cost. Every byte of replication traffic crosses region boundaries at roughly $0.02/GB. For a 500GB database generating steady writes, that can add $300+ per month before the instance cost. Cross-region replicas need to be explicitly justified against the cost of a regional outage, not added reflexively — but for revenue-critical systems the math tends to be straightforward.

How finance governs the replica decision without running CLI commands

Finance can't provision replicas, but it owns the framing that turns RDS-DR-004 from a reactive alert into a cost-model exercise. Four points of leverage, applied at the regular cadence.

1. Require a read:write ratio before approving any primary upsize

Before signing off on an RDS instance upsize, ask engineering to pull the read:write ratio from Performance Insights. If it's above 3:1, the right answer is usually a replica plus a smaller primary, not a larger primary alone. Making this a standing question converts a recurring cost driver into a one-time, often cost-neutral decision.

2. Model replica cost against primary right-sizing savings

For each flagged production database, build the simple two-column case: replica monthly cost vs. primary right-sizing delta. At a 3:1+ read ratio the columns frequently cancel out. Where they don't, the net replica cost needs to be justified explicitly — either as a DR investment (what is a two-hour outage worth?) or as a read-latency investment with a measurable application SLA benefit.

3. Separate DR replicas from read-scaling replicas in the budget

Same-region replicas added purely for DR should be budgeted as resilience spend, not infrastructure optimization. Cross-region replicas add an ongoing data-transfer charge on top of the instance cost — roughly $0.02/GB — which should be modeled and approved separately. Mixing these into a single 'replica' line obscures the purpose and makes cost reviews harder.

4. Treat intentional no-replica databases as a recorded decision

Not every database needs a replica, and that's correct. Dev, test, and low-traffic databases that don't justify the spend should be documented as intentionally single-instance with a recorded reason. That converts a silently ignored LOW finding into a visible, auditable decision — and prevents the list from growing unchecked as new databases are launched.

Quick quiz

Question 1 of 5

A production reporting database is a db.r5.4xlarge at $2,400/month. Its read:write ratio is 6:1. A same-region read replica of the same class costs $2,400/month. Engineering estimates the primary could drop to r5.2xlarge ($1,200/month) after routing reads off it. What's the right financial framing?

Keep learning

Dig deeper into RDS replication, Aurora's storage-layer model, and DR strategy.

You've finished the finance partner's view of RDS-DR-004. You know how to run the cost model — read:write ratio against primary right-sizing delta — to determine whether a replica is cost-neutral, a net add, or cheaper insurance than the outage it prevents. You know the cross-region data-transfer cost that changes the math for DR replicas, and the four governance levers — requiring the ratio before upsizes, modeling the two-column case, separating DR from read-scaling in the budget, and recording intentional no-replica decisions — that keep this control from being either ignored or blindly remediated.

Back to the library

RDS read replicas: the one question

Are our read-heavy databases sized right, and do we have a promotion path if the primary goes down?

An RDS read replica routes read traffic away from the primary database. For databases dominated by reads — reporting, dashboards, search — it lets the primary run leaner and gives the team a pre-built recovery target to promote if the primary fails. Without one, read load and write load share the same instance, and recovery from a primary failure is a restore from backup measured in minutes to hours.

The check fires at LOW severity because it is a prompted decision, not a blanket finding. Most dev and low-traffic databases rightly have no replica. The executive question is narrower: for our production databases with high read load or revenue impact, have we consciously decided whether a replica is warranted — and is that decision on record? Resilience by policy, not by whoever last clicked through a launch wizard.

A short read for the executive who needs to know what RDS-DR-004 actually surfaces. You'll get the plain-English version of when a read replica is worth the spend, why it's a per-database tiering decision rather than a mandate, and what a well-governed answer looks like — explicit choices per production database, recorded, with clear ownership of both the cost and the recovery plan.

Fun fact

GitHub's October 2018 outage was a replica problem

What it looks like when this decision is made deliberately

At one company the VP of Engineering kept approving database upsizes for the reporting workload — each one billed as a performance fix, none of them tracked against a root cause. When RDS-DR-004 flagged the database, the cloud team used the read:write ratio data to surface the real issue: 78% of the load was reads that could be served from a replica.

The executive conversation shifted from 'approve the next instance size' to 'approve a replica and right-size the primary — same cost, adds a DR target.' The spend didn't increase. What changed was that the decision was explicit, documented, and tied to a measurable outcome rather than repeated in the next budget cycle.

That's the right signal: not that every RDS-DR-004 finding gets a replica, but that each one produces a documented answer — cost-neutral case for read-heavy workloads, explicit DR justification for others, recorded single-AZ-by-design for low-traffic databases that don't need one.

What running without a replica actually exposes the business to

The headline exposure is recovery time after a primary database failure. Without a read replica, there is no warm standby to promote — the team restores from the most recent backup, which typically takes 45 minutes to several hours depending on database size, and loses any data written since that backup. For a checkout, billing, or user-data database, that window is a direct business impact.

The secondary exposure is a cost-efficiency gap that compounds over time. Databases without replicas absorb all read load on the primary, so read-heavy workloads get repeatedly upsized as a performance band-aid. The upsizes show up on the bill; the root cause does not. A replica converts that cycle into a one-time, often cost-neutral decision.

The leadership question is not "how many replicas do we have?" It's narrower: for our revenue- and customer-critical databases, do we have a documented, tested recovery path that isn't 'restore from a backup and hope,' and is the cost of that path modeled against the cost of the outage it prevents?

The leadership moves on RDS-DR-004

Two decisions, not sixteen. The executive handle is to set the default and require the exception to be documented.

1. Set a default for revenue-critical databases

Any database that backs a revenue- or customer-facing service should have a documented recovery path — either a read replica that can be promoted, a Multi-AZ standby, or both. Make that a policy, not a preference. The teams should be able to demonstrate, for each critical database, that a primary failure has a tested sub-ten-minute recovery path.

2. Accept single-instance for everything else — with a recorded reason

Dev, test, and low-traffic databases rightly have no replica. The policy goal is not replica count; it's that the no-replica decision is explicit and recorded rather than inherited from a launch-wizard default. A clean RDS-DR-004 picture is one where every production database has a deliberate answer, not one where every database has a replica.

3. Ask for the RTO on the databases that matter

At the next leadership review, ask for the measured — not theoretical — recovery time for each revenue-critical database after a primary failure. If the answer is 'restore from backup, estimated 90 minutes,' that's the exposure. If it's 'promote the replica, tested at four minutes last quarter,' that's coverage. The distinction is what this control is surfacing.

Quick quiz

Question 1 of 5

Your production checkout database has no read replica and no Multi-AZ standby. Engineering estimates a primary failure would mean a 90-minute restore from backup, with potential data loss since the last automated snapshot. What's the right executive response?

Keep learning

Dig deeper into RDS replication, Aurora's storage-layer model, and DR strategy.

Two takeaways from this lesson: a database without a replica has a recovery time measured in tens of minutes to hours after a primary failure, not seconds — and on a read-heavy workload a replica often pays for itself by unlocking a primary right-size. The leadership question is whether every revenue-critical database has a documented, tested recovery path and a deliberate decision on record. If the answer is yes, the control is doing its job. If it's 'we haven't checked,' RDS-DR-004 is the prompt to find out.

Back to the library

Part of the learning path Build in resilience

Add RDS read replicas for read scaling and DR

Read replicas: the basics

GitHub's October 2018 outage was a replica problem

Adding a read replica in action

How replication actually worksdeep dive

What is the impact of running RDS without read replicas?

How do you add and operate read replicas safely?

1. Profile the read:write ratio and pick the use case

2. Create the replica and verify lag before cutover

3. Route reads carefully — beware eventual consistency

4. Document and rehearse the promotion runbook

Quick quiz

Keep learning

RDS read replicas: what they cost, what they buy, and when they're worth it

GitHub's October 2018 outage was a replica problem

How a finance partner reads the replica cost case

Three cost impacts hidden inside RDS-DR-004

How finance governs the replica decision without running CLI commands

1. Require a read:write ratio before approving any primary upsize

2. Model replica cost against primary right-sizing savings

3. Separate DR replicas from read-scaling replicas in the budget

4. Treat intentional no-replica databases as a recorded decision

Quick quiz

Keep learning

RDS read replicas: the one question

GitHub's October 2018 outage was a replica problem

What it looks like when this decision is made deliberately

What running without a replica actually exposes the business to

The leadership moves on RDS-DR-004

1. Set a default for revenue-critical databases

2. Accept single-instance for everything else — with a recorded reason

3. Ask for the RTO on the databases that matter

Quick quiz

Keep learning

Related site reliability lessons