Site Reliability

Address stale backup recovery points

A backup plan that runs daily but whose newest recovery point is nine days old isn't protecting you — it's failing quietly. The age of your latest restore point is your real-world RPO.

12 min·10 sections·AWS

Last reviewed 27 May 2026

Stale recovery points: the basics

When your newest backup is older than your schedule says it should be

A recovery point is a single restorable copy of a resource — an EBS snapshot, an RDS cluster snapshot, a DynamoDB table backup — taken by a Backup Plan on a schedule. The freshness that matters isn't whether recovery points exist; it's the age of the newest one. A daily plan should never leave a resource with a latest recovery point older than roughly 24-36 hours (the schedule interval plus the backup window plus a little slack). When the newest point is nine days old, the plan has silently stopped protecting that resource.

This check flags any resource whose most recent successful recovery point is older than its backup frequency implies it should be. It maps directly to RPO — recovery point objective, the maximum data loss you've agreed to tolerate. Your true RPO at any moment equals the age of the newest restorable recovery point. If the latest is nine days old, restoring it loses nine days of data, no matter what the plan's schedule claims on paper.

It's flagged because this is the quiet failure mode of backup. A missing Backup Plan screams; a stale one whispers. The plan still exists, the dashboard still shows it as "configured," and everyone assumes coverage is fine — right up until a restore is needed and the only available point is from before the data that mattered. The gap between the schedule on paper and the newest point in the vault is the entire risk.

In this lesson you'll learn why the age of the newest recovery point — not the existence of a plan — is what determines real-world RPO, the four ways recovery points go stale (silent job failures, selection gaps, lifecycle deleting points faster than they're created, and a schedule too infrequent for the resource's stated RPO), how to read the latest CreationDate per resource from the AWS CLI, and how to build an alarm that fires when any resource's newest recovery point crosses its RPO budget. You'll see the edge cases that bite and the fix: resolve the underlying failure or selection gap, then tighten frequency to actually meet the RPO.

Fun fact

The plan that ran for everyone except the one that mattered

A SaaS platform team audited their AWS Backup vault and found a plan dutifully producing recovery points every night — 312 resources, green across the board. But one resource, their primary Postgres cluster, had a newest recovery point 41 days old. Months earlier someone had renamed its BackupRequired tag during a Terraform refactor, and the tag-based selection silently stopped matching it. The plan kept running flawlessly for everything else, so no alarm fired and no job failed — there was simply no job for that one resource. They'd been one bad deploy away from a 41-day data-loss event, and the dashboard said "protected" the entire time.

Catching a stale recovery point in action

Lena owns reliability at a payments startup with a stated 24-hour RPO on every production database. The continuity dashboard flags one RDS cluster: its newest recovery point is 9 days old against a daily Backup Plan. Everything else in the vault is green, so it's not a broken plan — it's something specific to this one resource.

She pulls the recovery points for just that cluster, sorted by creation date. The newest CreationDate is indeed 9 days ago, and there's a cluster of failed job records every night since — each one a KMSKeyError. Someone rotated the vault's KMS key and didn't grant the new key to the backup service role, so every nightly job for resources in that vault has failed silently while the dashboard counted the last good point as "a recovery point exists."

She fixes the key grant, runs an on-demand backup to close the gap immediately, and confirms a fresh completed recovery point lands within the window. Then she wires up the real fix: a check that compares the newest recovery point's age to the 24-hour RPO budget per resource and alarms the moment any resource crosses it — so the next silent failure surfaces in hours, not after nine days.

List recovery points for the flagged resource, newest first, and read the latest CreationDate. That date minus now is your real RPO for this resource.

$ aws backup list-recovery-points-by-resource --resource-arn arn:aws:rds:us-east-1:123456789012:cluster:prod-payments --query 'reverse(sort_by(RecoveryPoints,&CreationDate))[:4].{Created:CreationDate,Status:Status,Vault:BackupVaultName}' --output table

-----------------------------------------------------------------------

| ListRecoveryPointsByResource |

+----------------------------+------------+---------------------------+

| Created | Status | Vault |

+----------------------------+------------+---------------------------+

| 2026-05-17T05:04:11Z | COMPLETED | prod-backup-vault |

| 2026-05-16T05:03:58Z | COMPLETED | prod-backup-vault |

| 2026-05-15T05:04:02Z | COMPLETED | prod-backup-vault |

| 2026-05-14T05:03:49Z | COMPLETED | prod-backup-vault |

+----------------------------+------------+---------------------------+

# Newest COMPLETED point is 2026-05-17 — today is 2026-05-26. Real RPO = 9 days.

# Against a 24-hour RPO budget, this resource is 8 days past its limit.

The newest COMPLETED CreationDate is the only number that matters — its age is your real-world RPO for this resource.

Now make it a check: compute the age of the newest recovery point and compare it to the RPO budget in hours. Anything over budget exits non-zero so CI or an EventBridge rule can alarm.

$ RPO_HOURS=24; ARN=arn:aws:rds:us-east-1:123456789012:cluster:prod-payments; bash check-rpo.sh "$ARN" "$RPO_HOURS"

Newest recovery point: 2026-05-17T05:04:11Z

Age: 218.0 hours

RPO budget: 24 hours

STALE: newest recovery point is 194.0 hours past the RPO budget

exit 1

Newest-point-age vs RPO-budget as a non-zero exit — the difference between catching staleness in hours and discovering it in an incident.

How recovery points go stale under the hooddeep dive

AWS Backup records every backup attempt as a job with a status — COMPLETED, FAILED, ABORTED, EXPIRED — and produces a recovery point only on success. The list-recovery-points-by-resource call returns just the successful ones, each with a CreationDate. The freshness of a resource is therefore the CreationDate of its newest COMPLETED recovery point, full stop. A wall of FAILED jobs since then is invisible to anyone who only checks "does a recovery point exist?" — because one still does; it's just old.

There are four ways the newest point drifts past where the schedule should keep it. First, silent job failures: a KMS key the service role can't use, an RDS storage-full condition, a resource modified mid-backup, or a completion window too short — each fails the job without deleting the last good point, so coverage looks intact. Second, selection gaps: tag-based selections (BackupRequired=true) stop matching when a tag is renamed or removed in a refactor, so the plan simply never schedules a job for that resource. Third, lifecycle eating its own tail: a DeleteAfterDays shorter than the effective creation cadence deletes points faster than new ones land, walking the newest point steadily older. Fourth, frequency mismatch: a weekly plan on a resource with a 24-hour RPO is stale-by-design — even when every job succeeds, the newest point can legitimately be six days old.

Storage cost is per recovery point per GB-month and varies by resource type — EBS snapshots around $0.05/GB-month in US-East, with RDS, DynamoDB, EFS, and FSx each priced higher; cross-region copies double it. Tightening frequency to meet an RPO adds recovery points and therefore cost, but the increment is usually small relative to incremental snapshot dedup — and trivial next to the cost of a multi-day data-loss event. The fix is never "take fewer backups"; it's "resolve the failure or selection gap, then make the cadence match the RPO."

# check-rpo.sh — fail if the newest recovery point is older than the RPO budget.
ARN="$1"; RPO_HOURS="${2:-24}"

# Newest COMPLETED recovery point for this resource.
NEWEST=$(aws backup list-recovery-points-by-resource \
  --resource-arn "$ARN" \
  --query 'reverse(sort_by(RecoveryPoints[?Status==`COMPLETED`],&CreationDate))[0].CreationDate' \
  --output text)

if [ "$NEWEST" = "None" ] || [ -z "$NEWEST" ]; then
  echo "NO recovery points at all for $ARN"; exit 2
fi

AGE_HOURS=$(( ( $(date -u +%s) - $(date -u -d "$NEWEST" +%s) ) / 3600 ))
echo "Newest recovery point: $NEWEST"
echo "Age: ${AGE_HOURS} hours / RPO budget: ${RPO_HOURS} hours"

if [ "$AGE_HOURS" -gt "$RPO_HOURS" ]; then
  echo "STALE: $(( AGE_HOURS - RPO_HOURS )) hours past the RPO budget"; exit 1
fi
echo "OK: within RPO budget"

What is the impact of stale recovery points?

The direct impact is data loss measured in time. Your real RPO equals the age of the newest restorable recovery point, so a resource flagged with a 9-day-old point has a 9-day worst-case data-loss window today — regardless of what the plan's schedule promises. For a transactional system that's thousands of writes; for a customer database it's accounts, orders, and audit history that simply cannot be reconstructed. The schedule on paper is irrelevant at restore time; only the newest point in the vault can be restored.

The insidious part is false confidence. A stale recovery point doesn't trigger the alarms a missing plan does. The plan exists, the last good point exists, and most dashboards count "a recovery point is present" as healthy. So teams carry an exposure they believe they've eliminated — which is strictly worse than a known gap, because nobody is watching it. The stopped-instance lesson's lesson applies inverted here: there, "stopped" looked free but cost money; here, "backed up" looks safe but isn't.

There's a recovery-time impact too. When the newest point is stale, an incident forces a choice between restoring old data (large RPO loss) or attempting a partial reconstruction from logs and replicas (large RTO cost, and often impossible). Neither is the clean restore the team rehearsed. Stale points quietly convert a 1-hour recovery story into a multi-day forensic exercise, which is exactly when leadership and customers are watching most closely.

Finally, it's an audit and compliance exposure. SOC 2 CC7.5, ISO 27001 A.12.3, HIPAA backup provisions, and PCI DSS all expect backups that are demonstrably running and recoverable — not merely configured. A resource whose newest recovery point is weeks old against a daily plan is a control failure an auditor will escalate, and "the plan was enabled" is not a defense when the evidence shows no fresh recovery points were produced.

How do you keep recovery points fresh?

Closing a stale-recovery-point gap is a four-step loop: measure the age of the newest point per resource, find why it drifted, fix the underlying failure or selection gap, then make the cadence actually meet the RPO — and alarm continuously so the next gap surfaces in hours.

1. Measure newest-point age per resource against its RPO budget

For every protected resource, pull the newest COMPLETED recovery point's CreationDate and compute its age. Compare that age to the resource's stated RPO — 24 hours for a daily-plan database, 1 hour for a high-value transactional system. The newest-point age, not the plan's existence, is the metric. Tag resources with their RPO (RPOHours=24) so the budget travels with the resource and the check is self-describing.

2. Diagnose why the newest point drifted

Walk the four causes in order. Check list-backup-jobs for a wall of FAILED/ABORTED jobs since the last good point (KMS grant, storage-full, completion window). Check whether the resource still matches its plan's selection — a renamed or dropped tag is the classic silent gap. Check the rule's DeleteAfterDays against the real creation cadence — lifecycle can delete points faster than they land. If all jobs succeed and selection is intact, the cause is frequency mismatch: the schedule is simply too slow for the RPO.

3. Resolve the failure, then close the gap immediately

Fix the root cause — grant the KMS key to the backup role, re-tag the resource into its selection, lengthen the completion window, or correct the lifecycle. Then run an on-demand backup (start-backup-job) to land a fresh recovery point right away rather than waiting for the next scheduled run; a stale resource shouldn't stay stale for another full cycle while the fix bakes. Confirm a new COMPLETED point appears within the backup window before considering it closed.

4. Tighten frequency to meet RPO and alarm on age continuously

If the cause was frequency mismatch, change the rule's ScheduleExpression so the interval is comfortably inside the RPO — cron(0 */6 ? * * *) for a system that must never exceed a few hours of loss. Then wire continuous detection: an EventBridge rule on AWS Backup Backup Job state changes catches failures live, and a scheduled check comparing newest-point age to the RPOHours tag catches selection and lifecycle drift. AWS Backup Audit Manager's Min Frequency framework reports the same thing nightly. The goal is that the next stale point alarms in hours, not weeks.

# Step 2: are recent jobs failing for this resource? A wall of FAILED behind one old COMPLETED is the tell.
aws backup list-backup-jobs \
  --by-resource-arn arn:aws:rds:us-east-1:123456789012:cluster:prod-payments \
  --query 'BackupJobs[:6].{Created:CreationDate,State:State,Msg:StatusMessage}' \
  --output table

# Step 3: close the gap now with an on-demand backup rather than waiting for the next schedule.
aws backup start-backup-job \
  --backup-vault-name prod-backup-vault \
  --resource-arn arn:aws:rds:us-east-1:123456789012:cluster:prod-payments \
  --iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole

# Step 4: tighten the rule so cadence sits comfortably inside the RPO (every 6h for a few-hour RPO).
# Edit the plan's ScheduleExpression to: cron(0 */6 ? * * *)
aws backup update-backup-plan --backup-plan-id a1b2c3d4-5e6f-7890-abcd-ef0123456789 \
  --backup-plan file://prod-tiered-plan.json

Quick quiz

Question 1 of 5

A daily Backup Plan is enabled and shows green, but one RDS cluster's newest recovery point is 9 days old. The plan runs fine for every other resource. The cluster has a 24-hour RPO. What's the right first move?

Keep learning

Dig deeper into recovery points, RPO measurement, and continuous backup monitoring.

You've completed Address stale backup recovery points. You now know that real RPO is the age of your newest restorable recovery point — not the schedule on paper — the four ways points drift stale (silent failures, selection gaps, lifecycle eating its tail, frequency mismatch), how to read newest-point age from the CLI, and the four-step loop to diagnose, close the gap on demand, and alarm continuously. The next time a backup dashboard shows all-green, you'll know to ask the only question that matters: how old is the newest recovery point?

Back to the library

Stale recovery points: what it means for risk

A backup that exists on paper but hasn't actually run

Most people treat "we have backups" as a yes/no fact. It isn't. A backup is only as good as its most recent successful copy. This finding flags resources where the schedule says "daily" but the newest usable copy is, say, nine days old — the policy is configured, but it has quietly stopped running for that resource. From a risk standpoint that's worse than no backup, because it creates false confidence: everyone believes they're covered.

The number that matters here is the recovery point objective, or RPO — the most data the business has agreed it can afford to lose. Your real RPO is simply the age of the newest restorable copy. If the latest recovery point for a customer database is nine days old, then a failure today means restoring to nine days ago and losing everything since. For a billing or transactional system, nine days of lost data is a customer-trust and possibly a regulatory event, not a technical footnote.

This is also an audit and compliance exposure. SOC 2, ISO 27001, HIPAA, and PCI all expect backups that are not just configured but demonstrably running and recoverable. A plan that hasn't produced a fresh recovery point in over a week is exactly the kind of finding an auditor escalates. At the operational review, the question to ask isn't "do we have backups?" — it's "what is the age of the newest recovery point for each protected system, and how does that compare to the RPO we promised?"

This lesson is for the finance partner who signs off on "we're backed up" and needs to know what that claim is actually worth. It explains RPO in plain terms — how much data you'd lose if you restored the newest copy available — why a configured plan can still leave you exposed, what audit and compliance owners expect to see, and the one question to put on the operational-review agenda so a silent backup failure surfaces in a week rather than in an incident post-mortem. No CLI, no internals.

Fun fact

The plan that ran for everyone except the one that mattered

How a finance partner reads the exposure

Priya is the finance partner embedded with the platform team at a payments startup. At the operational review, the reliability lead presents the backup dashboard as "all green." Priya now asks the question that's standing on the agenda: "For each production database, how old is the newest restorable backup, and how does that compare to the RPO we promised customers?" The lead pulls the detail and one RDS cluster shows a newest recovery point 9 days old against a 24-hour commitment.

Priya doesn't need to know it was a KMS key grant. What she frames is the exposure: for nine days, the worst-case data loss on a customer payments database has been nine days, not the one day the business signed up for — and that's a customer-trust and audit issue, not a technical detail. She asks two follow-ups: how long was it stale before anyone noticed, and what now alerts us automatically. The answers — "nine days" and "nothing, until this review" — are the real finding.

The commitment Priya extracts isn't "fix the cluster" — engineering already did that in the meeting. It's that the age of the newest recovery point per critical system becomes a standing line on the operational review, with an automatic alert when any system crosses its RPO budget. The dollar cost of a fresh nightly backup is trivial; the cost of discovering a 41-day gap during an actual incident is not.

Why this matters to risk, not just operations

The exposure here isn't a line item — it's a contingent liability. Every system with a stale recovery point carries a worst-case data-loss window equal to the age of its newest backup. For a revenue or compliance-critical system, the cost of materializing that loss — lost transactions, customer churn, regulatory penalties, breach-notification obligations — dwarfs anything in the regular cloud bill. Finance should treat a stale recovery point on a critical system the way it treats an uninsured asset.

The remediation cost, by contrast, is negligible. Closing the gap is an engineering fix plus a modest increase in recovery-point storage when frequency is tightened to meet the RPO — typically a few percent on an already-small backup line. The asymmetry is the whole point: spending a little to keep backups fresh is cheap insurance against an event whose cost is open-ended. This is one of the rare findings where the right financial move is to spend slightly more, deliberately.

It interacts with the compliance budget directly. Audits, certifications, and customer security questionnaires all require evidence that backups run and recover. A stale-recovery-point finding turns a clean audit into a remediation cycle with its own cost and timeline, and in regulated industries it can gate contracts and renewals. The cheapest place to spend here is on continuous measurement of recovery-point age, so the finding never reaches an auditor in the first place.

Finally, treat it as a leading indicator of recovery readiness. A backup line that grows steadily while recovery-point ages quietly drift is the signature of a plan running on autopilot that nobody is actually validating. The metric to watch isn't backup spend — it's the age of the newest recovery point per critical system against its RPO budget. If that's measured and flat, the spend is doing its job; if it's unmeasured, the spend is buying false confidence.

What finance can actually do about this

Finance can't run a backup job, but it can make recovery-point freshness a measured, accountable number instead of an assumption. Three levers, used at the operational review.

1. Put newest-point age on the operational review

Add "age of the newest recovery point per critical system, vs its RPO budget" as a standing line — alongside how long any breach has been open and what alerts on it. The green/red plan count is not the metric; the age of the newest restorable copy is. If any critical system trends past its RPO budget, that's the escalation, and it should surface in a week, not in an incident review.

2. Tie RPO to the systems that carry real exposure

Not every resource needs a tight RPO; the budget should be set deliberately for the systems whose data loss is a customer-trust or compliance event. Require that each critical system has a documented RPO and that the newest-point age is reported against it. This converts "are we backed up?" into a measurable, defensible number an auditor and a board can both read.

3. Fund the fix — it's cheap insurance

Closing these gaps costs a little more recovery-point storage and an engineering fix, against an open-ended loss event. Approve the modest increase in backup spend that comes with tightening frequency to meet the RPO, and treat it as insurance, not waste. This is one of the rare findings where the right financial answer is to deliberately spend slightly more.

4. Watch whether it's measured, not just whether it's green

The failure mode here is autopilot — a plan that runs unattended while recovery-point ages quietly drift. The question that exposes it isn't "do we have backups?" but "who is alerted, automatically, when a recovery point goes stale, and how fast?" If the answer is "nobody, until someone checks," the spend is buying false confidence and that's the conversation, regardless of the dollar amount.

Quick quiz

Question 1 of 5

The reliability team presents an all-green backup dashboard, but you learn one production database's newest restorable backup is 9 days old against a 24-hour RPO commitment. As the finance partner, what's the right next move?

Keep learning

Dig deeper into recovery points, RPO measurement, and continuous backup monitoring.

You've finished the finance partner's view of stale recovery points. You know that "we have backups" is worth exactly the age of the newest restorable copy, why a configured plan can still leave a critical system exposed, and the three finance levers — put newest-point age on the operational review, tie RPO to the systems that carry real exposure, and fund the cheap fix. Next time the backup dashboard is green, you'll have a sharper question than "are we backed up?"

Back to the library

Stale recovery points: the headline

The backup says it's running. It isn't.

A configured backup plan can silently stop protecting a system — failed jobs, a resource that dropped out of coverage, a schedule too slow for the data — while still appearing healthy on every dashboard. The exposure is invisible until the day you need to restore and discover the newest copy is days old. Your worst-case data-loss window equals the age of that newest copy.

This is a business-continuity issue, not a cost one. A stale recovery point on a customer-facing system means a disaster recovery today would roll the business back days, not hours. The one question worth asking: what is our worst-case data-loss window on our most critical systems right now — and does anyone actually measure the age of the latest backup, or just confirm a plan exists?

A short read for the exec who needs the continuity headline and one question to ask. You'll get the framing — real RPO is the age of your newest backup, not the schedule on paper — what a growing gap signals about operational discipline, and the single question that exposes the risk without any technical depth.

Fun fact

The plan that ran for everyone except the one that mattered

What it looks like when the org gets this right

At one company the quarterly continuity review used to open with a green backup dashboard and a confident "everything's protected." Then an incident exposed a database whose newest backup had quietly been weeks old, and the recovery rolled customers back further than anyone thought possible. The exec sponsor changed the standing question from "do we have backups?" to "what is the age of the newest recovery point on our most critical systems, and does it meet the RPO we promised?"

Within two quarters the review changed shape. The headline was no longer a green/red plan count; it was a single number per critical system — worst-case data-loss window in hours — and an automatic alert whenever any system drifted past its budget. The org stopped confusing "a plan is configured" with "a fresh backup exists," and the gap between the two became something measured continuously rather than discovered during a disaster. That's the right outcome state: RPO is a number you watch, not a hope you hold.

Why this is on the report at all

This category measures the gap between what the business promised about recovery and what is actually true today. A configured backup plan is a promise; the age of the newest recovery point is the reality. When those diverge, the business is carrying a continuity risk it believes it has retired — and that belief is precisely what makes it dangerous, because no one is watching a risk they think is closed.

The trend and coverage say more than any single dollar. If every critical system's worst-case data-loss window is measured, small, and stable, the org's recovery discipline is real. If "we have backups" is the only assurance on offer, the org is one silent failure away from discovering its true RPO during an incident — and that is the most expensive possible time to find out. This sits at the intersection of continuity, compliance, and customer trust.

The leadership move on this category

The actionable handle for an executive isn't to manage backup jobs — it's to insist recovery readiness is measured continuously rather than assumed.

1. Ask for the worst-case data-loss window, not a plan count

The one question that exposes this risk: "What is the age of the newest backup on our most critical systems right now, and does it meet the RPO we promised customers?" A green dashboard answers a different, easier question. Make the worst-case data-loss window per critical system the number that gets reported.

2. Insist on automatic alerting, not periodic checking

A stale recovery point that's caught at a quarterly review was exposed for a quarter. Require that any system drifting past its RPO budget triggers an automatic alert in hours. The difference between continuous detection and periodic checking is the difference between a near-miss and a headline data-loss event.

3. Treat measured RPO as a confidence signal

If every critical system's data-loss window is measured, small, and stable across reviews, recovery discipline is real and leadership can look elsewhere. If "we have backups" is the only assurance offered, the org doesn't actually know its RPO — and will discover it during an incident, at the worst possible moment.

Quick quiz

Question 1 of 5

You ask your team for the worst-case data-loss window on critical systems. They report it's measured per system, small, stable across the last three reviews, and any drift past the RPO budget triggers an automatic alert. What's the right read?

Keep learning

Dig deeper into recovery points, RPO measurement, and continuous backup monitoring.

That's the lesson. Two takeaways worth holding onto: a configured backup plan is a promise, but your real worst-case data-loss window is the age of your newest backup — and that gap must be measured and alerted continuously, not assumed from a green dashboard. The leadership question is "what is our worst-case data-loss window right now?" — not "do we have backups?"

Back to the library

Part of the learning path Build in resilience

Address stale backup recovery points

Stale recovery points: the basics

The plan that ran for everyone except the one that mattered

Catching a stale recovery point in action

How recovery points go stale under the hooddeep dive

What is the impact of stale recovery points?

How do you keep recovery points fresh?

1. Measure newest-point age per resource against its RPO budget

2. Diagnose why the newest point drifted

3. Resolve the failure, then close the gap immediately

4. Tighten frequency to meet RPO and alarm on age continuously

Quick quiz

Keep learning

Stale recovery points: what it means for risk

The plan that ran for everyone except the one that mattered

How a finance partner reads the exposure

Why this matters to risk, not just operations

What finance can actually do about this

1. Put newest-point age on the operational review

2. Tie RPO to the systems that carry real exposure

3. Fund the fix — it's cheap insurance

4. Watch whether it's measured, not just whether it's green

Quick quiz

Keep learning

Stale recovery points: the headline

The plan that ran for everyone except the one that mattered

What it looks like when the org gets this right

Why this is on the report at all

The leadership move on this category

1. Ask for the worst-case data-loss window, not a plan count

2. Insist on automatic alerting, not periodic checking

3. Treat measured RPO as a confidence signal

Quick quiz

Keep learning

Related site reliability lessons