Stale recovery points: the basics
When your newest backup is older than your schedule says it should be
A recovery point is a single restorable copy of a resource — an EBS snapshot, an RDS cluster snapshot, a DynamoDB table backup — taken by a Backup Plan on a schedule. The freshness that matters isn't whether recovery points exist; it's the age of the newest one. A daily plan should never leave a resource with a latest recovery point older than roughly 24-36 hours (the schedule interval plus the backup window plus a little slack). When the newest point is nine days old, the plan has silently stopped protecting that resource.
This check flags any resource whose most recent successful recovery point is older than its backup frequency implies it should be. It maps directly to RPO — recovery point objective, the maximum data loss you've agreed to tolerate. Your true RPO at any moment equals the age of the newest restorable recovery point. If the latest is nine days old, restoring it loses nine days of data, no matter what the plan's schedule claims on paper.
It's flagged because this is the quiet failure mode of backup. A missing Backup Plan screams; a stale one whispers. The plan still exists, the dashboard still shows it as "configured," and everyone assumes coverage is fine — right up until a restore is needed and the only available point is from before the data that mattered. The gap between the schedule on paper and the newest point in the vault is the entire risk.
In this lesson you'll learn why the age of the newest recovery point — not the existence of a plan — is what determines real-world RPO, the four ways recovery points go stale (silent job failures, selection gaps, lifecycle deleting points faster than they're created, and a schedule too infrequent for the resource's stated RPO), how to read the latest CreationDate per resource from the AWS CLI, and how to build an alarm that fires when any resource's newest recovery point crosses its RPO budget. You'll see the edge cases that bite and the fix: resolve the underlying failure or selection gap, then tighten frequency to actually meet the RPO.
The plan that ran for everyone except the one that mattered
A SaaS platform team audited their AWS Backup vault and found a plan dutifully producing recovery points every night — 312 resources, green across the board. But one resource, their primary Postgres cluster, had a newest recovery point 41 days old. Months earlier someone had renamed its BackupRequired tag during a Terraform refactor, and the tag-based selection silently stopped matching it. The plan kept running flawlessly for everything else, so no alarm fired and no job failed — there was simply no job for that one resource. They'd been one bad deploy away from a 41-day data-loss event, and the dashboard said "protected" the entire time.
Catching a stale recovery point in action
Lena owns reliability at a payments startup with a stated 24-hour RPO on every production database. The continuity dashboard flags one RDS cluster: its newest recovery point is 9 days old against a daily Backup Plan. Everything else in the vault is green, so it's not a broken plan — it's something specific to this one resource.
She pulls the recovery points for just that cluster, sorted by creation date. The newest CreationDate is indeed 9 days ago, and there's a cluster of failed job records every night since — each one a KMSKeyError. Someone rotated the vault's KMS key and didn't grant the new key to the backup service role, so every nightly job for resources in that vault has failed silently while the dashboard counted the last good point as "a recovery point exists."
She fixes the key grant, runs an on-demand backup to close the gap immediately, and confirms a fresh completed recovery point lands within the window. Then she wires up the real fix: a check that compares the newest recovery point's age to the 24-hour RPO budget per resource and alarms the moment any resource crosses it — so the next silent failure surfaces in hours, not after nine days.
List recovery points for the flagged resource, newest first, and read the latest CreationDate. That date minus now is your real RPO for this resource.
The newest COMPLETED CreationDate is the only number that matters — its age is your real-world RPO for this resource.
Now make it a check: compute the age of the newest recovery point and compare it to the RPO budget in hours. Anything over budget exits non-zero so CI or an EventBridge rule can alarm.
Newest-point-age vs RPO-budget as a non-zero exit — the difference between catching staleness in hours and discovering it in an incident.
How recovery points go stale under the hooddeep dive
AWS Backup records every backup attempt as a job with a status — COMPLETED, FAILED, ABORTED, EXPIRED — and produces a recovery point only on success. The list-recovery-points-by-resource call returns just the successful ones, each with a CreationDate. The freshness of a resource is therefore the CreationDate of its newest COMPLETED recovery point, full stop. A wall of FAILED jobs since then is invisible to anyone who only checks "does a recovery point exist?" — because one still does; it's just old.
There are four ways the newest point drifts past where the schedule should keep it. First, silent job failures: a KMS key the service role can't use, an RDS storage-full condition, a resource modified mid-backup, or a completion window too short — each fails the job without deleting the last good point, so coverage looks intact. Second, selection gaps: tag-based selections (BackupRequired=true) stop matching when a tag is renamed or removed in a refactor, so the plan simply never schedules a job for that resource. Third, lifecycle eating its own tail: a DeleteAfterDays shorter than the effective creation cadence deletes points faster than new ones land, walking the newest point steadily older. Fourth, frequency mismatch: a weekly plan on a resource with a 24-hour RPO is stale-by-design — even when every job succeeds, the newest point can legitimately be six days old.
Storage cost is per recovery point per GB-month and varies by resource type — EBS snapshots around $0.05/GB-month in US-East, with RDS, DynamoDB, EFS, and FSx each priced higher; cross-region copies double it. Tightening frequency to meet an RPO adds recovery points and therefore cost, but the increment is usually small relative to incremental snapshot dedup — and trivial next to the cost of a multi-day data-loss event. The fix is never "take fewer backups"; it's "resolve the failure or selection gap, then make the cadence match the RPO."
# check-rpo.sh — fail if the newest recovery point is older than the RPO budget.
ARN="$1"; RPO_HOURS="${2:-24}"
# Newest COMPLETED recovery point for this resource.
NEWEST=$(aws backup list-recovery-points-by-resource \
--resource-arn "$ARN" \
--query 'reverse(sort_by(RecoveryPoints[?Status==`COMPLETED`],&CreationDate))[0].CreationDate' \
--output text)
if [ "$NEWEST" = "None" ] || [ -z "$NEWEST" ]; then
echo "NO recovery points at all for $ARN"; exit 2
fi
AGE_HOURS=$(( ( $(date -u +%s) - $(date -u -d "$NEWEST" +%s) ) / 3600 ))
echo "Newest recovery point: $NEWEST"
echo "Age: ${AGE_HOURS} hours / RPO budget: ${RPO_HOURS} hours"
if [ "$AGE_HOURS" -gt "$RPO_HOURS" ]; then
echo "STALE: $(( AGE_HOURS - RPO_HOURS )) hours past the RPO budget"; exit 1
fi
echo "OK: within RPO budget" What is the impact of stale recovery points?
The direct impact is data loss measured in time. Your real RPO equals the age of the newest restorable recovery point, so a resource flagged with a 9-day-old point has a 9-day worst-case data-loss window today — regardless of what the plan's schedule promises. For a transactional system that's thousands of writes; for a customer database it's accounts, orders, and audit history that simply cannot be reconstructed. The schedule on paper is irrelevant at restore time; only the newest point in the vault can be restored.
The insidious part is false confidence. A stale recovery point doesn't trigger the alarms a missing plan does. The plan exists, the last good point exists, and most dashboards count "a recovery point is present" as healthy. So teams carry an exposure they believe they've eliminated — which is strictly worse than a known gap, because nobody is watching it. The stopped-instance lesson's lesson applies inverted here: there, "stopped" looked free but cost money; here, "backed up" looks safe but isn't.
There's a recovery-time impact too. When the newest point is stale, an incident forces a choice between restoring old data (large RPO loss) or attempting a partial reconstruction from logs and replicas (large RTO cost, and often impossible). Neither is the clean restore the team rehearsed. Stale points quietly convert a 1-hour recovery story into a multi-day forensic exercise, which is exactly when leadership and customers are watching most closely.
Finally, it's an audit and compliance exposure. SOC 2 CC7.5, ISO 27001 A.12.3, HIPAA backup provisions, and PCI DSS all expect backups that are demonstrably running and recoverable — not merely configured. A resource whose newest recovery point is weeks old against a daily plan is a control failure an auditor will escalate, and "the plan was enabled" is not a defense when the evidence shows no fresh recovery points were produced.
How do you keep recovery points fresh?
Closing a stale-recovery-point gap is a four-step loop: measure the age of the newest point per resource, find why it drifted, fix the underlying failure or selection gap, then make the cadence actually meet the RPO — and alarm continuously so the next gap surfaces in hours.
1. Measure newest-point age per resource against its RPO budget
For every protected resource, pull the newest COMPLETED recovery point's CreationDate and compute its age. Compare that age to the resource's stated RPO — 24 hours for a daily-plan database, 1 hour for a high-value transactional system. The newest-point age, not the plan's existence, is the metric. Tag resources with their RPO (RPOHours=24) so the budget travels with the resource and the check is self-describing.
2. Diagnose why the newest point drifted
Walk the four causes in order. Check list-backup-jobs for a wall of FAILED/ABORTED jobs since the last good point (KMS grant, storage-full, completion window). Check whether the resource still matches its plan's selection — a renamed or dropped tag is the classic silent gap. Check the rule's DeleteAfterDays against the real creation cadence — lifecycle can delete points faster than they land. If all jobs succeed and selection is intact, the cause is frequency mismatch: the schedule is simply too slow for the RPO.
3. Resolve the failure, then close the gap immediately
Fix the root cause — grant the KMS key to the backup role, re-tag the resource into its selection, lengthen the completion window, or correct the lifecycle. Then run an on-demand backup (start-backup-job) to land a fresh recovery point right away rather than waiting for the next scheduled run; a stale resource shouldn't stay stale for another full cycle while the fix bakes. Confirm a new COMPLETED point appears within the backup window before considering it closed.
4. Tighten frequency to meet RPO and alarm on age continuously
If the cause was frequency mismatch, change the rule's ScheduleExpression so the interval is comfortably inside the RPO — cron(0 */6 ? * * *) for a system that must never exceed a few hours of loss. Then wire continuous detection: an EventBridge rule on AWS Backup Backup Job state changes catches failures live, and a scheduled check comparing newest-point age to the RPOHours tag catches selection and lifecycle drift. AWS Backup Audit Manager's Min Frequency framework reports the same thing nightly. The goal is that the next stale point alarms in hours, not weeks.
# Step 2: are recent jobs failing for this resource? A wall of FAILED behind one old COMPLETED is the tell.
aws backup list-backup-jobs \
--by-resource-arn arn:aws:rds:us-east-1:123456789012:cluster:prod-payments \
--query 'BackupJobs[:6].{Created:CreationDate,State:State,Msg:StatusMessage}' \
--output table
# Step 3: close the gap now with an on-demand backup rather than waiting for the next schedule.
aws backup start-backup-job \
--backup-vault-name prod-backup-vault \
--resource-arn arn:aws:rds:us-east-1:123456789012:cluster:prod-payments \
--iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole
# Step 4: tighten the rule so cadence sits comfortably inside the RPO (every 6h for a few-hour RPO).
# Edit the plan's ScheduleExpression to: cron(0 */6 ? * * *)
aws backup update-backup-plan --backup-plan-id a1b2c3d4-5e6f-7890-abcd-ef0123456789 \
--backup-plan file://prod-tiered-plan.json Quick quiz
Question 1 of 5A daily Backup Plan is enabled and shows green, but one RDS cluster's newest recovery point is 9 days old. The plan runs fine for every other resource. The cluster has a 24-hour RPO. What's the right first move?
You scored
0 / 5
Keep learning
Dig deeper into recovery points, RPO measurement, and continuous backup monitoring.
- AWS Backup: working with recovery points How recovery points are created, listed, and restored — and what CreationDate and Status actually mean.
- Monitoring AWS Backup with EventBridge and CloudWatch Wire Backup Job state-change events to alarms so a failed or stale job surfaces in hours, not weeks.
- AWS Backup Audit Manager — frameworks and controls Built-in Min Frequency and Min Retention controls that flag resources whose backups are too old or too sparse.
- AWS Well-Architected — Reliability: back up data (REL09) How RPO and RTO drive backup frequency, and why measuring recovery-point freshness is a reliability requirement.
You've completed Address stale backup recovery points. You now know that real RPO is the age of your newest restorable recovery point — not the schedule on paper — the four ways points drift stale (silent failures, selection gaps, lifecycle eating its tail, frequency mismatch), how to read newest-point age from the CLI, and the four-step loop to diagnose, close the gap on demand, and alarm continuously. The next time a backup dashboard shows all-green, you'll know to ask the only question that matters: how old is the newest recovery point?
Back to the library