Restore testing: the basics
What does it mean to "test" a backup?
A backup is two things stitched together: a snapshot sitting in S3 (or a vault), and an unproven assumption that you can turn it back into a running resource when something goes wrong. Most teams invest heavily in the first half — backup plans, lifecycle rules, cross-region copies — and almost nothing in the second. Until you've actually restored a recovery point and watched it come up healthy, you don't have a backup. You have a file.
Restore testing is the practice of running that create→restore→verify→cleanup loop on a schedule, in a sandbox, against real recovery points your backup plan produced. AWS Backup launched it as a managed feature in late 2023 specifically because the failure mode it catches — corrupt snapshots, missing IAM grants, KMS keys deleted from the source account, schema-drift in restored RDS instances — is invisible until the day you actually need the backup.
Continuity check BKP-006 ("Restore Testing Not Configured") flags any AWS Backup plan that has no associated restore testing plan. It's a MEDIUM-severity finding rather than HIGH because the backups themselves still exist — but it's the difference between knowing your DR strategy works and assuming it does. Auditors increasingly treat the gap the same way.
In this lesson you'll learn why "Schrödinger's backup" is a real operational risk, how AWS Backup's managed Restore Testing feature automates the verification loop, where the verification surface stops and your application-level checks have to start, and how to schedule tests at a cadence that matches each workload's criticality without setting fire to your bill.
Schrödinger's backup
The sysadmin saying is: "The state of any backup is unknown until a restore is attempted." Until you open the box, the recovery point is simultaneously a working backup and a useless one — and the universe doesn't pick a state until you need it. A 2022 ESG survey found that 79% of organisations had attempted a recovery in the previous year; 53% of those attempts failed at least partly. Of the ones that worked, the median time to discover whether they'd work was during the incident itself. AWS Backup Restore Testing exists to move that discovery point earlier — ideally to a Tuesday morning, not a Friday night.
Restore testing in action
Marco runs platform engineering at a payments processor. After a near-miss where a restore from a tagged-but-corrupted RDS snapshot took 11 hours of frantic debugging before he gave up and rebuilt from a logical dump, he committed to never letting that happen unannounced again.
His backup plan already covers the production RDS fleet — daily snapshots, 35-day retention, copied to a second region. What it doesn't have is any evidence those snapshots are restorable. BKP-006 flags it. Marco builds a restore testing plan that picks one random RDS recovery point per week, restores it to a sandbox account, waits for it to reach available, fires a smoke-test query, and tears it down.
He starts by creating the restore testing plan itself.
First, create the restore testing plan. The schedule is a cron expression; the selection window decides which recovery points qualify (e.g. created in the last 7 days).
The plan object — the schedule and selection algorithm — without any resources attached yet.
After attaching the RDS resource type and letting the first run fire, check what actually happened. describe-restore-testing-plan tells you about the plan; list-restore-jobs shows individual restore attempts.
A completed weekly restore test — restored, verified, deleted, with a real duration measurement.
Restore testing under the hooddeep dive
A restore testing plan is a separate AWS Backup object from a backup plan — they share a vault but otherwise live independently. The plan stores three things: a cron schedule, a recovery-point selection algorithm (latest, random, or specific), and a set of RestoreTestingSelection objects that say "for this resource type, use this IAM role and these override parameters." At scheduled time, AWS Backup picks an eligible recovery point per selection, calls the underlying restore API (rds:RestoreDBInstanceFromDBSnapshot, ec2:RestoreSnapshotFromRecoveryPoint, etc.), waits for the resource to reach available, marks the test successful or failed, and then deletes the restored resource.
The verification surface is deliberately narrow: AWS Backup checks that the resource creates without errors and reaches its provider's "healthy" state. It does not run application-level checks — it won't query your RDS instance to confirm rows are intact, won't curl an endpoint on the restored EC2, won't validate that the EFS filesystem actually contains the files you expect. For those, you wire a Step Functions workflow or a Lambda hook against the Restore Job COMPLETED EventBridge event and run your own assertions before letting the cleanup proceed.
Billing is the part that bites people. The temporarily-restored resource bills exactly like a production resource for the duration of the test — RDS instance-hours, EBS volume-hours, IOPS if provisioned. A weekly restore test of a db.r6i.4xlarge running for 50 minutes is roughly $1.50/test, but the same plan against an Aurora Serverless v2 provisioned at 32 ACUs while the test runs can be far worse. Schedule tests off-peak, sample small subsets of the fleet rather than every resource, and use StartWindowHours to spread restores out so you don't pay for a thundering herd of restored instances all running concurrently.
# The IAM role passed to the restore testing plan needs both backup:* permissions
# AND the create/delete permissions for each resource type you're testing.
# This is the trimmed inline policy for an RDS-only test plan.
cat <<'EOF' > restore-test-role-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"backup:StartRestoreJob",
"backup:DescribeRestoreJob",
"backup:GetRecoveryPointRestoreMetadata",
"rds:RestoreDBInstanceFromDBSnapshot",
"rds:DescribeDBInstances",
"rds:DeleteDBInstance",
"rds:AddTagsToResource",
"iam:PassRole",
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Resource": "*"
}
]
}
EOF What is the impact of skipping restore testing?
The headline impact is the one nobody plans for: the restore that doesn't work. Common causes are mundane — a KMS key on the source account got rotated and the snapshot's old key version was deleted; the destination subnet group no longer exists; the RDS engine version is now deprecated and won't accept new instance creates; the SG referenced by the launch template was deleted six months ago. None of these show up in the backup metadata. All of them surface at the worst possible moment, during a real recovery, while customers and exec staff watch.
The second-order impact is your RTO/RPO commitments. If your DR document says "recover production RDS in 2 hours" but you've never timed an actual restore, that number is a guess. Real restore times for large databases routinely run 4-12 hours depending on snapshot size, IOPS, and warm-up — and restore testing is the only way to know which bucket your fleet falls into. Restore tests measure RestoreDurationSeconds automatically; you can alert if any test exceeds your tier's RTO and find out about the gap before it matters.
Audit and compliance increasingly catch the gap directly. SOC 2 CC9.1 ("Identifies, selects, and develops risk mitigation activities for risks arising from potential business disruptions") and ISO 27001 A.17.1.3 ("verify the established and implemented information security continuity controls at regular intervals") both expect documented evidence of tested restores. Automated restore-test logs satisfy this without the operational pain of a manual annual DR drill — auditors get a CSV; engineers don't lose a quarter.
On the cost side, restore testing is one of the few SRE controls that's almost always cheaper than its alternatives. Weekly tests for the critical-tier fleet of 12 RDS instances at $1.50 each = ~$75/month. A four-hour unplanned production outage caused by a discovered-too-late corrupt backup costs anywhere from $50k to several million in lost revenue, customer credits, and engineering time. The ROI math is not subtle.
How do you set up restore testing properly?
Restore testing is a four-step loop that turns assumptions into evidence. Skip any step and you're back to hoping.
1. Inventory which workloads need which cadence
Not every backup is worth weekly testing. Map every protected resource to a tier: critical (revenue-bearing RDS, primary EBS) gets weekly tests, mid-tier gets monthly, cold storage and rarely-restored assets get quarterly. The point isn't to test everything — it's to make sure the things that matter are tested often enough that a regression surfaces within one acceptable-recovery window.
2. Build the IAM role with both halves
The restore-testing IAM role needs backup:* actions on the testing API surface AND the underlying resource type's create/delete permissions (e.g. rds:RestoreDBInstanceFromDBSnapshot + rds:DeleteDBInstance for RDS). Forgetting the second half is the #1 reason first-time setups fail silently — the plan runs, the restore is initiated, and the role can't actually call the underlying service. Test the role with aws sts decode-authorization-message against any failed job.
3. Chain application-level verification when the restore matters
AWS Backup verifies the resource reaches available and stops. For anything beyond "does it boot," hook a Step Functions workflow to the Restore Job COMPLETED EventBridge event: run a smoke query against restored RDS, hit a health endpoint on restored EC2, count rows in restored EFS. Only then let the cleanup proceed. The Step Functions log becomes the audit evidence for application-level recoverability.
4. Measure restore time as an SLO, not just a number
Each completed restore test surfaces RestoreDurationSeconds. Ship this metric to CloudWatch and alarm if any test exceeds your tier's RTO. A restore that quietly grows from 45 minutes to 4 hours over six months — usually because the database doubled in size — is exactly the kind of slow regression that breaks DR plans, and it's invisible without instrumented testing.
# Attach the RDS resource selection to the plan created earlier.
aws backup create-restore-testing-selection \
--restore-testing-plan-name weekly-rds-restore-test \
--restore-testing-selection '{
"RestoreTestingSelectionName": "prod-rds-fleet",
"ProtectedResourceType": "RDS",
"IamRoleArn": "arn:aws:iam::123456789012:role/AWSBackupRestoreTestingRole",
"ProtectedResourceConditions": {
"StringEquals": [{ "Key": "aws:ResourceTag/Tier", "Value": "critical" }]
},
"RestoreMetadataOverrides": {
"DBSubnetGroupName": "sandbox-subnet-group",
"VpcSecurityGroupIds": "sg-0restoretestsandbox01"
},
"ValidationWindowHours": 1
}' Quick quiz
Question 1 of 5You've configured an AWS Backup restore testing plan that fires weekly against your RDS production fleet. Tests complete with ValidationStatus: VALIDATION_SUCCESSFUL. What does that status actually prove?
You scored
0 / 5
Keep learning
Dig deeper into restore testing, backup verification, and DR practice on AWS.
- AWS Backup — Restore testing user guide The full reference for restore testing plans, selections, IAM, and event-driven validation.
- AWS Well-Architected — Reliability Pillar (REL13) Where DR strategy, RTO/RPO definition, and restore validation fit in the reliability picture.
- AWS re:Invent 2023 — Announcing AWS Backup Restore Testing The launch blog post with worked examples for RDS, EC2, EBS, and DynamoDB.
- ISO 27001:2022 A.5.30 — ICT readiness for business continuity The control language auditors quote when they ask for restore-test evidence.
You've completed Configure AWS Backup restore testing. You can now build a restore testing plan, attach the right IAM role for both halves of the workflow, chain application-level verification where it matters, and measure restore duration as a real SLO. The next time BKP-006 flags a backup plan with no tests, you'll have a four-step loop ready to run — and the next time you actually need to restore, you'll already know it works.
Back to the library