Site Reliability

Configure AWS Backup restore testing

An untested backup is a hope. Automate scheduled restores so you find out before the incident whether your backups actually work.

14 min·10 sections·AWS

Last reviewed 27 May 2026

Restore testing: the basics

What does it mean to "test" a backup?

A backup is two things stitched together: a snapshot sitting in S3 (or a vault), and an unproven assumption that you can turn it back into a running resource when something goes wrong. Most teams invest heavily in the first half — backup plans, lifecycle rules, cross-region copies — and almost nothing in the second. Until you've actually restored a recovery point and watched it come up healthy, you don't have a backup. You have a file.

Restore testing is the practice of running that create→restore→verify→cleanup loop on a schedule, in a sandbox, against real recovery points your backup plan produced. AWS Backup launched it as a managed feature in late 2023 specifically because the failure mode it catches — corrupt snapshots, missing IAM grants, KMS keys deleted from the source account, schema-drift in restored RDS instances — is invisible until the day you actually need the backup.

Continuity check BKP-006 ("Restore Testing Not Configured") flags any AWS Backup plan that has no associated restore testing plan. It's a MEDIUM-severity finding rather than HIGH because the backups themselves still exist — but it's the difference between knowing your DR strategy works and assuming it does. Auditors increasingly treat the gap the same way.

In this lesson you'll learn why "Schrödinger's backup" is a real operational risk, how AWS Backup's managed Restore Testing feature automates the verification loop, where the verification surface stops and your application-level checks have to start, and how to schedule tests at a cadence that matches each workload's criticality without setting fire to your bill.

Fun fact

Schrödinger's backup

The sysadmin saying is: "The state of any backup is unknown until a restore is attempted." Until you open the box, the recovery point is simultaneously a working backup and a useless one — and the universe doesn't pick a state until you need it. A 2022 ESG survey found that 79% of organisations had attempted a recovery in the previous year; 53% of those attempts failed at least partly. Of the ones that worked, the median time to discover whether they'd work was during the incident itself. AWS Backup Restore Testing exists to move that discovery point earlier — ideally to a Tuesday morning, not a Friday night.

Restore testing in action

Marco runs platform engineering at a payments processor. After a near-miss where a restore from a tagged-but-corrupted RDS snapshot took 11 hours of frantic debugging before he gave up and rebuilt from a logical dump, he committed to never letting that happen unannounced again.

His backup plan already covers the production RDS fleet — daily snapshots, 35-day retention, copied to a second region. What it doesn't have is any evidence those snapshots are restorable. BKP-006 flags it. Marco builds a restore testing plan that picks one random RDS recovery point per week, restores it to a sandbox account, waits for it to reach available, fires a smoke-test query, and tears it down.

He starts by creating the restore testing plan itself.

First, create the restore testing plan. The schedule is a cron expression; the selection window decides which recovery points qualify (e.g. created in the last 7 days).

$ aws backup create-restore-testing-plan --restore-testing-plan '{"RestoreTestingPlanName":"weekly-rds-restore-test","ScheduleExpression":"cron(0 4 ? * MON *)","ScheduleExpressionTimezone":"UTC","StartWindowHours":2,"RecoveryPointSelection":{"Algorithm":"RANDOM_WITHIN_WINDOW","IncludeVaults":["arn:aws:backup:eu-west-1:123456789012:backup-vault:prod-rds"],"RecoveryPointTypes":["SNAPSHOT"],"SelectionWindowDays":7}}'

{

"RestoreTestingPlanArn": "arn:aws:backup:eu-west-1:123456789012:restore-testing-plan:weekly-rds-restore-test-9c1f",

"RestoreTestingPlanName": "weekly-rds-restore-test",

"CreationTime": "2026-05-11T14:22:08.412000+00:00"

}

# Plan exists — but no resource types are attached yet. Next: tell it what to restore.

The plan object — the schedule and selection algorithm — without any resources attached yet.

After attaching the RDS resource type and letting the first run fire, check what actually happened. describe-restore-testing-plan tells you about the plan; list-restore-jobs shows individual restore attempts.

$ aws backup list-restore-jobs-by-protected-resource --resource-arn arn:aws:rds:eu-west-1:123456789012:db:billing-prod --query 'RestoreJobs[?CreatedBy.RestoreTestingPlanArn!=null] | [0]'

{

"RestoreJobId": "4F8A2C19-3D7B-4E1F-9A6C-2B8D5E0F1C34",

"Status": "COMPLETED",

"ValidationStatus": "VALIDATION_SUCCESSFUL",

"DeletionStatus": "DELETED",

"BackupSizeInBytes": 412316860416,

"RestoreDurationSeconds": 2847,

"CreatedResourceArn": "arn:aws:rds:eu-west-1:123456789012:db:rt-billing-prod-9c1f"

}

# 47 minutes from restore start to 'available'. Comfortably under our 2-hour RTO for this tier.

A completed weekly restore test — restored, verified, deleted, with a real duration measurement.

Restore testing under the hooddeep dive

A restore testing plan is a separate AWS Backup object from a backup plan — they share a vault but otherwise live independently. The plan stores three things: a cron schedule, a recovery-point selection algorithm (latest, random, or specific), and a set of RestoreTestingSelection objects that say "for this resource type, use this IAM role and these override parameters." At scheduled time, AWS Backup picks an eligible recovery point per selection, calls the underlying restore API (rds:RestoreDBInstanceFromDBSnapshot, ec2:RestoreSnapshotFromRecoveryPoint, etc.), waits for the resource to reach available, marks the test successful or failed, and then deletes the restored resource.

The verification surface is deliberately narrow: AWS Backup checks that the resource creates without errors and reaches its provider's "healthy" state. It does not run application-level checks — it won't query your RDS instance to confirm rows are intact, won't curl an endpoint on the restored EC2, won't validate that the EFS filesystem actually contains the files you expect. For those, you wire a Step Functions workflow or a Lambda hook against the Restore Job COMPLETED EventBridge event and run your own assertions before letting the cleanup proceed.

Billing is the part that bites people. The temporarily-restored resource bills exactly like a production resource for the duration of the test — RDS instance-hours, EBS volume-hours, IOPS if provisioned. A weekly restore test of a db.r6i.4xlarge running for 50 minutes is roughly $1.50/test, but the same plan against an Aurora Serverless v2 provisioned at 32 ACUs while the test runs can be far worse. Schedule tests off-peak, sample small subsets of the fleet rather than every resource, and use StartWindowHours to spread restores out so you don't pay for a thundering herd of restored instances all running concurrently.

# The IAM role passed to the restore testing plan needs both backup:* permissions
# AND the create/delete permissions for each resource type you're testing.
# This is the trimmed inline policy for an RDS-only test plan.
cat <<'EOF' > restore-test-role-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "backup:StartRestoreJob",
        "backup:DescribeRestoreJob",
        "backup:GetRecoveryPointRestoreMetadata",
        "rds:RestoreDBInstanceFromDBSnapshot",
        "rds:DescribeDBInstances",
        "rds:DeleteDBInstance",
        "rds:AddTagsToResource",
        "iam:PassRole",
        "kms:Decrypt",
        "kms:GenerateDataKey"
      ],
      "Resource": "*"
    }
  ]
}
EOF

What is the impact of skipping restore testing?

The headline impact is the one nobody plans for: the restore that doesn't work. Common causes are mundane — a KMS key on the source account got rotated and the snapshot's old key version was deleted; the destination subnet group no longer exists; the RDS engine version is now deprecated and won't accept new instance creates; the SG referenced by the launch template was deleted six months ago. None of these show up in the backup metadata. All of them surface at the worst possible moment, during a real recovery, while customers and exec staff watch.

The second-order impact is your RTO/RPO commitments. If your DR document says "recover production RDS in 2 hours" but you've never timed an actual restore, that number is a guess. Real restore times for large databases routinely run 4-12 hours depending on snapshot size, IOPS, and warm-up — and restore testing is the only way to know which bucket your fleet falls into. Restore tests measure RestoreDurationSeconds automatically; you can alert if any test exceeds your tier's RTO and find out about the gap before it matters.

Audit and compliance increasingly catch the gap directly. SOC 2 CC9.1 ("Identifies, selects, and develops risk mitigation activities for risks arising from potential business disruptions") and ISO 27001 A.17.1.3 ("verify the established and implemented information security continuity controls at regular intervals") both expect documented evidence of tested restores. Automated restore-test logs satisfy this without the operational pain of a manual annual DR drill — auditors get a CSV; engineers don't lose a quarter.

On the cost side, restore testing is one of the few SRE controls that's almost always cheaper than its alternatives. Weekly tests for the critical-tier fleet of 12 RDS instances at $1.50 each = ~$75/month. A four-hour unplanned production outage caused by a discovered-too-late corrupt backup costs anywhere from $50k to several million in lost revenue, customer credits, and engineering time. The ROI math is not subtle.

How do you set up restore testing properly?

Restore testing is a four-step loop that turns assumptions into evidence. Skip any step and you're back to hoping.

1. Inventory which workloads need which cadence

Not every backup is worth weekly testing. Map every protected resource to a tier: critical (revenue-bearing RDS, primary EBS) gets weekly tests, mid-tier gets monthly, cold storage and rarely-restored assets get quarterly. The point isn't to test everything — it's to make sure the things that matter are tested often enough that a regression surfaces within one acceptable-recovery window.

2. Build the IAM role with both halves

The restore-testing IAM role needs backup:* actions on the testing API surface AND the underlying resource type's create/delete permissions (e.g. rds:RestoreDBInstanceFromDBSnapshot + rds:DeleteDBInstance for RDS). Forgetting the second half is the #1 reason first-time setups fail silently — the plan runs, the restore is initiated, and the role can't actually call the underlying service. Test the role with aws sts decode-authorization-message against any failed job.

3. Chain application-level verification when the restore matters

AWS Backup verifies the resource reaches available and stops. For anything beyond "does it boot," hook a Step Functions workflow to the Restore Job COMPLETED EventBridge event: run a smoke query against restored RDS, hit a health endpoint on restored EC2, count rows in restored EFS. Only then let the cleanup proceed. The Step Functions log becomes the audit evidence for application-level recoverability.

4. Measure restore time as an SLO, not just a number

Each completed restore test surfaces RestoreDurationSeconds. Ship this metric to CloudWatch and alarm if any test exceeds your tier's RTO. A restore that quietly grows from 45 minutes to 4 hours over six months — usually because the database doubled in size — is exactly the kind of slow regression that breaks DR plans, and it's invisible without instrumented testing.

# Attach the RDS resource selection to the plan created earlier.
aws backup create-restore-testing-selection \
  --restore-testing-plan-name weekly-rds-restore-test \
  --restore-testing-selection '{
    "RestoreTestingSelectionName": "prod-rds-fleet",
    "ProtectedResourceType": "RDS",
    "IamRoleArn": "arn:aws:iam::123456789012:role/AWSBackupRestoreTestingRole",
    "ProtectedResourceConditions": {
      "StringEquals": [{ "Key": "aws:ResourceTag/Tier", "Value": "critical" }]
    },
    "RestoreMetadataOverrides": {
      "DBSubnetGroupName": "sandbox-subnet-group",
      "VpcSecurityGroupIds": "sg-0restoretestsandbox01"
    },
    "ValidationWindowHours": 1
  }'

Quick quiz

Question 1 of 5

You've configured an AWS Backup restore testing plan that fires weekly against your RDS production fleet. Tests complete with ValidationStatus: VALIDATION_SUCCESSFUL. What does that status actually prove?

Keep learning

Dig deeper into restore testing, backup verification, and DR practice on AWS.

You've completed Configure AWS Backup restore testing. You can now build a restore testing plan, attach the right IAM role for both halves of the workflow, chain application-level verification where it matters, and measure restore duration as a real SLO. The next time BKP-006 flags a backup plan with no tests, you'll have a four-step loop ready to run — and the next time you actually need to restore, you'll already know it works.

Back to the library

Restore testing: the cost and risk context

Why an untested backup is a hidden liability on the balance sheet

Cloud backup plans produce recovery points — snapshots stored in S3 vaults, tracked in AWS Backup. Those snapshots have a real cost: storage fees, cross-region copy fees, long-term retention charges. But the cost of storing them is only half the equation. The other half is whether they can actually be turned back into a running system when you need them. BKP-006 flags any backup plan with no restore testing plan, meaning you're paying for DR insurance with no evidence the policy pays out.

The failure modes restore testing catches are mundane but costly: a KMS key deleted from the source account, a VPC subnet group removed, an RDS engine version deprecated. None of these show up in backup metadata. All of them surface during a real incident — the worst time to discover your recovery point is worthless. Each silent corruption is a liability sitting under the backup line on your cloud bill.

For finance the relevant framing is that restore testing is a very cheap control relative to what it insures. Weekly tests for a production-tier fleet typically run under $100/month in temporary compute charges. The cost of discovering a broken backup during a real outage — lost revenue, customer credits, emergency engineering time — is orders of magnitude higher. BKP-006 is the trigger to make that insurance verifiable.

This lesson is for the finance partner who wants to understand what BKP-006 actually flags and why it belongs on the cloud-cost-and-risk review. It explains why paying for backup storage without testing recoverability is a latent liability, how to think about restore-testing costs as a predictable line item versus the open-ended cost of a failed recovery, and the two governance levers — tiered cadence by criticality and RTO measurement as a tracked SLO — that keep both the spend and the risk defensible. No CLI commands required.

Fun fact

Schrödinger's backup

How a finance partner frames the restore testing conversation

Dana is the cloud finance partner for the same payments processor. At the quarterly review she asks a pointed question: the backup line on the cloud bill runs to $8,400/month in RDS snapshots and cross-region copies across 34 protected resources. She wants to know what share of that spend is verified recoverable versus theoretical.

The engineering team pulls up BKP-006. Twelve backup plans, none with a restore testing plan attached. Dana doesn't ask them to test everything — she asks them to tier it. Which databases back live payment flows? Those get weekly tests. Which are internal reporting or dev replicas? Monthly is fine. The rest can be quarterly. She co-signs the testing budget: approximately $90/month in temporary compute charges against $8,400/month in backup storage. The ratio makes the ask easy.

Six weeks later the first tests surface two silent failures — a KMS key rotation on one snapshot and a deprecated engine version on another. Both would have been discovered during a real incident instead of a Tuesday morning test. Dana notes in the risk register: 'Restore testing identified two previously undetected unrecoverable backups; both remediated. Backup spend is now verifiably productive.'

The financial exposure hidden in an untested backup fleet

When a backup plan has no restore testing, the backup storage cost on the cloud bill represents a liability, not just a service charge. You're paying to retain recovery points that may not restore — and because the failure modes are silent (deleted KMS keys, deprecated engine versions, removed subnet groups), you won't find out until the moment you actually need them. That's the finance-relevant read on BKP-006: it flags spend that has not been validated against its stated purpose.

The ROI case for restore testing is unusually clean. You can model both sides of the ledger. Testing cost: take the most expensive resource type in the backup fleet, estimate its restore duration from the AWS Backup console's size metadata, and price it as instance-hours per test at the on-demand rate. For typical production RDS databases this runs $1–$3 per test. At weekly cadence for a twelve-instance critical fleet that's roughly $75–$150/month — a predictable, bounded line item.

The avoided-cost side: a failed restore during a real incident means discovery time (often hours of debugging before the team accepts the snapshot is unusable), then a fallback path (logical dump, earlier snapshot, or partial recovery) with its own RTO. For a revenue-bearing service, hours of unplanned downtime typically cost ten to a thousand times the annual testing budget. The insurance framing is the right one: restore testing is a cheap, measurable premium against a large, uncertain claim.

For the audit and compliance dimension, automated restore-test logs directly satisfy SOC 2 CC9.1 and ISO 27001 A.17.1.3 requirements for documented evidence of tested recovery. That replaces a costly and disruptive annual manual DR drill — freeing engineering time and replacing a once-a-year snapshot of readiness with continuous, timestamped evidence.

What finance can actually do about BKP-006

Finance can't configure restore testing plans, but it can set the conditions — through budgeting, tiering, and governance — that make the right cadence a funded, trackable line item rather than an afterthought.

1. Approve a tiered testing budget by workload criticality

Work with engineering to classify every backup plan against business criticality — revenue-bearing production systems, internal tools, dev/test environments. Budget restore-testing compute charges at each tier's cadence: weekly tests for critical, monthly for mid-tier, quarterly for cold. This converts the testing cost from a surprise to a planned line, and keeps the spend proportionate to the risk being mitigated.

2. Track verified-recoverable spend as a percentage of backup spend

Put one metric on the cloud-finance dashboard: the share of backup storage spend that is covered by a passing restore test in the last test cycle. Backup plans that fail BKP-006 are unverified spend. This reframes the control from a compliance checkbox into a spend-quality signal — you're paying for DR insurance; this is the metric that tells you whether the policy is active.

3. Treat failed restore tests as a budget risk event

When a restore test fails — KMS key deleted, engine version deprecated, subnet group removed — flag it in the risk register with an estimated recovery-cost impact if the failure had been discovered during a real incident. This priced the control gap explicitly, makes remediation easier to prioritise against competing engineering work, and builds a record for audit.

4. Require RTO measurement as a SLO, funded alongside the test

Restore duration is automatically captured per test. Make it a funded requirement — not a nice-to-have — that each tier's measured restore time is compared against its documented RTO target, and that alerts fire when a database grows slow enough to breach it. A silent RTO regression is a liability; measuring it is cheap.

Quick quiz

Question 1 of 5

Your cloud backup spend is $9,200/month across 40 protected resources. BKP-006 shows 15 backup plans have no restore testing. Engineering estimates weekly restore tests for the 10 critical-tier plans would cost roughly $120/month. What's the right finance call?

Keep learning

Dig deeper into restore testing, backup verification, and DR practice on AWS.

You've finished the finance partner's view of BKP-006 and restore testing. You know why an untested backup fleet is a latent liability rather than just a compliance gap, how to model the testing cost against backup storage spend and the avoided cost of a failed recovery, and the four levers — tiered budgeting by criticality, verified-recoverable spend as a dashboard metric, priced risk events for failed tests, and RTO measurement as a funded SLO — that make restore testing a defensible, trackable spend decision rather than an engineering afterthought.

Back to the library

Restore testing: the board-level question

Can we actually recover — or do we just have files?

Every organisation with a cloud backup plan assumes it works. Restore testing is the only way to replace that assumption with evidence. BKP-006 flags backup plans where that evidence has never been produced — meaning the DR plan hasn't been tested and the RTO/RPO commitments in the business continuity document are unverified.

The executive question is simple: if we needed to recover a critical system today, would it work, and how long would it take? Without restore testing the honest answer is "we don't know." With it, the answer is a measured duration from last week's automated test. The difference is the difference between a DR posture that exists on paper and one that can be defended to a board, an auditor, or a customer.

A short read for the executive who owns DR accountability. You'll get the plain-English version of why "we have backups" is not the same as "we can recover," what AWS Backup Restore Testing does to close that gap, and what a healthy posture looks like: critical workloads tested on a regular schedule with measured restore times, and every BKP-006 finding resolved or recorded as a deliberate risk acceptance. The implementation detail is your team's job — this lesson covers the decision.

Fun fact

Schrödinger's backup

What it looks like when the org gets this right

Before restore testing, the CISO at one fintech got the same answer every time she asked about DR readiness: 'Backups are running, retention is configured.' It was technically true and completely uninformative. She had no idea whether the snapshots were actually restorable or how long a real recovery would take.

After adopting restore testing as a tracked control, the answer changed shape. Weekly tests for the critical-tier fleet produced a dashboard: 12 backup plans tested, 12 passing, median restore time 47 minutes, longest 94 minutes. The DR document now has measured durations instead of estimates, and the next time an auditor asked for evidence of tested restores, the team handed over a CSV of automated test results.

The CISO's takeaway was not about the implementation — it was about the accountability shift. Before, 'we have backups' was an assertion. After, 'our backups are tested weekly and restore in under two hours' was a fact with a timestamp. That's the difference restore testing makes at the leadership level.

Why BKP-006 is a DR credibility question, not a backup question

The impact of skipping restore testing is simple to state: when you actually need to recover, you find out whether your backup works. If the answer is no, the recovery becomes an incident inside the incident — debugging a broken restore while the system is down. The RTO in your business continuity plan becomes aspirational rather than measured.

This surfaces as a leadership question during any serious outage or audit: 'Have we tested our restores?' Without automated restore testing the honest answer is either 'no' or 'not recently.' That answer is increasingly unacceptable to regulators, to enterprise customers reviewing your SOC 2 report, and to your own board. BKP-006 is the signal that the gap exists.

The healthy end state is that the question 'can we recover critical systems within our documented RTO?' has a measured answer — not an estimate, but a timestamp from last week's automated test. Getting there requires no large project, just a tiered testing plan aligned to criticality. The executive job is to require that evidence exists and is reviewed, not to implement it.

The leadership posture on restore testing

The executive responsibility is not to configure restore testing — it's to require that DR commitments are evidence-backed and that the evidence is reviewed regularly.

1. Set the standard: critical workloads are tested, not assumed

Make it policy that any workload with a documented RTO/RPO must have an active restore testing plan. That's a two-line policy statement, not a technical mandate. Engineering chooses the cadence and tooling; the policy creates the accountability that prevents BKP-006 from accumulating silently.

2. Ask for one number at the leadership review

The question that belongs at the quarterly executive review is: 'What percentage of our critical backup plans have passed a restore test in the last cycle, and what was the slowest measured restore time?' That one number tells you whether DR is governed by evidence or assumption. No technical detail required.

3. Require documented risk acceptance for any gap

Any backup plan that remains untested — BKP-006 open — should carry a named risk owner, a recorded reason, and a review date. That converts an ignored alert into an auditable decision. It also surfaces which teams haven't been allocated the time or budget to test, which is the real executive action item.

Quick quiz

Question 1 of 5

An auditor asks: 'Can you demonstrate that your critical systems would recover within your documented RTO?' Your team has AWS Backup configured with 35-day retention on all production databases, but no restore testing plans. What's the honest answer — and the right response?

Keep learning

Dig deeper into restore testing, backup verification, and DR practice on AWS.

That's the lesson. Two takeaways: 'we have backups' is not the same as 'we can recover,' and the gap between them is closed by evidence, not by assumption. The leadership posture is a two-line policy — critical workloads with documented RTOs must have active restore tests — and a single review question: are all critical backup plans passing, and what's the slowest measured restore time? Everything else is implementation.

Back to the library

Part of the learning path Build in resilience

Configure AWS Backup restore testing

Restore testing: the basics

Schrödinger's backup

Restore testing in action

Restore testing under the hooddeep dive

What is the impact of skipping restore testing?

How do you set up restore testing properly?

1. Inventory which workloads need which cadence

2. Build the IAM role with both halves

3. Chain application-level verification when the restore matters

4. Measure restore time as an SLO, not just a number

Quick quiz

Keep learning

Restore testing: the cost and risk context

Schrödinger's backup

How a finance partner frames the restore testing conversation

The financial exposure hidden in an untested backup fleet

What finance can actually do about BKP-006

1. Approve a tiered testing budget by workload criticality

2. Track verified-recoverable spend as a percentage of backup spend

3. Treat failed restore tests as a budget risk event

4. Require RTO measurement as a SLO, funded alongside the test

Quick quiz

Keep learning

Restore testing: the board-level question

Schrödinger's backup

What it looks like when the org gets this right

Why BKP-006 is a DR credibility question, not a backup question

The leadership posture on restore testing

1. Set the standard: critical workloads are tested, not assumed

2. Ask for one number at the leadership review

3. Require documented risk acceptance for any gap

Quick quiz

Keep learning

Related site reliability lessons