Restore test failures: the basics
Why "successfully created" is not "successfully restorable"
AWS Backup tracks three independent jobs, and people routinely conflate them. A backup job writes a recovery point to a vault. A copy job replicates that recovery point to another region or account. A restore test takes one of those recovery points and actually rebuilds it into a live resource to prove the snapshot is usable. Only the third one answers the question that matters in a disaster — can we get the workload back? A green backup job tells you the bytes were captured; it tells you nothing about whether those bytes will boot.
Restore testing automates the create→restore→verify→cleanup loop on a schedule: it picks an eligible recovery point, calls the underlying restore API, waits for the resource to reach its provider's healthy state, marks the test passed or failed, and tears the restored resource down. When that restore step errors out, the test job lands in FAILED and the check flags it. The backup itself is untouched and still sitting in the vault — but you now have evidence that, restored under these parameters with this role and this network config, it does not come back. An untested backup is a hope; a failed restore test is a hope that just got disproven.
This is flagged separately from backup-job and copy-job failures on purpose. A failed backup job means you have no recovery point. A failed copy job means your recovery point isn't where your DR plan assumes it is. A failed restore test means the recovery point exists and is in the right place — but the path from snapshot to running resource is broken. Each one is a different fix; this lesson is about the third, because it's the one that silently invalidates a DR strategy everyone believes is working.
In this lesson you'll learn why a successful backup job is not proof of recoverability, how AWS Backup restore testing surfaces the real failure modes — wrong or vanished target subnet/VPC, unavailable instance types or engine versions, missing IAM restore-role permissions or KMS decrypt access, capacity and quota limits, and validation-script timeouts — and how to diagnose them from restore-job history and the StatusMessage field. You'll see the CLI to list failed restore jobs, read the error, fix the restore metadata, role, or key, and re-run the test plan. You'll also learn to keep restore-test failures clearly separate from backup-job and copy-job failures, because each is a different problem with a different fix.
The snapshot that restored into a deleted subnet
A SaaS company's weekly RDS restore tests ran green for four months, then started failing every single run with StatusMessage: DBSubnetGroupNotFoundFault. Nobody had touched the backup plan — but a networking team had decommissioned an old sandbox VPC during a cleanup sprint, and the restore testing selection still pointed DBSubnetGroupName at a subnet group that no longer existed. The backups were perfect. The recovery points were valid. The restore simply had nowhere to land. Had a real outage hit during that window, the team would have discovered the broken target network mid-incident instead of in a Tuesday-morning alert. The fix was one line in the restore metadata override; the lesson was that restore tests fail for reasons that have nothing to do with the backup itself — which is precisely why you run them.
Fixing a restore test failure in action
Priya owns DR for a healthcare-analytics platform. The dashboard flags three failed restore-test jobs overnight, all against the production RDS fleet that her SOC 2 report says recovers within a 4-hour RTO. The backup jobs are all green; the recovery points are all present in the vault. It's specifically the restore tests that are red.
She doesn't guess. She lists the failed restore jobs and reads the StatusMessage on each — that one field is where AWS Backup records why the underlying restore API call rejected the attempt. Two of the three say KMSKeyNotAccessibleFault: the specified KMS key... access denied, and the third says InsufficientDBInstanceCapacity for the db.r6i.4xlarge instance class the restore parameters request in that AZ.
The KMS failures trace to a key policy change: a security cleanup removed the restore-testing role from the CMK that encrypts the production snapshots, so the role can no longer decrypt the recovery points. She re-adds the role's kms:Decrypt grant to the key policy. For the capacity failure she edits the restore metadata override to pin a different AZ with headroom, then re-runs the plan on demand rather than waiting a week — all three tests come back COMPLETED with VALIDATION_SUCCESSFUL, and the RTO clock she now trusts reads 51 minutes.
First, list every failed restore job. Restore-test jobs are restore jobs created by a restore testing plan, so filtering on FAILED status surfaces them alongside any other failed restores.
Failed restore-test jobs. The Plan column confirms they came from a restore testing plan, not an ad-hoc restore.
Now read the full StatusMessage for one failed job — this single field tells you which restore parameter, role, or key the underlying restore API rejected.
StatusMessage is the diagnosis. KMS, subnet, instance-type, quota, and timeout failures all name themselves here.
Restore test failures under the hooddeep dive
When a restore testing plan fires, AWS Backup selects an eligible recovery point, then calls the resource type's native restore API on your behalf using the IAM role attached to the restore testing selection — rds:RestoreDBInstanceFromDBSnapshot, ec2:RunInstances from an AMI/snapshot, dynamodb:RestoreTableFromBackup, and so on. It passes the recovery point's stored restore metadata, merged with any RestoreMetadataOverrides you set (target subnet group, security groups, instance class, DB engine version). It waits for the resource to reach available, optionally runs the validation window, then deletes the resource. A FAILED restore job means one of those steps threw, and the provider's error string is captured verbatim in StatusMessage.
The failure modes cluster into five families, and they are almost never about the snapshot bytes. Network: the DBSubnetGroupName/VpcSecurityGroupIds in the metadata point at a subnet group or SG that was deleted or lives in a region the snapshot can't restore to (DBSubnetGroupNotFoundFault, InvalidParameterValue). Resource availability: the instance class or engine version baked into the restore parameters is no longer offered in that AZ or has been deprecated (InsufficientDBInstanceCapacity, InvalidParameterCombination). Permissions: the restore role is missing a create/delete action or, very commonly, kms:Decrypt/kms:CreateGrant on the CMK that encrypts the recovery point (KMSKeyNotAccessibleFault, AccessDenied). Limits: account-level service quotas — max RDS instances, vCPU limits, EIP caps — reject the new resource (LimitExceeded). Validation: a chained Step Functions/Lambda check times out or returns failure, flipping ValidationStatus to VALIDATION_FAILED even though the resource came up.
Two details matter for triage. First, restore testing uses a separate IAM role and separate network/parameter config from your production restore runbook — so a restore test can fail on config that production restores would never hit, and vice versa; always confirm whether the failure is in the test harness or the actual recovery path. Second, transient failures (a momentary InsufficientDBInstanceCapacity in one AZ) self-heal on the next run, while structural failures (deleted subnet, revoked KMS grant, deprecated engine) recur every single run — the recurrence pattern in restore-job history is the fastest signal of which kind you're looking at.
# Triage loop: pull every failed restore job from the last 7 days and print the
# StatusMessage so you can bucket failures by family (KMS / subnet / capacity / quota).
START=$(date -u -d '7 days ago' +%s)
aws backup list-restore-jobs --by-status FAILED \
--by-created-after "$START" \
--query 'RestoreJobs[].RestoreJobId' --output text |
for id in $(cat); do
echo "=== $id ==="
aws backup describe-restore-job --restore-job-id "$id" \
--query 'StatusMessage' --output text
done
# If the message is KMSKeyNotAccessibleFault, confirm the restore role is on the key policy:
aws kms get-key-policy --key-id <cmk-id> --policy-name default \
--query 'Policy' --output text | python3 -m json.tool | grep -A3 RestoreTestingRole What is the impact of failed restore tests?
The headline impact is the recovery that doesn't happen. A failed restore test is a dress rehearsal that went wrong, which means the live performance — the actual disaster recovery — would have gone wrong too. The same KMSKeyNotAccessibleFault that turned a test red on Tuesday will turn a real restore red on the Friday night the primary database is corrupted, except now there are customers down, an incident bridge open, and an executive asking why the backups everyone said were fine won't come back. The test failure is the gift: it moves the discovery to a moment when the stakes are a re-run, not an outage.
The second-order impact is on your RTO/RPO commitments. If your DR document promises a 4-hour recovery but your restore tests are failing, that number isn't conservative — it's fictional. You cannot recover in four hours via a path that doesn't complete at all. Worse, partial-credit thinking creeps in: teams see green backup jobs, assume green recovery, and quietly let the restore-test failures age in a backlog because "the backups are fine." That backlog is a list of systems whose stated recovery capability is currently unproven, and the gap is invisible on every dashboard that only watches backup success.
Audit and compliance catch this directly. SOC 2 CC9.1 and ISO 27001 A.5.30 (ICT readiness for business continuity) expect documented evidence not just that backups exist but that recovery has been verified. A clean record of passing restore tests is exactly that evidence; a backlog of failed ones is the opposite — it's documented proof that recovery was attempted and did not work. Auditors increasingly know the difference between a backup-completion report and a restore-verification report, and they ask for the second.
Finally, there's a credibility and culture cost. The first time a team confidently asserts "we're fully backed up" and then can't restore during an incident, every future assurance gets discounted. Conversely, a clean restore-test record is one of the few DR claims that survives scrutiny because it's continuously, automatically re-proven. The cost of keeping restore tests green is trivial — pennies to a few dollars per test run; the cost of an unrecoverable backup discovered in production runs from tens of thousands to millions in downtime, credits, and lost trust.
How do you fix a failed restore test?
Fixing a restore-test failure is a four-step loop: read the StatusMessage to find the real cause, repair the specific restore metadata, role, or key, re-run the plan on demand to confirm, then close the loop so the same class of failure can't recur silently.
1. Read the StatusMessage before touching anything
Pull the failed restore job with describe-restore-job and read StatusMessage in full — it carries the verbatim error from the underlying restore API and almost always names the cause: DBSubnetGroupNotFoundFault (network), InsufficientDBInstanceCapacity or a deprecated-engine error (resource availability), KMSKeyNotAccessibleFault/AccessDenied (permissions), LimitExceeded (quota), or a VALIDATION_FAILED on an otherwise-created resource (your chained check). Also note whether the failure recurs every run (structural) or appears once and clears (transient) — that distinction decides whether you fix config or just re-run.
2. Repair the specific restore metadata, role, or key
Match the fix to the family. Network: update RestoreMetadataOverrides in the restore testing selection to a subnet group and security group that exist in the test region. Availability: pin a supported instance class / engine version, or move to an AZ with capacity. Permissions: add the missing create/delete action to the restore role, and for encrypted recovery points re-add the role's kms:Decrypt and kms:CreateGrant on the source CMK's key policy — the single most common silent failure. Quota: request a service-quota increase or sample fewer resources per run. Validation: fix or extend the timeout on the Step Functions/Lambda check.
3. Re-run the test plan on demand and confirm green
Don't wait a week for the next scheduled fire to learn whether your fix worked. Trigger the restore testing plan on demand with start-restore-job using the same recovery point and the corrected metadata/role, and confirm the job reaches COMPLETED with ValidationStatus: VALIDATION_SUCCESSFUL. Verify the fix on the actual failing recovery point, not a fresh one — a brand-new snapshot may dodge the exact condition (e.g. it's encrypted with a key the role can still reach) and give you a false all-clear.
4. Close the loop so it can't recur silently
A restore test failed because something drifted — a subnet was deleted, a key policy was tightened, a quota was hit. Prevent the recurrence: add the restore-testing role to any change-review checklist for KMS key policies and shared networking, alarm on Restore Job FAILED EventBridge events so a structural break pages someone the same day instead of aging in a dashboard, and keep restore-test results clearly separated from backup-job and copy-job status so nobody reads a green backup as proof of recovery.
# After fixing the KMS key policy / restore metadata, re-run the SAME recovery point
# on demand instead of waiting for the next weekly fire.
RP_ARN=arn:aws:backup:us-east-1:123456789012:recovery-point:rds:snapshot:awsbackup:job-...
ROLE=arn:aws:iam::123456789012:role/AWSBackupRestoreTestingRole
aws backup start-restore-job \
--recovery-point-arn "$RP_ARN" \
--iam-role-arn "$ROLE" \
--metadata '{
"DBInstanceClass": "db.r6i.4xlarge",
"DBSubnetGroupName": "sandbox-subnet-group-current",
"VpcSecurityGroupIds": "sg-0restoretestsandbox01",
"AvailabilityZone": "us-east-1b"
}'
# Then confirm it actually came back this time.
aws backup describe-restore-job --restore-job-id <new-id> \
--query '{Status:Status,Validation:ValidationStatus,Duration:RestoreDurationSeconds}' Quick quiz
Question 1 of 5Your weekly RDS restore-test job lands in FAILED with StatusMessage: KMSKeyNotAccessibleFault... the restore role cannot decrypt the recovery point. The nightly backup job for the same database is green. What's the right next move?
You scored
0 / 5
Keep learning
Dig deeper into restore testing, restore-job diagnostics, and DR validation on AWS.
- AWS Backup — Restore testing user guide The full reference for restore testing plans, selections, IAM roles, metadata overrides, and validation.
- AWS Backup — Managing restore jobs and statuses How restore jobs work, the status values, and where the StatusMessage diagnostic comes from.
- AWS Well-Architected — Reliability Pillar (REL13) Where restore validation, RTO/RPO, and DR testing fit in the reliability picture.
- ISO 27001:2022 A.5.30 — ICT readiness for business continuity The control language auditors quote when they ask for evidence that recovery has actually been verified.
You've completed Fix restore test failures. You now know why a successful backup job is not proof of recoverability, the five families of restore-test failure — network, resource availability, permissions, quota, and validation — how to diagnose each from the StatusMessage in restore-job history, and the four-step read-fix-rerun-close loop that turns a red test green. The next time the check flags a failed restore test, you'll have a defensible path from "flagged" to "recovery re-proven" — and you'll keep it firmly separate from backup-job and copy-job failures, because each is a different problem.
Back to the library