Backup job failures: the basics
Why a 'configured' backup plan can still leave you with nothing to restore
AWS Backup runs every protected resource through a backup job on its schedule. Each job ends in one of a handful of states: COMPLETED (the recovery point exists), FAILED (something went wrong and no recovery point was created), EXPIRED (the job never started inside its window and was abandoned), or ABORTED. A plan that lists a resource doesn't mean that resource is backed up — only a stream of COMPLETED jobs means that. The two are easy to confuse and that confusion is exactly where data-loss incidents live.
The check flags backup jobs that have ended in FAILED or EXPIRED. The danger is that the failure is silent. The plan still exists, the selection still matches, the console still shows the resource as 'protected' — but last night's job threw an IAM error, or the KMS key denied the backup role, or the resource was mid-modification and couldn't be quiesced, and no recovery point was written. Nothing pages anyone. The gap between 'last good recovery point' and 'now' grows by a day, every day, until someone tries to restore and discovers the most recent point is three weeks old.
It's flagged because the asymmetry is brutal. A plan is created once and trusted forever; jobs run nightly and fail quietly. A single missed configuration step — a permission, a key policy, a vault policy — turns a green dashboard into a recovery gap that only surfaces during the exact incident the backup was supposed to cover. The job-failure signal is the only early warning you get before that moment.
In this lesson you'll learn why a backup job fails even when the plan looks correct, how to read the BackupJobState and StatusMessage fields to tell the common root causes apart — IAM role gaps, KMS key-policy denials, resources in a non-backupable state, Windows VSS errors, concurrent-job throttling, deleted-before-run, and vault policy denials on copy — and the fix pattern for each. You'll see the AWS CLI to list failed jobs and read their status messages, how to wire EventBridge and SNS so the next failure pages a human instead of sitting silent, and how AWS Backup Audit Manager turns ad hoc checking into a nightly compliance report.
The backup that hadn't run since the key rotated
A SaaS company rotated the CMK on their production backup vault as part of a routine security hardening sprint, but forgot to update the key policy to grant the AWS Backup service role kms:GenerateDataKey and kms:Decrypt on the new key. Every nightly job for 41 protected RDS and EBS resources began ending in FAILED with the status message Access Denied — but no alarm was wired, so nothing surfaced. The gap was discovered 47 days later when an engineer went to restore a database to a point two days prior and found the most recent recovery point predated the key rotation. The fix was two lines of key-policy JSON; the lesson was the missing alert that would have caught it on night one.
Fixing a backup job failure in action
Dana runs platform reliability at a logistics company. The dashboard flags 18 backup jobs in a FAILED state over the last week, clustered on a single RDS-backed selection. The plan is fine, the selection matches, the resources show as protected — but no recovery points have landed for those databases in eight days.
She doesn't guess at the cause. She pulls the failed jobs and reads the StatusMessage on each — that field carries the actual reason AWS recorded. Twelve say Access Denied (a KMS key-policy problem on the vault CMK), four say resource is in an invalid state (instances mid-modification during a maintenance window), and two say aborted because it exceeded the completion window — a throttling/concurrency issue where too many jobs queued behind a slow one.
She fixes the dominant cause first: adds the backup service role to the vault CMK key policy with kms:GenerateDataKey and kms:Decrypt, reruns one job by hand to confirm it now reaches COMPLETED, then wires an EventBridge rule on Backup Job State Change to an SNS topic so the next failure pages the on-call channel within minutes instead of sitting silent for seven weeks.
First, list every backup job that ended in a FAILED state, with the resource and the reason AWS recorded in StatusMessage.
The StatusMessage field tells you the root cause per job — don't fix blind, group by reason first.
Now make the next failure loud. Wire an EventBridge rule on Backup Job State Change to an SNS topic so a failed job pages the on-call channel within minutes.
The fix for the silent-failure mode itself: every future job failure becomes a page, not a surprise.
Backup jobs under the hooddeep dive
When a plan's schedule fires, AWS Backup assumes the IAM role named in the Backup Selection (typically AWSBackupDefaultServiceRole) and uses it to call the underlying snapshot API for the resource type — CreateSnapshot for EBS, CreateDBSnapshot for RDS, and so on. If that role lacks the managed policy for the resource type, or the resource is encrypted with a CMK whose key policy doesn't grant the role kms:GenerateDataKey/kms:Decrypt, the call returns AccessDenied and the job ends FAILED with that reason recorded in StatusMessage. This is the single most common failure class, and it almost always traces to a role or key-policy change made for an unrelated reason.
The other failure classes each leave a fingerprint in StatusMessage. A resource mid-modification (an instance resizing, an RDS instance applying a parameter group, a volume being modified) reports resource is in an invalid state because it can't be quiesced. Windows EC2 instances that use VSS for application-consistent snapshots fail with a VSS-specific message when the VSS components or the SSM agent aren't healthy. Hitting the per-account concurrent-backup limit, or queuing behind a slow job past the rule's CompletionWindowMinutes, surfaces as EXPIRED or 'exceeded the completion window'. A resource deleted between selection-evaluation and job-run fails with a not-found message. And a copy job into a destination vault whose access policy denies backup:CopyIntoBackupVault fails on the copy leg even when the source backup succeeded.
Three signals let you diagnose without guessing. BackupJobState (and StatusMessage) from list-backup-jobs/describe-backup-job is the authoritative per-job record. EventBridge emits a Backup Job State Change event for every transition, which you can route to SNS for real-time alerting — this is how you convert a silent failure into a page. And AWS Backup Audit Manager runs nightly frameworks ('Backups Recovery Point Completed Within Frequency', minimum-retention checks) that turn 'is everything succeeding org-wide?' into a single compliance report instead of per-account CLI archaeology.
# Read the full failure reason for one job — StatusMessage is the diagnosis.
aws backup describe-backup-job \
--backup-job-id 9f8e7d6c-1234-5678-90ab-cdef01234567 \
--query '{State:State,Reason:StatusMessage,Resource:ResourceArn,Role:IamRoleArn}'
# Common fix for the 'Access Denied' class: grant the backup role on the vault CMK.
# (Add this statement to the KMS key policy, then rerun the job to confirm COMPLETED.)
cat > kms-backup-grant.json <<'JSON'
{
"Sid": "AllowBackupRoleUseOfTheKey",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole" },
"Action": ["kms:GenerateDataKey", "kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant"],
"Resource": "*"
}
JSON
# Then trigger an on-demand job to verify the fix before trusting the schedule.
aws backup start-backup-job \
--backup-vault-name prod-backup-vault \
--resource-arn arn:aws:rds:us-east-1:123456789012:db:prod-orders \
--iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole What is the impact of unaddressed backup job failures?
The headline impact is a recovery gap that compounds silently. Every failed job is a day a protected resource produced no recovery point, but the cost is zero until the moment you need to restore — at which point it can be total. A team that believes it can restore to within 24 hours, and discovers during an incident that the most recent good recovery point is three weeks old, has effectively had no backups for that window. The defining feature of this failure mode is that it is invisible right up until it is catastrophic.
The second-order impact is the false-confidence multiplier. Because the plan still shows as configured and the resource still shows as protected, every downstream assumption built on 'it's backed up' inherits the gap — disaster-recovery runbooks, customer SLAs, the RPO promised in a sales contract. A single unremediated KMS or IAM failure on a backup selection can quietly invalidate a recovery objective across dozens of resources at once, and no dashboard that only checks 'is a plan present' will catch it.
On the compliance side, the failure is an audit finding in waiting. SOC 2 CC7.5, ISO 27001 A.12.3, HIPAA §164.308(a)(7), and PCI DSS all expect backups to be not merely configured but verified as succeeding. 'We have a backup plan' does not pass when the evidence shows jobs have been failing for weeks. AWS Backup Audit Manager's nightly report is the cheapest answer to a control owner — but only if someone is acting on the failures it reports rather than letting them accumulate.
Finally there's the operational drag of late discovery. A failure caught on night one is a two-line key-policy or IAM fix and one rerun. The same failure caught seven weeks later is an incident: reconstructing what data was lost, explaining the gap to customers or regulators, and doing forensic work to figure out which of dozens of resources were affected and for how long. The cost of the fix is constant; the cost of the delay is everything else.
How do you fix backup job failures and keep them fixed?
Remediation is a four-step loop: read the failure reason, fix the dominant cause first, verify with an on-demand job, then wire alerting so the next failure is loud instead of silent.
1. Read StatusMessage and group failures by root cause
Pull every FAILED/EXPIRED job and group them by the StatusMessage field — that string is AWS's own diagnosis. Access Denied means an IAM role gap or a KMS key-policy denial. resource is in an invalid state means the resource was busy (mid-modification, in-use exclusive). A VSS message means the Windows backup agent or SSM agent is unhealthy. exceeded the completion window/EXPIRED means concurrency or throttling. A not-found message means the resource was deleted between selection and run. Don't fix blind — most failures cluster on one or two causes.
2. Fix the dominant cause, by class
For Access Denied: attach the AWS Backup managed policy for the resource type to the service role, and grant that role kms:GenerateDataKey/kms:Decrypt/kms:DescribeKey/kms:CreateGrant on the resource's and the vault's CMK. For 'invalid state': move the schedule outside maintenance windows and widen StartWindowMinutes. For VSS: repair the SSM agent and VSS components or fall back to crash-consistent snapshots. For completion-window/throttling: widen CompletionWindowMinutes and stagger schedules so jobs don't all fire at once. For copy-leg denials: fix the destination vault's access policy to allow backup:CopyIntoBackupVault from the source account.
3. Verify with an on-demand job before trusting the schedule
Never assume a config fix worked because it looked right. Trigger an on-demand backup of one affected resource with start-backup-job and watch it reach COMPLETED. This proves the role, the key policy, and the vault policy all line up for real, on that resource type, before you wait a full day for the next scheduled run to either confirm or quietly fail again. The on-demand run is the difference between 'I changed the policy' and 'backups are working.'
4. Make the next failure loud, then audit org-wide
The root fix for the silent-failure mode is alerting. Create an EventBridge rule on Backup Job State Change filtered to FAILED/EXPIRED/ABORTED, target an SNS topic, and fan out to PagerDuty or Slack so the next failure pages within minutes. Then enable AWS Backup Audit Manager's built-in frameworks for a nightly org-wide compliance report on completion and retention, and track backup success rate as a standing operational metric. A single missed configuration should surface as a page on night one — never as a three-week-old recovery point during an incident.
# Triage: how many jobs failed, grouped by reason, over the last window.
aws backup list-backup-jobs --by-state FAILED \
--query 'BackupJobs[].StatusMessage' --output text | sort | uniq -c | sort -rn
# After fixing the dominant cause (e.g. KMS key policy), verify with one on-demand job.
aws backup start-backup-job \
--backup-vault-name prod-backup-vault \
--resource-arn arn:aws:rds:us-east-1:123456789012:db:prod-orders \
--iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole \
--query '{Job:BackupJobId}'
# Watch it through to COMPLETED before trusting the schedule.
aws backup describe-backup-job --backup-job-id <id> \
--query '{State:State,Reason:StatusMessage,PercentDone:PercentDone}' Quick quiz
Question 1 of 5The dashboard flags 18 backup jobs in a FAILED state on one RDS selection; the StatusMessage on each reads 'Access Denied' and the resources are encrypted with a CMK on the backup vault. What's the right next move?
You scored
0 / 5
Keep learning
Dig deeper into backup job states, failure diagnosis, alerting, and org-wide compliance.
- AWS Backup: monitoring backup, copy, and restore jobs Job states, the StatusMessage field, and how to inspect failed and expired jobs.
- AWS Backup: using EventBridge for backup events Wire Backup Job State Change events to SNS so failures alert in real time instead of going silent.
- AWS Backup: troubleshooting (IAM, KMS, and resource-state failures) Common failure causes and fixes — permissions, encryption key policies, and unsupported resource states.
- AWS Backup Audit Manager Built-in frameworks that produce a nightly compliance report on backup completion, frequency, and retention.
You've completed Fix AWS Backup job failures. You now know why a configured plan can still leave you with nothing to restore, how to read BackupJobState and StatusMessage to tell IAM, KMS, resource-state, VSS, throttling, and copy-policy failures apart, the per-cause fix pattern, and — most importantly — how to wire EventBridge/SNS alerting so the next failure is loud instead of a three-week-old recovery gap discovered during an incident. The next time the check flags a failed backup job, you'll have a diagnosis-to-fix-to-alert path ready to run.
Back to the library