Site Reliability

Fix AWS Backup job failures

A backup plan can look configured on paper while its jobs quietly fail every night — leaving a growing recovery gap nobody notices until a restore is needed.

13 min·10 sections·AWS

Last reviewed 27 May 2026

Backup job failures: the basics

Why a 'configured' backup plan can still leave you with nothing to restore

AWS Backup runs every protected resource through a backup job on its schedule. Each job ends in one of a handful of states: COMPLETED (the recovery point exists), FAILED (something went wrong and no recovery point was created), EXPIRED (the job never started inside its window and was abandoned), or ABORTED. A plan that lists a resource doesn't mean that resource is backed up — only a stream of COMPLETED jobs means that. The two are easy to confuse and that confusion is exactly where data-loss incidents live.

The check flags backup jobs that have ended in FAILED or EXPIRED. The danger is that the failure is silent. The plan still exists, the selection still matches, the console still shows the resource as 'protected' — but last night's job threw an IAM error, or the KMS key denied the backup role, or the resource was mid-modification and couldn't be quiesced, and no recovery point was written. Nothing pages anyone. The gap between 'last good recovery point' and 'now' grows by a day, every day, until someone tries to restore and discovers the most recent point is three weeks old.

It's flagged because the asymmetry is brutal. A plan is created once and trusted forever; jobs run nightly and fail quietly. A single missed configuration step — a permission, a key policy, a vault policy — turns a green dashboard into a recovery gap that only surfaces during the exact incident the backup was supposed to cover. The job-failure signal is the only early warning you get before that moment.

In this lesson you'll learn why a backup job fails even when the plan looks correct, how to read the BackupJobState and StatusMessage fields to tell the common root causes apart — IAM role gaps, KMS key-policy denials, resources in a non-backupable state, Windows VSS errors, concurrent-job throttling, deleted-before-run, and vault policy denials on copy — and the fix pattern for each. You'll see the AWS CLI to list failed jobs and read their status messages, how to wire EventBridge and SNS so the next failure pages a human instead of sitting silent, and how AWS Backup Audit Manager turns ad hoc checking into a nightly compliance report.

Fun fact

The backup that hadn't run since the key rotated

A SaaS company rotated the CMK on their production backup vault as part of a routine security hardening sprint, but forgot to update the key policy to grant the AWS Backup service role kms:GenerateDataKey and kms:Decrypt on the new key. Every nightly job for 41 protected RDS and EBS resources began ending in FAILED with the status message Access Denied — but no alarm was wired, so nothing surfaced. The gap was discovered 47 days later when an engineer went to restore a database to a point two days prior and found the most recent recovery point predated the key rotation. The fix was two lines of key-policy JSON; the lesson was the missing alert that would have caught it on night one.

Fixing a backup job failure in action

Dana runs platform reliability at a logistics company. The dashboard flags 18 backup jobs in a FAILED state over the last week, clustered on a single RDS-backed selection. The plan is fine, the selection matches, the resources show as protected — but no recovery points have landed for those databases in eight days.

She doesn't guess at the cause. She pulls the failed jobs and reads the StatusMessage on each — that field carries the actual reason AWS recorded. Twelve say Access Denied (a KMS key-policy problem on the vault CMK), four say resource is in an invalid state (instances mid-modification during a maintenance window), and two say aborted because it exceeded the completion window — a throttling/concurrency issue where too many jobs queued behind a slow one.

She fixes the dominant cause first: adds the backup service role to the vault CMK key policy with kms:GenerateDataKey and kms:Decrypt, reruns one job by hand to confirm it now reaches COMPLETED, then wires an EventBridge rule on Backup Job State Change to an SNS topic so the next failure pages the on-call channel within minutes instead of sitting silent for seven weeks.

First, list every backup job that ended in a FAILED state, with the resource and the reason AWS recorded in StatusMessage.

$ aws backup list-backup-jobs --by-state FAILED --query 'BackupJobs[].{Resource:ResourceArn,Type:ResourceType,State:State,Reason:StatusMessage}' --output table

------------------------------------------------------------------------------------

| ListBackupJobs |

+------------------------+-----------+----------+---------------------------------+

+------------------------+-----------+----------+---------------------------------+

+------------------------+-----------+----------+---------------------------------+

# StatusMessage is the diagnosis: 'Access Denied' = IAM/KMS, 'invalid state' = resource busy, 'VSS' = Windows agent.

The StatusMessage field tells you the root cause per job — don't fix blind, group by reason first.

Now make the next failure loud. Wire an EventBridge rule on Backup Job State Change to an SNS topic so a failed job pages the on-call channel within minutes.

$ aws events put-rule --name backup-job-failed --event-pattern '{"source":["aws.backup"],"detail-type":["Backup Job State Change"],"detail":{"state":["FAILED","EXPIRED","ABORTED"]}}'

{

"RuleArn": "arn:aws:events:us-east-1:123456789012:rule/backup-job-failed"

}

# Rule created. Now point it at an SNS topic that fans out to PagerDuty / Slack.

$ aws events put-targets --rule backup-job-failed \

--targets 'Id=1,Arn=arn:aws:sns:us-east-1:123456789012:backup-alerts'

# Next FAILED/EXPIRED/ABORTED job now alerts in minutes — not discovered weeks later by a restore attempt.

The fix for the silent-failure mode itself: every future job failure becomes a page, not a surprise.

Backup jobs under the hooddeep dive

When a plan's schedule fires, AWS Backup assumes the IAM role named in the Backup Selection (typically AWSBackupDefaultServiceRole) and uses it to call the underlying snapshot API for the resource type — CreateSnapshot for EBS, CreateDBSnapshot for RDS, and so on. If that role lacks the managed policy for the resource type, or the resource is encrypted with a CMK whose key policy doesn't grant the role kms:GenerateDataKey/kms:Decrypt, the call returns AccessDenied and the job ends FAILED with that reason recorded in StatusMessage. This is the single most common failure class, and it almost always traces to a role or key-policy change made for an unrelated reason.

The other failure classes each leave a fingerprint in StatusMessage. A resource mid-modification (an instance resizing, an RDS instance applying a parameter group, a volume being modified) reports resource is in an invalid state because it can't be quiesced. Windows EC2 instances that use VSS for application-consistent snapshots fail with a VSS-specific message when the VSS components or the SSM agent aren't healthy. Hitting the per-account concurrent-backup limit, or queuing behind a slow job past the rule's CompletionWindowMinutes, surfaces as EXPIRED or 'exceeded the completion window'. A resource deleted between selection-evaluation and job-run fails with a not-found message. And a copy job into a destination vault whose access policy denies backup:CopyIntoBackupVault fails on the copy leg even when the source backup succeeded.

Three signals let you diagnose without guessing. BackupJobState (and StatusMessage) from list-backup-jobs/describe-backup-job is the authoritative per-job record. EventBridge emits a Backup Job State Change event for every transition, which you can route to SNS for real-time alerting — this is how you convert a silent failure into a page. And AWS Backup Audit Manager runs nightly frameworks ('Backups Recovery Point Completed Within Frequency', minimum-retention checks) that turn 'is everything succeeding org-wide?' into a single compliance report instead of per-account CLI archaeology.

# Read the full failure reason for one job — StatusMessage is the diagnosis.
aws backup describe-backup-job \
  --backup-job-id 9f8e7d6c-1234-5678-90ab-cdef01234567 \
  --query '{State:State,Reason:StatusMessage,Resource:ResourceArn,Role:IamRoleArn}'

# Common fix for the 'Access Denied' class: grant the backup role on the vault CMK.
# (Add this statement to the KMS key policy, then rerun the job to confirm COMPLETED.)
cat > kms-backup-grant.json <<'JSON'
{
  "Sid": "AllowBackupRoleUseOfTheKey",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole" },
  "Action": ["kms:GenerateDataKey", "kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant"],
  "Resource": "*"
}
JSON

# Then trigger an on-demand job to verify the fix before trusting the schedule.
aws backup start-backup-job \
  --backup-vault-name prod-backup-vault \
  --resource-arn arn:aws:rds:us-east-1:123456789012:db:prod-orders \
  --iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole

What is the impact of unaddressed backup job failures?

The headline impact is a recovery gap that compounds silently. Every failed job is a day a protected resource produced no recovery point, but the cost is zero until the moment you need to restore — at which point it can be total. A team that believes it can restore to within 24 hours, and discovers during an incident that the most recent good recovery point is three weeks old, has effectively had no backups for that window. The defining feature of this failure mode is that it is invisible right up until it is catastrophic.

The second-order impact is the false-confidence multiplier. Because the plan still shows as configured and the resource still shows as protected, every downstream assumption built on 'it's backed up' inherits the gap — disaster-recovery runbooks, customer SLAs, the RPO promised in a sales contract. A single unremediated KMS or IAM failure on a backup selection can quietly invalidate a recovery objective across dozens of resources at once, and no dashboard that only checks 'is a plan present' will catch it.

On the compliance side, the failure is an audit finding in waiting. SOC 2 CC7.5, ISO 27001 A.12.3, HIPAA §164.308(a)(7), and PCI DSS all expect backups to be not merely configured but verified as succeeding. 'We have a backup plan' does not pass when the evidence shows jobs have been failing for weeks. AWS Backup Audit Manager's nightly report is the cheapest answer to a control owner — but only if someone is acting on the failures it reports rather than letting them accumulate.

Finally there's the operational drag of late discovery. A failure caught on night one is a two-line key-policy or IAM fix and one rerun. The same failure caught seven weeks later is an incident: reconstructing what data was lost, explaining the gap to customers or regulators, and doing forensic work to figure out which of dozens of resources were affected and for how long. The cost of the fix is constant; the cost of the delay is everything else.

How do you fix backup job failures and keep them fixed?

Remediation is a four-step loop: read the failure reason, fix the dominant cause first, verify with an on-demand job, then wire alerting so the next failure is loud instead of silent.

1. Read StatusMessage and group failures by root cause

Pull every FAILED/EXPIRED job and group them by the StatusMessage field — that string is AWS's own diagnosis. Access Denied means an IAM role gap or a KMS key-policy denial. resource is in an invalid state means the resource was busy (mid-modification, in-use exclusive). A VSS message means the Windows backup agent or SSM agent is unhealthy. exceeded the completion window/EXPIRED means concurrency or throttling. A not-found message means the resource was deleted between selection and run. Don't fix blind — most failures cluster on one or two causes.

2. Fix the dominant cause, by class

For Access Denied: attach the AWS Backup managed policy for the resource type to the service role, and grant that role kms:GenerateDataKey/kms:Decrypt/kms:DescribeKey/kms:CreateGrant on the resource's and the vault's CMK. For 'invalid state': move the schedule outside maintenance windows and widen StartWindowMinutes. For VSS: repair the SSM agent and VSS components or fall back to crash-consistent snapshots. For completion-window/throttling: widen CompletionWindowMinutes and stagger schedules so jobs don't all fire at once. For copy-leg denials: fix the destination vault's access policy to allow backup:CopyIntoBackupVault from the source account.

3. Verify with an on-demand job before trusting the schedule

Never assume a config fix worked because it looked right. Trigger an on-demand backup of one affected resource with start-backup-job and watch it reach COMPLETED. This proves the role, the key policy, and the vault policy all line up for real, on that resource type, before you wait a full day for the next scheduled run to either confirm or quietly fail again. The on-demand run is the difference between 'I changed the policy' and 'backups are working.'

4. Make the next failure loud, then audit org-wide

The root fix for the silent-failure mode is alerting. Create an EventBridge rule on Backup Job State Change filtered to FAILED/EXPIRED/ABORTED, target an SNS topic, and fan out to PagerDuty or Slack so the next failure pages within minutes. Then enable AWS Backup Audit Manager's built-in frameworks for a nightly org-wide compliance report on completion and retention, and track backup success rate as a standing operational metric. A single missed configuration should surface as a page on night one — never as a three-week-old recovery point during an incident.

# Triage: how many jobs failed, grouped by reason, over the last window.
aws backup list-backup-jobs --by-state FAILED \
  --query 'BackupJobs[].StatusMessage' --output text | sort | uniq -c | sort -rn

# After fixing the dominant cause (e.g. KMS key policy), verify with one on-demand job.
aws backup start-backup-job \
  --backup-vault-name prod-backup-vault \
  --resource-arn arn:aws:rds:us-east-1:123456789012:db:prod-orders \
  --iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole \
  --query '{Job:BackupJobId}'

# Watch it through to COMPLETED before trusting the schedule.
aws backup describe-backup-job --backup-job-id <id> \
  --query '{State:State,Reason:StatusMessage,PercentDone:PercentDone}'

Quick quiz

Question 1 of 5

The dashboard flags 18 backup jobs in a FAILED state on one RDS selection; the StatusMessage on each reads 'Access Denied' and the resources are encrypted with a CMK on the backup vault. What's the right next move?

Keep learning

Dig deeper into backup job states, failure diagnosis, alerting, and org-wide compliance.

You've completed Fix AWS Backup job failures. You now know why a configured plan can still leave you with nothing to restore, how to read BackupJobState and StatusMessage to tell IAM, KMS, resource-state, VSS, throttling, and copy-policy failures apart, the per-cause fix pattern, and — most importantly — how to wire EventBridge/SNS alerting so the next failure is loud instead of a three-week-old recovery gap discovered during an incident. The next time the check flags a failed backup job, you'll have a diagnosis-to-fix-to-alert path ready to run.

Back to the library

Backup job failures: what it means for risk

The difference between 'we back up' and 'we can recover'

When a team says "yes, that's backed up," what they usually mean is "a backup plan is configured for it." Those are not the same thing. A plan is the intent; the backup job is the act. If the job fails — and jobs fail for mundane reasons like an expired permission or a misconfigured encryption key — the intent is intact but the act never happened. The resource shows as protected on every dashboard while the actual recoverable copy quietly stopped being created weeks ago.

This finding flags backup jobs that ended in a failed or expired state. Each one is a night where a protected resource produced no recovery point. The exposure isn't a dollar amount on this month's invoice — it's a growing gap in recoverability that costs nothing until the day you need to restore, at which point it can cost a regulatory finding, a customer-data-loss disclosure, or days of reconstruction. A failed backup is the cheapest possible thing to ignore and one of the most expensive things to discover late.

From a governance and audit standpoint, the question to bring to the operational review is not "do we have backups?" — everyone answers yes to that. The question is "what is our backup success rate, and how old is the oldest unaddressed failure?" SOC 2, ISO 27001, and HIPAA controls all expect backups to be not just configured but verified as succeeding. A pattern of unremediated job failures is an audit finding waiting to happen, and the right control is an alert that makes the next failure loud rather than a quarterly discovery that it has been failing all along.

This lesson is for the finance or risk partner who hears "it's backed up" and wants to know whether that statement is actually true and verifiable. It walks through why a configured backup can silently stop working, what the exposure is when it does (recoverability, audit, compliance — not a line on the invoice), what number to ask for at the operational review, and what 'good' looks like: not just 'we have backups' but a measured success rate with a loud alert on the next failure. No CLI, no internals — just the framing to ask the right question and recognise a weak answer.

Fun fact

The backup that hadn't run since the key rotated

How a risk partner closes the loop

Priya is the risk and finance partner embedded with the platform team. At the monthly operational review the reliability lead presents a slide that says "backups: green." Priya asks the question that's now standard on her agenda: "What's the backup job success rate this month, and what's the oldest failure we haven't resolved?" The honest answer turns out to be 76% success on one selection, with failures going back eight days — the green board was showing 'plan configured,' not 'jobs succeeding.'

The conversation that follows isn't technical. Priya doesn't ask about KMS keys or completion windows. She asks three things: how many protected resources currently have no recent recovery point, how long that's been true, and whether anyone gets alerted when a job fails. The answer to the last one — "not automatically" — is the real finding. A silent failure mode is the gap; the eight days is just how long it took to notice by accident.

Two outcomes go into the next pack. First, engineering wires failure alerting so success rate is monitored continuously, not discovered at review time. Second, 'backup job success rate' joins the standing operational metrics, with the threshold being any sustained dip below 100% on protected resources. Priya knows the dollar saving here is zero — the value is that a recoverability promise the company makes to customers and auditors is now measured and defensible instead of assumed.

Why this matters to risk, not the bill

Unlike most findings on a cost dashboard, this one has effectively no line-item impact — a failed backup costs nothing because no storage was written. That's precisely what makes it dangerous to a finance or risk function: there is no invoice signal to catch it. The exposure is entirely off-balance-sheet until it crystallises as an incident, and then it lands as remediation cost, contractual penalty, or regulatory finding all at once.

The material impact is on contractual and regulatory commitments. If customer contracts or compliance attestations state a recovery point objective — 'we recover to within an hour of data loss' — a pattern of failed backups silently breaches that commitment for every affected resource. The breach is real the moment the jobs start failing, not when it's discovered. Finance should treat backup success rate as a covenant-compliance metric, not an IT detail, because it underwrites promises the business has already sold and signed.

The third impact is on audit credibility. When an auditor asks for evidence of working backups, 'we have a plan configured' is a weak answer that invites deeper scrutiny; 'here is our measured backup success rate and our alerting on failures' closes the control in one artifact. A standing record of unaddressed job failures is the opposite — it converts a routine control test into a finding, which is expensive in audit hours and reputational terms even when no data was actually lost.

Finally, it's a leading indicator of operational discipline. A backup selection that has been failing for weeks with no one noticing means there is no monitoring closing the loop on this class of automated job — and the same blind spot almost certainly applies to other silent-failure systems (replication, log shipping, certificate renewal). Watch the success rate and the time-to-detect as signals of whether the team's operational hygiene is real or assumed.

What finance and risk can actually do about this

Finance can't fix a KMS key policy, but it can make backup success a measured, governed metric instead of an assumption. Three levers, used at the operational review cadence.

1. Put backup success rate on the operational review as a standing line

Add 'backup job success rate' and 'oldest unresolved failure' to the standing review pack. 'Do we have backups?' is the wrong question — everyone answers yes. The right questions are the success rate against protected resources and how long the oldest failure has gone unaddressed. Either one moving in the wrong direction is the prompt to escalate, well before an incident forces the conversation.

2. Treat the RPO as a covenant, not an aspiration

Wherever a recovery point objective is written into a customer contract or a compliance attestation, the agreement should be that a backup success rate below 100% on the resources covered by that promise is a breach to be reported, not an ops detail. That single framing turns 'fix the backup' from an engineering chore into protecting a commitment the business has already sold.

3. Require alerting as a precondition for calling a system 'protected'

Make it policy that no resource is reported as 'backed up' unless its backup jobs are monitored and failures alert automatically. A configured plan without failure alerting is an unmonitored promise. This converts the abstract risk into a concrete, checkable control: either failures page someone within a day, or the system isn't really protected — regardless of what the plan says.

4. Track time-to-detect as the discipline metric

The dangerous number isn't how many jobs failed — it's how long failures sit unnoticed. A failure caught on night one is a trivial fix; the same failure at seven weeks is an incident. Ask for the time-to-detect on backup failures and treat a multi-day answer as a sign the loop isn't being closed, even if the success rate looks acceptable on average.

Quick quiz

Question 1 of 5

At the operational review the reliability lead presents 'backups: green', but when you ask, the backup job success rate on one selection is 76% and the oldest unresolved failure is eight days old. As the risk partner, what's the right next move?

Keep learning

Dig deeper into backup job states, failure diagnosis, alerting, and org-wide compliance.

You've finished the risk partner's view of backup job failures. You know why 'we have backups' and 'we can recover' are different statements, why this exposure never shows up on the invoice, how it ties to RPO covenants and audit credibility, and the levers you control — success rate on the review pack, RPO-as-covenant, alerting as a precondition for 'protected', and time-to-detect as the discipline metric. Next time someone says 'backups: green', you'll have a sharper question than 'do we have backups?'

Back to the library

Backup job failures: the headline

Recovery you believe you have, but don't

Having a backup policy and having a working backup are different things. Backup jobs run automatically every night, and they can fail silently — a permission lapses, an encryption key changes — while every status board still shows the system as 'protected.' The business carries a recovery promise it can no longer keep, and finds out only during the incident the backup existed to cover.

A pattern of backup failures is a business-continuity exposure, not an IT housekeeping detail. The one question worth asking is: "What is our backup success rate, and would we know within a day if it dropped?" If the answer is anything other than a confident number and a yes, the organisation is one bad week away from discovering its disaster-recovery plan was theoretical.

A short read for the executive who needs the continuity headline and one question to ask their team. You'll get the distinction between having a backup policy and having working backups, what a pattern of silent failures signals about operational discipline, and what 'good' looks like at an org level — a known success rate and same-day alerting. No commands, no implementation.

Fun fact

The backup that hadn't run since the key rotated

What it looks like when the org gets this right

At one company the quarterly continuity review used to open with a backup slide that simply read 'green.' Then a near-miss — a restore attempt that found a three-week-old recovery point — changed the question the exec sponsor asked. He stopped accepting 'green' and started asking: "What's our backup success rate, and how quickly would we know if it dropped?"

Within a quarter the slide changed. It no longer said 'green'; it showed a success-rate number and a single line: 'every backup failure pages on-call within minutes.' The team had wired failure alerting and added the success rate to the standing metrics. The sponsor hadn't asked them to chase a dollar figure — he'd asked them to make a silent failure mode loud.

That's the right outcome state for this category. The goal isn't 'we have backups' — every company says that. The goal is 'we measure that backups succeed and we'd know within a day if they stopped.' The continuity review stops being a reassurance ritual and becomes a real confidence signal.

Why this is on the report at all

This category carries no dollar figure, which is exactly why it's worth an executive's attention: there's no bill to catch it, so the only thing standing between a silent failure and a data-loss incident is whether the organisation measures backup success and alerts on failure. A confident success-rate number with same-day alerting means the continuity promise is real; a green dashboard with no success metric means it's assumed.

There's a direct risk-and-reputation dimension too. The recovery promises in customer contracts, the attestations in compliance reports, and the disaster-recovery plan the board has been shown all rest on backups actually working. A pattern of unaddressed failures quietly invalidates all of them at once, and the failure surfaces at the worst possible moment — during the incident the backups existed to cover. This sits at the intersection of continuity, compliance, and customer trust, which is why it belongs in front of leadership and not buried in an ops queue.

The leadership move on this category

The executive handle isn't to chase individual failures — it's to insist the organisation measures backup success and detects failure fast.

1. Demand a success rate, not a green light

Replace 'are backups green?' with 'what is our backup success rate and how is it trending?' A green status board can mean 'plan configured' while jobs fail silently. A measured success rate against protected resources is the only answer that tells you the recovery promise is real.

2. Insist that every failure is detected within a day

The defining risk here is silence. Make same-day detection of backup failures a non-negotiable operating norm — every failed job should page a human within minutes. This single requirement converts the worst failure mode (a recovery gap discovered weeks later during an incident) into a routine same-day fix.

3. Treat a pattern of failures as a continuity signal

Ask for the trend at the leadership review. A flat 100% success rate with active alerting means the continuity promise underwriting customer contracts and the DR plan is sound. A pattern of unaddressed failures signals that the organisation's recovery posture is theoretical — and the same blind spot likely affects other silent-failure systems.

Quick quiz

Question 1 of 5

You're reviewing the continuity pack and see backup job success rate has been 100% for three quarters, with every failure paging on-call within minutes. What's the right read?

Keep learning

Dig deeper into backup job states, failure diagnosis, alerting, and org-wide compliance.

That's the lesson. Two takeaways worth holding onto: having a backup policy and having working backups are different things, and the size of this risk is hidden because it never hits the bill. The leadership question is about a measured success rate and same-day detection — not about a green light.

Back to the library

Part of the learning path Build in resilience

Fix AWS Backup job failures

Backup job failures: the basics

The backup that hadn't run since the key rotated

Fixing a backup job failure in action

Backup jobs under the hooddeep dive

What is the impact of unaddressed backup job failures?

How do you fix backup job failures and keep them fixed?

1. Read StatusMessage and group failures by root cause

2. Fix the dominant cause, by class

3. Verify with an on-demand job before trusting the schedule

4. Make the next failure loud, then audit org-wide

Quick quiz

Keep learning

Backup job failures: what it means for risk

The backup that hadn't run since the key rotated

How a risk partner closes the loop

Why this matters to risk, not the bill

What finance and risk can actually do about this

1. Put backup success rate on the operational review as a standing line

2. Treat the RPO as a covenant, not an aspiration

3. Require alerting as a precondition for calling a system 'protected'

4. Track time-to-detect as the discipline metric

Quick quiz

Keep learning

Backup job failures: the headline

The backup that hadn't run since the key rotated

What it looks like when the org gets this right

Why this is on the report at all

The leadership move on this category

1. Demand a success rate, not a green light

2. Insist that every failure is detected within a day

3. Treat a pattern of failures as a continuity signal

Quick quiz

Keep learning

Related site reliability lessons