Site Reliability

Fix cross-region backup copy failures

The local backup succeeded, so the dashboard looks green — but the cross-region copy silently failed, and the off-region DR copy you think you have doesn't exist.

12 min·10 sections·AWS

Last reviewed 27 May 2026

Cross-region copy failures: the basics

Why a green backup plan can still leave you with no DR copy

AWS Backup runs a plan in two distinct stages. First the local backup job takes a recovery point into the vault in the workload's own region — that's the part that usually works. Then, if the rule has a Copy Action attached, a separate copy job fires to push that recovery point into a vault in your nominated DR region. These are two independent jobs with two independent success states, and the second one can fail completely while the first reports COMPLETED. The plan looks healthy at a glance because the resource was, in fact, backed up — just not where you think.

The check flags backup plans where copy jobs are entering the FAILED state. That matters far more than it sounds, because the failure is silent in every place people normally look. The protected-resource view is green. The local recovery point exists. The only place the truth lives is the copy-job history and the StatusMessage on each failed job. Teams discover the gap during a region outage — the worst possible moment to learn that the off-region copy they planned their DR around was never actually written.

This is the failure-mode companion to configuring cross-region copy in the first place. Having a Copy Action defined is necessary but not sufficient: the destination region needs an opted-in account, an existing destination vault, a usable KMS key with the right policy, and a copy role with the right trust. When any one of those is wrong, the local backup keeps succeeding and the copy keeps failing, indefinitely, until somebody reads the copy-job log.

In this lesson you'll learn the two-stage anatomy of an AWS Backup plan that lets a local backup succeed while the cross-region copy fails, and why that failure is invisible everywhere except the copy-job log. You'll see how to diagnose it from list-copy-jobs and the StatusMessage field, the handful of root causes that account for almost every failure — a missing or unusable destination KMS key, a missing or policy-denied destination vault, a destination region the account hasn't opted into, resource-type limits, and copy-role trust — and the order to fix them in. You'll also wire up an EventBridge alarm so the next failed copy is loud within minutes instead of discovered in a post-mortem.

Fun fact

The DR copy that failed for 94 days straight

A SaaS platform team ran cross-region copies from eu-west-2 to eu-west-1 for two years and never thought about them again — the plans were green every morning. During an unrelated audit they pulled the copy-job history and found every single copy job had been failing for 94 consecutive days with the same StatusMessage: "Access denied while assuming role". A routine least-privilege cleanup had pruned a KMS permission from the copy role. The local backups had succeeded perfectly the whole time, so nothing ever alerted. They had been paying for a DR posture that had quietly evaporated three months earlier — and would have discovered it for the first time during an actual region failure.

Diagnosing a failing copy job in action

Lena owns platform reliability for a healthtech company. The continuity dashboard flips a backup plan from green to flagged: local backups COMPLETED as always, but copy jobs in the last 24 hours are FAILED. The protected-resource view still shows everything backed up, which is exactly why nobody had noticed — the gap was only ever visible in the copy-job log.

She lists the failed copy jobs and reads the StatusMessage on each one. They all say the same thing: the copy can't use the destination vault's KMS key. The destination vault exists, the Copy Action is wired correctly, the local recovery points are fine — but the re-encryption step at the destination is being denied. That points at a key policy, not a backup-plan problem.

She inspects the destination KMS key policy and finds it: a recent security-hardening change tightened the key policy and dropped the statement that let AWS Backup use the key for cross-region copies. She adds the grant back, re-runs the plan on-demand to force an immediate copy rather than waiting for the 2am schedule, watches it reach COMPLETED, and then does the part most teams skip — confirms the destination recovery point is actually present and restorable in eu-west-1. Total time: about fifteen minutes, most of it reading the StatusMessage carefully enough to fix the right thing.

First, list the failed copy jobs in the source region over the last day and read the StatusMessage — that field is where the real cause lives.

$ aws backup list-copy-jobs --region eu-west-2 --by-state FAILED --by-created-after $(date -u -d '24 hours ago' +%FT%TZ) --query 'CopyJobs[*].{Resource:ResourceArn,Dest:DestinationBackupVaultArn,State:State,Msg:StatusMessage}' --output table

-------------------------------------------------------------------------------

| ListCopyJobs |

+-------------+----------------------------+--------+------------------------+

+-------------+----------------------------+--------+------------------------+

+-------------+----------------------------+--------+------------------------+

# Local backups are COMPLETED; only the cross-region copies fail. StatusMessage points at KMS.

The copy-job log is the only place the failure is visible — and StatusMessage names the root cause.

The most common culprit: the destination vault's KMS key policy doesn't let AWS Backup use it. Pull the policy and confirm the Backup service statement is missing.

$ aws kms get-key-policy --region eu-west-1 --key-id alias/backup-key --policy-name default --query Policy --output text | python3 -m json.tool

{

"Version": "2012-10-17",

"Statement": [

{

"Sid": "AllowAdmins",

"Effect": "Allow",

"Principal": { "AWS": "arn:aws:iam::123456789012:root" },

"Action": "kms:*", "Resource": "*"

}

]

}

# No statement granting the AWS Backup copy role kms:Decrypt/Encrypt/GenerateDataKey/CreateGrant.

A hardened key policy with only an admin statement silently blocks every cross-region copy.

Why the local backup succeeds and the copy failsdeep dive

A plan rule produces a local backup job; on its COMPLETED transition, AWS Backup fires one copy job per entry in the rule's CopyActions array. The two jobs are independent — the local job only needs the source vault and source KMS key, both of which are typically healthy, while the copy job has a longer chain of dependencies in a different region. That asymmetry is the whole reason for the silent gap: the thing people watch (the local backup) almost never fails, and the thing they don't watch (the copy) is where every failure lands.

The dominant cause, by a wide margin, is KMS. KMS keys are region-scoped, so the copy re-encrypts the recovery point with the destination vault's key. If that key's policy doesn't grant the AWS Backup copy role kms:Decrypt, kms:Encrypt, kms:GenerateDataKey, and kms:CreateGrant, the copy fails with an access-denied StatusMessage even though the vault and plan are perfect. A source key that isn't grantable to the destination produces the same symptom. The remaining causes are structural: the destination vault doesn't exist or its access policy denies backup:CopyIntoBackupVault; the destination region isn't enabled/opted-in for the account (common with newer opt-in regions); the resource type isn't supported for cross-region copy; or the copy IAM role's trust or permissions are wrong. Each leaves the local backup untouched.

This is distinct from an in-region backup-job failure, which is a separate concern with separate causes — vault lock conflicts, source-resource state, in-region IAM or KMS on the source side. A backup-job failure means you have no recovery point at all and is usually loud. A copy-job failure means you have a perfectly good local recovery point and no off-region copy, and is usually silent. Diagnose them as two different problems: read list-backup-jobs for the former, list-copy-jobs --by-state FAILED plus describe-copy-job for the latter, and let the StatusMessage steer you to the specific dependency that's broken.

# Drill into a single failed copy job for the full StatusMessage and the source/dest vaults.
aws backup describe-copy-job \
  --copy-job-id 8e3c5a7d-2f4b-4c8a-9d1e-6b3a7f9c2e51 \
  --region eu-west-2 \
  --query '{State:State,Msg:StatusMessage,Source:SourceBackupVaultArn,Dest:DestinationBackupVaultArn}'

# Confirm the destination region is actually opted-in for this account (a common silent cause).
aws account get-region-opt-status \
  --region-name eu-west-1 \
  --query 'RegionOptStatus'

# Confirm the local backup job for the same recovery point succeeded — proving it's a copy-only failure.
aws backup list-backup-jobs \
  --region eu-west-2 \
  --by-state COMPLETED \
  --by-created-after $(date -u -d '24 hours ago' +%FT%TZ) \
  --query 'BackupJobs[*].{Resource:ResourceArn,State:State}'

What is the impact of a failing cross-region copy?

The headline impact is having no off-region recovery point when a region event hits, despite believing you do. AZ-level failures are survivable from the local vault; a region-wide outage is exactly the scenario the cross-region copy exists for. If the copy has been failing, your RPO during that event collapses to whatever exists outside the affected region — which, by assumption, is nothing — and your RTO stretches to days while you reconstruct from offline or customer-supplied data. The local recovery points are stranded in the region that's down.

The second-order impact is the false-confidence tax. Because the local backup is green and the protected-resource view looks complete, teams build DR runbooks, sign DR attestations, and reassure customers on the basis of copies that aren't being written. The gap doesn't degrade gracefully — it's binary and invisible right up until the moment of failure, when the cost is at its absolute peak. A control that is assumed-working and actually-broken is worse than a control everyone knows is missing, because nobody is compensating for it.

The third impact is contractual and audit exposure. SOC 2 CC7.5, ISO 27001 A.17, PCI DSS 12.10.1, and the DR clauses in most enterprise MSAs require recovery points that demonstrably exist in a second region. A failing copy job means the control is non-operating. An auditor who asks for evidence of a recent destination recovery point — not the plan config, the actual object — will find the gap, and "the plan was configured to copy" is not a defence when no copies landed.

The financial impact, by contrast, is small and almost irrelevant. The DR storage and per-copy charges keep billing whether the copy lands or not, so a failing copy may even cost slightly less than a working one. The cost of fixing it is a few minutes of engineering time. The cost of not fixing it is denominated in the hours or days of downtime during a regional event — for most production workloads, orders of magnitude larger than the entire DR storage line.

How do you fix a failing cross-region copy safely?

Fixing a failed copy is a four-step loop: read the StatusMessage to find the real cause, fix the most common culprit first (KMS and vault policy), force an immediate copy and verify it lands and restores, and then make the next failure loud with an alarm so you never discover this in a post-mortem again.

1. Diagnose from copy-job history and StatusMessage

Run aws backup list-copy-jobs --by-state FAILED in the source region and describe-copy-job on a sample, and read the StatusMessage carefully — it almost always names the broken dependency directly (access denied on a KMS key, vault not found, region not opted-in, unsupported resource type, role assumption failure). Confirm in parallel that the local backup job for the same recovery point reached COMPLETED, so you know you're fixing a copy-only failure and not a deeper backup problem.

2. Fix the most common culprit: KMS and vault policy

The destination vault's KMS key policy must grant the AWS Backup copy role kms:Decrypt, kms:Encrypt, kms:GenerateDataKey, and kms:CreateGrant; the destination Backup Vault must exist and its access policy must allow backup:CopyIntoBackupVault. Fix the key policy and vault policy first — they account for the large majority of failures. Then check the structural causes in order: is the destination region opted-in for the account, is the resource type supported for cross-region copy, and does the copy IAM role have the right trust and permissions.

3. Force an immediate copy, then verify it lands and restores

Don't wait for the 2am schedule to find out if the fix worked. Trigger an on-demand backup with the same Copy Action (start-backup-job with the copy specified), watch the copy job reach COMPLETED in list-copy-jobs, then do the step teams skip: restore the destination recovery point in the DR region with StartRestoreJob, confirm it boots, and tear it down. A copy that completes but has never been restored from is still unproven.

4. Make the next failure loud with an alarm

The root cause of the danger isn't the failure — it's that the failure was silent. Wire an EventBridge rule on AWS Backup copy-job state changes filtering for FAILED, targeting SNS or PagerDuty, so a broken copy pages the on-call within minutes instead of surfacing in an audit months later. Pair it with a recurring report of "days since last successful DR copy" per critical workload, so a copy that quietly stops gets noticed by the metric even if the alarm is misconfigured.

# 1) Add the missing AWS Backup statement to the destination KMS key policy, then re-run the copy.
#    (Append this statement to the existing key policy document and put-key-policy.)
cat > kms-backup-statement.json <<'JSON'
{
  "Sid": "AllowBackupCopy",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole" },
  "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:GenerateDataKey",
    "kms:CreateGrant"
  ],
  "Resource": "*"
}
JSON

# 2) Force an immediate backup-with-copy instead of waiting for the schedule.
aws backup start-backup-job \
  --region eu-west-2 \
  --backup-vault-name prod-rds-vault \
  --resource-arn arn:aws:rds:eu-west-2:123456789012:db:prod \
  --iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole \
  --copy-actions DestinationBackupVaultArn=arn:aws:backup:eu-west-1:123456789012:backup-vault:prod-rds-vault-dr

# 3) Verify the copy now reaches COMPLETED.
aws backup list-copy-jobs \
  --region eu-west-2 \
  --by-state COMPLETED \
  --by-created-after $(date -u -d '1 hour ago' +%FT%TZ) \
  --query 'CopyJobs[*].{State:State,Dest:DestinationBackupVaultArn,Bytes:BackupSizeInBytes}'

# 4) Make the next failure loud: EventBridge rule on FAILED copy jobs -> SNS.
aws events put-rule \
  --region eu-west-2 \
  --name backup-copy-failed \
  --event-pattern '{"source":["aws.backup"],"detail-type":["Copy Job State Change"],"detail":{"state":["FAILED"]}}'

Quick quiz

Question 1 of 5

A backup plan's local backup jobs are all COMPLETED, but list-copy-jobs --by-state FAILED shows every cross-region copy failing with StatusMessage "Access denied" on the destination KMS key. What's the right first move?

Keep learning

Dig deeper into AWS Backup copy mechanics, the KMS and policy dependencies behind them, and how to monitor for silent failures.

You've completed Fix cross-region backup copy failures. You now know why a local backup can succeed while the cross-region copy silently fails, how to diagnose it from the copy-job history and StatusMessage, the handful of root causes — KMS key policy first, then vault policy, region opt-in, resource-type limits, and role trust — and how to force, verify, and alarm on copies so a missing DR copy is loud instead of discovered in a post-mortem. Next time a plan looks green, you'll know to check whether the copy actually landed.

Back to the library

Cross-region copy failures: what it means for risk

Paying for a DR copy you don't actually have

The organisation decided, at some point, that critical workloads need a backup copy held in a second AWS region so a regional outage can't destroy both the workload and its only backup. That decision usually shows up on the invoice as warm storage in the DR region and a per-copy charge. This finding flags the case where those copies are silently failing: the backup plan reports success, the local backup exists, but the copy to the second region never lands. The intended protection isn't there, even though everyone believes it is.

The exposure is asymmetric and easy to underestimate. Nothing breaks day to day. Reports stay green. The first time the gap becomes visible is during an actual regional disaster — exactly when there is no time left to fix it and the cost of the gap is at its maximum. In risk terms this is a control that is documented, assumed effective, and actually broken: the most dangerous kind, because no one is watching it.

It also has a compliance dimension. Geographic-redundancy requirements in frameworks like SOC 2, ISO 27001, and PCI DSS, and the DR clauses in most enterprise contracts, are satisfied by recovery points that actually exist in a second region — not by a backup plan that was configured to create them. A failing copy job means the control is non-operating, and an auditor who asks to see the destination recovery points will find the gap whether or not the team has. The right framing at the operational review is not "are we paying for DR?" but "can we prove a recent copy actually landed in the DR region?"

This lesson is for the finance partner who sees "backup" and "DR storage" on the cloud invoice and reasonably assumes the protection is working. It explains why a backup plan can report success while the off-region copy quietly fails, why that's a risk-and-compliance problem rather than a cost one, how the exposure behaves (invisible until a disaster, then catastrophic), and the small number of things to ask for at the operational review so a broken control gets caught in a quarter rather than in an outage. No commands, no internals.

Fun fact

The DR copy that failed for 94 days straight

How a finance partner surfaces the gap

Priya is the finance partner embedded with the platform team at a healthtech company. The DR-storage line on the cloud invoice has looked stable for months, so on paper the disaster-recovery posture is funded and working. At the operational review she asks the question that's now standard on the agenda: "For our critical workloads, can we show evidence that a backup copy actually landed in the second region in the last week — not just that the plan is configured to send one?"

The answer that comes back isn't comfortable. The team pulls the copy-job history and finds the copies have been failing; the local backups were green the whole time, so nothing had flagged it. The conversation that follows isn't technical — Priya doesn't ask about KMS keys or vault policies. She asks three things: how long has the copy been failing, which workloads are exposed in the meantime, and what will tell us automatically the next time it breaks. The team commits to fixing the copy, restoring from the destination to prove it, and adding an alert so a failed copy surfaces in minutes.

A month later the review includes a new standing line: "days since last successful DR copy" per critical workload, alongside a note that copy failures now page the on-call. Priya knows the dollar amount on the DR-storage line was never the point — a stable invoice told her nothing about whether the control was working. The metric that matters is evidence of a recent successful copy, and that's now the thing she asks for.

Why this matters to risk, not the bill

Unlike most findings on a cost dashboard, this one barely touches the invoice. The DR-storage and per-copy line items keep billing at roughly the same rate whether copies succeed or fail — a failing copy might even shave a little off, because nothing new is being stored at the destination. So watching the dollar amount tells you nothing about whether the control works. That's the trap: a stable, healthy-looking invoice can sit on top of a completely broken DR posture.

The real impact is risk exposure that is invisible in steady state and maximal in a disaster. The organisation is funding a disaster-recovery capability and, when copies are failing, simply not receiving it. The exposure is binary — there either is a recent off-region recovery point or there isn't — and it stays hidden until a regional event forces the question. From a risk-quantification standpoint, the expected loss is low-probability but catastrophic-magnitude, which is precisely the profile that justifies a cheap, always-on control rather than periodic spot checks.

There's a commitment-and-budget angle too: this is spend that is failing to buy its intended outcome. Money is leaving the budget for DR storage while the DR outcome — recoverability from a second region — is not being delivered. That's worse than waste, because waste is at least visible; this is paying full price for a capability that silently isn't there. The right budget question is whether the spend is producing verified copies, not whether the spend is stable.

Finally, it's a control-assurance signal. If a DR copy could fail for weeks without anyone noticing, the same blind spot almost certainly applies to other assumed-working controls — encryption, retention, replication. A failing copy that went undetected is a leading indicator that the organisation verifies controls by configuration rather than by evidence, and that pattern predicts other unpleasant audit surprises.

What finance can actually do about this

Finance can't fix a KMS policy, but it can change what the organisation reports and verifies so a broken DR copy gets caught in a quarter rather than in an outage. Four levers, used at the operational review.

1. Ask for evidence of a recent copy, not configuration

Make the standing question at the operational review "can we show a backup copy actually landed in the DR region in the last week?" rather than "is cross-region copy configured?" Those are different claims, and the gap between them is exactly where this risk lives. Evidence means a recent destination recovery point, not a screenshot of the plan.

2. Put 'days since last successful DR copy' on the report

Add a per-critical-workload line showing how long since the last verified destination copy. A green backup tick hides this; an explicit recency metric makes a silently-failing copy show up as a number creeping upward. If it's anything other than near-zero for a critical workload, that's the escalation prompt.

3. Require that copy failures are alerted, not just logged

The agreement to push for is that a failed copy pages a human within minutes. Finance doesn't implement the alarm, but it can make "is there an automatic alert on copy failure?" a precondition for signing off the DR posture as effective. An unmonitored control is, for risk purposes, an absent one.

4. Treat verified recoverability as the funded outcome

The DR-storage spend is buying recoverability from a second region, not storage for its own sake. Frame the budget line as conditional on that outcome being demonstrated — a restore drill or a verified recent copy — so the spend is tied to proof rather than to the existence of a configuration that may not work.

Quick quiz

Question 1 of 5

The cloud invoice shows the DR-storage line flat and healthy, but an audit reveals the cross-region copy jobs have been failing silently for weeks. As the finance partner, what's the right read and next move?

Keep learning

Dig deeper into AWS Backup copy mechanics, the KMS and policy dependencies behind them, and how to monitor for silent failures.

You've finished the finance partner's view of cross-region copy failures. You know why a stable DR-storage invoice can hide a completely broken control, why this is a risk-and-compliance issue rather than a cost one, and the four levers — ask for evidence not configuration, track days-since-last-successful-copy, require alerting on failure, and tie the spend to a verified outcome. Next time the DR line shows up at the operational review, you'll have a sharper question than "are we paying for DR?"

Back to the library

Cross-region copy failures: the headline

A disaster-recovery copy that quietly stopped existing

The business agreed that critical data should have a backup held in a second region so a regional outage can't take out both the system and its only copy. This finding means that copy has been silently failing: the primary backup still runs, the dashboards stay green, but the off-region copy the recovery plan depends on isn't being written. The protection the business is paying for and counting on is not actually in place.

The danger is that nothing looks wrong until the day it matters most. This is a continuity and audit issue, not a cost one — the fix is cheap and fast, but only if it's done before a regional event rather than during one. The leadership question is simple: can the team show evidence that a recent backup copy actually landed in the second region, not just that the system was configured to send one?

A short read for the executive who wants the business-continuity headline and the one question to ask. You'll get why a DR backup copy can silently stop existing while everything looks healthy, what that signals about how the org verifies its controls, and what "good" looks like — evidence that a recent copy actually landed, not just that one was configured.

Fun fact

The DR copy that failed for 94 days straight

What it looks like when the org gets this right

At one company the quarterly continuity review used to report "backups: green" and move on. Then a near-miss — a copy that had been failing for weeks, caught by luck during an audit — changed the question the exec sponsor asked. Instead of "are backups running?" she started asking "can we prove a recent DR copy actually landed, and how fast would we know if it stopped?"

Within a quarter the reporting changed shape. The headline was no longer a green backup tick; it was "every critical workload has a verified destination copy from the last 24 hours, and a failed copy pages the on-call within minutes." The cost of the DR storage didn't move — it was never the issue — but the confidence behind the number became real. The team had stopped trusting that a configured control was a working one.

That's the right outcome state. The goal isn't "backups configured"; it's "DR copies verified and monitored." A configured-but-unverified copy is the most expensive kind of false comfort, and the leadership move is to make the verification, not the configuration, the thing that gets reported.

Why this is on the report at all

The dollar amount here is negligible and beside the point. This is tracked because it represents a continuity control the business is paying for and assuming is effective, that may in fact be silently broken. The risk profile is low-probability and high-consequence — nothing breaks until a regional event, and then the missing off-region copy is the difference between a recovery measured in hours and one measured in days. That asymmetry is exactly why a quietly-failing copy belongs on a leadership report even though it costs almost nothing.

The deeper signal is about how the organisation assures its controls. If a DR copy can fail for weeks unnoticed, the org is trusting configuration over evidence — and the same blind spot likely covers other controls that show up in audits, customer security reviews, and incident post-mortems. Most CFOs care about the audit and contractual exposure; most CIOs care about the recoverability; both should care that an assumed-working control was actually broken and nothing caught it.

The leadership move on this category

The handle for an executive isn't the dollar amount — it's insisting the organisation verifies its continuity controls by evidence rather than by configuration.

1. Ask for proof, not assurance

"Can we demonstrate a recent DR copy actually exists, and restore from it?" is a one-minute question that cuts straight through a green dashboard. A configured control and a working control are different things; only the second one matters in a disaster, and only proof distinguishes them.

2. Require that silent failures become loud

The core failure here is that a broken control went unnoticed. Insist that critical continuity controls page a human when they fail, and make "how fast would we know if this stopped working?" a standard question for any control the business depends on. Detection latency is itself a risk metric.

3. Make verification the thing that gets reported

Shift the continuity report from "backups configured" to "DR copies verified and monitored." If the answer is "verified within the last day and alarmed on failure" for every critical workload, the posture is real and leadership can look elsewhere. If it's "configured," that's the gap to close before the next regional event, not after.

Quick quiz

Question 1 of 5

Your continuity report says "backups: green" but a near-miss showed the off-region DR copy had silently been failing. What's the right thing to change about how this is reported?

Keep learning

Dig deeper into AWS Backup copy mechanics, the KMS and policy dependencies behind them, and how to monitor for silent failures.

That's the lesson. Two takeaways worth holding onto: a green backup dashboard can sit on top of a DR copy that silently stopped existing, and the leadership move is to demand proof and monitoring — "can we show a recent copy landed, and how fast would we know if it stopped?" — not configuration.

Back to the library

Part of the learning path Build in resilience

Fix cross-region backup copy failures

Cross-region copy failures: the basics

The DR copy that failed for 94 days straight

Diagnosing a failing copy job in action

Why the local backup succeeds and the copy failsdeep dive

What is the impact of a failing cross-region copy?

How do you fix a failing cross-region copy safely?

1. Diagnose from copy-job history and StatusMessage

2. Fix the most common culprit: KMS and vault policy

3. Force an immediate copy, then verify it lands and restores

4. Make the next failure loud with an alarm

Quick quiz

Keep learning

Cross-region copy failures: what it means for risk

The DR copy that failed for 94 days straight

How a finance partner surfaces the gap

Why this matters to risk, not the bill

What finance can actually do about this

1. Ask for evidence of a recent copy, not configuration

2. Put 'days since last successful DR copy' on the report

3. Require that copy failures are alerted, not just logged

4. Treat verified recoverability as the funded outcome

Quick quiz

Keep learning

Cross-region copy failures: the headline

The DR copy that failed for 94 days straight

What it looks like when the org gets this right

Why this is on the report at all

The leadership move on this category

1. Ask for proof, not assurance

2. Require that silent failures become loud

3. Make verification the thing that gets reported

Quick quiz

Keep learning

Related site reliability lessons