Skip to main content
emnode / learn
Site Reliability

Fix cross-region backup copy failures

The local backup succeeded, so the dashboard looks green — but the cross-region copy silently failed, and the off-region DR copy you think you have doesn't exist.

12 min·10 sections·AWS

Last reviewed

Cross-region copy failures: the basics

Why a green backup plan can still leave you with no DR copy

AWS Backup runs a plan in two distinct stages. First the local backup job takes a recovery point into the vault in the workload's own region — that's the part that usually works. Then, if the rule has a Copy Action attached, a separate copy job fires to push that recovery point into a vault in your nominated DR region. These are two independent jobs with two independent success states, and the second one can fail completely while the first reports COMPLETED. The plan looks healthy at a glance because the resource was, in fact, backed up — just not where you think.

The check flags backup plans where copy jobs are entering the FAILED state. That matters far more than it sounds, because the failure is silent in every place people normally look. The protected-resource view is green. The local recovery point exists. The only place the truth lives is the copy-job history and the StatusMessage on each failed job. Teams discover the gap during a region outage — the worst possible moment to learn that the off-region copy they planned their DR around was never actually written.

This is the failure-mode companion to configuring cross-region copy in the first place. Having a Copy Action defined is necessary but not sufficient: the destination region needs an opted-in account, an existing destination vault, a usable KMS key with the right policy, and a copy role with the right trust. When any one of those is wrong, the local backup keeps succeeding and the copy keeps failing, indefinitely, until somebody reads the copy-job log.

In this lesson you'll learn the two-stage anatomy of an AWS Backup plan that lets a local backup succeed while the cross-region copy fails, and why that failure is invisible everywhere except the copy-job log. You'll see how to diagnose it from list-copy-jobs and the StatusMessage field, the handful of root causes that account for almost every failure — a missing or unusable destination KMS key, a missing or policy-denied destination vault, a destination region the account hasn't opted into, resource-type limits, and copy-role trust — and the order to fix them in. You'll also wire up an EventBridge alarm so the next failed copy is loud within minutes instead of discovered in a post-mortem.

Fun fact

The DR copy that failed for 94 days straight

A SaaS platform team ran cross-region copies from eu-west-2 to eu-west-1 for two years and never thought about them again — the plans were green every morning. During an unrelated audit they pulled the copy-job history and found every single copy job had been failing for 94 consecutive days with the same StatusMessage: "Access denied while assuming role". A routine least-privilege cleanup had pruned a KMS permission from the copy role. The local backups had succeeded perfectly the whole time, so nothing ever alerted. They had been paying for a DR posture that had quietly evaporated three months earlier — and would have discovered it for the first time during an actual region failure.

Diagnosing a failing copy job in action

Lena owns platform reliability for a healthtech company. The continuity dashboard flips a backup plan from green to flagged: local backups COMPLETED as always, but copy jobs in the last 24 hours are FAILED. The protected-resource view still shows everything backed up, which is exactly why nobody had noticed — the gap was only ever visible in the copy-job log.

She lists the failed copy jobs and reads the StatusMessage on each one. They all say the same thing: the copy can't use the destination vault's KMS key. The destination vault exists, the Copy Action is wired correctly, the local recovery points are fine — but the re-encryption step at the destination is being denied. That points at a key policy, not a backup-plan problem.

She inspects the destination KMS key policy and finds it: a recent security-hardening change tightened the key policy and dropped the statement that let AWS Backup use the key for cross-region copies. She adds the grant back, re-runs the plan on-demand to force an immediate copy rather than waiting for the 2am schedule, watches it reach COMPLETED, and then does the part most teams skip — confirms the destination recovery point is actually present and restorable in eu-west-1. Total time: about fifteen minutes, most of it reading the StatusMessage carefully enough to fix the right thing.

First, list the failed copy jobs in the source region over the last day and read the StatusMessage — that field is where the real cause lives.

$ aws backup list-copy-jobs --region eu-west-2 --by-state FAILED --by-created-after $(date -u -d '24 hours ago' +%FT%TZ) --query 'CopyJobs[*].{Resource:ResourceArn,Dest:DestinationBackupVaultArn,State:State,Msg:StatusMessage}' --output table
-------------------------------------------------------------------------------
| ListCopyJobs |
+-------------+----------------------------+--------+------------------------+
| Resource | Dest | State | Msg |
+-------------+----------------------------+--------+------------------------+
| ...:db/prod | ...:vault:prod-rds-dr | FAILED | Access denied: KMS key |
| ...:vol/abc | ...:vault:prod-ebs-dr | FAILED | Access denied: KMS key |
+-------------+----------------------------+--------+------------------------+
# Local backups are COMPLETED; only the cross-region copies fail. StatusMessage points at KMS.

The copy-job log is the only place the failure is visible — and StatusMessage names the root cause.

The most common culprit: the destination vault's KMS key policy doesn't let AWS Backup use it. Pull the policy and confirm the Backup service statement is missing.

$ aws kms get-key-policy --region eu-west-1 --key-id alias/backup-key --policy-name default --query Policy --output text | python3 -m json.tool
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAdmins",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::123456789012:root" },
"Action": "kms:*", "Resource": "*"
}
]
}
# No statement granting the AWS Backup copy role kms:Decrypt/Encrypt/GenerateDataKey/CreateGrant.

A hardened key policy with only an admin statement silently blocks every cross-region copy.

Why the local backup succeeds and the copy failsdeep dive

A plan rule produces a local backup job; on its COMPLETED transition, AWS Backup fires one copy job per entry in the rule's CopyActions array. The two jobs are independent — the local job only needs the source vault and source KMS key, both of which are typically healthy, while the copy job has a longer chain of dependencies in a different region. That asymmetry is the whole reason for the silent gap: the thing people watch (the local backup) almost never fails, and the thing they don't watch (the copy) is where every failure lands.

The dominant cause, by a wide margin, is KMS. KMS keys are region-scoped, so the copy re-encrypts the recovery point with the destination vault's key. If that key's policy doesn't grant the AWS Backup copy role kms:Decrypt, kms:Encrypt, kms:GenerateDataKey, and kms:CreateGrant, the copy fails with an access-denied StatusMessage even though the vault and plan are perfect. A source key that isn't grantable to the destination produces the same symptom. The remaining causes are structural: the destination vault doesn't exist or its access policy denies backup:CopyIntoBackupVault; the destination region isn't enabled/opted-in for the account (common with newer opt-in regions); the resource type isn't supported for cross-region copy; or the copy IAM role's trust or permissions are wrong. Each leaves the local backup untouched.

This is distinct from an in-region backup-job failure, which is a separate concern with separate causes — vault lock conflicts, source-resource state, in-region IAM or KMS on the source side. A backup-job failure means you have no recovery point at all and is usually loud. A copy-job failure means you have a perfectly good local recovery point and no off-region copy, and is usually silent. Diagnose them as two different problems: read list-backup-jobs for the former, list-copy-jobs --by-state FAILED plus describe-copy-job for the latter, and let the StatusMessage steer you to the specific dependency that's broken.

# Drill into a single failed copy job for the full StatusMessage and the source/dest vaults.
aws backup describe-copy-job \
  --copy-job-id 8e3c5a7d-2f4b-4c8a-9d1e-6b3a7f9c2e51 \
  --region eu-west-2 \
  --query '{State:State,Msg:StatusMessage,Source:SourceBackupVaultArn,Dest:DestinationBackupVaultArn}'

# Confirm the destination region is actually opted-in for this account (a common silent cause).
aws account get-region-opt-status \
  --region-name eu-west-1 \
  --query 'RegionOptStatus'

# Confirm the local backup job for the same recovery point succeeded — proving it's a copy-only failure.
aws backup list-backup-jobs \
  --region eu-west-2 \
  --by-state COMPLETED \
  --by-created-after $(date -u -d '24 hours ago' +%FT%TZ) \
  --query 'BackupJobs[*].{Resource:ResourceArn,State:State}'

What is the impact of a failing cross-region copy?

The headline impact is having no off-region recovery point when a region event hits, despite believing you do. AZ-level failures are survivable from the local vault; a region-wide outage is exactly the scenario the cross-region copy exists for. If the copy has been failing, your RPO during that event collapses to whatever exists outside the affected region — which, by assumption, is nothing — and your RTO stretches to days while you reconstruct from offline or customer-supplied data. The local recovery points are stranded in the region that's down.

The second-order impact is the false-confidence tax. Because the local backup is green and the protected-resource view looks complete, teams build DR runbooks, sign DR attestations, and reassure customers on the basis of copies that aren't being written. The gap doesn't degrade gracefully — it's binary and invisible right up until the moment of failure, when the cost is at its absolute peak. A control that is assumed-working and actually-broken is worse than a control everyone knows is missing, because nobody is compensating for it.

The third impact is contractual and audit exposure. SOC 2 CC7.5, ISO 27001 A.17, PCI DSS 12.10.1, and the DR clauses in most enterprise MSAs require recovery points that demonstrably exist in a second region. A failing copy job means the control is non-operating. An auditor who asks for evidence of a recent destination recovery point — not the plan config, the actual object — will find the gap, and "the plan was configured to copy" is not a defence when no copies landed.

The financial impact, by contrast, is small and almost irrelevant. The DR storage and per-copy charges keep billing whether the copy lands or not, so a failing copy may even cost slightly less than a working one. The cost of fixing it is a few minutes of engineering time. The cost of not fixing it is denominated in the hours or days of downtime during a regional event — for most production workloads, orders of magnitude larger than the entire DR storage line.

How do you fix a failing cross-region copy safely?

Fixing a failed copy is a four-step loop: read the StatusMessage to find the real cause, fix the most common culprit first (KMS and vault policy), force an immediate copy and verify it lands and restores, and then make the next failure loud with an alarm so you never discover this in a post-mortem again.

1. Diagnose from copy-job history and StatusMessage

Run aws backup list-copy-jobs --by-state FAILED in the source region and describe-copy-job on a sample, and read the StatusMessage carefully — it almost always names the broken dependency directly (access denied on a KMS key, vault not found, region not opted-in, unsupported resource type, role assumption failure). Confirm in parallel that the local backup job for the same recovery point reached COMPLETED, so you know you're fixing a copy-only failure and not a deeper backup problem.

2. Fix the most common culprit: KMS and vault policy

The destination vault's KMS key policy must grant the AWS Backup copy role kms:Decrypt, kms:Encrypt, kms:GenerateDataKey, and kms:CreateGrant; the destination Backup Vault must exist and its access policy must allow backup:CopyIntoBackupVault. Fix the key policy and vault policy first — they account for the large majority of failures. Then check the structural causes in order: is the destination region opted-in for the account, is the resource type supported for cross-region copy, and does the copy IAM role have the right trust and permissions.

3. Force an immediate copy, then verify it lands and restores

Don't wait for the 2am schedule to find out if the fix worked. Trigger an on-demand backup with the same Copy Action (start-backup-job with the copy specified), watch the copy job reach COMPLETED in list-copy-jobs, then do the step teams skip: restore the destination recovery point in the DR region with StartRestoreJob, confirm it boots, and tear it down. A copy that completes but has never been restored from is still unproven.

4. Make the next failure loud with an alarm

The root cause of the danger isn't the failure — it's that the failure was silent. Wire an EventBridge rule on AWS Backup copy-job state changes filtering for FAILED, targeting SNS or PagerDuty, so a broken copy pages the on-call within minutes instead of surfacing in an audit months later. Pair it with a recurring report of "days since last successful DR copy" per critical workload, so a copy that quietly stops gets noticed by the metric even if the alarm is misconfigured.

# 1) Add the missing AWS Backup statement to the destination KMS key policy, then re-run the copy.
#    (Append this statement to the existing key policy document and put-key-policy.)
cat > kms-backup-statement.json <<'JSON'
{
  "Sid": "AllowBackupCopy",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole" },
  "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:GenerateDataKey",
    "kms:CreateGrant"
  ],
  "Resource": "*"
}
JSON

# 2) Force an immediate backup-with-copy instead of waiting for the schedule.
aws backup start-backup-job \
  --region eu-west-2 \
  --backup-vault-name prod-rds-vault \
  --resource-arn arn:aws:rds:eu-west-2:123456789012:db:prod \
  --iam-role-arn arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole \
  --copy-actions DestinationBackupVaultArn=arn:aws:backup:eu-west-1:123456789012:backup-vault:prod-rds-vault-dr

# 3) Verify the copy now reaches COMPLETED.
aws backup list-copy-jobs \
  --region eu-west-2 \
  --by-state COMPLETED \
  --by-created-after $(date -u -d '1 hour ago' +%FT%TZ) \
  --query 'CopyJobs[*].{State:State,Dest:DestinationBackupVaultArn,Bytes:BackupSizeInBytes}'

# 4) Make the next failure loud: EventBridge rule on FAILED copy jobs -> SNS.
aws events put-rule \
  --region eu-west-2 \
  --name backup-copy-failed \
  --event-pattern '{"source":["aws.backup"],"detail-type":["Copy Job State Change"],"detail":{"state":["FAILED"]}}'

Quick quiz

Question 1 of 5

A backup plan's local backup jobs are all COMPLETED, but list-copy-jobs --by-state FAILED shows every cross-region copy failing with StatusMessage "Access denied" on the destination KMS key. What's the right first move?

You've completed Fix cross-region backup copy failures. You now know why a local backup can succeed while the cross-region copy silently fails, how to diagnose it from the copy-job history and StatusMessage, the handful of root causes — KMS key policy first, then vault policy, region opt-in, resource-type limits, and role trust — and how to force, verify, and alarm on copies so a missing DR copy is loud instead of discovered in a post-mortem. Next time a plan looks green, you'll know to check whether the copy actually landed.

Back to the library