Skip to main content
emnode / learn
Site Reliability

Establish AWS Backup Plans

Without a Backup Plan there is no policy — recovery becomes whatever someone hopes is there. Wire up a plan that covers resources by tag.

16 min·10 sections·AWS

Last reviewed

Backup Plans: the basics

What is an AWS Backup Plan and why does its absence get flagged?

An AWS Backup Plan is a single piece of configuration that bundles three things: a schedule (when to take a recovery point), a retention rule (how long to keep it before AWS deletes it), and a resource selection (which resources the plan applies to). Once a plan exists and a selection matches a resource, AWS Backup takes recovery points automatically — no Lambda, no cron, no team owning a homegrown snapshot script.

Without a plan, snapshots are ad hoc. Engineers create RDS snapshots manually before a release, EBS snapshots get taken once and never rotated, DynamoDB point-in-time recovery is enabled on some tables and not others. The fleet ends up in a state where nobody can answer the question "is this resource backed up?" without opening each console and squinting at it — and the answer to "can we restore yesterday's 14:00 state of this database?" is usually "let's see."

Continuity check BKP-003 ("No Backup Plans") fires on any account where there is no AWS::Backup::BackupPlan resource in a given region. The check is intentionally blunt: if no plan exists, no policy exists, and recovery is whatever someone happens to remember to do manually. Severity is CRITICAL because a region with zero backup plans is one accidental DeleteDBInstance away from a permanent data loss event.

In this lesson you'll learn how AWS Backup Plans are structured, how to use tag-based selection so coverage scales without per-resource toggles, how to layer multiple retention rules (daily/weekly/monthly) in a single plan, and how to verify coverage with AWS Config and Backup audit reports. You'll also see the KMS and vault setup that prevents a compromised account from also deleting its own recovery points.

Fun fact

The snapshot graveyard

An AWS field survey of mid-sized accounts found a median of 1,200 untagged manual EBS snapshots per account, sitting at $0.05/GB-month, with no record of which volume they came from or whether they were still needed. The average account was spending around $9k/year on snapshots nobody could identify. The fix wasn't "take fewer snapshots" — it was "replace manual snapshots with a Backup Plan that has explicit retention." The graveyard cleared itself within 35 days once the plan owned the lifecycle.

Establishing a Backup Plan in action

Marco runs platform reliability at a healthcare SaaS. A continuity scan flags BKP-003 (CRITICAL) across all four of their active regions — there are zero AWS::Backup::BackupPlan resources anywhere in the account. He has roughly 300 EBS volumes, 40 RDS instances, and a dozen DynamoDB tables that he believes are "backed up," but he can't prove it.

His goal isn't to back up everything — that's expensive and most of the fleet is ephemeral. He wants a single plan, applied by tag (BackupRequired=true), with three tiered rules: daily-7days, weekly-1month, monthly-1year. Anything tagged is in. Anything not tagged is intentionally out.

He starts by confirming the gap — no plans, no selections — and then drops the plan in.

First, confirm the finding. List every Backup Plan in the region. An empty list is the failure mode.

$ aws backup list-backup-plans --region eu-west-2 --query 'BackupPlansList[*].{Name:BackupPlanName,Id:BackupPlanId,Last:LastExecutionDate}' --output table
┌───────┬─────┬──────┐
│ Name │ Id │ Last │
├───────┼─────┼──────┤
└───────┴─────┴──────┘
# Zero plans in eu-west-2. Same result in us-east-1, eu-west-1, ap-southeast-2.
# Every protected-data resource in this region is relying on ad hoc snapshots.

Empty BackupPlansList — BKP-003 confirmed across every region with workloads.

Now create the plan. Three rules in one document — daily-7days, weekly-1month, monthly-1year — all writing to a dedicated vault with its own KMS key.

$ aws backup create-backup-plan --region eu-west-2 --backup-plan file://prod-tiered-plan.json
{
"BackupPlanId": "a1b2c3d4-5e6f-7890-abcd-ef0123456789",
"BackupPlanArn": "arn:aws:backup:eu-west-2:123456789012:backup-plan:a1b2c3d4-5e6f-7890-abcd-ef0123456789",
"CreationDate": "2026-05-15T09:42:11.103000+00:00",
"VersionId": "NDQyZmJmMWEt..."
}
# Plan created. Next: attach a selection so it actually picks up resources.
# Selection rule: any resource tagged BackupRequired=true, across EBS, RDS, DynamoDB, EFS, FSx.

Tiered plan landed. The plan is inert until a selection is attached — that's the next call (create-backup-selection).

AWS Backup under the hooddeep dive

A Backup Plan is a JSON document. Each rule inside it specifies a ScheduleExpression (cron), a TargetBackupVaultName, a Lifecycle (move-to-cold-after / delete-after), and a CompletionWindowMinutes after which an in-flight backup is abandoned. The plan itself does nothing until you create a Backup Selection that maps it to resources — either by ARN, by tag (StringEquals/StringLike on a tag key/value), or by resource type. Tag-based selection is the right default: tag a resource, it's covered; remove the tag, it's not.

Recovery points land in a Backup Vault. The vault is the storage container and also the access-control boundary — it has its own KMS key and its own resource-based policy. Best practice is one vault per environment with a dedicated CMK, and a vault access policy that denies backup:DeleteRecoveryPoint and backup:DeleteBackupVault to everyone except a small break-glass role. This means even a compromised admin in the source account cannot wipe the recovery points by accident or by design.

Pricing is per-recovery-point storage, not per-plan or per-rule. EBS snapshots in AWS Backup are billed at standard EBS snapshot rates (~$0.05/GB-month for the source-region copy); RDS, DynamoDB, EFS, and FSx each have their own per-GB-month rate, generally higher than EBS. Cross-region copies double the storage cost. A plan with 35-day retention against a 10 TB fleet typically costs $500-$800/month — budget for it before turning on monthly-1year retention across the whole org.

# The plan document referenced above — three tiered rules into a dedicated vault.
cat > prod-tiered-plan.json <<'JSON'
{
  "BackupPlanName": "prod-tiered",
  "Rules": [
    {
      "RuleName": "daily-7days",
      "TargetBackupVaultName": "prod-backup-vault",
      "ScheduleExpression": "cron(0 5 ? * * *)",
      "StartWindowMinutes": 60,
      "CompletionWindowMinutes": 180,
      "Lifecycle": { "DeleteAfterDays": 7 }
    },
    {
      "RuleName": "weekly-1month",
      "TargetBackupVaultName": "prod-backup-vault",
      "ScheduleExpression": "cron(0 5 ? * SUN *)",
      "Lifecycle": { "DeleteAfterDays": 30 }
    },
    {
      "RuleName": "monthly-1year",
      "TargetBackupVaultName": "prod-backup-vault",
      "ScheduleExpression": "cron(0 5 1 * ? *)",
      "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 }
    }
  ]
}
JSON

aws backup create-backup-plan --backup-plan file://prod-tiered-plan.json

What is the impact of running without a Backup Plan?

The headline impact is permanent data loss. A DROP TABLE in production, a terraform destroy against the wrong workspace, a ransomware event, or an IAM compromise that walks the account looking for things to delete — all of these are recoverable if a Backup Plan with vault-level deletion protection exists, and unrecoverable if it doesn't. Most data-loss incidents that make the news aren't infrastructure failures; they're operator errors or compromised credentials against accounts that had no enforced backup policy.

The second-order impact is the recovery objective gap. Even when teams do take ad hoc snapshots, the snapshots almost never align with the business's stated RPO (recovery point objective) or RTO (recovery time objective). Marketing tells customers "we recover within 4 hours to within 1 hour of data loss"; ops have hourly snapshots on two databases and weekly snapshots on the rest. A Backup Plan makes the RPO a single piece of configuration — ScheduleExpression: rate(1 hour) — that you can show an auditor.

On the compliance side, SOC 2 CC7.5, ISO 27001 A.12.3, HIPAA §164.308(a)(7)(ii)(A), and PCI DSS Requirement 9.5.1 all expect a documented, tested, automated backup policy. "We have some snapshots" doesn't pass any of these. A Backup Plan with an audit report from AWS Backup Audit Manager is the cheapest possible answer to a control owner asking for evidence — it's a single artifact that proves coverage, retention, and recoverability.

There is also a cost impact in the wrong direction: turning on backups without budgeting for them. A 10 TB fleet with 35-day daily retention is somewhere around $500-$800/month in standard storage; add cross-region copies and monthly-1year cold storage and the number can climb past $2k. The cost is fine as long as it's a deliberate choice — but teams that flip the switch organisation-wide without running the math get a billing surprise the following month.

How do you establish backup coverage that actually holds?

Standing up a Backup Plan is a four-step loop. Skip any step and you end up with either uncovered resources, surprise bills, or recovery points that can be deleted by the same compromise that triggered the disaster.

1. Tag the fleet, then select by tag

Pick one tag — BackupRequired=true is the convention. Apply it to every resource that should be in the policy: RDS instances, EBS volumes attached to stateful workloads, DynamoDB tables, EFS file systems, FSx file systems. Then create the Backup Selection on that single tag. New resources tagged the same way are picked up automatically — no plan edits required. Resources that genuinely don't need backups (ephemeral autoscaling nodes, scratch volumes) stay untagged and stay out.

2. Run multiple rules in one plan for tiered retention

A single rule (daily, 35-day) is the AWS-managed Daily-35day-Retention plan — fine as a starter, not enough for compliance. Add a weekly rule retained 1 month and a monthly rule retained 1 year (with cold-storage transition after 30 days). The total storage cost is modest compared to single-tier daily, and you get long-horizon recovery without keeping every daily for a year.

3. Use a dedicated vault with its own KMS key and a deletion-deny policy

Never write to the default vault. Create a prod-backup-vault with a dedicated CMK, then attach a vault access policy that denies backup:DeleteRecoveryPoint and backup:DeleteBackupVault to every principal except a single break-glass role that requires MFA. This is the difference between "backups exist" and "backups survive a compromise." For the highest-risk workloads, enable AWS Backup Vault Lock in compliance mode — recovery points become immutable for the retention period even to AWS support.

4. Verify coverage continuously and roll plans org-wide

Enable the AWS Config managed rule aws-resource-backup-protected to fire whenever a resource type that should be covered isn't. Pair it with AWS Backup Audit Manager's built-in frameworks (Backup Resources Protected by Backup Plan, Backup Plan Min Frequency and Min Retention) for a nightly compliance report. For multi-account orgs, use AWS Backup central management with AWS Organizations to push the same plan from the management account to every member account — the alternative is hand-rolling plans per account and inevitably missing one.

# Step 2 of the loop: attach a tag-based selection so the plan actually picks up resources.
cat > prod-selection.json <<'JSON'
{
  "SelectionName": "backup-required-tag",
  "IamRoleArn": "arn:aws:iam::123456789012:role/service-role/AWSBackupDefaultServiceRole",
  "ListOfTags": [
    {
      "ConditionType": "STRINGEQUALS",
      "ConditionKey": "BackupRequired",
      "ConditionValue": "true"
    }
  ]
}
JSON

aws backup create-backup-selection \
  --backup-plan-id a1b2c3d4-5e6f-7890-abcd-ef0123456789 \
  --backup-selection file://prod-selection.json

# Verify which resources the selection now covers.
aws backup list-protected-resources \
  --query 'Results[*].{Arn:ResourceArn,Type:ResourceType,Last:LastBackupTime}' \
  --output table

Quick quiz

Question 1 of 5

Continuity check BKP-003 fires with severity CRITICAL because no Backup Plans exist in the region. You have ~300 EBS volumes, ~40 RDS instances, and a dozen DynamoDB tables — most are ephemeral, but the stateful ones must hit a 1-hour RPO and 1-year retention for compliance. What's the right first move?

You've completed Establish AWS Backup Plans. You can now build a tag-driven Backup Plan with tiered retention, land recovery points in a vault that survives a compromised account, verify coverage with Config and Audit Manager, and roll the same policy to every account in the org. The next time a continuity scan flags BKP-003, you'll have a four-step loop ready to run — and recovery stops being whatever someone hopes is there.

Back to the library