Skip to main content
emnode / learn
Compliance

Configure backups and retention

One capability across databases, tables, streams, file systems and snapshots: make sure every data store can be recovered to a recent point, and that no backup is shared with the public internet.

14 min·10 sections·AWS

Last reviewed

Backups and retention: the basics

What does "recoverable" actually mean across AWS data stores?

Recoverability is not one setting. Every AWS data store expresses it differently: RDS has a backup retention period in days that enables daily snapshots plus continuous transaction-log archiving for point-in-time recovery, DynamoDB has point-in-time recovery (PITR) that captures a rolling 35-day change log, DocumentDB and Neptune have their own retention windows, EFS has automatic backups, ElastiCache for Redis has automatic snapshots, Redshift has automated snapshots, and Kinesis has a stream retention period that defines how long records can be replayed. Each is the same idea wearing a different name: if something goes wrong, can you get the data back?

AWS Security Hub turns each of these into its own control, which is why a single estate can fail a dozen backup checks at once. RDS.1, RDS.11, RDS.26 and RDS.50 cover database and cluster backups, DynamoDB.2 and DynamoDB.4 cover PITR and backup plans, DocumentDB.2 and DocumentDB.3 cover cluster retention and snapshots, Neptune.3 and Neptune.5 cover automated backups, EFS.2 and EFS.7 cover file-system backups, ElastiCache.1 covers Redis snapshots, Redshift.3 covers automated snapshots, and Kinesis.3 covers stream retention. A second, related set, EC2.1, EC2.182 and the public-snapshot checks behind them, covers the other half of this capability: a backup that is shared publicly is a data leak, so EBS snapshots must not be exposed to all accounts.

The good news is that most of this is one decision repeated: turn recovery on, set a sensible window, and keep the resulting backups private. Most failures are drift, a database launched from a template with retention set to zero, a table created with PITR off (the default), a stream left at the 24-hour default, a snapshot copied with the wrong sharing flag. The job is to find every data store that cannot be recovered or whose backups are exposed, fix the production ones, and enforce a retention floor so new resources arrive protected.

In this lesson you will learn how AWS expresses recoverability across databases, tables, streams, file systems and snapshots, how to find every production data store that cannot be recovered or whose backups are exposed publicly, and how to fix them without spending on coverage that genuinely is not needed. The Controls this lesson covers section lists every Security Hub control in this capability, each linking to a deep page with the exact check and a copy-and-paste fix.

Fun fact

The retention period that quietly read zero

A team launched an RDS instance from a template they had copied from a tutorial, which set the backup retention period to zero so demo environments tore down instantly and for free. The instance graduated to staging, then quietly to production, carrying the zero the whole way. Eighteen months later a migration script dropped the wrong table, and there was no automated backup and no point-in-time recovery to fall back on: the most recent restore point was a manual snapshot someone had taken, by luck, six weeks earlier. The same trap shows up across the capability, DynamoDB PITR is off by default and only captures forward from the moment you enable it, and Kinesis streams default to 24 hours, so a consumer that fails on a Friday can outrun the buffer before anyone is back on Monday. Recovery has to be turned on before you need it, never after.

Finding unrecoverable data across an estate

Marco picks up a batch of backup findings during the weekly compliance triage. Security Hub shows failures spread across RDS, DynamoDB and Kinesis in three accounts, plus a couple of EBS snapshots flagged as shared publicly.

Rather than work the findings one by one, he starts with the data stores whose loss is most expensive, listing which RDS instances have backups disabled so he can separate the production databases that must be fixed from the dev instances that can be documented as exceptions, before changing anything.

Start with the resources whose loss is most expensive. RDS instances with a zero or below-floor retention period have no point-in-time recovery at all.

$ aws rds describe-db-instances --query 'DBInstances[?BackupRetentionPeriod<`7`].[DBInstanceIdentifier,BackupRetentionPeriod,Engine]' --output table
----------------------------------------------
| prod-orders | 0 | postgres |
| prod-billing | 3 | mysql |
| dev-scratch | 0 | mysql |
----------------------------------------------
# prod-orders has NO backups; prod-billing is below the 7-day floor; dev-scratch can stay.

Retention of 0 disables backups entirely; anything 1 to 6 fails the default 7-day floor. Fix the production databases first, then document the dev ones as exceptions.

How AWS keeps a data store recoverabledeep dive

Most recovery controls resolve to one of two mechanisms. The first is a retention window measured in time: RDS takes a daily snapshot and continuously archives transaction logs so it can replay to any second within the retention period, DynamoDB PITR captures a rolling 35-day change log, and Kinesis keeps records readable for a configurable window (24 hours by default, up to 8,760 hours). Set the window to zero or leave it at a low default and the safety net simply is not there. The second is an automatic backup job: EFS, ElastiCache for Redis, Redshift, DocumentDB and Neptune each take scheduled backups when the relevant setting is on. Security Hub reads the configured value directly (for example BackupRetentionPeriod on RDS, PointInTimeRecoveryStatus on DynamoDB, RetentionPeriodHours on Kinesis) and fails anything below the floor.

Two behaviours catch teams out. Backups only ever protect forward from the moment they are enabled: turning on DynamoDB PITR the day after an accident gives you one day of history, not 35, so the only safe time to enable recovery is at creation. And a restore is usually not an in-place rollback: DynamoDB and RDS point-in-time restores create a new resource from the chosen timestamp, with settings like auto scaling, TTL and tags not carried over, so the restore runbook matters as much as the setting.

The other half of this capability is keeping the backups themselves private. EBS and RDS snapshots can be shared with other accounts or with all accounts, and a snapshot shared with all accounts is a public copy of your data. The public-snapshot controls evaluate snapshot sharing attributes and fail any snapshot exposed to the public, which is why EBS snapshot block-public-access (EC2.182) is the strongest backstop: it prevents any snapshot in the account from being made public regardless of an individual sharing flag.

What is the impact of leaving data unrecoverable?

The direct impact is the absence of a rewind. With backups disabled or retention too short, there is no automated way to undo a bad migration, a buggy batch job, an accidental delete or a ransomware encryption event. The failure mode is silent and total: the data store works perfectly until the day someone needs to restore and discovers they cannot. For a production system this is the difference between a five-minute restore to just before the incident and an open-ended reconstruction effort with permanent gaps.

The second-order impact is blast radius across everything downstream. When events or records are lost, every system that derives from them, analytics dashboards, billing reconciliation, audit trails, ML training sets, inherits a gap that is often invisible until someone asks a question the data can no longer answer. A short retention window quietly raises the severity of every incident upstream of it.

On the compliance side, backup and recovery map directly to recognised frameworks: NIST 800-53 contingency-planning controls (CP-9, CP-10, SI-12), SOC 2, ISO 27001 and PCI DSS all expect production data to be recoverable and backups to be protected. A failing finding is documented audit evidence, and a publicly shared snapshot is a data-exposure incident in its own right. A clean, complete set of backup controls across every account is among the cheapest and most defensible artefacts you can hand an auditor.

How do you make data recoverable safely?

Work the capability as one loop rather than chasing individual findings. The order matters: fix the highest-impact production data stores first, mind the operational gotchas on the way, and enforce a retention floor so new resources arrive protected.

1. Inventory every data store and its recovery state

Across services, list each data store with its recovery setting: RDS BackupRetentionPeriod, DynamoDB PITR status, Kinesis RetentionPeriodHours, plus the automatic-backup flags on EFS, ElastiCache, Redshift, DocumentDB and Neptune. Separately, list every EBS and RDS snapshot shared publicly. Capture the environment tag for each so you can separate production (must fix) from genuinely disposable resources (document as exceptions). Read replicas are excluded from RDS.11 because their recovery follows the source instance, so do not waste effort on them.

2. Set a retention floor that matches the recovery requirement

Seven days is a sensible minimum for RDS and Kinesis, but match it to your real recovery point objective and any regulation: 35 days is the RDS automated maximum, 8,760 hours the Kinesis maximum, and longer horizons layer on manual snapshots or AWS Backup. For DynamoDB, PITR is simply on or off. Set the window once, correctly, per data classification rather than blindly applying one number everywhere.

3. Apply changes at the right time and un-share public snapshots

Mind the gotchas. Enabling RDS backups from a retention of zero can trigger a brief I/O pause for the first base snapshot, so apply it in the maintenance window unless the instance carries no traffic. DynamoDB PITR and Kinesis retention increases are instant and non-disruptive. For exposed backups, remove the public sharing from each snapshot and remember that recovery only protects forward, so enable it early rather than during an incident.

4. Ratchet it in with defaults and guardrails

Fix the source, not just the symptom. Set the retention floor and PITR-on in your CloudFormation and Terraform modules so new resources arrive protected, enable account-level EBS snapshot block-public-access so no snapshot can be made public again, and back the lot with AWS Config rules (for example db-instance-backup-enabled, dynamodb-pitr-enabled) so the posture cannot drift. For the resources you intentionally leave unprotected, record a documented exception rather than ignoring the finding.

# Set a 7-day backup floor on production databases below it (skip read replicas).
for db in $(aws rds describe-db-instances \
    --query 'DBInstances[?ReadReplicaSourceDBInstanceIdentifier==`null` && BackupRetentionPeriod<`7`].DBInstanceIdentifier' --output text); do
  aws rds modify-db-instance --db-instance-identifier "$db" \
    --backup-retention-period 7 --no-apply-immediately
done

# Turn on DynamoDB point-in-time recovery (instant, no downtime).
aws dynamodb update-continuous-backups --table-name prod-orders \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

# Stop any snapshot in the account from being shared publicly, ever.
aws ec2 enable-snapshot-block-public-access --state block-all-sharing

Quick quiz

Question 1 of 5

Security Hub shows backup failures across RDS, DynamoDB and Kinesis plus a publicly shared EBS snapshot. What is the most efficient way to think about them?

You can now treat backups and retention as one capability rather than a scatter of findings: inventory every data store's recovery state, set a retention floor that matches each system's recovery requirement, fix the production resources highest-impact first while keeping every snapshot private, and ratchet the estate shut with retention defaults, snapshot block-public-access and Config guardrails. The Controls this lesson covers section below links every control in this group to its deep page and fix.

Back to the library

Controls this lesson covers

One capability, many AWS Security Hub controls. This lesson is the shared playbook; each control below keeps its own deep page with the exact check, severity and a copy-and-paste fix.