Learning path

Build in resilience

Multi-AZ, backups, restore testing and read replicas, to survive the bad day.

22 lessons·~287 min total

1
Site Reliability AWS

Add RDS read replicas for read scaling and DR

Read replicas absorb read load and give you a promotion-ready DR target — for read-heavy workloads they pay for themselves in primary-instance right-sizing.

14 min
2
Site Reliability AWS

Configure cross-region backup copy

Backups in the same region as the workload won't help when the region goes down. Add a cross-region copy rule before disaster forces the conversation.

15 min
3
Site Reliability AWS

Configure AWS Backup restore testing

An untested backup is a hope. Automate scheduled restores so you find out before the incident whether your backups actually work.

14 min
4
Site Reliability AWS

Establish AWS Backup Plans

Without a Backup Plan there is no policy — recovery becomes whatever someone hopes is there. Wire up a plan that covers resources by tag.

16 min
5
Site Reliability AWS

Enable S3 versioning on critical buckets

Without versioning, an overwrite or delete is permanent. Turn it on and pair with a noncurrent-version lifecycle so you don't pay forever.

12 min
6
Site Reliability AWS

Protect EBS volumes with AWS Backup

EBS volumes not covered by any backup plan or DLM policy have no recovery path. Wire up coverage by tag and verify.

13 min
7
Site Reliability AWS

Protect EC2 instances with AWS Backup

EC2 instance backups capture the full machine state — find instances with no protection and bring them under a backup plan.

13 min
8
Site Reliability AWS

Protect S3 buckets with AWS Backup

Versioning alone won't save you from ransomware that deletes versions. Layer AWS Backup or replication on top — and verify recovery.

14 min
9
Site Reliability AWS

Enable automated backups on RDS

A retention period of 0 disables automated backups entirely — and with them point-in-time recovery, leaving the database one bad query or deploy away from unrecoverable data loss.

12 min
10
Site Reliability AWS

Extend RDS backup retention

Automated backups are on, but the retention window is too short — corruption or a bad change discovered after the window rolls off is unrecoverable.

11 min
11
Site Reliability AWS

Enable DynamoDB point-in-time recovery

A bad deploy or a fat-fingered delete can corrupt a DynamoDB table in seconds. PITR lets you rewind to any second in the last 35 days — but it's off by default on every table.

12 min
12
Site Reliability AWS

Protect DynamoDB tables with AWS Backup

Point-in-time recovery lives with the table and dies with it. Layer AWS Backup on top for isolated, long-term, cross-Region copies — and verify the restore.

12 min
13
Site Reliability AWS

Protect RDS instances with AWS Backup

Native RDS backups die with the database — bring your RDS instances and Aurora clusters under a centralized AWS Backup plan so they're protected by policy, not per-DB settings.

13 min
14
Site Reliability AWS

Configure S3 Cross-Region Replication

A single bucket in a single Region is a single point of failure. Replicate critical data to a second Region — and understand the cost before you do.

13 min
15
Site Reliability AWS

Enable DynamoDB Global Tables

Global Tables give you active-active, multi-Region DynamoDB with seconds-RTO failover and local-latency reads — but you pay for every replicated write and a full copy of the data in every Region, so it's a deliberate choice, not a default.

13 min
16
Site Reliability AWS

Address stale backup recovery points

A backup plan that runs daily but whose newest recovery point is nine days old isn't protecting you — it's failing quietly. The age of your latest restore point is your real-world RPO.

12 min
17
Site Reliability AWS

Fix AWS Backup job failures

A backup plan can look configured on paper while its jobs quietly fail every night — leaving a growing recovery gap nobody notices until a restore is needed.

13 min
18
Site Reliability AWS

Fix cross-region backup copy failures

The local backup succeeded, so the dashboard looks green — but the cross-region copy silently failed, and the off-region DR copy you think you have doesn't exist.

12 min
19
Site Reliability AWS

Fix restore test failures

A backup job that says "succeeded" only proves bytes were written. A failed restore test is your warning — issued on a Tuesday, not during the disaster — that those bytes may not come back as a running resource.

12 min
20
Compliance AWS

Deploy across multiple Availability Zones

One capability across databases, caches, load balancers, file systems, search domains and serverless: make sure no single Availability Zone outage can take a production workload down.

14 min
21
Compliance AWS

Keep software and engines patched

One capability across databases, runtimes, clusters and instances: make sure no workload runs on an unsupported or unpatched version that has stopped receiving security fixes.

14 min
22
Compliance AWS

Harden load balancers (ALB/NLB/CLB)

One capability across Application and Classic Load Balancers and the Auto Scaling groups behind them: reject malformed HTTP, drain connections cleanly, balance traffic evenly and replace genuinely-broken instances, mostly through single attribute flips.

13 min

Build in resilience

Lessons in this path

Add RDS read replicas for read scaling and DR

Configure cross-region backup copy

Configure AWS Backup restore testing

Establish AWS Backup Plans

Enable S3 versioning on critical buckets

Protect EBS volumes with AWS Backup

Protect EC2 instances with AWS Backup

Protect S3 buckets with AWS Backup

Enable automated backups on RDS

Extend RDS backup retention

Enable DynamoDB point-in-time recovery

Protect DynamoDB tables with AWS Backup

Protect RDS instances with AWS Backup

Configure S3 Cross-Region Replication

Enable DynamoDB Global Tables

Address stale backup recovery points

Fix AWS Backup job failures

Fix cross-region backup copy failures

Fix restore test failures

Deploy across multiple Availability Zones

Keep software and engines patched

Harden load balancers (ALB/NLB/CLB)