Skip to main content
emnode / learn
Learning path

Build in resilience

Multi-AZ, backups, restore testing and read replicas, to survive the bad day.

22 lessons·~287 min total

Lessons in this path

  1. 1
    Site Reliability AWS

    Add RDS read replicas for read scaling and DR

    Read replicas absorb read load and give you a promotion-ready DR target — for read-heavy workloads they pay for themselves in primary-instance right-sizing.

    14 min
  2. 2
    Site Reliability AWS

    Configure cross-region backup copy

    Backups in the same region as the workload won't help when the region goes down. Add a cross-region copy rule before disaster forces the conversation.

    15 min
  3. 3
    Site Reliability AWS

    Configure AWS Backup restore testing

    An untested backup is a hope. Automate scheduled restores so you find out before the incident whether your backups actually work.

    14 min
  4. 4
    Site Reliability AWS

    Establish AWS Backup Plans

    Without a Backup Plan there is no policy — recovery becomes whatever someone hopes is there. Wire up a plan that covers resources by tag.

    16 min
  5. 5
    Site Reliability AWS

    Enable S3 versioning on critical buckets

    Without versioning, an overwrite or delete is permanent. Turn it on and pair with a noncurrent-version lifecycle so you don't pay forever.

    12 min
  6. 6
    Site Reliability AWS

    Protect EBS volumes with AWS Backup

    EBS volumes not covered by any backup plan or DLM policy have no recovery path. Wire up coverage by tag and verify.

    13 min
  7. 7
    Site Reliability AWS

    Protect EC2 instances with AWS Backup

    EC2 instance backups capture the full machine state — find instances with no protection and bring them under a backup plan.

    13 min
  8. 8
    Site Reliability AWS

    Protect S3 buckets with AWS Backup

    Versioning alone won't save you from ransomware that deletes versions. Layer AWS Backup or replication on top — and verify recovery.

    14 min
  9. 9
    Site Reliability AWS

    Enable automated backups on RDS

    A retention period of 0 disables automated backups entirely — and with them point-in-time recovery, leaving the database one bad query or deploy away from unrecoverable data loss.

    12 min
  10. 10
    Site Reliability AWS

    Extend RDS backup retention

    Automated backups are on, but the retention window is too short — corruption or a bad change discovered after the window rolls off is unrecoverable.

    11 min
  11. 11
    Site Reliability AWS

    Enable DynamoDB point-in-time recovery

    A bad deploy or a fat-fingered delete can corrupt a DynamoDB table in seconds. PITR lets you rewind to any second in the last 35 days — but it's off by default on every table.

    12 min
  12. 12
    Site Reliability AWS

    Protect DynamoDB tables with AWS Backup

    Point-in-time recovery lives with the table and dies with it. Layer AWS Backup on top for isolated, long-term, cross-Region copies — and verify the restore.

    12 min
  13. 13
    Site Reliability AWS

    Protect RDS instances with AWS Backup

    Native RDS backups die with the database — bring your RDS instances and Aurora clusters under a centralized AWS Backup plan so they're protected by policy, not per-DB settings.

    13 min
  14. 14
    Site Reliability AWS

    Configure S3 Cross-Region Replication

    A single bucket in a single Region is a single point of failure. Replicate critical data to a second Region — and understand the cost before you do.

    13 min
  15. 15
    Site Reliability AWS

    Enable DynamoDB Global Tables

    Global Tables give you active-active, multi-Region DynamoDB with seconds-RTO failover and local-latency reads — but you pay for every replicated write and a full copy of the data in every Region, so it's a deliberate choice, not a default.

    13 min
  16. 16
    Site Reliability AWS

    Address stale backup recovery points

    A backup plan that runs daily but whose newest recovery point is nine days old isn't protecting you — it's failing quietly. The age of your latest restore point is your real-world RPO.

    12 min
  17. 17
    Site Reliability AWS

    Fix AWS Backup job failures

    A backup plan can look configured on paper while its jobs quietly fail every night — leaving a growing recovery gap nobody notices until a restore is needed.

    13 min
  18. 18
    Site Reliability AWS

    Fix cross-region backup copy failures

    The local backup succeeded, so the dashboard looks green — but the cross-region copy silently failed, and the off-region DR copy you think you have doesn't exist.

    12 min
  19. 19
    Site Reliability AWS

    Fix restore test failures

    A backup job that says "succeeded" only proves bytes were written. A failed restore test is your warning — issued on a Tuesday, not during the disaster — that those bytes may not come back as a running resource.

    12 min
  20. 20
    Compliance AWS

    Deploy across multiple Availability Zones

    One capability across databases, caches, load balancers, file systems, search domains and serverless: make sure no single Availability Zone outage can take a production workload down.

    14 min
  21. 21
    Compliance AWS

    Keep software and engines patched

    One capability across databases, runtimes, clusters and instances: make sure no workload runs on an unsupported or unpatched version that has stopped receiving security fixes.

    14 min
  22. 22
    Compliance AWS

    Harden load balancers (ALB/NLB/CLB)

    One capability across Application and Classic Load Balancers and the Auto Scaling groups behind them: reject malformed HTTP, drain connections cleanly, balance traffic evenly and replace genuinely-broken instances, mostly through single attribute flips.

    13 min