Skip to main content
emnode / learn
Compliance

Deploy across multiple Availability Zones

One capability across databases, caches, load balancers, file systems, search domains and serverless: make sure no single Availability Zone outage can take a production workload down.

14 min·10 sections·AWS

Last reviewed

High availability across AZs: the basics

What does "multi-AZ" actually mean across AWS services?

An Availability Zone (AZ) is a physically isolated datacentre within an AWS Region, with its own power, cooling and networking. AWS designs every Region with multiple AZs precisely because a single zone can have a power, network or hardware event. "Multi-AZ" is the capability of spreading a workload across two or more zones so that one zone failing reduces capacity rather than causing an outage. Each service expresses this differently: RDS keeps a synchronous standby in a second AZ, an Auto Scaling group launches instances across several subnets, an Elastic Load Balancer registers targets in multiple zones, FSx runs an active and a standby file server, OpenSearch and ElastiCache replicate nodes across zones, and even Lambda needs to be wired to subnets in more than one AZ when it runs in a VPC.

AWS Security Hub turns each of these into its own control, which is why a single estate can fail a dozen high-availability checks at once. RDS.5 and RDS.15 cover database instances and clusters, AutoScaling.2 and AutoScaling.6 cover compute fleets, ELB.10 and ELB.13 cover load balancers, FSx.3, FSx.4 and FSx.5 cover the three FSx file-system types, ES.6, ES.7 and Opensearch.6 cover search domains, ElastiCache.3 covers cache failover, Neptune.9, Redshift.16, Redshift.18 and DMS.13 cover their respective engines, NetworkFirewall.1 covers firewall endpoints, and Lambda.5 covers VPC functions. They look like separate problems on the report, but they are one capability: no production workload should sit in a single zone.

The good news is that most single-AZ exposure is drift, not intent. A launch wizard left a database single-AZ, a Terraform module hard-coded one subnet, a cluster was promoted from a pilot without a resilience review. The job is to find every production workload that lives in one zone, decide which genuinely need a standby and which are throwaway, span the ones that matter across at least two AZs, and then enforce multi-AZ as the default so new resources arrive resilient.

In this lesson you will learn how AWS expresses high availability across databases, compute, load balancers, file systems, search domains and serverless, how to find every production workload that lives in a single Availability Zone, and how to span the ones that matter across zones without breaking the few that are intentionally single-AZ. The Controls this lesson covers section lists every Security Hub control in this capability, each linking to a deep page with the exact check and a copy-and-paste fix.

Fun fact

The morning us-east-1 reminded everyone that AZs fail

When a large AWS Availability Zone disruption hit us-east-1, teams running single-AZ resources in the affected zone watched databases, fleets and file shares go unreachable for hours, while teams that had paid the premium for multi-AZ on the same workloads saw AWS promote standbys and rebalance capacity in the healthy zones, with their endpoints flickering for a minute or two and then carrying on. The pattern repeats in every major zone event: AZs are engineered to fail independently, and the workloads that survive are the ones whose capacity could move when one zone went down. The annual multi-AZ premium on a critical workload is frequently less than the cost of a single multi-hour outage of the service it backs.

Finding single-AZ exposure across an estate

Priya is the platform lead at a scale-up preparing for its first SOC 2 audit. Security Hub shows high-availability failures spread across RDS, Auto Scaling, ELB and OpenSearch in three accounts that pre-date the team's current guardrails.

Rather than work the findings one by one, she starts with the resources that hold data and have the highest impact, listing which databases are still single-AZ so she can separate the production systems that need a standby from the dev instances that do not, before changing anything.

Start with the resources whose loss is most expensive. Single-AZ RDS instances are a common and high-impact finding.

$ aws rds describe-db-instances --query 'DBInstances[?MultiAZ==`false`].[DBInstanceIdentifier,Engine,AvailabilityZone]' --output table
-----------------------------------------------------
| prod-orders-db | postgres | us-east-1a |
| prod-auth-db | postgres | us-east-1a |
| dev-sandbox-db | mysql | us-east-1b |
-----------------------------------------------------
# Two production databases with no standby; the dev sandbox can stay single-AZ.

Single-AZ production databases are the highest-value target in this group. Cross-reference each against its environment tag, then fix the production ones first.

How AWS makes a workload survive a zone outagedeep dive

Most high-availability controls resolve to one of three mechanisms. The first is a synchronous standby in a second zone, which is how RDS.5, FSx.3, FSx.4, FSx.5, ElastiCache.3, Neptune.9 and DMS.13 work: AWS replicates writes to a standby and fails over automatically, repointing a stable DNS endpoint so applications reconnect without a configuration change. The second is spreading stateless capacity across zones, which is how AutoScaling.2, AutoScaling.6, ELB.10 and ELB.13 work: the subnet list you provide is the only thing telling the resource which zones it may use, so adding subnets in more zones is the act of opting into them. The third is replica placement across zones for quorum-based engines, which is how ES.6, ES.7, Opensearch.6, Redshift.16 and Redshift.18 work.

Security Hub evaluates these through AWS Config, typically on a several-hour cycle, so a fix does not flip the finding to PASSED instantly even though the configuration change itself takes effect quickly. This matters when you are gathering audit evidence against a deadline. Some changes are also irreversible in place: FSx for Windows, OpenZFS and ONTAP deployment types are chosen at creation, so remediating FSx.3, FSx.4 or FSx.5 means building a new Multi-AZ file system and migrating the data, not toggling a flag.

The control-plane and worker layers often version and fail over independently, which is why some capabilities are split into two controls: RDS.5 (instances) versus RDS.15 (clusters), AutoScaling.2 (group spans AZs) versus AutoScaling.6 (mixed instance types across AZs for capacity diversity). The strongest end state is not just spanning resources today but enforcing multi-AZ as a provisioning default through infrastructure-as-code and AWS Config rules, so no production resource can be created single-AZ without a deliberate, recorded exception.

What is the impact of running production in a single AZ?

The direct impact is availability. A single-AZ resource has exactly one copy of itself in one datacentre. If that zone suffers a power, network, cooling or hardware failure, which AWS designs explicitly around because it happens, the resource becomes unreachable. With no standby a database can only be recovered by restoring from a backup into a healthy zone, a process that can take from tens of minutes to several hours and can lose writes since the last backup. A single-AZ Auto Scaling group cannot launch replacement capacity at all, so the fleet drains to zero as instances turn over.

The second-order impact is that planned maintenance becomes downtime too. On a multi-AZ database AWS patches the standby, fails over and then patches the old primary, turning a maintenance outage into a brief failover. On a single-AZ resource the same patch takes the workload offline for its duration. So single-AZ exposure costs you on both the unplanned and the planned axis.

On the compliance side, every modern framework, SOC 2, ISO 27001, HIPAA, PCI DSS and FedRAMP, expects production workloads to be resilient to a single-zone failure. A passing set of high-availability controls across every account is defensible audit evidence; a scatter of single-AZ production resources is the wrong answer to a question auditors and enterprise customers now ask directly.

How do you make workloads multi-AZ safely?

Work the capability as one loop rather than chasing individual findings. The order matters: decide what genuinely needs resilience before you start changing topologies, and confirm the surrounding networking can reach the new zones so you do not take a live service offline.

1. Inventory every production workload that lives in one AZ

Across services, list the resources that are single-AZ: RDS instances and clusters, Auto Scaling groups, load balancers, FSx file systems, OpenSearch and ElastiCache deployments, Neptune, Redshift, DMS and VPC Lambda functions. Treat this inventory as the source of truth rather than the Security Hub finding count, because one workload can trigger several controls, and capture the environment tag for each so the next step has the data it needs.

2. Assign a resilience tier to each workload

Decide per workload, not in bulk. Production, customer-facing and revenue-critical systems get multi-AZ; the premium is justified by automatic failover and zero-data-loss durability. Dev, test and disposable resources stay single-AZ by design. Record the decision against each resource with a tag so the choice is auditable and not re-litigated on every scan.

3. Span the resources that matter, networking first

Before flipping a flag, confirm the VPC actually has suitable subnets in another zone, routed correctly and tagged for the workload, otherwise new capacity launches and immediately fails its health checks. For databases and caches, enable the standby; for fleets and load balancers, add subnets in 2+ zones and mirror the set on the load balancer; for FSx, accept that the deployment type is fixed at creation and plan a build-new-and-migrate project. Prioritise resources that hold data.

4. Ratchet it in with defaults and guardrails

Make multi-AZ the default in your CloudFormation and Terraform modules so new production resources arrive resilient, and back it with AWS Config rules (for example autoscaling-multiple-az and rds-multi-az-support) and Service Control Policies so the posture cannot quietly drift back to one zone. For the resources you intentionally leave single-AZ, record a documented exception rather than ignoring the finding.

# Fix the highest-impact data stores first: enable Multi-AZ on production databases.
for db in $(aws rds describe-db-instances \
    --query 'DBInstances[?MultiAZ==`false` && DBClusterIdentifier==null].DBInstanceIdentifier' --output text); do
  aws rds modify-db-instance --db-instance-identifier "$db" \
    --multi-az --apply-immediately
  echo "$db: standby being provisioned in a second AZ"
done

# Span a stateless compute fleet across three AZs, then mirror the set on its load balancer.
aws autoscaling update-auto-scaling-group --auto-scaling-group-name web-tier-asg \
  --vpc-zone-identifier "subnet-0aaa1,subnet-0bbb2,subnet-0ccc3"
aws elbv2 set-subnets --load-balancer-arn "$ALB_ARN" \
  --subnets subnet-0aaa1 subnet-0bbb2 subnet-0ccc3

Quick quiz

Question 1 of 5

Security Hub shows high-availability failures across RDS, Auto Scaling, ELB and OpenSearch. What is the most efficient way to think about them?

You can now treat high availability as one capability rather than a scatter of findings: inventory every production workload that lives in a single zone, tier it by business importance, span the ones that matter across at least two AZs (building new where the deployment type is fixed at creation), and ratchet the estate shut with multi-AZ defaults and Config guardrails. The Controls this lesson covers section below links every control in this group to its deep page and fix.

Back to the library

Controls this lesson covers

One capability, many AWS Security Hub controls. This lesson is the shared playbook; each control below keeps its own deep page with the exact check, severity and a copy-and-paste fix.