Compliance

Deploy across multiple Availability Zones

One capability across databases, caches, load balancers, file systems, search domains and serverless: make sure no single Availability Zone outage can take a production workload down.

14 min·10 sections·AWS

Last reviewed 16 June 2026

Remediates AWS Security Hub: AutoScaling.2 AutoScaling.6 DMS.13 ElastiCache.3 ELB.10 ELB.13 ES.6 ES.7 FSx.3 FSx.4 FSx.5 Lambda.5 Neptune.9 NetworkFirewall.1 Opensearch.6 RDS.5 RDS.15 Redshift.16 Redshift.18

High availability across AZs: the basics

What does "multi-AZ" actually mean across AWS services?

An Availability Zone (AZ) is a physically isolated datacentre within an AWS Region, with its own power, cooling and networking. AWS designs every Region with multiple AZs precisely because a single zone can have a power, network or hardware event. "Multi-AZ" is the capability of spreading a workload across two or more zones so that one zone failing reduces capacity rather than causing an outage. Each service expresses this differently: RDS keeps a synchronous standby in a second AZ, an Auto Scaling group launches instances across several subnets, an Elastic Load Balancer registers targets in multiple zones, FSx runs an active and a standby file server, OpenSearch and ElastiCache replicate nodes across zones, and even Lambda needs to be wired to subnets in more than one AZ when it runs in a VPC.

AWS Security Hub turns each of these into its own control, which is why a single estate can fail a dozen high-availability checks at once. RDS.5 and RDS.15 cover database instances and clusters, AutoScaling.2 and AutoScaling.6 cover compute fleets, ELB.10 and ELB.13 cover load balancers, FSx.3, FSx.4 and FSx.5 cover the three FSx file-system types, ES.6, ES.7 and Opensearch.6 cover search domains, ElastiCache.3 covers cache failover, Neptune.9, Redshift.16, Redshift.18 and DMS.13 cover their respective engines, NetworkFirewall.1 covers firewall endpoints, and Lambda.5 covers VPC functions. They look like separate problems on the report, but they are one capability: no production workload should sit in a single zone.

The good news is that most single-AZ exposure is drift, not intent. A launch wizard left a database single-AZ, a Terraform module hard-coded one subnet, a cluster was promoted from a pilot without a resilience review. The job is to find every production workload that lives in one zone, decide which genuinely need a standby and which are throwaway, span the ones that matter across at least two AZs, and then enforce multi-AZ as the default so new resources arrive resilient.

In this lesson you will learn how AWS expresses high availability across databases, compute, load balancers, file systems, search domains and serverless, how to find every production workload that lives in a single Availability Zone, and how to span the ones that matter across zones without breaking the few that are intentionally single-AZ. The Controls this lesson covers section lists every Security Hub control in this capability, each linking to a deep page with the exact check and a copy-and-paste fix.

Fun fact

The morning us-east-1 reminded everyone that AZs fail

When a large AWS Availability Zone disruption hit us-east-1, teams running single-AZ resources in the affected zone watched databases, fleets and file shares go unreachable for hours, while teams that had paid the premium for multi-AZ on the same workloads saw AWS promote standbys and rebalance capacity in the healthy zones, with their endpoints flickering for a minute or two and then carrying on. The pattern repeats in every major zone event: AZs are engineered to fail independently, and the workloads that survive are the ones whose capacity could move when one zone went down. The annual multi-AZ premium on a critical workload is frequently less than the cost of a single multi-hour outage of the service it backs.

Finding single-AZ exposure across an estate

Priya is the platform lead at a scale-up preparing for its first SOC 2 audit. Security Hub shows high-availability failures spread across RDS, Auto Scaling, ELB and OpenSearch in three accounts that pre-date the team's current guardrails.

Rather than work the findings one by one, she starts with the resources that hold data and have the highest impact, listing which databases are still single-AZ so she can separate the production systems that need a standby from the dev instances that do not, before changing anything.

Start with the resources whose loss is most expensive. Single-AZ RDS instances are a common and high-impact finding.

$ aws rds describe-db-instances --query 'DBInstances[?MultiAZ==`false`].[DBInstanceIdentifier,Engine,AvailabilityZone]' --output table

-----------------------------------------------------

| prod-orders-db | postgres | us-east-1a |

| prod-auth-db | postgres | us-east-1a |

| dev-sandbox-db | mysql | us-east-1b |

-----------------------------------------------------

# Two production databases with no standby; the dev sandbox can stay single-AZ.

Single-AZ production databases are the highest-value target in this group. Cross-reference each against its environment tag, then fix the production ones first.

How AWS makes a workload survive a zone outagedeep dive

Most high-availability controls resolve to one of three mechanisms. The first is a synchronous standby in a second zone, which is how RDS.5, FSx.3, FSx.4, FSx.5, ElastiCache.3, Neptune.9 and DMS.13 work: AWS replicates writes to a standby and fails over automatically, repointing a stable DNS endpoint so applications reconnect without a configuration change. The second is spreading stateless capacity across zones, which is how AutoScaling.2, AutoScaling.6, ELB.10 and ELB.13 work: the subnet list you provide is the only thing telling the resource which zones it may use, so adding subnets in more zones is the act of opting into them. The third is replica placement across zones for quorum-based engines, which is how ES.6, ES.7, Opensearch.6, Redshift.16 and Redshift.18 work.

Security Hub evaluates these through AWS Config, typically on a several-hour cycle, so a fix does not flip the finding to PASSED instantly even though the configuration change itself takes effect quickly. This matters when you are gathering audit evidence against a deadline. Some changes are also irreversible in place: FSx for Windows, OpenZFS and ONTAP deployment types are chosen at creation, so remediating FSx.3, FSx.4 or FSx.5 means building a new Multi-AZ file system and migrating the data, not toggling a flag.

The control-plane and worker layers often version and fail over independently, which is why some capabilities are split into two controls: RDS.5 (instances) versus RDS.15 (clusters), AutoScaling.2 (group spans AZs) versus AutoScaling.6 (mixed instance types across AZs for capacity diversity). The strongest end state is not just spanning resources today but enforcing multi-AZ as a provisioning default through infrastructure-as-code and AWS Config rules, so no production resource can be created single-AZ without a deliberate, recorded exception.

What is the impact of running production in a single AZ?

The direct impact is availability. A single-AZ resource has exactly one copy of itself in one datacentre. If that zone suffers a power, network, cooling or hardware failure, which AWS designs explicitly around because it happens, the resource becomes unreachable. With no standby a database can only be recovered by restoring from a backup into a healthy zone, a process that can take from tens of minutes to several hours and can lose writes since the last backup. A single-AZ Auto Scaling group cannot launch replacement capacity at all, so the fleet drains to zero as instances turn over.

The second-order impact is that planned maintenance becomes downtime too. On a multi-AZ database AWS patches the standby, fails over and then patches the old primary, turning a maintenance outage into a brief failover. On a single-AZ resource the same patch takes the workload offline for its duration. So single-AZ exposure costs you on both the unplanned and the planned axis.

On the compliance side, every modern framework, SOC 2, ISO 27001, HIPAA, PCI DSS and FedRAMP, expects production workloads to be resilient to a single-zone failure. A passing set of high-availability controls across every account is defensible audit evidence; a scatter of single-AZ production resources is the wrong answer to a question auditors and enterprise customers now ask directly.

How do you make workloads multi-AZ safely?

Work the capability as one loop rather than chasing individual findings. The order matters: decide what genuinely needs resilience before you start changing topologies, and confirm the surrounding networking can reach the new zones so you do not take a live service offline.

1. Inventory every production workload that lives in one AZ

Across services, list the resources that are single-AZ: RDS instances and clusters, Auto Scaling groups, load balancers, FSx file systems, OpenSearch and ElastiCache deployments, Neptune, Redshift, DMS and VPC Lambda functions. Treat this inventory as the source of truth rather than the Security Hub finding count, because one workload can trigger several controls, and capture the environment tag for each so the next step has the data it needs.

2. Assign a resilience tier to each workload

Decide per workload, not in bulk. Production, customer-facing and revenue-critical systems get multi-AZ; the premium is justified by automatic failover and zero-data-loss durability. Dev, test and disposable resources stay single-AZ by design. Record the decision against each resource with a tag so the choice is auditable and not re-litigated on every scan.

3. Span the resources that matter, networking first

Before flipping a flag, confirm the VPC actually has suitable subnets in another zone, routed correctly and tagged for the workload, otherwise new capacity launches and immediately fails its health checks. For databases and caches, enable the standby; for fleets and load balancers, add subnets in 2+ zones and mirror the set on the load balancer; for FSx, accept that the deployment type is fixed at creation and plan a build-new-and-migrate project. Prioritise resources that hold data.

4. Ratchet it in with defaults and guardrails

Make multi-AZ the default in your CloudFormation and Terraform modules so new production resources arrive resilient, and back it with AWS Config rules (for example autoscaling-multiple-az and rds-multi-az-support) and Service Control Policies so the posture cannot quietly drift back to one zone. For the resources you intentionally leave single-AZ, record a documented exception rather than ignoring the finding.

# Fix the highest-impact data stores first: enable Multi-AZ on production databases.
for db in $(aws rds describe-db-instances \
    --query 'DBInstances[?MultiAZ==`false` && DBClusterIdentifier==null].DBInstanceIdentifier' --output text); do
  aws rds modify-db-instance --db-instance-identifier "$db" \
    --multi-az --apply-immediately
  echo "$db: standby being provisioned in a second AZ"
done

# Span a stateless compute fleet across three AZs, then mirror the set on its load balancer.
aws autoscaling update-auto-scaling-group --auto-scaling-group-name web-tier-asg \
  --vpc-zone-identifier "subnet-0aaa1,subnet-0bbb2,subnet-0ccc3"
aws elbv2 set-subnets --load-balancer-arn "$ALB_ARN" \
  --subnets subnet-0aaa1 subnet-0bbb2 subnet-0ccc3

Quick quiz

Question 1 of 5

Security Hub shows high-availability failures across RDS, Auto Scaling, ELB and OpenSearch. What is the most efficient way to think about them?

Keep learning

Go deeper on how high availability works across the services in this capability.

You can now treat high availability as one capability rather than a scatter of findings: inventory every production workload that lives in a single zone, tier it by business importance, span the ones that matter across at least two AZs (building new where the deployment type is fixed at creation), and ratchet the estate shut with multi-AZ defaults and Config guardrails. The Controls this lesson covers section below links every control in this group to its deep page and fix.

Back to the library

High availability across AZs: the cost and risk view

A tiering decision where the premium is predictable and the downside is an outage

Multi-AZ is one of the few resilience capabilities that carries a real, recurring cost rather than being free to enable. A Multi-AZ RDS instance roughly doubles its compute and storage charge because you pay to run the standby, a Multi-AZ FSx file system runs a second file server, and spanning compute or databases across zones adds cross-AZ data-transfer charges. So this is not a blanket on switch: it is a tiering decision about which workloads are important enough to insure.

Frame each failing control by what depends on it, not by the control count. A single-AZ database behind checkout carries a far higher expected loss than a single-AZ dev sandbox, yet both show up as one red finding. Map the failures to the business workload behind them and prioritise by exposure. Production and customer-facing systems almost always justify the premium; dev, test and disposable environments usually do not, and should be recorded as deliberate single-AZ exceptions rather than left as silent gaps.

The premium is unusually clean to model because it is structural, not variable: roughly 2x the per-resource charge plus cross-AZ transfer for chatty workloads. Against that sits the documented cost of a zone-outage incident, lost revenue per hour, SLA credits, incident response and recovery. For revenue-critical systems the maths is one-sided, which makes the spend easy to approve once it is framed as insurance with a known premium.

This lesson is for the finance partner who sees a cluster of high-availability findings on the security report and wants to know what the right response is and what it costs. It covers why multi-AZ is a tiering decision rather than a blanket mandate, how to estimate the premium per resource, which workloads justify it, and how to turn a list of red findings into a risk-ordered, budgeted remediation plan with documented exceptions.

Fun fact

The morning us-east-1 reminded everyone that AZs fail

How a finance partner frames the multi-AZ decision

Priya is the finance partner supporting the platform team ahead of a SOC 2 audit. The security report shows a dozen high-availability failures across RDS, Auto Scaling, ELB and OpenSearch in three older accounts. Unlike most security findings, this one has a real recurring price tag, so she does not approve a blanket fix. She asks the tiering question first: which of these single-AZ resources actually carry business risk if a zone goes down, and which are dev sandboxes where an outage is an inconvenience.

The team pulls the resource list with environment tags. Two production databases sit behind checkout and authentication with no standby, an OpenSearch domain backs the customer search experience, and the rest are dev and staging instances. Priya prices the premium precisely because it is structural, not variable: roughly 2x the per-resource charge for the production database standbys plus a small amount of cross-AZ transfer she can estimate from existing CloudWatch network metrics. Her finance pack frames it as insurance with a known premium: 'The multi-AZ premium on the two production databases and the search domain is a small, predictable annual line against the open-ended cost of a multi-hour zone outage of checkout. The dev resources stay single-AZ by design and are recorded as deliberate exceptions, because paying the premium on them would be waste.'

Why single-AZ exposure belongs on the risk register

The cost model here is unusually clean to forecast because the premium is structural, not variable. Multi-AZ on a database roughly doubles its charge, multi-AZ on a file system runs a second server, and spanning compute across zones adds cross-AZ transfer you can estimate from existing CloudWatch network metrics. So for any workload you can price the resilience premium precisely up front, rather than discovering it later on the bill.

Against that predictable premium sits an open-ended tail risk. A single-AZ production resource is a quantifiable exposure: the cost of a zone-outage incident is the revenue lost during a multi-hour recovery plus SLA credits and reputational damage, against an annual premium that is typically a small fraction of that. The finance role is to attach expected loss to each failing production resource so the work is prioritised by risk, to record every intentional single-AZ exception with an owner, and to budget the premium on critical systems as a deliberate resilience line rather than letting it surface as an unexplained bill increase.

What finance can actually do about high availability

Finance cannot enable a standby, but it owns the framing that makes multi-AZ a deliberate, tiered spend rather than either waste or an uninsured gap. Three levers, used at the regular cadence.

1. Make resilience tier a budgeting input

Agree a simple rule with engineering: production-critical workloads are budgeted with the multi-AZ premium included, lower environments are not. That turns the premium from a surprise on the bill into a planned line item, and makes any exception (a critical dev system, or a production system deliberately left single-AZ) an explicit ask rather than a silent default.

2. Track failing production resources alongside their dollar cost

Put the count of single-AZ production resources, and the estimated premium to remediate them, on the security-and-cost review. The number that matters is failing production resources, not total fails, because dev and test single-AZ is expected. This keeps the conversation on insurable risk rather than raw finding counts.

3. Price the premium against the outage it prevents

Frame the multi-AZ premium on a critical workload as insurance: a small, predictable annual cost against the much larger cost of a multi-hour zone outage. For revenue-critical systems the maths is almost always one-sided, and saying so explicitly makes the spend easy to approve and every intentional single-AZ exception easy to challenge.

Quick quiz

Question 1 of 5

Why is multi-AZ different from most security findings when finance assesses the cost of remediation?

Keep learning

Go deeper on how high availability works across the services in this capability.

You have finished the finance view of high availability. You know multi-AZ is the rare resilience capability with a real, predictable premium (roughly 2x per resource plus cross-AZ transfer), which makes it a tiering decision rather than a blanket switch: insure the production-critical workloads, leave the disposable ones single-AZ by design, and record every exception with an owner. Next time the findings land, you will price the premium up front, track failing production resources rather than raw counts, and frame the spend as cheap insurance against an open-ended outage.

Back to the library

High availability across AZs: the headline

Whether the business survives a datacentre outage without a manual scramble

Cloud workloads run inside a single datacentre unless you configure them across multiple. If that one datacentre has an event, a single-AZ database, fleet or file share goes dark and recovery is a manual, multi-hour effort, often with data loss. The report shows this as a scatter of separate findings across databases, compute, storage and networking, but the underlying question is one: which of our production systems are one datacentre outage away from going down?

Unlike most security findings this one is partly a cost decision, because multi-AZ carries a recurring premium on the resources that need it. So the leadership question is not "are we compliant today?" It is "is every production-critical workload resilient by design, with the few intentional single-AZ exceptions on the record?" The defensible end state is multi-AZ by default for production, deliberate exceptions for low-stakes systems, and a guardrail so resilience cannot drift back to one zone unnoticed.

This is the rare class of control where the trade is explicit: a predictable, bounded premium against an open-ended outage risk. For the systems that matter, that is cheap insurance, and it should be a decision leadership makes deliberately rather than one inherited from whoever clicked through a launch wizard.

A short read for the leader who needs to know what single-AZ exposure risks, why multi-AZ is a per-workload tiering decision because it carries a recurring premium, and what a defensible end state looks like across the estate: production resilient by default, exceptions on the record, enforced by policy.

Fun fact

The morning us-east-1 reminded everyone that AZs fail

What it looks like when resilience is a deliberate tier, not an accident

When a large Availability Zone disruption hit a major Region, the leadership team watched a competitor's checkout go dark for hours while their own customer-facing systems flickered for a minute and carried on. At the next review the CEO asked the question the outage had made concrete: if a datacentre went down for us, which of our systems would survive, and which would need a manual scramble?

The honest answer at the time was a mix, because resilience had been inherited from whoever clicked through each launch wizard rather than decided deliberately. The team's response was to make resilience an explicit per-workload tier. Production and revenue-critical systems were spanned across at least two zones, with automatic failover in about a minute; dev and disposable resources were left single-AZ on purpose and recorded as deliberate exceptions. The default in the infrastructure-as-code modules was changed so new production resources arrive resilient, backed by Config rules so the posture cannot drift. The next time the question came up, the answer was no longer a shrug but a tracked list: every production-critical workload multi-AZ, every exception on the record. That is the difference between resilience by policy and resilience by accident, and the premium that buys it is frequently less than the cost of a single multi-hour outage of the service it protects.

Why this is a board-level risk

Single-AZ exposure is a direct proxy for a question every executive eventually gets asked: if a datacentre goes down, do our critical systems stay up? For each single-AZ production workload the honest answer is no, recovery would be a manual scramble. Multi-AZ makes the answer yes, with automatic failover in about a minute, and the high-availability controls turn that abstract question into a concrete, tracked list.

What makes this a leadership item rather than a pure engineering one is the cost trade-off: multi-AZ carries a recurring premium, so it cannot be a blanket mandate. The healthy signal is not zero findings; it is that every production-critical workload is resilient by design and every exception is a recorded, deliberate decision. That is the difference between resilience by policy and resilience by accident.

The leadership move on high availability

The executive handle is not to mandate multi-AZ everywhere, which would waste money on workloads that do not need it. It is to require that every workload's resilience tier is a deliberate, recorded decision matched to its business importance.

1. Set a default: production is multi-AZ

Make it policy that anything customer- or revenue-facing spans at least two zones unless there is a documented reason not to. A clear default removes the per-resource debate and ensures the systems that matter are protected by design rather than by chance.

2. Accept intentional single-AZ for low-stakes systems

Do not drive the finding count to zero. Dev, test and disposable resources are correctly single-AZ, and paying the premium on them is waste. The goal is the right tier for each system, with each exception documented and owned, not uniform spend.

3. Ask for the trend on protected critical workloads

At the leadership review the one-line question is whether all production-critical workloads are multi-AZ with every exception documented. A consistent yes means resilience is governed by policy: a one-minute confidence signal that needs no technical depth.

Quick quiz

Question 1 of 5

After a zone disruption took out a competitor's checkout, your CEO asks whether your critical systems would survive a datacentre outage. What is the defensible end state to be able to point to?

Keep learning

Go deeper on how high availability works across the services in this capability.

Two takeaways. Multi-AZ is the rare control where the trade is explicit: a predictable, bounded premium against an open-ended outage risk, so it is a per-workload tiering decision, not a blanket mandate. And the healthy signal is not zero findings; it is that every production-critical workload is resilient by design, every exception is a deliberate recorded decision, and a guardrail stops the posture drifting back to one zone unnoticed.

Back to the library

Controls this lesson covers

One capability, many AWS Security Hub controls. This lesson is the shared playbook; each control below keeps its own deep page with the exact check, severity and a copy-and-paste fix.

Deploy across multiple Availability Zones

High availability across AZs: the basics

The morning us-east-1 reminded everyone that AZs fail

Finding single-AZ exposure across an estate

How AWS makes a workload survive a zone outagedeep dive

What is the impact of running production in a single AZ?

How do you make workloads multi-AZ safely?

1. Inventory every production workload that lives in one AZ

2. Assign a resilience tier to each workload

3. Span the resources that matter, networking first

4. Ratchet it in with defaults and guardrails

Quick quiz

Keep learning

High availability across AZs: the cost and risk view

The morning us-east-1 reminded everyone that AZs fail

How a finance partner frames the multi-AZ decision

Why single-AZ exposure belongs on the risk register

What finance can actually do about high availability

1. Make resilience tier a budgeting input

2. Track failing production resources alongside their dollar cost

3. Price the premium against the outage it prevents

Quick quiz

Keep learning

High availability across AZs: the headline

The morning us-east-1 reminded everyone that AZs fail

What it looks like when resilience is a deliberate tier, not an accident

Why this is a board-level risk

The leadership move on high availability

1. Set a default: production is multi-AZ

2. Accept intentional single-AZ for low-stakes systems

3. Ask for the trend on protected critical workloads

Quick quiz

Keep learning

Controls this lesson covers

AutoScaling

DMS

ElastiCache

ELB

ES

FSx

Lambda

Neptune

NetworkFirewall

Opensearch

RDS

Redshift

Related compliance lessons