Skip to main content
emnode / learn
Cost

Consolidate redundant NAT Gateways

Many VPCs run more NAT Gateways than they need — each one bills ~$32/month just to exist, so collapse the duplicates without breaking your AZ-resilience or cross-AZ data-transfer math.

12 min·10 sections·AWS

Last reviewed

Redundant NAT Gateways: the basics

When does a VPC have more NAT Gateways than it needs?

A NAT Gateway lets resources in a private subnet reach the internet outbound without being reachable inbound. It bills in two parts: a flat $0.045 per hour — about $32.40 per gateway per month — just for being provisioned, plus $0.045 per GB of traffic processed. The hourly charge runs whether or not the gateway is the right number of gateways for the VPC. This lesson isn't about a NAT with zero traffic; it's about a VPC carrying more NATs than its design actually requires, each one quietly adding $32 to the bill.

The over-provisioning patterns are predictable. A NAT per private subnet when one per AZ would do. Two or more NATs in the same Availability Zone — pure duplication, zero added resilience. A dev or test VPC built from the production template that inherited a NAT-per-AZ layout it never needed. Or leftover NATs from an old subnet design where the routes were re-pointed but the original gateways were never torn down. In every case multiple gateways exist where fewer would carry the same traffic with the same availability.

It's flagged because the redundancy is invisible until someone groups the gateways by VPC and AZ and reads the route tables. The classic best practice — one NAT per AZ — exists for a reason, so consolidation is a genuine tradeoff rather than free money: it isn't always safe to collapse to a single gateway. The skill is telling true duplication from legitimate per-AZ redundancy, and knowing which environments tolerate which.

In this lesson you'll learn how to spot a VPC carrying more NAT Gateways than its design needs — duplicates within an AZ, a NAT-per-subnet sprawl, non-prod VPCs inheriting prod-grade redundancy — and how to consolidate them without breaking egress. You'll see how to group gateways by VPC and AZ to find duplication, how to read route tables to learn which subnets depend on which NAT, and the central tradeoff: collapsing to one NAT saves $32/month per removed gateway but can add cross-AZ data-transfer charges ($0.01/GB each way) and shrink your failure domain to a single AZ.

Fun fact

Two NATs, one zone, zero extra resilience

An audit of a media company's main account found a VPC running four NAT Gateways: one each in us-east-1a and us-east-1b — and two in us-east-1c. The second us-east-1c gateway was created during a failed Terraform apply 18 months earlier; nothing routed to it, and even if something had, a second gateway in the same AZ adds no availability — an AZ outage takes both down together. It had billed roughly $580 in pure provisioning charges for redundancy that, by definition, could never exist. The fix was a one-line route check and a single delete-nat-gateway call.

Consolidating NAT Gateways in action

Lena runs quarterly VPC hygiene for a logistics company with 60-odd AWS accounts. Her tooling flags the orders-dev VPC: three NAT Gateways, one per AZ, on a development environment that only ever needs egress to pull packages and hit a couple of APIs. Three gateways is $97/month of provisioning for a sandbox that has no uptime SLA at all.

Before touching anything she groups every NAT in the VPC by AZ, then walks the route tables to learn which private subnets point at which gateway. Two findings: the per-AZ pattern was copied straight from the production template, and one of the three NATs has a route table whose subnets are all empty — a leftover from a subnet redesign last year.

The plan splits by environment. For orders-dev she keeps a single NAT and re-points the other two AZs' route tables at it — accepting the small cross-AZ transfer charge because dev egress volume is tiny and there's no resilience requirement. In production she does the opposite: she leaves the legitimate one-per-AZ layout alone and only removes the genuine duplicate and the orphaned NAT. Net: three gateways removed across the two VPCs, ~$97/month saved, no production failure domain touched.

First, list every NAT Gateway in the VPC and group by AZ — the fastest way to spot same-AZ duplicates and over-provisioning.

$ aws ec2 describe-nat-gateways --filter Name=vpc-id,Values=vpc-0orders123dev --query 'NatGateways[?State==`available`].{Nat:NatGatewayId,AZ:SubnetId,Created:CreateTime}' --output table
-----------------------------------------------------------------------
| DescribeNatGateways |
+----------------------+----------------------+-----------------------+
| Created | AZ | Nat |
+----------------------+----------------------+-----------------------+
| 2024-11-02 (1a) | subnet-0aaa (us-1a) | nat-0aaa111 |
| 2024-11-02 (1b) | subnet-0bbb (us-1b) | nat-0bbb222 |
| 2024-11-02 (1c) | subnet-0ccc (us-1c) | nat-0ccc333 |
+----------------------+----------------------+-----------------------+
# 3 NATs, one per AZ, on a DEV VPC with no uptime SLA — over-provisioned.

Grouping by AZ exposes both same-AZ duplicates and per-AZ sprawl on environments that don't need it.

Now find which subnets route through each NAT — you must re-point these before deleting, or egress breaks for anything that subnet hosts.

$ aws ec2 describe-route-tables --filters Name=route.nat-gateway-id,Values=nat-0ccc333 --query 'RouteTables[].{RT:RouteTableId,Subnets:Associations[?SubnetId!=null].SubnetId}'
[
{
"RT": "rtb-0c1c1c1c1c1",
"Subnets": ["subnet-0ccc"]
}
]
# subnet-0ccc routes through nat-0ccc333; re-point it to nat-0aaa111 first.

Read the route tables before deleting — re-point dependent subnets to the surviving NAT, then tear the redundant one down.

How NAT redundancy and the cross-AZ trap actually workdeep dive

A NAT Gateway is a managed Elastic Network Interface that lives inside a single Availability Zone. That single-AZ scope is the whole reason the per-AZ pattern exists: if the AZ hosting your only NAT goes down, every private subnet that routes through it loses egress, even subnets in healthy AZs. Running one NAT per AZ — with each AZ's private subnets routing to the NAT in their own AZ — means a single AZ failure only takes out that AZ's workloads, which were already affected. This is genuine resilience, not waste, which is why blindly collapsing to one gateway is wrong for production.

The second reason the same-AZ routing matters is money, and it's the trap of careless consolidation. When a private subnet in us-east-1b routes to a NAT in us-east-1a, every byte crosses an AZ boundary and incurs cross-AZ data-transfer charges of ~$0.01/GB each way — once on the way to the NAT and again on the return path. Collapse three NATs to one and you save 2 × $32.40 = $64.80/month in provisioning, but you've just routed two AZs' worth of traffic across zone boundaries. A workload pushing 4 TB/month through that single NAT now pays roughly $80/month in cross-AZ transfer it didn't pay before — more than the provisioning you saved.

So the decision is data-volume- and environment-dependent. Same-AZ duplicates (two NATs in one AZ) are always safe to consolidate — the second adds neither resilience nor different routing. NATs that no subnet routes to are always safe to delete. Per-AZ NATs in production should usually stay. Per-AZ NATs in low-volume non-prod are the sweet spot for consolidation: the cross-AZ charge is negligible at low traffic and there's no resilience requirement to protect.

# Group every NAT Gateway in an account by VPC and AZ to spot duplicates fast.
aws ec2 describe-nat-gateways \
  --filter Name=state,Values=available \
  --query 'NatGateways[].{VPC:VpcId, Subnet:SubnetId, Nat:NatGatewayId}' \
  --output text | sort

# For a candidate NAT, list every subnet whose route table points at it.
# These must be re-pointed to the surviving NAT BEFORE deletion, or egress breaks.
aws ec2 describe-route-tables \
  --filters Name=route.nat-gateway-id,Values=nat-0ccc333 \
  --query 'RouteTables[].Associations[?SubnetId!=null].SubnetId'

What's the impact of running redundant NAT Gateways?

The direct cost is $32.40 per redundant gateway per month. One duplicate is rounding error; a fleet-wide pattern is not. A 60-account org that copies a 3-NAT-per-VPC template onto every environment — including dozens of dev, staging, and sandbox VPCs that need no AZ resilience — is carrying hundreds of gateways, a large fraction of them redundant by design. At $32/month each, trimming non-prod from three gateways to one across 40 VPCs is ~$2,600/month of pure provisioning saved with no functional change.

The most expensive mistake here is over-consolidating and triggering cross-AZ data transfer. If a team collapses three NATs to one to chase the provisioning saving, every byte from the other two AZs' subnets now crosses zone boundaries at ~$0.01/GB each way. For a chatty, high-egress workload this variable charge can dwarf the $64/month they saved — turning a tidy win into a net loss that only shows up on next month's data-transfer line, where nobody's looking.

There's a resilience cost that doesn't appear on the bill at all. Collapse production to a single NAT and you've created a single-AZ failure domain for egress: an AZ outage now takes down outbound connectivity for every private subnet, including those in AZs that are otherwise healthy. The $64/month saved is invisible against the cost of an avoidable multi-AZ egress outage during an incident. This is why production per-AZ NATs are usually left alone — the redundancy is the point.

Finally, NAT sprawl clutters the network the way any unmanaged primitive does. Extra gateways mean extra Elastic IPs, extra route-table entries, and a more complicated diagram to reason about during an incident or a PCI/SOC 2 audit. Consolidating the genuinely redundant ones isn't only a cost story — it's a simpler, more defensible network.

How do you consolidate NAT Gateways safely?

Consolidation is a four-step loop per VPC: map the gateways by AZ, decide the target topology for that environment, re-point routes to the survivor, then delete the redundant gateways and release their EIPs. The decision is environment-dependent — aggressive in non-prod, surgical in prod.

1. Map every NAT by VPC and AZ to find true redundancy

Group all gateways by VPC and Availability Zone. Two NATs in the same AZ are always redundant — the second adds no resilience and no different routing. A NAT-per-subnet layout where one-per-AZ would do is over-provisioned. Any NAT that no route table points at is dead. These three categories are the consolidation candidates; per-AZ NATs serving live subnets in their own AZ are not, until you've decided the target topology.

2. Pick the target topology per environment, not globally

Production usually keeps one NAT per AZ — the redundancy protects against a zone outage and keeps traffic same-AZ to avoid transfer charges. Non-prod (dev, staging, sandbox) can almost always collapse to a single NAT: there's no uptime SLA, and low egress volume makes the resulting cross-AZ transfer charge negligible. Decide the target per VPC before touching anything, and write it down so the next audit doesn't re-flag a deliberate choice.

3. Re-point dependent subnets to the surviving NAT first

For every subnet that routes through a NAT you intend to delete, update its route table's 0.0.0.0/0 route to point at the surviving gateway — before deletion, not after. Use replace-route, confirm egress still works from a test instance, then proceed. Re-pointing a subnet to a NAT in a different AZ is the moment cross-AZ transfer starts billing; that's fine in low-volume non-prod and worth re-checking in anything chatty.

4. Delete the redundant NAT, release its EIP, then watch

Run delete-nat-gateway, wait for deleted, then release-address on its Elastic IP — an orphaned EIP keeps billing ~$3.65/month. Watch CloudWatch and app alerting for 24–48 hours in case a long-tail dependency surfaces. Tag the change in your log with the target topology decision so the gateway isn't quietly recreated by the next Terraform apply.

# 1. Re-point the dependent subnet's route to the SURVIVING NAT, before deleting.
aws ec2 replace-route \
  --route-table-id rtb-0c1c1c1c1c1 \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-0aaa111

# 2. Now the redundant NAT has no dependents — delete it.
aws ec2 delete-nat-gateway --nat-gateway-id nat-0ccc333
aws ec2 wait nat-gateway-deleted --nat-gateway-ids nat-0ccc333

# 3. Release its Elastic IP, or it keeps billing ~$3.65/month standalone.
aws ec2 release-address --allocation-id eipalloc-0xxxxxxxxx

Quick quiz

Question 1 of 5

A production VPC runs three NAT Gateways — one each in us-east-1a, 1b, and 1c — plus a fourth in 1c left over from a failed deploy. Each AZ's private subnets route to the NAT in their own AZ; nothing routes to the fourth. The workload is high-egress. What's the right move?

You've completed Consolidate redundant NAT Gateways. You now know how to spot a VPC carrying more gateways than it needs — same-AZ duplicates, per-subnet sprawl, non-prod inheriting prod-grade redundancy, and orphans nothing routes to — and the four-step loop to consolidate safely: map by AZ, choose the target topology per environment, re-point routes, then delete and release. Most importantly, you can weigh the ~$32/month-per-gateway saving against the cross-AZ data-transfer charge and the single-AZ failure domain, so you consolidate aggressively in non-prod and surgically in prod.

Back to the library