Monitoring

Add CloudWatch alarms to load balancers

Load balancers see every request - alarms on 5xx rates and unhealthy host count catch outages before customers do.

13 min·10 sections·AWS

Last reviewed 27 May 2026

Load balancer alarms: the basics

What does it mean for a load balancer to be "unmonitored"?

Every request that hits your application passes through a load balancer first. ALBs terminate HTTPS, route by path or host, and spray traffic across targets; NLBs do the same at L4 for raw TCP/UDP. Either way, the LB is the only component that sees the full picture of what's working and what isn't - it's the single most useful place in the stack to attach alarms.

An "unmonitored" LB is one with no CloudWatch alarms wired to its key metrics: 5xx rates, unhealthy host count, target response time, rejected connection count. When the LB is silent, the first signal of an outage is a customer complaint, an external uptime check, or a sales rep noticing the demo environment is down. By the time you find out, the incident clock has already been running for minutes or hours.

AWS check COV-004 flags ELB and ELBv2 resources that have zero CloudWatch alarms associated with them. The finding fires per-LB at HIGH severity because a load balancer in production without alarms is a structural visibility gap, not a tuning problem - and it's almost always an oversight rather than a deliberate choice.

In this lesson you'll learn which four alarms every production ALB and NLB should have, how ALB and NLB metric names differ, why the distinction between target 5xx and ELB 5xx matters in an incident, and how to provision the standard alarm set automatically so new load balancers are never created without monitoring. You'll see real CLI calls to audit an account for unmonitored LBs and bulk-create the alarms in a single pass.

Fun fact

The 11-minute Friday afternoon

A US SaaS team once shipped a release at 4:50pm on a Friday that caused a target group's healthy host count to drop to one. The ALB kept serving 200s from the lone healthy instance, but with no UnHealthyHostCount alarm, nobody noticed. At 11:02pm the last host failed its health check and the LB started returning 503s. The first signal was an enterprise customer's pager going off, then theirs. An $8 alarm would have caught it at 4:51pm and turned a 7-hour outage into a 30-second blip.

Adding load balancer alarms in action

Marco runs reliability at a B2B fintech. A quarterly audit surfaces 23 production ALBs and 4 NLBs with zero CloudWatch alarms attached. Most of them were created by the EKS AWS Load Balancer Controller (one per Kubernetes Ingress) or by ECS service definitions - the LB shipped with the workload, but the alarms never did.

He picks the worst offender first: the public-facing ALB in front of the payments API. It handles around 1,200 requests per second at peak and serves the most revenue-critical surface in the product. He pulls its current 5xx rate to confirm the workload is healthy enough that adding alarms now won't immediately fire false positives.

Then he wires the four alarms that every production LB should have. The whole exercise for one ALB takes about three minutes; the rest he scripts.

First, check the recent 5xx behaviour on the target side. HTTPCode_Target_5XX_Count counts errors your app returned - the ones that mean a backend crashed, not the LB itself.

$ aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count --dimensions Name=LoadBalancer,Value=app/payments-api/50dc6c495c0c9188 --start-time $(date -u -d '1 hour ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 300 --statistics Sum

{

"Datapoints": [

{ "Timestamp": "2026-05-15T13:45:00Z", "Sum": 2.0, "Unit": "Count" },

{ "Timestamp": "2026-05-15T13:50:00Z", "Sum": 0.0, "Unit": "Count" },

{ "Timestamp": "2026-05-15T13:55:00Z", "Sum": 1.0, "Unit": "Count" },

{ "Timestamp": "2026-05-15T14:00:00Z", "Sum": 0.0, "Unit": "Count" }

]

}

# ~3 5xx in the last hour against ~4.3M requests. Baseline is clean - alarm at >10/5min won't fire false positives.

5xx baseline on the payments-api ALB. Healthy enough to wire alarms without immediate noise.

Now create the four alarms in one pass. The ALB dimension uses the LoadBalancer ARN suffix; target group alarms use both LoadBalancer and TargetGroup.

$ aws cloudwatch put-metric-alarm --alarm-name payments-api-alb-target-5xx --metric-name HTTPCode_Target_5XX_Count --namespace AWS/ApplicationELB --statistic Sum --period 300 --evaluation-periods 1 --threshold 10 --comparison-operator GreaterThanThreshold --dimensions Name=LoadBalancer,Value=app/payments-api/50dc6c495c0c9188 --alarm-actions arn:aws:sns:eu-west-1:123456789012:sre-pager

# 1/4: HTTPCode_Target_5XX_Count > 10 / 5min - backend errors

# 2/4: HTTPCode_ELB_5XX_Count > 5 / 5min - LB-side errors (no healthy target, timeout)

# 3/4: UnHealthyHostCount > 0 / 1min - any target failing health checks

# 4/4: TargetResponseTime p99 > 1.5 / 5min - latency SLO breach

All four alarms created. State: INSUFFICIENT_DATA until next evaluation window.

The four-alarm core set, all routed to the SRE pager SNS topic. The HTTPCode_ELB_5XX one is the most important - it fires when the LB itself can't serve, not just when a backend crashes.

Load balancer metrics under the hooddeep dive

ALB and NLB publish different metric namespaces and partly different metric names. ALBs use AWS/ApplicationELB and ship HTTP-aware metrics: HTTPCode_Target_2XX_Count, HTTPCode_Target_5XX_Count, HTTPCode_ELB_5XX_Count, TargetResponseTime, RequestCount, ActiveConnectionCount, RejectedConnectionCount. NLBs use AWS/NetworkELB and ship L4 metrics: ProcessedBytes, ActiveFlowCount, NewFlowCount, TCP_Client_Reset_Count, TCP_Target_Reset_Count, UnHealthyHostCount. If you copy an ALB alarm template verbatim onto an NLB it will simply never fire - the metric names don't exist there.

The HTTPCode_Target vs HTTPCode_ELB distinction is the single most useful signal in any ALB incident. HTTPCode_Target_5XX_Count means your application returned a 5xx - a backend crashed, threw an exception, or hit a downstream timeout. HTTPCode_ELB_5XX_Count means the load balancer itself returned a 5xx because it couldn't reach a healthy target, the target rejected the connection, or the LB's own infrastructure throttled. Different fix paths: target 5xx is an app bug, ELB 5xx is a capacity, health-check, or networking problem. An alarm on each tells the on-call where to start looking before they've even opened the runbook.

Alarming on rate, not absolute count, matters more on busier LBs. A threshold of "100 5xx in 5 minutes" is meaningful for a service doing 1,000 RPS and meaningless for one doing 100k RPS. For high-traffic services, alarm on the math expression m1/m2 > 0.01 where m1 is HTTPCode_Target_5XX_Count and m2 is RequestCount - that's a 1% error rate, which is portable across any traffic level. For per-target normalisation, RequestCountPerTarget already divides by the healthy target count.

# Audit every LB in the account for existing alarms.
aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[].LoadBalancerArn' --output text | tr '\t' '\n' | while read arn; do
  dim_value=$(echo "$arn" | sed 's|.*loadbalancer/||')
  alarm_count=$(aws cloudwatch describe-alarms-for-metric \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --dimensions Name=LoadBalancer,Value="$dim_value" \
    --query 'length(MetricAlarms)' --output text)
  echo "$dim_value alarms=$alarm_count"
done

# Anything reporting alarms=0 needs the standard alarm set.

What is the impact of running load balancers without alarms?

The direct impact is detection latency. A load balancer is the canary for the entire workload behind it - 5xx rates, target health, latency - and without alarms the first signal of an outage is whatever the slowest external check happens to be: a customer email, a status-page tweet, a CI smoke test the next morning. Mean time to detect goes from seconds to hours, and every minute of undetected outage is revenue and trust burning.

The second-order impact is incident triage time. When the on-call gets paged at 3am with "the app is down," they need to know in 30 seconds whether the backend is throwing exceptions, the LB has no healthy targets, the LB itself is throttled, or latency has crept past the SLO. A well-named alarm answers that before they open the runbook. With no alarms, they're stuck pulling four different CloudWatch graphs from memory while customers wait.

There's a compliance angle on regulated workloads. SOC 2, ISO 27001, and PCI DSS all expect documented monitoring and alerting on production systems. "We have CloudWatch metrics" isn't the same as "we have alarms wired to a pager" - and an auditor walking through an account with 23 unmonitored production ALBs will write that up as a finding.

The fix is cheap. A CloudWatch standard-resolution alarm costs $0.10/month; the standard four-alarm set on every LB in a fleet of 50 LBs is $20/month. The cost of a single 30-minute payment-API outage that an UnHealthyHostCount alarm would have caught in 60 seconds is several orders of magnitude higher.

How do you wire load balancer alarms safely?

Adding alarms is a four-step loop. Get the standard set right once, then make it the default for every new LB so this problem doesn't recur with the next ECS service or Kubernetes Ingress.

1. Inventory every LB and its current alarm coverage

Walk both elbv2 (ALB/NLB) and elb (Classic) APIs in every region. For each LB, count CloudWatch alarms associated with its metrics. Anything with zero alarms goes on the remediation list; anything with only one or two is partial coverage that probably misses one of the key signals. Don't trust a Terraform repo to tell you the state - alarms drift, and LBs created by Kubernetes Ingress controllers or ECS service definitions aren't always in the IaC.

2. Provision the standard four-alarm set

For ALBs: HTTPCode_Target_5XX_Count (backend errors), HTTPCode_ELB_5XX_Count (LB-side errors), UnHealthyHostCount > 0 per target group (target health), TargetResponseTime p99 (latency SLO). For NLBs: substitute TCP_Target_Reset_Count for target 5xx and skip TargetResponseTime - NLBs don't see HTTP. Add RejectedConnectionCount on both - it fires when the LB itself is overwhelmed and rejecting connections at the edge. Route alarms to an SNS topic the on-call actually monitors, not an email alias that nobody reads.

3. Complement with CloudWatch Synthetics

Alarms on LB metrics tell you whether the infrastructure is healthy - they don't tell you whether a customer can actually complete checkout. Add a CloudWatch Synthetics canary that runs the critical user flow every minute from a real browser; alarm on canary failure. This catches problems that LB metrics miss: a deploy that returns 200s but with the wrong page, a session-affinity bug, a third-party JS failure. Canaries are cheap and they catch the failure modes that pure infrastructure alarms can't.

4. Prevent recurrence with automatic provisioning

The reason these alarms get missed is that ALBs are created by many different paths - the AWS Load Balancer Controller for EKS, ECS service definitions, manual ALB-per-microservice via Terraform - and none of those default to creating alarms. Fix it once: an EventBridge rule that fires on CreateLoadBalancer events triggers a Lambda that creates the standard alarm set, tagged with the LB's owner. Now every new LB ships with monitoring on day one without anyone remembering to add it.

# Bulk-create the standard alarm set across every unmonitored ALB.
for arn in $(aws elbv2 describe-load-balancers \
      --query 'LoadBalancers[?Type==`application`].LoadBalancerArn' --output text); do
  dim=$(echo "$arn" | sed 's|.*loadbalancer/||')
  name=$(echo "$dim" | cut -d/ -f2)

  aws cloudwatch put-metric-alarm --alarm-name "${name}-target-5xx" \
    --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count \
    --statistic Sum --period 300 --evaluation-periods 1 --threshold 10 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=LoadBalancer,Value="$dim" \
    --alarm-actions arn:aws:sns:eu-west-1:123456789012:sre-pager

  aws cloudwatch put-metric-alarm --alarm-name "${name}-elb-5xx" \
    --namespace AWS/ApplicationELB --metric-name HTTPCode_ELB_5XX_Count \
    --statistic Sum --period 300 --evaluation-periods 1 --threshold 5 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=LoadBalancer,Value="$dim" \
    --alarm-actions arn:aws:sns:eu-west-1:123456789012:sre-pager

  # Plus UnHealthyHostCount per TG and TargetResponseTime p99 - see the audit script.
done

Quick quiz

Question 1 of 5

You're alarming on a high-traffic ALB doing 50k RPS. Which is the most robust way to detect a backend error spike that won't be either noisy or blind across traffic swings?

Keep learning

Dig deeper into load balancer observability and the AWS tooling around it.

You've completed Add CloudWatch alarms to load balancers. You now know which four signals every production LB should be alarming on, how ALB and NLB metrics differ, why HTTPCode_Target vs HTTPCode_ELB is the most useful distinction in any incident, and how to automate the standard alarm set so new load balancers ship with monitoring on day one. The next time COV-004 fires, you'll have a four-step loop ready to run.

Back to the library

Load balancer alarms: what they cost and why they matter to the budget

A $0.40/month line item that defines whether you find out about an outage in seconds or hours

A load balancer sits in front of every application and is the only component that sees the complete error and latency picture for all traffic. AWS flags load balancers with zero CloudWatch alarms as COV-004 at HIGH severity — HIGH because unmonitored production infrastructure is a governance gap, not a configuration preference.

CloudWatch alarms are priced at $0.10 per alarm per month at standard resolution. The four-alarm set that covers a production load balancer costs $0.40/month per load balancer. That is the full incremental cost to close a HIGH-severity finding. Against the revenue exposure from a multi-hour undetected outage, this is not a cost-versus-benefit question — it is a decision about whether a $0.40 line item is worth the risk it removes.

The finance framing is a tiering one: production and revenue-facing load balancers should be treated as non-negotiable for the standard alarm set; internal, dev, and test load balancers are a judgment call. The value of tracking this control is that it forces every unmonitored LB to be either remediated or recorded as an intentional exception with a documented owner — so the monitoring posture is a deliberate decision, not an oversight.

This lesson is for the finance partner who sees COV-004 findings on the security dashboard and wants to understand the cost and risk dimension before approving remediation. It covers what CloudWatch alarms on load balancers actually cost ($0.10/alarm/month, four alarms per LB), why this is a tiering decision rather than a blanket mandate, how to frame the spend against the outage revenue exposure it prevents, and how to ensure that any unmonitored production LB is either remediated or recorded as an intentional decision with a documented owner. No CLI commands required.

Fun fact

The 11-minute Friday afternoon

How a finance partner frames the load balancer alarm decision

Yuki is the FinOps lead at a payments company. The quarterly security review surfaces 18 COV-004 findings — 18 load balancers with no CloudWatch alarms. Rather than approve a bulk remediation without scrutiny, she asks the tiering question: which of these 18 are production, customer-facing, or revenue-critical, and which are dev, internal, or ephemeral that a lower monitoring bar is acceptable for?

The engineering team sorts them using environment and owner tags. Eleven are production ALBs fronting live services — the payments API, the checkout flow, the customer portal. Yuki agrees the four-alarm standard set at $0.40/month per LB is clearly justified; the annual cost for all eleven is under $55, against the revenue exposure from a single undetected payments outage. The remaining seven are dev and internal tooling ALBs; the team documents them as intentionally bare-bones and records a review date.

Yuki's note for the finance pack is one sentence: 'Production load balancers now carry monitoring by design at a total incremental cost of under $55/year; non-production exceptions are recorded.' The spend is trivial and the risk reduction is concrete — and the next COV-004 scan will show production coverage as a deliberate posture, not luck.

The financial exposure of unmonitored load balancers

The direct financial exposure is the gap between when an outage starts and when someone with the ability to fix it gets paged. On a production ALB with no alarms, that gap is bounded by whichever external signal fires first — a customer complaint, a status-page query from a key account, a CI run that fails overnight. Every minute of undetected downtime on a revenue-critical service is lost transaction value that is straightforward to model: hourly revenue rate multiplied by detection lag.

The remediation cost is exceptionally clean to compare against that exposure. Four CloudWatch standard-resolution alarms at $0.10/month each equals $0.40/month per load balancer, or $4.80/year. For a fleet of 30 production LBs that is $144/year total — a rounding error on any cloud bill. The asymmetry between the annual monitoring cost and the cost of a single missed outage makes this one of the easiest risk-versus-spend justifications in FinOps.

The second-order cost is incident triage time. On-call engineers without named alarms spend the first 5–15 minutes of an incident reconstructing what broke — pulling CloudWatch graphs, guessing whether it's backend errors, LB capacity, or health check failures. Well-named alarms cut that cold-start time to near zero. That has a real dollar value in reduced engineer-hours per incident, and it reduces the likelihood of costly mis-triage decisions made under pressure.

For the finance review, the right framing is not 'how much do alarms cost?' but 'what is the uninsured revenue exposure per production LB per month?' Once that number is on the table, the $0.40/month monitoring cost frames itself.

What finance can drive on COV-004

Finance cannot create CloudWatch alarms, but it owns the framing that keeps monitoring coverage from being treated as optional. Four levers, applied at the regular cost and risk review cadence.

1. Make monitoring a budgeted line in provisioning

Agree with engineering that every production load balancer is budgeted to include the four-alarm standard set — $0.40/month per LB — as a non-optional provisioning cost. This treats monitoring the same way you treat compute: not an add-on but a component. It also prevents the common pattern where alarms are cut during a cost-reduction sprint because they weren't explicitly on the approved spec.

2. Track COV-004 failures against production LBs only

The finding count that matters to finance is unmonitored production load balancers, not total findings. Dev, test, and ephemeral LBs without alarms are expected and intentional; including them in the headline number dilutes the risk signal. Put the production-only count on the cost-and-security dashboard alongside the estimated annual cost to remediate — typically under $50 for most fleets.

3. Require a documented exception for every uncovered production LB

Any production load balancer left without the standard alarm set should carry a recorded justification with an owner name and a review date — not a silently suppressed finding. This converts the decision from 'we didn't get to it' into 'we made a choice and signed off on it,' which is what survives a post-incident review or an auditor walkthrough.

4. Price the monitoring gap as revenue exposure, not as a cost line

Frame the decision in terms of detection lag: for a payments API processing $X per hour, every additional minute of undetected downtime is $X/60 in lost transaction value. The four-alarm set that closes the detection gap from hours to seconds costs $4.80 per LB per year. Making that comparison explicit — in the same review where the cloud bill is discussed — removes the ambiguity about whether the $0.40/month is justified.

Quick quiz

Question 1 of 5

A cost review surfaces 14 COV-004 findings: six production ALBs fronting revenue services and eight dev/internal ALBs. The four-alarm set costs $0.40/month per LB. What is the right finance recommendation?

Keep learning

Dig deeper into load balancer observability and the AWS tooling around it.

You've finished the finance partner's view of COV-004. You know the full remediation cost is $0.40/month per production load balancer — one of the most asymmetric risk-versus-spend decisions in cloud governance — why this is a tiering call rather than a blanket mandate, and the four levers that keep monitoring coverage defensible: budgeting alarms as a provisioning component, tracking only production findings, requiring documented exceptions, and framing the cost against the revenue exposure it prevents. Next time COV-004 shows up in a review, you'll have the numbers ready.

Back to the library

Unmonitored load balancers: the one-line risk

Whether your teams find out about outages before customers do

Every application's traffic flows through a load balancer. If that load balancer has no alarms, the first signal of an outage is a customer complaint or an external status check — not an internal page. COV-004 flags this at HIGH severity because silent production infrastructure is a risk posture question, not a configuration detail.

The cost to fix it is negligible: four CloudWatch alarms per load balancer at $0.10 each. The question this control really asks is whether monitoring coverage on revenue-facing infrastructure is policy or accident. The healthy answer is that every production load balancer is covered by design, every exception is documented, and the team finds out about problems before customers do.

A short read for the executive who wants to understand what COV-004 represents and what a healthy answer looks like. You'll get the plain-English version of why unmonitored load balancers are flagged at HIGH severity, what the fix costs, and what governance looks like: production infrastructure monitored by policy, exceptions documented, and the business finding out about outages before customers do.

Fun fact

The 11-minute Friday afternoon

What it looks like when monitoring is a policy, not an accident

At one company, the VP of Engineering used to get "we'd know pretty quickly" as the answer to "how fast would we know if payments went down?" After operationalising COV-004 as a tracked control, the answer changed: every production load balancer had a named alarm routed to an on-call pager, and the team had documented which internal and dev LBs were intentionally uncovered.

The shift that mattered wasn't the alarms themselves — it was that monitoring coverage on revenue-facing infrastructure became a policy rather than a side effect of whoever happened to add it. A new service automatically got its alarm set on day one because the provisioning process required it. The executive question "would we know before customers do?" had a verifiable yes attached to it, not a guess.

Why this is a risk posture item, not a tooling detail

The headline impact of an unmonitored load balancer is detection latency: the time between when an outage starts and when someone inside the organisation knows about it. If that first signal is a customer, a detection lag of minutes becomes hours, and the credibility cost on top of the revenue cost is significant. COV-004 is flagged HIGH precisely because this isn't a performance-tuning question — it's whether the organisation is flying blind on its own infrastructure.

The second point is that remediation cost is trivial. Four alarms per load balancer at $0.10 each means the entire monitoring gap can be closed for cents per LB per month. The leadership question isn't whether to fix it — it's whether production infrastructure is covered by policy so it stays fixed as the fleet grows. A service that ships without alarms today is the next undetected outage.

The leadership move on COV-004

The executive handle here is not to approve a one-time remediation ticket — it is to require that monitoring coverage on production load balancers becomes a default, not a task that gets skipped when a team is moving fast.

1. Set a default: production LBs ship with alarms

Make it an engineering standard that any load balancer serving production traffic includes the four-alarm set from day one. The reason COV-004 accumulates is that dozens of teams create LBs via Kubernetes Ingress controllers, ECS service definitions, and Terraform modules — none of which default to wiring alarms. A provisioning-level default removes the per-team memory dependency entirely.

2. Accept bare monitoring for low-stakes environments

Don't aim for zero COV-004 findings across the board. Dev, internal, and test load balancers without alarms are intentional. The goal is that production infrastructure is covered by design and every exception to that is a documented decision, not that every LB in the account has four alarms regardless of purpose.

3. Ask the one-question readiness check

At the leadership review the single question worth asking is: 'If the payments load balancer failed right now, how long before the on-call gets paged?' If the answer is 'a few minutes, the alarm fires on UnHealthyHostCount,' monitoring is working. If the answer is 'we'd see it in the logs' or 'a customer would tell us,' COV-004 is describing a real gap.

Quick quiz

Question 1 of 5

Your engineering team reports that all production load balancers now have the four-alarm standard set, and the remaining COV-004 findings are dev and internal LBs marked as intentional exceptions with documented owners. How should leadership read this?

Keep learning

Dig deeper into load balancer observability and the AWS tooling around it.

That's the lesson. The core question COV-004 asks is whether the organisation finds out about outages before customers do — and the answer should be yes by policy, not by luck. The fix is trivial in cost and permanent when baked into provisioning. The leadership signal to watch for is not a zero finding count but a clear split: production load balancers covered by design, intentional exceptions documented, and the team with a pager answer ready when you ask how fast they'd know.

Back to the library

Part of the learning path Get your alarms right

Add CloudWatch alarms to load balancers

Load balancer alarms: the basics

The 11-minute Friday afternoon

Adding load balancer alarms in action

Load balancer metrics under the hooddeep dive

What is the impact of running load balancers without alarms?

How do you wire load balancer alarms safely?

1. Inventory every LB and its current alarm coverage

2. Provision the standard four-alarm set

3. Complement with CloudWatch Synthetics

4. Prevent recurrence with automatic provisioning

Quick quiz

Keep learning

Load balancer alarms: what they cost and why they matter to the budget

The 11-minute Friday afternoon

How a finance partner frames the load balancer alarm decision

The financial exposure of unmonitored load balancers

What finance can drive on COV-004

1. Make monitoring a budgeted line in provisioning

2. Track COV-004 failures against production LBs only

3. Require a documented exception for every uncovered production LB

4. Price the monitoring gap as revenue exposure, not as a cost line

Quick quiz

Keep learning

Unmonitored load balancers: the one-line risk

The 11-minute Friday afternoon

What it looks like when monitoring is a policy, not an accident

Why this is a risk posture item, not a tooling detail

The leadership move on COV-004

1. Set a default: production LBs ship with alarms

2. Accept bare monitoring for low-stakes environments

3. Ask the one-question readiness check

Quick quiz

Keep learning

Related monitoring lessons