Load balancer alarms: the basics
What does it mean for a load balancer to be "unmonitored"?
Every request that hits your application passes through a load balancer first. ALBs terminate HTTPS, route by path or host, and spray traffic across targets; NLBs do the same at L4 for raw TCP/UDP. Either way, the LB is the only component that sees the full picture of what's working and what isn't - it's the single most useful place in the stack to attach alarms.
An "unmonitored" LB is one with no CloudWatch alarms wired to its key metrics: 5xx rates, unhealthy host count, target response time, rejected connection count. When the LB is silent, the first signal of an outage is a customer complaint, an external uptime check, or a sales rep noticing the demo environment is down. By the time you find out, the incident clock has already been running for minutes or hours.
AWS check COV-004 flags ELB and ELBv2 resources that have zero CloudWatch alarms associated with them. The finding fires per-LB at HIGH severity because a load balancer in production without alarms is a structural visibility gap, not a tuning problem - and it's almost always an oversight rather than a deliberate choice.
In this lesson you'll learn which four alarms every production ALB and NLB should have, how ALB and NLB metric names differ, why the distinction between target 5xx and ELB 5xx matters in an incident, and how to provision the standard alarm set automatically so new load balancers are never created without monitoring. You'll see real CLI calls to audit an account for unmonitored LBs and bulk-create the alarms in a single pass.
The 11-minute Friday afternoon
A US SaaS team once shipped a release at 4:50pm on a Friday that caused a target group's healthy host count to drop to one. The ALB kept serving 200s from the lone healthy instance, but with no UnHealthyHostCount alarm, nobody noticed. At 11:02pm the last host failed its health check and the LB started returning 503s. The first signal was an enterprise customer's pager going off, then theirs. An $8 alarm would have caught it at 4:51pm and turned a 7-hour outage into a 30-second blip.
Adding load balancer alarms in action
Marco runs reliability at a B2B fintech. A quarterly audit surfaces 23 production ALBs and 4 NLBs with zero CloudWatch alarms attached. Most of them were created by the EKS AWS Load Balancer Controller (one per Kubernetes Ingress) or by ECS service definitions - the LB shipped with the workload, but the alarms never did.
He picks the worst offender first: the public-facing ALB in front of the payments API. It handles around 1,200 requests per second at peak and serves the most revenue-critical surface in the product. He pulls its current 5xx rate to confirm the workload is healthy enough that adding alarms now won't immediately fire false positives.
Then he wires the four alarms that every production LB should have. The whole exercise for one ALB takes about three minutes; the rest he scripts.
First, check the recent 5xx behaviour on the target side. HTTPCode_Target_5XX_Count counts errors your app returned - the ones that mean a backend crashed, not the LB itself.
5xx baseline on the payments-api ALB. Healthy enough to wire alarms without immediate noise.
Now create the four alarms in one pass. The ALB dimension uses the LoadBalancer ARN suffix; target group alarms use both LoadBalancer and TargetGroup.
The four-alarm core set, all routed to the SRE pager SNS topic. The HTTPCode_ELB_5XX one is the most important - it fires when the LB itself can't serve, not just when a backend crashes.
Load balancer metrics under the hooddeep dive
ALB and NLB publish different metric namespaces and partly different metric names. ALBs use AWS/ApplicationELB and ship HTTP-aware metrics: HTTPCode_Target_2XX_Count, HTTPCode_Target_5XX_Count, HTTPCode_ELB_5XX_Count, TargetResponseTime, RequestCount, ActiveConnectionCount, RejectedConnectionCount. NLBs use AWS/NetworkELB and ship L4 metrics: ProcessedBytes, ActiveFlowCount, NewFlowCount, TCP_Client_Reset_Count, TCP_Target_Reset_Count, UnHealthyHostCount. If you copy an ALB alarm template verbatim onto an NLB it will simply never fire - the metric names don't exist there.
The HTTPCode_Target vs HTTPCode_ELB distinction is the single most useful signal in any ALB incident. HTTPCode_Target_5XX_Count means your application returned a 5xx - a backend crashed, threw an exception, or hit a downstream timeout. HTTPCode_ELB_5XX_Count means the load balancer itself returned a 5xx because it couldn't reach a healthy target, the target rejected the connection, or the LB's own infrastructure throttled. Different fix paths: target 5xx is an app bug, ELB 5xx is a capacity, health-check, or networking problem. An alarm on each tells the on-call where to start looking before they've even opened the runbook.
Alarming on rate, not absolute count, matters more on busier LBs. A threshold of "100 5xx in 5 minutes" is meaningful for a service doing 1,000 RPS and meaningless for one doing 100k RPS. For high-traffic services, alarm on the math expression m1/m2 > 0.01 where m1 is HTTPCode_Target_5XX_Count and m2 is RequestCount - that's a 1% error rate, which is portable across any traffic level. For per-target normalisation, RequestCountPerTarget already divides by the healthy target count.
# Audit every LB in the account for existing alarms.
aws elbv2 describe-load-balancers \
--query 'LoadBalancers[].LoadBalancerArn' --output text | tr '\t' '\n' | while read arn; do
dim_value=$(echo "$arn" | sed 's|.*loadbalancer/||')
alarm_count=$(aws cloudwatch describe-alarms-for-metric \
--namespace AWS/ApplicationELB \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value="$dim_value" \
--query 'length(MetricAlarms)' --output text)
echo "$dim_value alarms=$alarm_count"
done
# Anything reporting alarms=0 needs the standard alarm set. What is the impact of running load balancers without alarms?
The direct impact is detection latency. A load balancer is the canary for the entire workload behind it - 5xx rates, target health, latency - and without alarms the first signal of an outage is whatever the slowest external check happens to be: a customer email, a status-page tweet, a CI smoke test the next morning. Mean time to detect goes from seconds to hours, and every minute of undetected outage is revenue and trust burning.
The second-order impact is incident triage time. When the on-call gets paged at 3am with "the app is down," they need to know in 30 seconds whether the backend is throwing exceptions, the LB has no healthy targets, the LB itself is throttled, or latency has crept past the SLO. A well-named alarm answers that before they open the runbook. With no alarms, they're stuck pulling four different CloudWatch graphs from memory while customers wait.
There's a compliance angle on regulated workloads. SOC 2, ISO 27001, and PCI DSS all expect documented monitoring and alerting on production systems. "We have CloudWatch metrics" isn't the same as "we have alarms wired to a pager" - and an auditor walking through an account with 23 unmonitored production ALBs will write that up as a finding.
The fix is cheap. A CloudWatch standard-resolution alarm costs $0.10/month; the standard four-alarm set on every LB in a fleet of 50 LBs is $20/month. The cost of a single 30-minute payment-API outage that an UnHealthyHostCount alarm would have caught in 60 seconds is several orders of magnitude higher.
How do you wire load balancer alarms safely?
Adding alarms is a four-step loop. Get the standard set right once, then make it the default for every new LB so this problem doesn't recur with the next ECS service or Kubernetes Ingress.
1. Inventory every LB and its current alarm coverage
Walk both elbv2 (ALB/NLB) and elb (Classic) APIs in every region. For each LB, count CloudWatch alarms associated with its metrics. Anything with zero alarms goes on the remediation list; anything with only one or two is partial coverage that probably misses one of the key signals. Don't trust a Terraform repo to tell you the state - alarms drift, and LBs created by Kubernetes Ingress controllers or ECS service definitions aren't always in the IaC.
2. Provision the standard four-alarm set
For ALBs: HTTPCode_Target_5XX_Count (backend errors), HTTPCode_ELB_5XX_Count (LB-side errors), UnHealthyHostCount > 0 per target group (target health), TargetResponseTime p99 (latency SLO). For NLBs: substitute TCP_Target_Reset_Count for target 5xx and skip TargetResponseTime - NLBs don't see HTTP. Add RejectedConnectionCount on both - it fires when the LB itself is overwhelmed and rejecting connections at the edge. Route alarms to an SNS topic the on-call actually monitors, not an email alias that nobody reads.
3. Complement with CloudWatch Synthetics
Alarms on LB metrics tell you whether the infrastructure is healthy - they don't tell you whether a customer can actually complete checkout. Add a CloudWatch Synthetics canary that runs the critical user flow every minute from a real browser; alarm on canary failure. This catches problems that LB metrics miss: a deploy that returns 200s but with the wrong page, a session-affinity bug, a third-party JS failure. Canaries are cheap and they catch the failure modes that pure infrastructure alarms can't.
4. Prevent recurrence with automatic provisioning
The reason these alarms get missed is that ALBs are created by many different paths - the AWS Load Balancer Controller for EKS, ECS service definitions, manual ALB-per-microservice via Terraform - and none of those default to creating alarms. Fix it once: an EventBridge rule that fires on CreateLoadBalancer events triggers a Lambda that creates the standard alarm set, tagged with the LB's owner. Now every new LB ships with monitoring on day one without anyone remembering to add it.
# Bulk-create the standard alarm set across every unmonitored ALB.
for arn in $(aws elbv2 describe-load-balancers \
--query 'LoadBalancers[?Type==`application`].LoadBalancerArn' --output text); do
dim=$(echo "$arn" | sed 's|.*loadbalancer/||')
name=$(echo "$dim" | cut -d/ -f2)
aws cloudwatch put-metric-alarm --alarm-name "${name}-target-5xx" \
--namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count \
--statistic Sum --period 300 --evaluation-periods 1 --threshold 10 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=LoadBalancer,Value="$dim" \
--alarm-actions arn:aws:sns:eu-west-1:123456789012:sre-pager
aws cloudwatch put-metric-alarm --alarm-name "${name}-elb-5xx" \
--namespace AWS/ApplicationELB --metric-name HTTPCode_ELB_5XX_Count \
--statistic Sum --period 300 --evaluation-periods 1 --threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=LoadBalancer,Value="$dim" \
--alarm-actions arn:aws:sns:eu-west-1:123456789012:sre-pager
# Plus UnHealthyHostCount per TG and TargetResponseTime p99 - see the audit script.
done Quick quiz
Question 1 of 5You're alarming on a high-traffic ALB doing 50k RPS. Which is the most robust way to detect a backend error spike that won't be either noisy or blind across traffic swings?
You scored
0 / 5
Keep learning
Dig deeper into load balancer observability and the AWS tooling around it.
- CloudWatch metrics for Application Load Balancers Full list of ALB metrics, dimensions, and recommended alarm patterns from AWS.
- CloudWatch metrics for Network Load Balancers NLB metric names differ from ALB - the L4 equivalents for connection resets, processed bytes, and flow counts.
- Amazon CloudWatch Synthetics End-to-end user-flow canaries that catch what infrastructure alarms miss.
- AWS Well-Architected - Reliability Pillar (Monitor workload resources) Where load balancer alarms fit in the broader reliability framework.
You've completed Add CloudWatch alarms to load balancers. You now know which four signals every production LB should be alarming on, how ALB and NLB metrics differ, why HTTPCode_Target vs HTTPCode_ELB is the most useful distinction in any incident, and how to automate the standard alarm set so new load balancers ship with monitoring on day one. The next time COV-004 fires, you'll have a four-step loop ready to run.
Back to the library