Skip to main content
emnode / learn
Monitoring

Tame flapping CloudWatch alarms

10 state transitions in 24 hours isn't a fire — it's a misconfigured threshold. Tune the eval window or use anomaly detection.

12 min·10 sections·AWS

Last reviewed

Flapping alarms: the basics

What does it mean for a CloudWatch alarm to flap?

A CloudWatch alarm flaps when it bounces between ALARM and OK over and over in a short window — 5, 10, sometimes 20 state transitions in 24 hours. Each transition fires whatever's wired to AlarmActions and OKActions: an SNS topic, a PagerDuty integration, an auto-scaling step, a Lambda. So a flapping alarm isn't a single noisy notification; it's a fire hose. Ops gets paged at 02:14, recovered at 02:16, paged at 02:19, recovered at 02:23, and so on until someone snoozes the rotation entirely.

The metric isn't broken. The alarm is. The threshold sits right at the value the metric normally hovers around, the evaluation window is a single one-minute period, and the metric has enough natural variance to cross the line a dozen times a day under perfectly healthy conditions. The alarm is doing exactly what you asked — it just turns out you asked for the wrong thing.

AWS surfaces this pattern through check ALH-004 ("Flapping Alarms"), which counts state transitions over a 24-hour window and flags anything with more than a small handful. The default heuristic is that real incidents are sticky — once they start, they stay started for minutes-to-hours — and an alarm transitioning every few minutes is almost certainly oscillating around its threshold rather than detecting a real, sustained problem.

In this lesson you'll learn how to recognise a flapping CloudWatch alarm, why it's flapping (it's almost always one of three things), and how to tune it back into something useful with the M-of-N evaluation model, a longer period, a better threshold, or an Anomaly Detection band. You'll also learn when flapping is a real signal that the underlying system is oscillating — and you should fix that, not the alarm.

Fun fact

The pager that cried wolf

Google's SRE book famously argues that an alert page should be actionable, urgent, and rare. Their internal benchmark: on-callers should get at most two pages per 12-hour shift, and every page should require a human decision. The most common reason that bar gets blown is flapping — one badly-tuned alarm firing 20 times overnight burns the whole month's page budget by Tuesday. Once on-callers learn an alarm is unreliable, they start ignoring it; the first real incident it catches is the one that gets missed.

Taming a flapping alarm in action

Marco runs platform at a SaaS company. A check fires from the FinOps dashboard: alarm "SES bounce rate >= 9%" has transitioned state 10 times in the last 24 hours. Severity MEDIUM. Marco's PagerDuty inbox confirms it — ten pages, ten auto-resolves, nobody actually did anything about any of them.

He pulls the alarm's history before changing a thing. The pattern is unmistakable: the metric bounces between 8.4% and 9.3% every few minutes, crossing the 9.0% threshold each time. The bounce rate isn't broken — it's just naturally variable, and the threshold sits right in the middle of its normal range.

He has three levers: require multiple periods to alarm (M-of-N), lengthen the period to smooth the metric, or move to anomaly detection. He picks M-of-N first because it's the cheapest change with the biggest effect.

First, count state transitions over the last 24 hours. This is the canonical "is this alarm flapping?" query.

$ aws cloudwatch describe-alarm-history --alarm-name 'SES bounce rate >= 9%' --history-item-type StateUpdate --start-date $(date -u -d '24 hours ago' +%FT%TZ) --end-date $(date -u +%FT%TZ) --query 'length(AlarmHistoryItems)'
10
# 10 transitions in 24h — textbook flap.

Counting StateUpdate events over the last day.

Now look at the actual transitions to see how close the metric is to the threshold each time it crosses.

$ aws cloudwatch describe-alarm-history --alarm-name 'SES bounce rate >= 9%' --history-item-type StateUpdate --max-items 6 --query 'AlarmHistoryItems[*].[Timestamp,HistorySummary]' --output table
┌──────────────────────┬───────────────────────────────────────────────────────┐
│ Timestamp │ HistorySummary │
├──────────────────────┼───────────────────────────────────────────────────────┤
│ 2026-05-15T02:14:00Z │ Alarm updated from OK to ALARM (9.12 >= 9.00) │
│ 2026-05-15T02:16:00Z │ Alarm updated from ALARM to OK (8.81 < 9.00) │
│ 2026-05-15T02:19:00Z │ Alarm updated from OK to ALARM (9.04 >= 9.00) │
│ 2026-05-15T02:23:00Z │ Alarm updated from ALARM to OK (8.72 < 9.00) │
│ 2026-05-15T02:27:00Z │ Alarm updated from OK to ALARM (9.21 >= 9.00) │
│ 2026-05-15T02:31:00Z │ Alarm updated from ALARM to OK (8.94 < 9.00) │
└──────────────────────┴───────────────────────────────────────────────────────┘
# Crossings are all within 0.3 of the threshold. Metric is healthy; the alarm is too tight.

Six most recent transitions — the metric is oscillating ±0.3 around the line.

Flapping under the hooddeep dive

A CloudWatch alarm evaluates a metric on a defined Period (the size of each datapoint, e.g. 60s or 300s), checks the most recent EvaluationPeriods of datapoints, and transitions to ALARM if DatapointsToAlarm of those periods breach the threshold. The default is EvaluationPeriods=1 and DatapointsToAlarm=1 — a single bad datapoint, and you're paged. This is fine for hard-edged metrics (was the lambda invoked?) and terrible for noisy ones (bounce rate, queue depth, p99 latency).

The fix is the M-of-N pattern: require M breaching datapoints out of the last N. Setting EvaluationPeriods=5 and DatapointsToAlarm=3 means the alarm only fires if 3 out of the last 5 periods cross the threshold — random spikes in a single period are absorbed, sustained issues still trigger. At a 5-minute period, this corresponds to a sustained ~15-minute breach. That's appropriate for slow-burn alarms (bounce rate, error rate, fill rate) but slower than you want for pages on user-facing latency.

The other lever is the metric itself. CloudWatch Anomaly Detection runs a model over the last ~2 weeks of a metric, learns its daily and weekly seasonality, and emits an upper/lower band. An anomaly-detection alarm fires when the metric leaves the band, not when it crosses an absolute number. It's immune to the kind of flapping caused by a static threshold sitting on top of a naturally-variable metric — but it's not free: it adds cost per metric per month, and it takes time to train, so cold-start alarms behave oddly for the first couple of weeks.

# Inspect the alarm's current evaluation config.
aws cloudwatch describe-alarms \
  --alarm-names 'SES bounce rate >= 9%' \
  --query "MetricAlarms[0].{Period:Period,Eval:EvaluationPeriods,Datapoints:DatapointsToAlarm,Threshold:Threshold}"

# Example output:
# {
#   "Period": 60,
#   "Eval": 1,
#   "Datapoints": 1,
#   "Threshold": 9.0
# }
# Single 1-minute period at the metric's typical value — guaranteed to flap.

What is the impact of leaving an alarm flapping?

The most direct impact is alert fatigue. An on-caller who gets paged ten times in a shift for the same alarm stops reading the page text — they swipe to acknowledge, go back to sleep, and the next real incident on that alarm lands in a brain that has been trained to ignore it. SRE post-mortems are full of "the alarm fired but we'd been ignoring it for weeks."

The second-order impact is downstream automation. AlarmActions don't just page humans — they trigger Auto Scaling steps, Lambda functions, SSM documents, Step Functions. A flapping CPU alarm wired to a scaling policy will scale up and down every few minutes, churning instances and burning money on EC2 hours and EBS snapshots that exist for thirty minutes apiece. A flapping alarm wired to an incident-response Lambda spawns ten investigations a day, ten log queries, ten Slack threads.

The third impact is on the SLO itself. If your alarm threshold corresponds to a service-level objective ("page if error rate > 1%"), a flapping alarm corrupts your error-budget accounting: every transition shows up as a notional breach event, and the team starts arguing about whether the SLO is really broken or the alarm is just wrong. Both can be true at once, which is the worst case to debug.

And there's a cost dimension: SNS notifications, PagerDuty events, and downstream Lambda invocations are all billed per event. A single flapping alarm doing 20 transitions a day across multiple actions isn't expensive by itself — but at fleet scale across 200 alarms, the bill plus the human time to investigate adds up to thousands a month for no actionable information.

How do you safely tame a flapping alarm?

Taming a flapping alarm is a four-step loop. The order matters — guessing at the tuning before you look at the metric's actual distribution is how you end up with an alarm that's no longer noisy but also no longer catches the thing it was meant to catch.

1. Confirm it's the alarm, not the system

Pull describe-alarm-history and the underlying metric for the last 7 days. If the metric is oscillating tightly around the threshold while the underlying service is healthy, the alarm is mis-tuned. If the metric is genuinely sawtoothing — HPA fighting itself, retry storms, a feedback loop in queue depth — the system is the problem and the alarm is correctly reporting it. Fix the system, not the alarm.

2. Tune evaluation periods, period, and threshold

Try M-of-N first: EvaluationPeriods=5, DatapointsToAlarm=3 absorbs almost all single-period noise. If that's not enough, raise the Period from 60s to 300s — the longer averaging window smooths bursts. Move the threshold last, and only by a small buffer above the metric's typical p95 (not the mean) — otherwise you'll silence real signals along with the noise.

3. Consider Anomaly Detection for naturally variable metrics

If the metric has strong daily or weekly seasonality (traffic, queue depth, bounce rates that vary by send volume), a static threshold is the wrong tool. Replace it with a CloudWatch Anomaly Detection alarm — the band adapts to the metric's pattern and only fires on actual deviation. Acceptable trade-off: small added cost per metric, a 2-week warm-up before the band is reliable.

4. Audit the rest of the fleet for the same pattern

Where there's one flapping alarm there are usually ten. Loop describe-alarm-history over every alarm in the account, count StateUpdates per 24h, and flag anything over 5 transitions. The fix from steps 1-3 applies to all of them; the audit prevents the next 2am pager-storm of the same shape.

# Apply M-of-N evaluation: 3 out of the last 5 periods at 5-minute granularity.
aws cloudwatch put-metric-alarm \
  --alarm-name 'SES bounce rate >= 9%' \
  --namespace AWS/SES \
  --metric-name Reputation.BounceRate \
  --statistic Average \
  --period 300 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 9.5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ses-alerts

# Result: alarm only fires on a sustained 15-minute breach above 9.5%.
# Eliminates the ±0.3 oscillation around the old 9.0% line.

Quick quiz

Question 1 of 5

An alarm has transitioned state 10 times in 24 hours. The metric is oscillating between 8.8% and 9.2% against a 9.0% threshold, and the underlying service is healthy. What's the right fix?

You've completed Tame flapping CloudWatch alarms. You can now spot a flap from describe-alarm-history, decide whether the alarm or the system is the problem, and apply the right combination of M-of-N, a longer period, a buffered threshold, or anomaly detection to silence the noise without losing the signal. The next time a 2am page-storm fires from the same alarm twice in three minutes, you'll have a four-step loop ready to run.

Back to the library