Frequently firing alarms: the basics
What does "frequently firing" actually mean?
A CloudWatch alarm has a job: tell the on-call engineer when something is genuinely wrong. Every time it transitions from OK to ALARM and back, that's one episode — one ping in a channel, one row in the incident log, one moment of attention spent. A healthy alarm fires when the thing it watches breaks, which for most real systems is a handful of times a month at most.
"Frequently firing" describes an alarm that has crossed that threshold dozens of times over weeks — not because the underlying system keeps having real outages, but because the alarm is poorly tuned to the metric it watches. The classic shape is an alarm that fires 100+ times in 30 days with no corresponding incident tickets, no remediation work, and no escalations — just a stream of pages everyone has learned to ignore.
Most monitoring tools flag this pattern automatically. It's adjacent to but distinct from "flapping" (rapid OK ↔ ALARM transitions inside a single 24-hour window): frequently firing is many distinct episodes spread over weeks. Same root cause — a threshold mismatched to reality — but a different observable signal.
In this lesson you'll learn how to spot a frequently firing alarm, how to decide whether to tune it, fix the underlying signal, or suppress it to a quieter channel, and how to build alarm-health tracking into your monitoring practice so the noise doesn't creep back. You'll see real CloudWatch CLI calls to audit alarm history and apply targeted threshold changes.
Alert fatigue is measurable — and it kills response time
A 2020 study of incident response across SaaS engineering teams found that on-call engineers exposed to >50% false-positive page rates took 40% longer to respond to genuine pages than peers on cleaner rotations. The brain learns that a buzz at 3am usually means nothing — and that learning generalises. By the time a real incident fires, the first instinct is to silence the notification and check it later. "It's probably just that bounce-rate alarm again" is how outages get long.
Auditing a noisy alarm in action
Marco is the SRE lead at SevenC3, a mid-sized SaaS company. The on-call rotation has been grumbling for weeks about pages from an SES bounce-rate alarm — "SevenC3 SES bounce rate >= 9%" — that nobody ever actually acts on. The team's monitoring dashboard flags it: 158 state transitions in 30 days, severity HIGH on their alarm-health report.
Before he tunes or deletes anything, Marco wants to know two things: is the bounce rate genuinely sitting at 9% (so the threshold is wrong), or is the service flapping near it (so the alarm has the right idea but the wrong period). He pulls the alarm's state-change history to see the shape of the noise.
What he finds is the most common pattern: bounce rate hovers between 8.7% and 9.3% for most of the month, and the alarm is essentially counting random crossings of a line that the system happens to sit on. The threshold isn't catching incidents; it's measuring noise.
First, pull the alarm's state-change history and count the transitions to ALARM. This is the noise audit.
Counting OK→ALARM transitions over the audit window.
Now pull the underlying metric to see whether bounce rate is genuinely sitting near 9% or spiking past it.
Daily bounce-rate distribution over the audit window.
How CloudWatch alarms actually decide to firedeep dive
A CloudWatch alarm evaluates its metric on a fixed cadence — by default every period (e.g. 60s, 300s) — and counts how many of the last N evaluation periods breached the threshold. The alarm transitions to ALARM only when M of those N periods breach (the --datapoints-to-alarm and --evaluation-periods parameters). Most teams leave these at 1-of-1, which is exactly the configuration that produces frequent firing: a single data point at the wrong side of the threshold flips the state immediately.
Increasing the M-of-N ratio is the cheapest mitigation. An alarm configured as 3-of-5 breaches before firing — and 3-of-5 OK before recovering — is dramatically less reactive to single-point noise without losing any sensitivity to a genuine sustained event. The cost is roughly N×period seconds of detection latency, which for a 60s-period alarm is 3-5 minutes. For a metric like bounce rate, that's a trivial trade.
CloudWatch Anomaly Detection is the structural fix when the underlying metric is genuinely variable. Instead of a hard-coded threshold, it learns the metric's daily/weekly seasonality and fires when the value falls outside a confidence band. For metrics with diurnal patterns (latency, request rate, queue depth, bounce rate) it produces a fraction of the noise of a fixed threshold — and catches anomalies a fixed threshold misses entirely.
# Audit the noisiest alarms in the account — find the 80/20.
aws cloudwatch describe-alarms --query 'MetricAlarms[*].AlarmName' --output text | \
tr '\t' '\n' | \
while read name; do
count=$(aws cloudwatch describe-alarm-history \
--alarm-name "$name" \
--history-item-type StateUpdate \
--start-date $(date -u -d '30 days ago' +%FT%TZ) \
--query 'AlarmHistoryItems[?contains(HistorySummary, `to ALARM`)] | length(@)')
echo "$count $name"
done | sort -rn | head -20 What is the impact of leaving noisy alarms in place?
The first-order impact is response time on the real incidents. A team conditioned by months of false pages stops treating the next page as urgent. The buzz arrives, the engineer thinks "probably the bounce-rate one again," and the response that should have started in 2 minutes starts in 20. Multiply that across a noisy on-call rotation and the team is operating at a measurable handicap when something genuinely breaks.
The second-order impact is on the people doing the rotation. Sustained false-positive paging is one of the strongest predictors of on-call burnout — sleep fragmentation, weekend disruption, and the slow erosion of trust in the monitoring stack itself. Engineers leave teams over this; the cost shows up as attrition, not as a line item on the AWS bill.
The third-order impact is monitoring atrophy. Once the team has learned to ignore one alarm, they're more likely to silence others, mute channels, or stop investigating ALARM states entirely. By the time a real incident fires, half the alarms are muted and the dashboards haven't been opened in a week. This is how monitoring stacks die — not by going wrong, but by being ignored.
There's also a small but real direct cost: CloudWatch alarms are billed at $0.10 per alarm per month for standard resolution, and high-resolution alarms cost more. A single noisy alarm is cheap; an account full of unused, unmaintained alarms accumulating over years adds up — typically not enough to matter on its own, but a useful proxy metric for monitoring debt.
How do you fix a frequently firing alarm?
There are three remediation paths — tune, fix, or suppress — and they apply in that order of preference. The four-step loop below works for any noisy alarm; the only judgement call is which of the three paths the alarm belongs on.
1. Diagnose whether the metric is doing what's intended
Pull 30 days of the underlying metric and look at the distribution. If the metric genuinely sits near the threshold most of the time, the threshold is wrong — tune it. If the metric is mostly fine and the alarm catches real spikes, the underlying issue needs fixing. If neither — the metric just isn't a useful signal for whatever the alarm was trying to detect — suppress to a low-priority channel or delete.
2. Tune to reality, or replace fixed thresholds with Anomaly Detection
If the threshold is wrong, set it to a level the business actually cares about (e.g. bounce rate > 12%, not 9%) and combine with M-of-N evaluation periods (3-of-5 instead of 1-of-1) to absorb noise. For metrics with seasonality, switch to Anomaly Detection — it's a single CLI flag (--threshold-metric-id) on put-metric-alarm. Anomaly Detection consistently produces fewer false positives than a static line, especially for diurnal metrics.
3. Suppress low-criticality alarms to a quieter channel
Not every alarm deserves a page. For informational signals — bounce rates trending up, queue depth growing, a non-critical job running long — route to Slack or email instead of PagerDuty. Use SNS topic routing or PagerDuty's severity rules to make this a config change rather than a code change. The criterion is simple: if there's no immediate action the on-call should take, it shouldn't wake them up.
4. Apply the delete-or-fix rule, and audit alarm health monthly
Any alarm that has been firing without remediation for 90+ days should be deleted. It is, by definition, not driving action — keeping it around degrades trust in the rest of the stack. Run the noise audit (describe-alarm-history aggregated by alarm) monthly and treat the top-10 noisiest alarms as a standing review item. Usually the 80/20 rule applies — a few alarms produce most of the noise, and fixing them transforms the on-call experience. For complex systems, fold related noisy child alarms into a single tunable composite alarm.
# Tune the bounce-rate alarm: 3-of-5 evaluation periods, threshold raised to 12%.
aws cloudwatch put-metric-alarm \
--alarm-name 'SevenC3 SES bounce rate >= 12%' \
--namespace AWS/SES \
--metric-name Reputation.BounceRate \
--statistic Average \
--period 300 \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold 0.12 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ses-alerts
# Or, swap fixed threshold for Anomaly Detection — let CloudWatch learn the seasonality.
aws cloudwatch put-metric-alarm \
--alarm-name 'SevenC3 SES bounce rate anomaly' \
--comparison-operator GreaterThanUpperThreshold \
--evaluation-periods 5 \
--datapoints-to-alarm 3 \
--threshold-metric-id ad1 \
--metrics '[{"Id":"m1","MetricStat":{"Metric":{"Namespace":"AWS/SES","MetricName":"Reputation.BounceRate"},"Period":300,"Stat":"Average"},"ReturnData":true},{"Id":"ad1","Expression":"ANOMALY_DETECTION_BAND(m1, 2)"}]' \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ses-alerts Quick quiz
Question 1 of 5You have a CloudWatch alarm that's fired 158 times in 30 days. You pull the underlying metric and see the value lives between 8.7% and 9.3% for most of the month, with the threshold set at 9%. What's the right next move?
You scored
0 / 5
Keep learning
Dig deeper into alarm design, Anomaly Detection, and reducing on-call noise.
- Amazon CloudWatch Anomaly Detection Replace static thresholds with bands learned from the metric's own seasonality.
- Amazon CloudWatch Composite Alarms Fold multiple noisy child alarms into one tunable parent to reduce page volume.
- AWS Well-Architected — Operational Excellence Pillar Includes guidance on observability, alarm design, and runbook practices.
- Google SRE Book — Practical Alerting The canonical reference on what makes a good alert and how to avoid alert fatigue.
You've completed Address frequently firing alarms. You now know how to diagnose a noisy alarm, decide between tuning, fixing, and suppressing, and run a monthly noise audit to keep the 80/20 of noisy alarms in check. The next time the on-call rotation grumbles about pages nobody acts on, you'll have a four-step loop ready to run — diagnose, tune or replace, route correctly, and delete what doesn't drive action.
Back to the library