Monitoring

Address frequently firing alarms

An alarm that fires 158 times in 30 days isn't catching incidents — it's generating noise. Tune, suppress, or fix the underlying problem.

13 min·10 sections·AWS

Last reviewed 27 May 2026

Frequently firing alarms: the basics

What does "frequently firing" actually mean?

A CloudWatch alarm has a job: tell the on-call engineer when something is genuinely wrong. Every time it transitions from OK to ALARM and back, that's one episode — one ping in a channel, one row in the incident log, one moment of attention spent. A healthy alarm fires when the thing it watches breaks, which for most real systems is a handful of times a month at most.

"Frequently firing" describes an alarm that has crossed that threshold dozens of times over weeks — not because the underlying system keeps having real outages, but because the alarm is poorly tuned to the metric it watches. The classic shape is an alarm that fires 100+ times in 30 days with no corresponding incident tickets, no remediation work, and no escalations — just a stream of pages everyone has learned to ignore.

Most monitoring tools flag this pattern automatically. It's adjacent to but distinct from "flapping" (rapid OK ↔ ALARM transitions inside a single 24-hour window): frequently firing is many distinct episodes spread over weeks. Same root cause — a threshold mismatched to reality — but a different observable signal.

In this lesson you'll learn how to spot a frequently firing alarm, how to decide whether to tune it, fix the underlying signal, or suppress it to a quieter channel, and how to build alarm-health tracking into your monitoring practice so the noise doesn't creep back. You'll see real CloudWatch CLI calls to audit alarm history and apply targeted threshold changes.

Fun fact

Alert fatigue is measurable — and it kills response time

A 2020 study of incident response across SaaS engineering teams found that on-call engineers exposed to >50% false-positive page rates took 40% longer to respond to genuine pages than peers on cleaner rotations. The brain learns that a buzz at 3am usually means nothing — and that learning generalises. By the time a real incident fires, the first instinct is to silence the notification and check it later. "It's probably just that bounce-rate alarm again" is how outages get long.

Auditing a noisy alarm in action

Marco is the SRE lead at SevenC3, a mid-sized SaaS company. The on-call rotation has been grumbling for weeks about pages from an SES bounce-rate alarm — "SevenC3 SES bounce rate >= 9%" — that nobody ever actually acts on. The team's monitoring dashboard flags it: 158 state transitions in 30 days, severity HIGH on their alarm-health report.

Before he tunes or deletes anything, Marco wants to know two things: is the bounce rate genuinely sitting at 9% (so the threshold is wrong), or is the service flapping near it (so the alarm has the right idea but the wrong period). He pulls the alarm's state-change history to see the shape of the noise.

What he finds is the most common pattern: bounce rate hovers between 8.7% and 9.3% for most of the month, and the alarm is essentially counting random crossings of a line that the system happens to sit on. The threshold isn't catching incidents; it's measuring noise.

First, pull the alarm's state-change history and count the transitions to ALARM. This is the noise audit.

$ aws cloudwatch describe-alarm-history --alarm-name 'SevenC3 SES bounce rate >= 9%' --history-item-type StateUpdate --start-date $(date -u -d '30 days ago' +%FT%TZ) --end-date $(date -u +%FT%TZ) --query 'AlarmHistoryItems[?contains(HistorySummary, `to ALARM`)] | length(@)'

158

# 158 OK→ALARM transitions in 30 days — roughly one page every 4.5 hours.

Counting OK→ALARM transitions over the audit window.

Now pull the underlying metric to see whether bounce rate is genuinely sitting near 9% or spiking past it.

$ aws cloudwatch get-metric-statistics --namespace AWS/SES --metric-name Reputation.BounceRate --start-time $(date -u -d '30 days ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 86400 --statistics Average Maximum --query 'Datapoints[*].[Timestamp,Average,Maximum]' --output text

2026-04-15T00:00:00Z 0.087 0.094

2026-04-16T00:00:00Z 0.089 0.092

2026-04-17T00:00:00Z 0.091 0.096

2026-04-18T00:00:00Z 0.090 0.093

2026-04-19T00:00:00Z 0.088 0.094

# Average lives at 8.8-9.1% — alarm is firing on a line the system sits on, not on real incidents.

Daily bounce-rate distribution over the audit window.

How CloudWatch alarms actually decide to firedeep dive

A CloudWatch alarm evaluates its metric on a fixed cadence — by default every period (e.g. 60s, 300s) — and counts how many of the last N evaluation periods breached the threshold. The alarm transitions to ALARM only when M of those N periods breach (the --datapoints-to-alarm and --evaluation-periods parameters). Most teams leave these at 1-of-1, which is exactly the configuration that produces frequent firing: a single data point at the wrong side of the threshold flips the state immediately.

Increasing the M-of-N ratio is the cheapest mitigation. An alarm configured as 3-of-5 breaches before firing — and 3-of-5 OK before recovering — is dramatically less reactive to single-point noise without losing any sensitivity to a genuine sustained event. The cost is roughly N×period seconds of detection latency, which for a 60s-period alarm is 3-5 minutes. For a metric like bounce rate, that's a trivial trade.

CloudWatch Anomaly Detection is the structural fix when the underlying metric is genuinely variable. Instead of a hard-coded threshold, it learns the metric's daily/weekly seasonality and fires when the value falls outside a confidence band. For metrics with diurnal patterns (latency, request rate, queue depth, bounce rate) it produces a fraction of the noise of a fixed threshold — and catches anomalies a fixed threshold misses entirely.

# Audit the noisiest alarms in the account — find the 80/20.
aws cloudwatch describe-alarms --query 'MetricAlarms[*].AlarmName' --output text | \
  tr '\t' '\n' | \
  while read name; do
    count=$(aws cloudwatch describe-alarm-history \
      --alarm-name "$name" \
      --history-item-type StateUpdate \
      --start-date $(date -u -d '30 days ago' +%FT%TZ) \
      --query 'AlarmHistoryItems[?contains(HistorySummary, `to ALARM`)] | length(@)')
    echo "$count $name"
  done | sort -rn | head -20

What is the impact of leaving noisy alarms in place?

The first-order impact is response time on the real incidents. A team conditioned by months of false pages stops treating the next page as urgent. The buzz arrives, the engineer thinks "probably the bounce-rate one again," and the response that should have started in 2 minutes starts in 20. Multiply that across a noisy on-call rotation and the team is operating at a measurable handicap when something genuinely breaks.

The second-order impact is on the people doing the rotation. Sustained false-positive paging is one of the strongest predictors of on-call burnout — sleep fragmentation, weekend disruption, and the slow erosion of trust in the monitoring stack itself. Engineers leave teams over this; the cost shows up as attrition, not as a line item on the AWS bill.

The third-order impact is monitoring atrophy. Once the team has learned to ignore one alarm, they're more likely to silence others, mute channels, or stop investigating ALARM states entirely. By the time a real incident fires, half the alarms are muted and the dashboards haven't been opened in a week. This is how monitoring stacks die — not by going wrong, but by being ignored.

There's also a small but real direct cost: CloudWatch alarms are billed at $0.10 per alarm per month for standard resolution, and high-resolution alarms cost more. A single noisy alarm is cheap; an account full of unused, unmaintained alarms accumulating over years adds up — typically not enough to matter on its own, but a useful proxy metric for monitoring debt.

How do you fix a frequently firing alarm?

There are three remediation paths — tune, fix, or suppress — and they apply in that order of preference. The four-step loop below works for any noisy alarm; the only judgement call is which of the three paths the alarm belongs on.

1. Diagnose whether the metric is doing what's intended

Pull 30 days of the underlying metric and look at the distribution. If the metric genuinely sits near the threshold most of the time, the threshold is wrong — tune it. If the metric is mostly fine and the alarm catches real spikes, the underlying issue needs fixing. If neither — the metric just isn't a useful signal for whatever the alarm was trying to detect — suppress to a low-priority channel or delete.

2. Tune to reality, or replace fixed thresholds with Anomaly Detection

If the threshold is wrong, set it to a level the business actually cares about (e.g. bounce rate > 12%, not 9%) and combine with M-of-N evaluation periods (3-of-5 instead of 1-of-1) to absorb noise. For metrics with seasonality, switch to Anomaly Detection — it's a single CLI flag (--threshold-metric-id) on put-metric-alarm. Anomaly Detection consistently produces fewer false positives than a static line, especially for diurnal metrics.

3. Suppress low-criticality alarms to a quieter channel

Not every alarm deserves a page. For informational signals — bounce rates trending up, queue depth growing, a non-critical job running long — route to Slack or email instead of PagerDuty. Use SNS topic routing or PagerDuty's severity rules to make this a config change rather than a code change. The criterion is simple: if there's no immediate action the on-call should take, it shouldn't wake them up.

4. Apply the delete-or-fix rule, and audit alarm health monthly

Any alarm that has been firing without remediation for 90+ days should be deleted. It is, by definition, not driving action — keeping it around degrades trust in the rest of the stack. Run the noise audit (describe-alarm-history aggregated by alarm) monthly and treat the top-10 noisiest alarms as a standing review item. Usually the 80/20 rule applies — a few alarms produce most of the noise, and fixing them transforms the on-call experience. For complex systems, fold related noisy child alarms into a single tunable composite alarm.

# Tune the bounce-rate alarm: 3-of-5 evaluation periods, threshold raised to 12%.
aws cloudwatch put-metric-alarm \
  --alarm-name 'SevenC3 SES bounce rate >= 12%' \
  --namespace AWS/SES \
  --metric-name Reputation.BounceRate \
  --statistic Average \
  --period 300 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 0.12 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ses-alerts

# Or, swap fixed threshold for Anomaly Detection — let CloudWatch learn the seasonality.
aws cloudwatch put-metric-alarm \
  --alarm-name 'SevenC3 SES bounce rate anomaly' \
  --comparison-operator GreaterThanUpperThreshold \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold-metric-id ad1 \
  --metrics '[{"Id":"m1","MetricStat":{"Metric":{"Namespace":"AWS/SES","MetricName":"Reputation.BounceRate"},"Period":300,"Stat":"Average"},"ReturnData":true},{"Id":"ad1","Expression":"ANOMALY_DETECTION_BAND(m1, 2)"}]' \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ses-alerts

Quick quiz

Question 1 of 5

You have a CloudWatch alarm that's fired 158 times in 30 days. You pull the underlying metric and see the value lives between 8.7% and 9.3% for most of the month, with the threshold set at 9%. What's the right next move?

Keep learning

Dig deeper into alarm design, Anomaly Detection, and reducing on-call noise.

You've completed Address frequently firing alarms. You now know how to diagnose a noisy alarm, decide between tuning, fixing, and suppressing, and run a monthly noise audit to keep the 80/20 of noisy alarms in check. The next time the on-call rotation grumbles about pages nobody acts on, you'll have a four-step loop ready to run — diagnose, tune or replace, route correctly, and delete what doesn't drive action.

Back to the library

Frequently firing alarms: what it costs to ignore them

Alert noise is an operational overhead with a measurable price tag

A CloudWatch alarm is a decision gate: it exists to trigger a human response when a metric crosses a threshold that matters to the business. Each transition from OK to ALARM that generates no remediation — no ticket, no fix, no investigation — is a unit of wasted on-call time. When an alarm fires 158 times in 30 days without a single follow-through action, you are paying engineering labour at on-call rates to acknowledge noise roughly once every four hours, around the clock.

The direct cost model is straightforward: count the annual alarm-fire frequency, multiply by the average time-to-acknowledge and the loaded hourly cost of an on-call engineer, and you have the labour cost of leaving the alarm untuned. For a senior engineer paged at off-hours rates, 158 monthly false fires can represent thousands of dollars in fully-loaded labour annually — per alarm. Spread across an account with dozens of noisy alarms, the aggregate is material.

The indirect cost is harder to price but larger: teams conditioned by months of false pages take longer to respond to real incidents, and the monitoring stack itself erodes in usefulness. Both effects translate into higher mean-time-to-resolve on actual outages — which hits SLA penalties, customer retention, and revenue. The frequently-firing alarm control is the trigger to quantify and address that exposure before it compounds.

This lesson is for the finance partner who wants to understand why noisy alarms show up as an operational cost item and how to quantify it. You'll get the framework for estimating the labour cost of false pages, why a frequently-firing alarm is a recoverable spend problem rather than a permanent overhead, and the governance levers — a monthly noise audit and a delete-or-fix rule for alarms idle for 90+ days — that keep on-call costs from silently compounding. No CLI knowledge required.

Fun fact

Alert fatigue is measurable — and it kills response time

How a finance partner frames the noisy-alarm conversation

Dana is the finance partner at SevenC3. When the monthly engineering review surfaces that the on-call rotation received 158 pages from a single SES bounce-rate alarm last month, her first question isn't about the alarm itself — it's about the labour cost. She estimates: 158 acknowledgements at an average of 8 minutes each, at the loaded on-call rate, comes to roughly 21 hours of engineering time in a single month. For a senior engineer at fully-loaded cost, that's a meaningful line item — and the alarm wasn't catching real incidents.

Dana tables a second question: what does it cost to fix versus what does it cost to leave it? The SRE team says threshold tuning takes about an hour. Dana approves the work immediately — the payback period is measured in days. She also asks for the same calculation on the next-noisiest ten alarms in the account, and puts a standing agenda item on the monthly review: false-positive page rate as an operational efficiency metric alongside cost and performance.

Her framing for the finance pack is: "Alarm noise is a labour overhead that compounds quietly. We now track it, and the cost of remediation is consistently a fraction of the cost of the noise it eliminates."

How alert noise translates into cost — and into risk

The quantifiable cost of a noisy alarm has two components: direct labour and indirect exposure. The direct component is tractable: alarm fires per month multiplied by average acknowledgement time multiplied by the loaded on-call rate. For a senior engineer at typical fully-loaded costs, a single alarm firing 158 times per month — roughly once every four and a half hours — can represent several thousand dollars of annual labour, none of which produces a resolved incident. Scale that across an account with a dozen poorly-tuned alarms and the aggregate is a meaningful, recurring overhead.

The indirect component is larger and harder to model precisely, but it connects directly to the risk register. Alert-fatigued teams take materially longer to respond to genuine incidents — conservatively 40% longer per published incident-response research. Longer response times translate into longer mean time to resolution, which means longer customer-visible outages, greater SLA exposure, and higher probability of a reputational event. The cost of a single major incident where slow response was a factor almost always exceeds the annual cost of the alarm noise that caused the fatigue.

There is also a workforce cost. Sustained false-positive paging is a leading predictor of on-call burnout and voluntary attrition. Replacing an experienced on-call engineer costs roughly 1.5–2x their annual salary in recruiting and ramp time. Alarm hygiene is rarely framed as a retention issue, but it is one — and unlike the labour cost of false pages, attrition is a one-time cliff event rather than a recoverable monthly overhead.

The direct CloudWatch billing cost is the smallest component: $0.10 per alarm per month at standard resolution. It's worth tracking as a proxy for monitoring debt, but it should not be the primary frame — the labour, risk, and retention costs are orders of magnitude larger and are the right basis for prioritising remediation work.

What finance can do about frequently firing alarms

Finance doesn't tune alarm thresholds, but it owns the framing that makes alarm hygiene a prioritised spend decision rather than an indefinitely deferred housekeeping task. Four levers, applied at the regular cadence.

1. Put false-positive page rate on the monthly efficiency review

Ask for the alarm-fire-to-remediation ratio: of all pages generated last month, what fraction led to an action? A rate above 30-40% is a cost problem and a risk signal worth tracking. Framing it as a monthly metric makes it visible and creates pressure to improve it, whereas a one-off audit gets fixed once and regresses.

2. Cost the noise before approving remediation work

Require a simple calculation before prioritising alarm-tuning work: fires per month multiplied by average acknowledgement time multiplied by on-call labour rate. For most noisy alarms the payback period for a one-hour tuning effort is measured in days. Showing that calculation makes the business case obvious and gets the work scheduled rather than queued indefinitely.

3. Apply a 90-day delete-or-justify rule as a standing governance item

Any alarm that has fired 20+ times in the past 90 days with no linked incident or remediation ticket should either be fixed or deleted at the next review. Finance can enforce this by treating surviving alarm debt — counted as alarm-months above the threshold — as an operational overhead line, the same way you would treat unresolved cost anomalies.

4. Track the 80/20 — a few alarms produce most of the noise

The top-10 noisiest alarms in most accounts account for the large majority of total false pages. Finance can ask for that concentrated list and treat remediating it as a discrete, time-boxed project with a clear ROI calculation. Fixing the top ten is almost always both sufficient and faster than a broad hygiene sweep.

Quick quiz

Question 1 of 5

A monthly alarm-health report shows the top-5 noisiest alarms generated 430 combined pages last month, none of which produced a remediation ticket. Engineering estimates tuning all five takes about 3 hours of work. As the finance partner, what's the right framing for prioritising this work?

Keep learning

Dig deeper into alarm design, Anomaly Detection, and reducing on-call noise.

You've finished the finance partner's view of alarm hygiene. You know how to cost a noisy alarm in terms of wasted on-call labour, why the remediation payback is almost always measured in days rather than quarters, and the four levers — false-positive rate on the monthly review, a cost-first prioritisation model, a 90-day delete-or-justify rule, and focusing on the top-10 to capture the 80/20 — that keep monitoring debt from compounding. Next time this comes up in a sprint planning discussion, you'll have a number, not a shrug.

Back to the library

Frequently firing alarms: the headline

Noise desensitises the team that stands between you and the next outage

An alarm that fires a hundred times a month without anyone acting on it is not a monitoring tool — it is background noise. The engineering team learns to ignore it, and that habit generalises: the next page, the one that matters, gets the same slow initial response as the last hundred that didn't.

This is a reliability risk disguised as an operational quirk. The question for leadership is not how many alarms exist, but whether the team still trusts them. A monitoring stack eroded by noise takes longer to respond to genuine incidents, and longer response times convert into longer outages, missed SLAs, and reputational exposure. Addressing frequently firing alarms is how you preserve the on-call team's trust in the systems they operate.

A short read for the executive who wants to understand why the monitoring team's alert hygiene is a leadership concern. You'll get the plain-English version of what a noisy alarm does to incident response quality, why this is a risk posture issue rather than a technical one, and what the one-question readout looks like at a leadership review: the trend on false-positive page rate and whether the team still trusts the stack.

Fun fact

Alert fatigue is measurable — and it kills response time

What it looks like when leadership takes alarm hygiene seriously

Priya, a CTO, used to get a vague answer whenever she asked whether the on-call team was operating well. After the second quarter in which an incident post-mortem cited slow initial response as a contributing factor, she added one metric to her monthly engineering readout: false-positive page rate.

The first time the number appeared — 68% of all pages in the previous month generated no remediation action — it was obvious the monitoring stack had a trust problem. The team wasn't lazy; they'd been conditioned by months of noise. Engineering triaged the top-10 noisiest alarms, which accounted for 80% of the false pages, and addressed them over two weeks.

Three months later the false-positive rate was under 15%, mean-time-to-acknowledge on real incidents had dropped by 30%, and the on-call rotation had stopped rotating people out after one week because of burnout. Priya's read: "Alarm hygiene isn't an engineering housekeeping task. It's how you preserve your incident response capability before you need it."

Why this is a reliability risk, not just a tooling problem

The impact of frequently firing alarms is not an inconvenience for engineers — it is a degradation of the organisation's ability to respond to real incidents. A team that has been paged 158 times in a month without a single genuine incident to show for it has learned, rationally, that pages don't matter. That learned indifference does not turn off selectively when the next real outage fires.

For leadership, the risk framing is this: every month that high false-positive page rates go unaddressed, the effective response capability of the on-call team decreases. The next time a customer-facing service goes down, the team that should respond in two minutes is operating with a habituated delay. That delay is the gap between an incident resolved before customers notice and one that shows up in the post-mortem as 'we were slow to triage because we assumed it was noise.'

The accountability question is straightforward: does the organisation know its false-positive page rate, does that rate trend over time, and is there a standing process to address the alarms generating most of the noise? An organisation that can answer yes to all three has monitoring governed by policy. One that cannot is relying on luck that the team's response instincts survive another quarter of conditioning.

The leadership handle on alarm hygiene

The executive question isn't which thresholds to change — it's whether the organisation has a policy that prevents monitoring debt from accumulating silently and degrading incident response capability over time.

1. Require a false-positive page rate as a standing metric

Ask for it at the monthly engineering review: of all pages generated, what fraction required action? A healthy stack is under 20-25%. Anything above 50% means the team is responding to more noise than signal. Making the number visible is the first governance step — teams that track it improve it.

2. Set a delete-or-fix policy for alarms idle for 90+ days

Establish a standing rule: any alarm that has fired repeatedly without generating a remediation action for 90 days is either fixed or deleted. It is not a neutral cost of running a monitoring stack; it is a degrader of the incident response capability. A policy removes the need for individual engineering judgment on whether to touch a legacy alarm.

3. Ask one question at the leadership review

The signal you want is: does the on-call team trust the monitoring stack? Proxy metrics are the false-positive page rate trend and mean-time-to-acknowledge on genuine incidents. Both trends improving is evidence that the policy is working. A flat or worsening false-positive rate is the trigger for a deeper conversation.

Quick quiz

Question 1 of 5

At the quarterly engineering review, the false-positive page rate is reported at 72% — nearly three in four pages last quarter were noise. Incident post-mortems from the same period cite slow initial triage as a contributing factor on two customer-visible outages. What's the leadership response?

Keep learning

Dig deeper into alarm design, Anomaly Detection, and reducing on-call noise.

That's the lesson. Two takeaways: a high false-positive page rate is a reliability risk that shows up as slow incident response before it shows up as a visible outage, and the governance answer is a standing metric and a delete-or-fix policy rather than a one-off cleanup. The leadership question at every review is whether the team still trusts the monitoring stack — and the false-positive rate is the one number that answers it.

Back to the library

Part of the learning path Get your alarms right

Address frequently firing alarms

Frequently firing alarms: the basics

Alert fatigue is measurable — and it kills response time

Auditing a noisy alarm in action

How CloudWatch alarms actually decide to firedeep dive

What is the impact of leaving noisy alarms in place?

How do you fix a frequently firing alarm?

1. Diagnose whether the metric is doing what's intended

2. Tune to reality, or replace fixed thresholds with Anomaly Detection

3. Suppress low-criticality alarms to a quieter channel

4. Apply the delete-or-fix rule, and audit alarm health monthly

Quick quiz

Keep learning

Frequently firing alarms: what it costs to ignore them

Alert fatigue is measurable — and it kills response time

How a finance partner frames the noisy-alarm conversation

How alert noise translates into cost — and into risk

What finance can do about frequently firing alarms

1. Put false-positive page rate on the monthly efficiency review

2. Cost the noise before approving remediation work

3. Apply a 90-day delete-or-justify rule as a standing governance item

4. Track the 80/20 — a few alarms produce most of the noise

Quick quiz

Keep learning

Frequently firing alarms: the headline

Alert fatigue is measurable — and it kills response time

What it looks like when leadership takes alarm hygiene seriously

Why this is a reliability risk, not just a tooling problem

The leadership handle on alarm hygiene

1. Require a false-positive page rate as a standing metric

2. Set a delete-or-fix policy for alarms idle for 90+ days

3. Ask one question at the leadership review

Quick quiz

Keep learning

Related monitoring lessons