Monitoring

Tame flapping CloudWatch alarms

10 state transitions in 24 hours isn't a fire — it's a misconfigured threshold. Tune the eval window or use anomaly detection.

12 min·10 sections·AWS

Last reviewed 27 May 2026

Flapping alarms: the basics

What does it mean for a CloudWatch alarm to flap?

A CloudWatch alarm flaps when it bounces between ALARM and OK over and over in a short window — 5, 10, sometimes 20 state transitions in 24 hours. Each transition fires whatever's wired to AlarmActions and OKActions: an SNS topic, a PagerDuty integration, an auto-scaling step, a Lambda. So a flapping alarm isn't a single noisy notification; it's a fire hose. Ops gets paged at 02:14, recovered at 02:16, paged at 02:19, recovered at 02:23, and so on until someone snoozes the rotation entirely.

The metric isn't broken. The alarm is. The threshold sits right at the value the metric normally hovers around, the evaluation window is a single one-minute period, and the metric has enough natural variance to cross the line a dozen times a day under perfectly healthy conditions. The alarm is doing exactly what you asked — it just turns out you asked for the wrong thing.

AWS surfaces this pattern through check ALH-004 ("Flapping Alarms"), which counts state transitions over a 24-hour window and flags anything with more than a small handful. The default heuristic is that real incidents are sticky — once they start, they stay started for minutes-to-hours — and an alarm transitioning every few minutes is almost certainly oscillating around its threshold rather than detecting a real, sustained problem.

In this lesson you'll learn how to recognise a flapping CloudWatch alarm, why it's flapping (it's almost always one of three things), and how to tune it back into something useful with the M-of-N evaluation model, a longer period, a better threshold, or an Anomaly Detection band. You'll also learn when flapping is a real signal that the underlying system is oscillating — and you should fix that, not the alarm.

Fun fact

The pager that cried wolf

Google's SRE book famously argues that an alert page should be actionable, urgent, and rare. Their internal benchmark: on-callers should get at most two pages per 12-hour shift, and every page should require a human decision. The most common reason that bar gets blown is flapping — one badly-tuned alarm firing 20 times overnight burns the whole month's page budget by Tuesday. Once on-callers learn an alarm is unreliable, they start ignoring it; the first real incident it catches is the one that gets missed.

Taming a flapping alarm in action

Marco runs platform at a SaaS company. A check fires from the FinOps dashboard: alarm "SES bounce rate >= 9%" has transitioned state 10 times in the last 24 hours. Severity MEDIUM. Marco's PagerDuty inbox confirms it — ten pages, ten auto-resolves, nobody actually did anything about any of them.

He pulls the alarm's history before changing a thing. The pattern is unmistakable: the metric bounces between 8.4% and 9.3% every few minutes, crossing the 9.0% threshold each time. The bounce rate isn't broken — it's just naturally variable, and the threshold sits right in the middle of its normal range.

He has three levers: require multiple periods to alarm (M-of-N), lengthen the period to smooth the metric, or move to anomaly detection. He picks M-of-N first because it's the cheapest change with the biggest effect.

First, count state transitions over the last 24 hours. This is the canonical "is this alarm flapping?" query.

$ aws cloudwatch describe-alarm-history --alarm-name 'SES bounce rate >= 9%' --history-item-type StateUpdate --start-date $(date -u -d '24 hours ago' +%FT%TZ) --end-date $(date -u +%FT%TZ) --query 'length(AlarmHistoryItems)'

# 10 transitions in 24h — textbook flap.

Counting StateUpdate events over the last day.

Now look at the actual transitions to see how close the metric is to the threshold each time it crosses.

$ aws cloudwatch describe-alarm-history --alarm-name 'SES bounce rate >= 9%' --history-item-type StateUpdate --max-items 6 --query 'AlarmHistoryItems[*].[Timestamp,HistorySummary]' --output table

┌──────────────────────┬───────────────────────────────────────────────────────┐

│ Timestamp │ HistorySummary │

├──────────────────────┼───────────────────────────────────────────────────────┤

│ 2026-05-15T02:14:00Z │ Alarm updated from OK to ALARM (9.12 >= 9.00) │

│ 2026-05-15T02:16:00Z │ Alarm updated from ALARM to OK (8.81 < 9.00) │

│ 2026-05-15T02:19:00Z │ Alarm updated from OK to ALARM (9.04 >= 9.00) │

│ 2026-05-15T02:23:00Z │ Alarm updated from ALARM to OK (8.72 < 9.00) │

│ 2026-05-15T02:27:00Z │ Alarm updated from OK to ALARM (9.21 >= 9.00) │

│ 2026-05-15T02:31:00Z │ Alarm updated from ALARM to OK (8.94 < 9.00) │

└──────────────────────┴───────────────────────────────────────────────────────┘

# Crossings are all within 0.3 of the threshold. Metric is healthy; the alarm is too tight.

Six most recent transitions — the metric is oscillating ±0.3 around the line.

Flapping under the hooddeep dive

A CloudWatch alarm evaluates a metric on a defined Period (the size of each datapoint, e.g. 60s or 300s), checks the most recent EvaluationPeriods of datapoints, and transitions to ALARM if DatapointsToAlarm of those periods breach the threshold. The default is EvaluationPeriods=1 and DatapointsToAlarm=1 — a single bad datapoint, and you're paged. This is fine for hard-edged metrics (was the lambda invoked?) and terrible for noisy ones (bounce rate, queue depth, p99 latency).

The fix is the M-of-N pattern: require M breaching datapoints out of the last N. Setting EvaluationPeriods=5 and DatapointsToAlarm=3 means the alarm only fires if 3 out of the last 5 periods cross the threshold — random spikes in a single period are absorbed, sustained issues still trigger. At a 5-minute period, this corresponds to a sustained ~15-minute breach. That's appropriate for slow-burn alarms (bounce rate, error rate, fill rate) but slower than you want for pages on user-facing latency.

The other lever is the metric itself. CloudWatch Anomaly Detection runs a model over the last ~2 weeks of a metric, learns its daily and weekly seasonality, and emits an upper/lower band. An anomaly-detection alarm fires when the metric leaves the band, not when it crosses an absolute number. It's immune to the kind of flapping caused by a static threshold sitting on top of a naturally-variable metric — but it's not free: it adds cost per metric per month, and it takes time to train, so cold-start alarms behave oddly for the first couple of weeks.

# Inspect the alarm's current evaluation config.
aws cloudwatch describe-alarms \
  --alarm-names 'SES bounce rate >= 9%' \
  --query "MetricAlarms[0].{Period:Period,Eval:EvaluationPeriods,Datapoints:DatapointsToAlarm,Threshold:Threshold}"

# Example output:
# {
#   "Period": 60,
#   "Eval": 1,
#   "Datapoints": 1,
#   "Threshold": 9.0
# }
# Single 1-minute period at the metric's typical value — guaranteed to flap.

What is the impact of leaving an alarm flapping?

The most direct impact is alert fatigue. An on-caller who gets paged ten times in a shift for the same alarm stops reading the page text — they swipe to acknowledge, go back to sleep, and the next real incident on that alarm lands in a brain that has been trained to ignore it. SRE post-mortems are full of "the alarm fired but we'd been ignoring it for weeks."

The second-order impact is downstream automation. AlarmActions don't just page humans — they trigger Auto Scaling steps, Lambda functions, SSM documents, Step Functions. A flapping CPU alarm wired to a scaling policy will scale up and down every few minutes, churning instances and burning money on EC2 hours and EBS snapshots that exist for thirty minutes apiece. A flapping alarm wired to an incident-response Lambda spawns ten investigations a day, ten log queries, ten Slack threads.

The third impact is on the SLO itself. If your alarm threshold corresponds to a service-level objective ("page if error rate > 1%"), a flapping alarm corrupts your error-budget accounting: every transition shows up as a notional breach event, and the team starts arguing about whether the SLO is really broken or the alarm is just wrong. Both can be true at once, which is the worst case to debug.

And there's a cost dimension: SNS notifications, PagerDuty events, and downstream Lambda invocations are all billed per event. A single flapping alarm doing 20 transitions a day across multiple actions isn't expensive by itself — but at fleet scale across 200 alarms, the bill plus the human time to investigate adds up to thousands a month for no actionable information.

How do you safely tame a flapping alarm?

Taming a flapping alarm is a four-step loop. The order matters — guessing at the tuning before you look at the metric's actual distribution is how you end up with an alarm that's no longer noisy but also no longer catches the thing it was meant to catch.

1. Confirm it's the alarm, not the system

Pull describe-alarm-history and the underlying metric for the last 7 days. If the metric is oscillating tightly around the threshold while the underlying service is healthy, the alarm is mis-tuned. If the metric is genuinely sawtoothing — HPA fighting itself, retry storms, a feedback loop in queue depth — the system is the problem and the alarm is correctly reporting it. Fix the system, not the alarm.

2. Tune evaluation periods, period, and threshold

Try M-of-N first: EvaluationPeriods=5, DatapointsToAlarm=3 absorbs almost all single-period noise. If that's not enough, raise the Period from 60s to 300s — the longer averaging window smooths bursts. Move the threshold last, and only by a small buffer above the metric's typical p95 (not the mean) — otherwise you'll silence real signals along with the noise.

3. Consider Anomaly Detection for naturally variable metrics

If the metric has strong daily or weekly seasonality (traffic, queue depth, bounce rates that vary by send volume), a static threshold is the wrong tool. Replace it with a CloudWatch Anomaly Detection alarm — the band adapts to the metric's pattern and only fires on actual deviation. Acceptable trade-off: small added cost per metric, a 2-week warm-up before the band is reliable.

4. Audit the rest of the fleet for the same pattern

Where there's one flapping alarm there are usually ten. Loop describe-alarm-history over every alarm in the account, count StateUpdates per 24h, and flag anything over 5 transitions. The fix from steps 1-3 applies to all of them; the audit prevents the next 2am pager-storm of the same shape.

# Apply M-of-N evaluation: 3 out of the last 5 periods at 5-minute granularity.
aws cloudwatch put-metric-alarm \
  --alarm-name 'SES bounce rate >= 9%' \
  --namespace AWS/SES \
  --metric-name Reputation.BounceRate \
  --statistic Average \
  --period 300 \
  --evaluation-periods 5 \
  --datapoints-to-alarm 3 \
  --threshold 9.5 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ses-alerts

# Result: alarm only fires on a sustained 15-minute breach above 9.5%.
# Eliminates the ±0.3 oscillation around the old 9.0% line.

Quick quiz

Question 1 of 5

An alarm has transitioned state 10 times in 24 hours. The metric is oscillating between 8.8% and 9.2% against a 9.0% threshold, and the underlying service is healthy. What's the right fix?

Keep learning

Dig deeper into CloudWatch alarm tuning and the math behind M-of-N evaluation.

You've completed Tame flapping CloudWatch alarms. You can now spot a flap from describe-alarm-history, decide whether the alarm or the system is the problem, and apply the right combination of M-of-N, a longer period, a buffered threshold, or anomaly detection to silence the noise without losing the signal. The next time a 2am page-storm fires from the same alarm twice in three minutes, you'll have a four-step loop ready to run.

Back to the library

Flapping alarms: what it means for cost and signal quality

Misconfigured thresholds generate billable noise and erode the value of your alerting investment

A CloudWatch alarm flaps when it bounces between ALARM and OK repeatedly — 10, 20, sometimes more state transitions in a single day. Every transition is a billable event: SNS notifications, PagerDuty pages, and any downstream Lambda invocations each carry a per-event charge. A single flapping alarm is cheap in isolation; at fleet scale across dozens of alarms, the cumulative bill for noise with zero operational value is measurable and completely avoidable.

More importantly, every flapping alarm fires against engineering time. An on-call engineer who gets paged ten times overnight for the same auto-resolving condition isn't responding to incidents — they're burning incident-response budget on a misconfiguration. That's a unit-economics problem: you're paying the human cost of incident response without getting the reliability outcome it's meant to buy.

AWS check ALH-004 surfaces flapping alarms by counting state transitions over 24 hours and flagging anything above a small threshold. The cost framing is straightforward: each flagged alarm represents a stream of billable events and billable human time that, once the alarm is correctly tuned, drops to near zero. Fixing a flapping alarm is one of the few operational improvements with an immediate, measurable, positive impact on both the observability bill and the on-call burden.

This lesson is for the finance partner who sees SNS and Lambda line items on the cloud bill and wants to understand where flapping alarms fit in the observability cost picture. You'll learn why a misconfigured evaluation window turns a single alarm into a per-event billing stream, how to quantify the noise cost at fleet scale, and what the tuning levers are — M-of-N, period, threshold buffer — so you can speak to the remediation cost-benefit with engineering. No CLI commands required.

Fun fact

The pager that cried wolf

How a finance partner reads the alarm-flapping bill

Dana is the FinOps lead at a SaaS company. During a monthly cloud cost review she spots an SNS line climbing steadily — $340 in April, $410 in May, nothing obvious in the release log to explain it. She pulls a breakdown by topic and finds one topic, ses-alerts, generating over 600 notifications a month. A quick cross-reference shows it wired to the same "SES bounce rate >= 9%" alarm Marco's team has been getting paged on.

Dana maps the cost chain: every state transition fires one SNS notification ($0.00 per publish, but the Lambda subscriber bills per invocation at ~$0.0000002 each — not the issue here) and one PagerDuty event ($0.08 per incident under their current plan). Ten transitions a day across a 30-day month is 300 PagerDuty incidents from a single alarm — around $24 just in PagerDuty charges, plus the SNS publishes, plus the analyst time to triage 300 non-incidents.

She brings the number to the engineering review with one ask: show me the 10 alarms with the highest transition count this month. The list reveals the same pattern across multiple alarms — thresholds sitting at the metric's natural mean. Dana approves two hours of engineering time per alarm for tuning and sets a target: alarm-generated PagerDuty incidents down 80% within 60 days. The resulting bill reduction pays for the tuning work in the first month.

The measurable cost of a fleet of flapping alarms

The direct bill impact of flapping alarms is small per alarm but compounds at scale. SNS publishes are effectively free, but every downstream action costs money: PagerDuty bills per incident event, incident-management Lambda invocations bill per call plus duration, and Auto Scaling actions that provision and terminate instances in rapid cycles generate compute and EBS charges for instances that may exist for under an hour — the minimum billing granularity on-demand. A single alarm doing 20 transitions a day isn't material; 50 alarms with the same pattern across a production fleet generates a real, recurring, avoidable cost.

The harder cost to model is the human one. On-call engineers are a finite resource. Every non-actionable page pulls them away from real work, degrades their sleep, and gradually erodes their confidence in the monitoring system. At the unit-economics level, you are paying the full burdened cost of incident response — the engineer's time, the PagerDuty event, the downstream automation — for an event that requires no human action and produces no operational value. That is the definition of waste.

There is also an automation-spend multiplier. If a flapping alarm is wired to an Auto Scaling policy, every oscillation between ALARM and OK spawns a scale-out and a scale-in event. EC2 instances provisioned and terminated every 30 minutes contribute to the bill at full on-demand rates for each partial hour. The same pattern applies to Lambda-based runbooks: ten alarm transitions a day means ten Lambda invocations, ten CloudWatch log streams, and potentially ten downstream API calls to ticketing or incident systems — all billed, none informative.

The correct framing for finance is not "alarm noise is a purely operational problem." It's a chargeback and efficiency question: which teams own the alarms generating the highest event volumes, what is the per-team cost of that noise, and what does the remediation — typically a few hours of an engineer's time per alarm — cost against the ongoing waste it eliminates? The math almost always favors fixing the alarm in week one.

What finance can actually do about flapping alarms

Finance can't touch put-metric-alarm, but it can shape the remediation conversation with the right framing — turning alarm tuning from an engineering backlog item into a cost-justified prioritization decision.

1. Quantify the noise cost before the tuning conversation

Pull the top-ten alarms by state-transition count and map each one to its action chain: SNS topics, PagerDuty integrations, downstream Lambdas. Calculate the monthly event volume and the associated charges — PagerDuty per-incident rates, Lambda invocation costs, any Auto Scaling churn. Present it as a recurring cost with a one-time fix cost (typically a few hours of engineering time per alarm). The ROI on alarm tuning is almost always realized within the first billing cycle.

2. Introduce alarm-quality as a FinOps metric

Add "alarm-generated incidents per week" to the observability cost review alongside SNS spend and Lambda invocation counts. Segment by team so each cost center can see its own noise contribution. A team that sees its alarm noise is generating $400/month in PagerDuty events and 60 engineer-hours of non-actionable response has a concrete number to optimize against — far more motivating than a generic "tune your alarms" request.

3. Budget the tuning sprint as a waste-elimination line

When engineering asks for capacity to run an alarm audit and tuning sprint, frame the budget request as waste elimination with a calculable payback period: if the current noise costs X per month in direct charges plus Y in engineer time, and the tuning work costs Z, the break-even is Z / (X + Y) months. For most teams this is under 30 days. Approve it as a cost-efficiency investment, not as overhead.

4. Require a documented threshold rationale for high-volume alarms

Any alarm generating more than 10 transitions per day should carry a documented reason its threshold is set where it is — ideally with a reference to the metric's p95 or p99 over the last 30 days. If that documentation doesn't exist, the threshold was set by guess. Finance's contribution is to make documentation a condition of the alarm being on the bill — if it's generating cost, it needs a recorded justification for its configuration.

Quick quiz

Question 1 of 5

A cost review shows one CloudWatch alarm generating 240 PagerDuty incident events per month at $0.08 each — $19.20/month — plus an incident-response Lambda running 240 times at ~$0.0004 each. Engineering estimates 3 hours to tune the alarm, eliminating 90% of the noise. How should finance frame this remediation decision?

Keep learning

Dig deeper into CloudWatch alarm tuning and the math behind M-of-N evaluation.

You've finished the finance view of flapping alarms. You know how to quantify the direct cost of alarm noise — PagerDuty events, Lambda invocations, Auto Scaling churn — and how to frame alarm tuning as a waste-elimination investment with a calculable payback period. You also have the four levers for keeping alarm economics healthy: quantifying noise cost upfront, tracking alarm-generated incidents as a FinOps metric, budgeting tuning sprints as cost-efficiency work, and requiring documented threshold rationale for any alarm that's generating charges. Next time an SNS or PagerDuty line spikes unexpectedly, you'll know exactly which question to ask.

Back to the library

Flapping alarms: the headline

Alert systems that cry wolf train engineers to ignore them — including when a real incident fires

A flapping CloudWatch alarm is one that bounces between "alerting" and "resolved" repeatedly — sometimes 10 or 20 times in a single day — on a healthy system. The alarm isn't catching a real problem; the threshold is positioned right where the metric's normal daily variance crosses it, so the system fires constantly for no reason.

The organizational consequence is alert fatigue: once an on-call team learns that a specific alarm auto-resolves within minutes, they stop treating it as urgent. The risk is that when a real incident eventually fires on the same alarm, the conditioned response is to dismiss it. That's a governance failure — the alerting investment is providing false assurance rather than genuine oversight.

AWS flags this pattern automatically through check ALH-004. The right leadership read is not a count of noisy alarms but a question of process maturity: are alerts configured by policy — with defined thresholds, documented rationale, and a regular tuning cadence — or are they set once and never revisited? Flapping alarms at scale are a symptom of the latter, and the fix is a policy, not just a configuration change.

A short read for the leader who wants to understand what alert fatigue means for organizational risk and how to ask the right question at a governance review. You'll get the plain-English version of why alarms flap, why it degrades the reliability of the entire alerting system over time, and what a mature, policy-driven alerting posture looks like — defined thresholds, documented rationale, a regular tuning review. No implementation detail.

Fun fact

The pager that cried wolf

What it looks like when the org gets alert tuning right

At one company the VP of Engineering, Priya, used to start every Monday with a digest of weekend pages. The count was high — 40, 60, sometimes 80 over a Saturday and Sunday — and almost none required a human decision. Engineers had stopped reading page text; they'd swipe to acknowledge and go back to sleep. The team called it the "false alarm tax."

After a quarterly retrospective surfaced three near-misses where real incidents got caught in the noise, Priya made alarm quality a tracked metric alongside availability and deployment frequency. Engineering ran a one-week audit, flagged every alarm with more than five transitions per day, and applied a standard tuning pass — M-of-N evaluation, longer periods, small threshold buffers — to each one.

Weekend page volume dropped from 70 to 8. More importantly, the eight pages that remained all required a genuine decision. On-call engineers started trusting the system again: when a page fired, they knew it was real. That trust is the outcome worth measuring — not the count of green checks on a security dashboard, but whether the alerting system is genuinely governing infrastructure quality by policy rather than generating background noise by accident.

Why untuned alarms are a governance risk, not just an ops nuisance

The executive risk in a fleet of flapping alarms is not the noise itself — it's what the noise trains the team to do. Engineers who receive ten non-actionable pages from the same alarm within 24 hours adapt: they learn to dismiss that alarm on sight. The problem is that the same conditioned response fires when the alarm eventually catches a real incident. Alert fatigue is the mechanism behind a disproportionate number of major outage post-mortems: "the alarm fired, but we'd been ignoring it for three weeks."

From a governance standpoint, an alerting system that produces chronic false signals is not providing oversight — it is providing the appearance of oversight. Dashboards show alarms configured; what they don't show is that those alarms have been trained out of human attention. That is a material gap between the control that exists on paper and the control that actually operates in practice.

The second executive risk is that flapping alarms wired to automation — Auto Scaling policies, incident-response workflows, scaling Lambdas — execute that automation repeatedly on a healthy system. Infrastructure churning in response to a misconfigured alarm is not a hypothetical: it wastes compute spend, can destabilize the very service the alarm was meant to protect, and creates a change history that is nearly impossible to audit causally.

The right question at a leadership review is not "how many alarms do we have?" but "are our alarms governed by policy?" Governance means defined thresholds with documented rationale, M-of-N evaluation parameters matched to the metric's natural variance, and a regular tuning cadence that catches drift. Where that policy exists and is followed, flapping alarms are caught and fixed quickly. Where it doesn't, the organization is running on hope — and hoping is not a monitoring strategy.

The leadership move on flapping alarms

The executive handle is not to mandate alarm tuning one alarm at a time — it's to require that alerting is governed by policy and that governance is visible at the regular review.

1. Set a standard: every alarm must have a documented threshold rationale

Make it a standing requirement that any alarm wired to a human-notification or automation action carries a documented rationale for its threshold and evaluation window — ideally signed off by the team that owns the metric. This single policy converts alarm configuration from a one-time guess to a reviewable decision, and it makes drift visible: when an alarm starts flapping, the question becomes whether the threshold rationale is still valid, not "why does this thing keep firing?"

2. Track alarms-by-policy as a maturity signal

Ask at the quarterly engineering review: what percentage of our production alarms have documented, reviewed thresholds? The trajectory matters more than the absolute number. An org moving from 20% documented to 60% in two quarters is governing its alerting. One that's been at 20% for a year is operating on defaults and hope. Alarms-by-policy is a one-number proxy for observability maturity.

3. Treat a tuning backlog as a risk item, not a tech-debt item

A queue of known-flapping, known-untuned alarms is not just technical debt — it's an active reliability risk. Each alarm on that list is training on-call engineers to ignore its alert type, and one of them will eventually catch a real incident. Elevate the alarm-tuning backlog to the risk register with a target: every alarm with more than 5 transitions per day must be resolved or documented as intentional within 30 days. That creates accountability without requiring executive involvement in individual alarm configurations.

Quick quiz

Question 1 of 5

At a quarterly review the engineering lead says: "We have 30 production alarms that have each fired more than 5 times per day over the last month. The on-call team has started acknowledging them automatically without reading them." What's the right leadership response?

Keep learning

Dig deeper into CloudWatch alarm tuning and the math behind M-of-N evaluation.

That's the lesson. Two takeaways: a fleet of untuned alarms is a governance risk, not just an ops nuisance — it trains engineers to ignore the system and erodes the reliability signal that leadership depends on for oversight. And the fix is a policy, not just a configuration change. The right end state is alerting governed by documented thresholds, reviewed on a regular cadence, with alarm-quality tracked as a maturity metric at the engineering review. When you can say every production alarm has a documented rationale and a team accountable for its signal quality, you have monitoring by policy rather than monitoring by hope.

Back to the library

Part of the learning path Get your alarms right

Tame flapping CloudWatch alarms

Flapping alarms: the basics

The pager that cried wolf

Taming a flapping alarm in action

Flapping under the hooddeep dive

What is the impact of leaving an alarm flapping?

How do you safely tame a flapping alarm?

1. Confirm it's the alarm, not the system

2. Tune evaluation periods, period, and threshold

3. Consider Anomaly Detection for naturally variable metrics

4. Audit the rest of the fleet for the same pattern

Quick quiz

Keep learning

Flapping alarms: what it means for cost and signal quality

The pager that cried wolf

How a finance partner reads the alarm-flapping bill

The measurable cost of a fleet of flapping alarms

What finance can actually do about flapping alarms

1. Quantify the noise cost before the tuning conversation

2. Introduce alarm-quality as a FinOps metric

3. Budget the tuning sprint as a waste-elimination line

4. Require a documented threshold rationale for high-volume alarms

Quick quiz

Keep learning

Flapping alarms: the headline

The pager that cried wolf

What it looks like when the org gets alert tuning right

Why untuned alarms are a governance risk, not just an ops nuisance

The leadership move on flapping alarms

1. Set a standard: every alarm must have a documented threshold rationale

2. Track alarms-by-policy as a maturity signal

3. Treat a tuning backlog as a risk item, not a tech-debt item

Quick quiz

Keep learning

Related monitoring lessons