Silent alarms: the basics
What's a "never-triggered" alarm and why is it a problem?
A CloudWatch alarm watches a metric, compares it to a threshold, and changes state — OK, ALARM, or INSUFFICIENT_DATA — based on the comparison. When the alarm transitions into ALARM, it fires actions: SNS notifications, Auto Scaling steps, Lambda invocations, PagerDuty pages. That state-change is the entire reason the alarm exists.
A "never-triggered" alarm is one that has sat in OK for months or years without ever transitioning to ALARM. On paper that looks like a sign of a healthy system — the metric never crossed the line, so the workload must be fine. In practice it's ambiguous. The alarm might genuinely be guarding a healthy system, or it might be quietly broken in a way that means it would never fire even if the underlying condition actually happened.
ALH-003 ("Never Triggered Alarms") flags alarms whose StateValue has been OK for an extended period with no state transitions in the history log. The severity is LOW because a silent alarm isn't actively harmful — but it represents an untested control. The first time you find out it doesn't work is the incident it was supposed to catch.
In this lesson you'll learn how to find alarms that have never fired, the failure modes that quietly mask broken alarms, and how to verify each one actually works using either synthetic metric data or set-alarm-state. You'll also see how to document an alarm's intent so future-you can decide whether to keep, fix, or delete it during a quarterly review.
The Knight Capital alarm that never was
In 2012 Knight Capital lost $440M in 45 minutes because a deployment left old code running on one of eight servers. The monitoring existed — a system status email had been arriving every morning showing the divergence — but no one had configured an alarm on it, and the email was filtered to a folder nobody read. "We have monitoring" and "we have working alarms" are not the same sentence. Untested alarms behave the same way: present in the inventory, absent in the incident.
Auditing silent alarms in action
Marco runs SRE at a mid-size SaaS. A FinOps review surfaces 84 CloudWatch alarms in the production account, and a quick query shows 31 of them — over a third — haven't transitioned state in 90+ days. ALH-003 has flagged the whole set.
Most of them are probably fine. The auto-scaling targets, the queue-depth alerts, the routine CPU thresholds — they sit at OK because the system genuinely behaves. But buried in the 31 is at least one alarm that was added during an incident two years ago and has since had its underlying metric renamed, and another that's pointing at a dimension for an instance that was terminated last summer.
Marco doesn't trust the list. He picks one alarm — a critical SES bounce-rate alarm — and decides to actually prove it works before assuming it does.
First, pull the audit list: every alarm in OK with no state transitions in the last 90 days. This is the ALH-003 query.
Every alarm in OK with no state change since Feb. Some are real, some are suspect.
Now prove the SES bounce-rate alarm actually works. set-alarm-state forces a transition and fires the alarm's actions — without producing any real bounce traffic.
set-alarm-state proves the SNS + Slack path works end-to-end without waiting for a real bounce spike.
How CloudWatch alarms actually workdeep dive
A CloudWatch alarm is a small state machine. Every evaluation period (default 60s, configurable) the alarm pulls the metric for the configured statistic (Sum, Average, p99, etc.) across the configured dimensions, compares it to the threshold using the configured operator (GreaterThanThreshold, LessThanLowerOrGreaterThanUpperThreshold, etc.), and decides which state to be in. State changes — and only state changes — fire actions.
This is where silent failures live. If the metric stops being published, the alarm transitions to INSUFFICIENT_DATA, not ALARM — unless you've configured treat-missing-data=breaching, missing data won't trigger anything. If the dimension points at a resource that no longer exists, you get the same outcome. If the threshold is set to 999 when the metric never exceeds 100, the comparison never evaluates true. If the operator is inverted — GreaterThanThreshold when you meant LessThan — the alarm will never fire under the condition it was supposed to catch.
CloudWatch charges roughly $0.10 per alarm metric per month for standard-resolution alarms and $0.30 for high-resolution. That's pocket change individually, but across a fleet with thousands of alarms accumulated over years of incidents, the bill is real. Worse, every silent broken alarm is dead weight in your inventory — it dilutes attention and makes the working alarms harder to trust.
# Pull the full state history for a suspect alarm — has it ever changed state?
aws cloudwatch describe-alarm-history \
--alarm-name legacy-worker-disk-full \
--history-item-type StateUpdate \
--max-records 50 \
--query 'AlarmHistoryItems[].[Timestamp,HistorySummary]' \
--output table
# If the only entries are the original "created" event, the alarm has literally
# never transitioned. Combined with treat-missing-data=missing, that often means
# the underlying metric is no longer being published. What's the impact of unverified silent alarms?
The direct impact is missed incidents. Every silent alarm represents a control you believe is in place but haven't proven. When the condition it was meant to catch actually happens — bounce rate spikes, disk fills, dead-letter queue backs up — you find out about it from a customer ticket instead of a page, and the mean-time-to-detect doubles or triples.
The second-order impact is alert fatigue working in reverse. Engineers trust the alarm inventory as a proxy for "we'd hear about it." When that proxy is silently broken, post-incident reviews keep finding the same root cause: "we had an alarm for this, it just didn't fire." That erodes trust in the entire monitoring system, and the response is usually to add more alarms — which compounds the problem.
The financial impact is small but real. Each alarm costs roughly $0.10/month — a fleet with 2,000 alarms is paying ~$2,400/year just to keep them configured. A meaningful fraction of those are usually defunct. More importantly, every silent alarm is a future incident-investigation hour: when an outage hits and the alarm didn't fire, someone has to figure out why, fix it, and update the runbook — usually under pressure at 3am.
There's also a compliance angle. SOC 2 CC7.2 and ISO 27001 A.12.4 expect monitoring controls to be tested periodically. An auditor asking "how do you know your alarms work?" wants to hear about a documented verification process — not "they're in the inventory and we trust they're configured correctly."
How do you audit and fix silent alarms?
Auditing silent alarms is a four-step loop, run quarterly. The goal isn't to keep every alarm — it's to make sure every alarm in the inventory is verified, documented, and still relevant.
1. Inventory alarms by last state transition
Pull describe-alarms filtered to StateValue=OK, then cross-reference describe-alarm-history for the StateUpdate event count. Any alarm whose only history entry is its own creation, or whose last transition is older than your audit window (90 days is reasonable), goes on the audit list. Don't audit every alarm every quarter — focus on the silent set.
2. Verify the alarm actually fires
Two safe ways to test. set-alarm-state --state-value ALARM forces a transition and fires every configured action — the cleanest end-to-end test of the SNS topic, Lambda, and downstream paging integration. Or put-metric-data with a value above the threshold for a few evaluation periods, which exercises the metric→evaluation→action path too. Always set a clear state-reason like "quarterly audit" so the on-call team knows it's a drill, and reset to OK immediately after.
3. Tabletop the intent against the configuration
For each alarm, write one sentence describing what it's meant to catch and read the configuration with that intent in mind. Does the metric actually measure the thing you described? Is the threshold inside the range of plausible values? Is the comparison operator the right direction? An alarm called "low-disk-warning" with GreaterThanThreshold 10 is configured backwards — it'll only fire when the disk is more than 10% full, which is always. These bugs hide in plain sight until someone reads the config out loud.
4. Document intent in the description, then keep or delete
Every alarm needs a one-line description that explains its purpose, action, and severity — e.g. "Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation." If you can't write that sentence, delete the alarm. Untracked alarms accumulate over years; the quarterly audit is your chance to keep the inventory honest. A short, verified set of alarms is worth more than a long unverified one.
# Update the description so the next on-call (and the next auditor) knows the intent.
aws cloudwatch put-metric-alarm \
--alarm-name ses-bounce-rate-high \
--alarm-description 'Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation. Last verified 2026-05-15.' \
--metric-name Reputation.BounceRate \
--namespace AWS/SES \
--statistic Average \
--period 300 \
--evaluation-periods 3 \
--threshold 5.0 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-alerts
# Delete the alarms whose intent you couldn't articulate — they were dead weight.
aws cloudwatch delete-alarms --alarm-names legacy-worker-disk-full old-elb-latency-2023 Quick quiz
Question 1 of 5You find an alarm that's been in OK for 14 months with zero state transitions in its history. The on-call team says they're sure the workload has had bad days in that window. What's the right next move?
You scored
0 / 5
Keep learning
Dig deeper into CloudWatch alarms and the discipline of testing your monitoring.
- Amazon CloudWatch alarms — User Guide How alarm state machines, evaluation periods, and missing-data handling actually work.
- set-alarm-state — AWS CLI reference Force a state transition for testing — the safest way to verify an alarm's action path end-to-end.
- describe-alarm-history — AWS CLI reference Pull state-change history for any alarm to see whether it has ever actually transitioned.
- Google SRE Book — Monitoring Distributed Systems The canonical reference on what monitoring is for, why untested alerts are dangerous, and how to keep the alarm inventory honest.
You've completed Audit alarms that never trigger. You can now spot silent alarms in your inventory, verify them with set-alarm-state or synthetic metric data, tabletop the configuration against the intent, and decide quarterly which alarms to keep and which to delete. The next ALH-003 finding won't be a question mark — you'll have a four-step loop ready to run.
Back to the library