Skip to main content
emnode / learn
Monitoring

Audit alarms that never trigger

An alarm that's been OK for 12 months is either fine or unverified — review periodically before you trust it for the next incident.

11 min·10 sections·AWS

Last reviewed

Silent alarms: the basics

What's a "never-triggered" alarm and why is it a problem?

A CloudWatch alarm watches a metric, compares it to a threshold, and changes state — OK, ALARM, or INSUFFICIENT_DATA — based on the comparison. When the alarm transitions into ALARM, it fires actions: SNS notifications, Auto Scaling steps, Lambda invocations, PagerDuty pages. That state-change is the entire reason the alarm exists.

A "never-triggered" alarm is one that has sat in OK for months or years without ever transitioning to ALARM. On paper that looks like a sign of a healthy system — the metric never crossed the line, so the workload must be fine. In practice it's ambiguous. The alarm might genuinely be guarding a healthy system, or it might be quietly broken in a way that means it would never fire even if the underlying condition actually happened.

ALH-003 ("Never Triggered Alarms") flags alarms whose StateValue has been OK for an extended period with no state transitions in the history log. The severity is LOW because a silent alarm isn't actively harmful — but it represents an untested control. The first time you find out it doesn't work is the incident it was supposed to catch.

In this lesson you'll learn how to find alarms that have never fired, the failure modes that quietly mask broken alarms, and how to verify each one actually works using either synthetic metric data or set-alarm-state. You'll also see how to document an alarm's intent so future-you can decide whether to keep, fix, or delete it during a quarterly review.

Fun fact

The Knight Capital alarm that never was

In 2012 Knight Capital lost $440M in 45 minutes because a deployment left old code running on one of eight servers. The monitoring existed — a system status email had been arriving every morning showing the divergence — but no one had configured an alarm on it, and the email was filtered to a folder nobody read. "We have monitoring" and "we have working alarms" are not the same sentence. Untested alarms behave the same way: present in the inventory, absent in the incident.

Auditing silent alarms in action

Marco runs SRE at a mid-size SaaS. A FinOps review surfaces 84 CloudWatch alarms in the production account, and a quick query shows 31 of them — over a third — haven't transitioned state in 90+ days. ALH-003 has flagged the whole set.

Most of them are probably fine. The auto-scaling targets, the queue-depth alerts, the routine CPU thresholds — they sit at OK because the system genuinely behaves. But buried in the 31 is at least one alarm that was added during an incident two years ago and has since had its underlying metric renamed, and another that's pointing at a dimension for an instance that was terminated last summer.

Marco doesn't trust the list. He picks one alarm — a critical SES bounce-rate alarm — and decides to actually prove it works before assuming it does.

First, pull the audit list: every alarm in OK with no state transitions in the last 90 days. This is the ALH-003 query.

$ aws cloudwatch describe-alarms --state-value OK --query 'MetricAlarms[?StateUpdatedTimestamp<=`2026-02-14`].[AlarmName,MetricName,Namespace,Threshold,StateUpdatedTimestamp]' --output table
┌──────────────────────────────────────┬──────────────────────┬───────────────┬──────┬──────────────────────┐
│ AlarmName │ MetricName │ Namespace │ Thr. │ StateUpdatedTimestamp│
├──────────────────────────────────────┼──────────────────────┼───────────────┼──────┼──────────────────────┤
│ ses-bounce-rate-high │ Reputation.BounceRate│ AWS/SES │ 5.0 │ 2024-11-03T08:21:14Z │
│ rds-prod-cpu-warn │ CPUUtilization │ AWS/RDS │ 80.0 │ 2025-01-09T14:02:51Z │
│ alb-target-5xx-spike │ HTTPCode_Target_5XX │ AWS/AppELB │ 50.0 │ 2024-12-22T09:44:00Z │
│ legacy-worker-disk-full │ DiskSpaceUtilization │ System/Linux │ 90.0 │ 2024-07-17T03:11:09Z │
│ sqs-deadletter-depth │ ApproximateNumberOfM…│ AWS/SQS │ 1.0 │ 2024-10-04T11:55:32Z │
└──────────────────────────────────────┴──────────────────────┴───────────────┴──────┴──────────────────────┘
# 31 rows total. legacy-worker-disk-full hasn't transitioned in 10 months — and System/Linux is a CW Agent namespace.

Every alarm in OK with no state change since Feb. Some are real, some are suspect.

Now prove the SES bounce-rate alarm actually works. set-alarm-state forces a transition and fires the alarm's actions — without producing any real bounce traffic.

$ aws cloudwatch set-alarm-state --alarm-name ses-bounce-rate-high --state-value ALARM --state-reason 'Quarterly audit — verifying SNS + Slack action fires'
# (set-alarm-state returns no output on success — confirmation is downstream.)
# Slack channel #ops-alerts, 14:32 UTC:
[ALARM] ses-bounce-rate-high — Quarterly audit — verifying SNS + Slack action fires
Region: eu-west-1 Account: 123456789012 Threshold: > 5.0 sustained 15m
# Action fired. Reset state and document the test.
$ aws cloudwatch set-alarm-state --alarm-name ses-bounce-rate-high --state-value OK --state-reason 'Audit complete'

set-alarm-state proves the SNS + Slack path works end-to-end without waiting for a real bounce spike.

How CloudWatch alarms actually workdeep dive

A CloudWatch alarm is a small state machine. Every evaluation period (default 60s, configurable) the alarm pulls the metric for the configured statistic (Sum, Average, p99, etc.) across the configured dimensions, compares it to the threshold using the configured operator (GreaterThanThreshold, LessThanLowerOrGreaterThanUpperThreshold, etc.), and decides which state to be in. State changes — and only state changes — fire actions.

This is where silent failures live. If the metric stops being published, the alarm transitions to INSUFFICIENT_DATA, not ALARM — unless you've configured treat-missing-data=breaching, missing data won't trigger anything. If the dimension points at a resource that no longer exists, you get the same outcome. If the threshold is set to 999 when the metric never exceeds 100, the comparison never evaluates true. If the operator is inverted — GreaterThanThreshold when you meant LessThan — the alarm will never fire under the condition it was supposed to catch.

CloudWatch charges roughly $0.10 per alarm metric per month for standard-resolution alarms and $0.30 for high-resolution. That's pocket change individually, but across a fleet with thousands of alarms accumulated over years of incidents, the bill is real. Worse, every silent broken alarm is dead weight in your inventory — it dilutes attention and makes the working alarms harder to trust.

# Pull the full state history for a suspect alarm — has it ever changed state?
aws cloudwatch describe-alarm-history \
  --alarm-name legacy-worker-disk-full \
  --history-item-type StateUpdate \
  --max-records 50 \
  --query 'AlarmHistoryItems[].[Timestamp,HistorySummary]' \
  --output table

# If the only entries are the original "created" event, the alarm has literally
# never transitioned. Combined with treat-missing-data=missing, that often means
# the underlying metric is no longer being published.

What's the impact of unverified silent alarms?

The direct impact is missed incidents. Every silent alarm represents a control you believe is in place but haven't proven. When the condition it was meant to catch actually happens — bounce rate spikes, disk fills, dead-letter queue backs up — you find out about it from a customer ticket instead of a page, and the mean-time-to-detect doubles or triples.

The second-order impact is alert fatigue working in reverse. Engineers trust the alarm inventory as a proxy for "we'd hear about it." When that proxy is silently broken, post-incident reviews keep finding the same root cause: "we had an alarm for this, it just didn't fire." That erodes trust in the entire monitoring system, and the response is usually to add more alarms — which compounds the problem.

The financial impact is small but real. Each alarm costs roughly $0.10/month — a fleet with 2,000 alarms is paying ~$2,400/year just to keep them configured. A meaningful fraction of those are usually defunct. More importantly, every silent alarm is a future incident-investigation hour: when an outage hits and the alarm didn't fire, someone has to figure out why, fix it, and update the runbook — usually under pressure at 3am.

There's also a compliance angle. SOC 2 CC7.2 and ISO 27001 A.12.4 expect monitoring controls to be tested periodically. An auditor asking "how do you know your alarms work?" wants to hear about a documented verification process — not "they're in the inventory and we trust they're configured correctly."

How do you audit and fix silent alarms?

Auditing silent alarms is a four-step loop, run quarterly. The goal isn't to keep every alarm — it's to make sure every alarm in the inventory is verified, documented, and still relevant.

1. Inventory alarms by last state transition

Pull describe-alarms filtered to StateValue=OK, then cross-reference describe-alarm-history for the StateUpdate event count. Any alarm whose only history entry is its own creation, or whose last transition is older than your audit window (90 days is reasonable), goes on the audit list. Don't audit every alarm every quarter — focus on the silent set.

2. Verify the alarm actually fires

Two safe ways to test. set-alarm-state --state-value ALARM forces a transition and fires every configured action — the cleanest end-to-end test of the SNS topic, Lambda, and downstream paging integration. Or put-metric-data with a value above the threshold for a few evaluation periods, which exercises the metric→evaluation→action path too. Always set a clear state-reason like "quarterly audit" so the on-call team knows it's a drill, and reset to OK immediately after.

3. Tabletop the intent against the configuration

For each alarm, write one sentence describing what it's meant to catch and read the configuration with that intent in mind. Does the metric actually measure the thing you described? Is the threshold inside the range of plausible values? Is the comparison operator the right direction? An alarm called "low-disk-warning" with GreaterThanThreshold 10 is configured backwards — it'll only fire when the disk is more than 10% full, which is always. These bugs hide in plain sight until someone reads the config out loud.

4. Document intent in the description, then keep or delete

Every alarm needs a one-line description that explains its purpose, action, and severity — e.g. "Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation." If you can't write that sentence, delete the alarm. Untracked alarms accumulate over years; the quarterly audit is your chance to keep the inventory honest. A short, verified set of alarms is worth more than a long unverified one.

# Update the description so the next on-call (and the next auditor) knows the intent.
aws cloudwatch put-metric-alarm \
  --alarm-name ses-bounce-rate-high \
  --alarm-description 'Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation. Last verified 2026-05-15.' \
  --metric-name Reputation.BounceRate \
  --namespace AWS/SES \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 5.0 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-alerts

# Delete the alarms whose intent you couldn't articulate — they were dead weight.
aws cloudwatch delete-alarms --alarm-names legacy-worker-disk-full old-elb-latency-2023

Quick quiz

Question 1 of 5

You find an alarm that's been in OK for 14 months with zero state transitions in its history. The on-call team says they're sure the workload has had bad days in that window. What's the right next move?

You've completed Audit alarms that never trigger. You can now spot silent alarms in your inventory, verify them with set-alarm-state or synthetic metric data, tabletop the configuration against the intent, and decide quarterly which alarms to keep and which to delete. The next ALH-003 finding won't be a question mark — you'll have a four-step loop ready to run.

Back to the library