Monitoring

Audit alarms that never trigger

An alarm that's been OK for 12 months is either fine or unverified — review periodically before you trust it for the next incident.

11 min·10 sections·AWS

Last reviewed 27 May 2026

Silent alarms: the basics

What's a "never-triggered" alarm and why is it a problem?

A CloudWatch alarm watches a metric, compares it to a threshold, and changes state — OK, ALARM, or INSUFFICIENT_DATA — based on the comparison. When the alarm transitions into ALARM, it fires actions: SNS notifications, Auto Scaling steps, Lambda invocations, PagerDuty pages. That state-change is the entire reason the alarm exists.

A "never-triggered" alarm is one that has sat in OK for months or years without ever transitioning to ALARM. On paper that looks like a sign of a healthy system — the metric never crossed the line, so the workload must be fine. In practice it's ambiguous. The alarm might genuinely be guarding a healthy system, or it might be quietly broken in a way that means it would never fire even if the underlying condition actually happened.

ALH-003 ("Never Triggered Alarms") flags alarms whose StateValue has been OK for an extended period with no state transitions in the history log. The severity is LOW because a silent alarm isn't actively harmful — but it represents an untested control. The first time you find out it doesn't work is the incident it was supposed to catch.

In this lesson you'll learn how to find alarms that have never fired, the failure modes that quietly mask broken alarms, and how to verify each one actually works using either synthetic metric data or set-alarm-state. You'll also see how to document an alarm's intent so future-you can decide whether to keep, fix, or delete it during a quarterly review.

Fun fact

The Knight Capital alarm that never was

In 2012 Knight Capital lost $440M in 45 minutes because a deployment left old code running on one of eight servers. The monitoring existed — a system status email had been arriving every morning showing the divergence — but no one had configured an alarm on it, and the email was filtered to a folder nobody read. "We have monitoring" and "we have working alarms" are not the same sentence. Untested alarms behave the same way: present in the inventory, absent in the incident.

Auditing silent alarms in action

Marco runs SRE at a mid-size SaaS. A FinOps review surfaces 84 CloudWatch alarms in the production account, and a quick query shows 31 of them — over a third — haven't transitioned state in 90+ days. ALH-003 has flagged the whole set.

Most of them are probably fine. The auto-scaling targets, the queue-depth alerts, the routine CPU thresholds — they sit at OK because the system genuinely behaves. But buried in the 31 is at least one alarm that was added during an incident two years ago and has since had its underlying metric renamed, and another that's pointing at a dimension for an instance that was terminated last summer.

Marco doesn't trust the list. He picks one alarm — a critical SES bounce-rate alarm — and decides to actually prove it works before assuming it does.

First, pull the audit list: every alarm in OK with no state transitions in the last 90 days. This is the ALH-003 query.

$ aws cloudwatch describe-alarms --state-value OK --query 'MetricAlarms[?StateUpdatedTimestamp<=`2026-02-14`].[AlarmName,MetricName,Namespace,Threshold,StateUpdatedTimestamp]' --output table

┌──────────────────────────────────────┬──────────────────────┬───────────────┬──────┬──────────────────────┐

│ AlarmName │ MetricName │ Namespace │ Thr. │ StateUpdatedTimestamp│

├──────────────────────────────────────┼──────────────────────┼───────────────┼──────┼──────────────────────┤

│ ses-bounce-rate-high │ Reputation.BounceRate│ AWS/SES │ 5.0 │ 2024-11-03T08:21:14Z │

│ rds-prod-cpu-warn │ CPUUtilization │ AWS/RDS │ 80.0 │ 2025-01-09T14:02:51Z │

│ alb-target-5xx-spike │ HTTPCode_Target_5XX │ AWS/AppELB │ 50.0 │ 2024-12-22T09:44:00Z │

│ legacy-worker-disk-full │ DiskSpaceUtilization │ System/Linux │ 90.0 │ 2024-07-17T03:11:09Z │

│ sqs-deadletter-depth │ ApproximateNumberOfM…│ AWS/SQS │ 1.0 │ 2024-10-04T11:55:32Z │

└──────────────────────────────────────┴──────────────────────┴───────────────┴──────┴──────────────────────┘

# 31 rows total. legacy-worker-disk-full hasn't transitioned in 10 months — and System/Linux is a CW Agent namespace.

Every alarm in OK with no state change since Feb. Some are real, some are suspect.

Now prove the SES bounce-rate alarm actually works. set-alarm-state forces a transition and fires the alarm's actions — without producing any real bounce traffic.

$ aws cloudwatch set-alarm-state --alarm-name ses-bounce-rate-high --state-value ALARM --state-reason 'Quarterly audit — verifying SNS + Slack action fires'

# (set-alarm-state returns no output on success — confirmation is downstream.)

# Slack channel #ops-alerts, 14:32 UTC:

[ALARM] ses-bounce-rate-high — Quarterly audit — verifying SNS + Slack action fires

Region: eu-west-1 Account: 123456789012 Threshold: > 5.0 sustained 15m

# Action fired. Reset state and document the test.

$ aws cloudwatch set-alarm-state --alarm-name ses-bounce-rate-high --state-value OK --state-reason 'Audit complete'

set-alarm-state proves the SNS + Slack path works end-to-end without waiting for a real bounce spike.

How CloudWatch alarms actually workdeep dive

A CloudWatch alarm is a small state machine. Every evaluation period (default 60s, configurable) the alarm pulls the metric for the configured statistic (Sum, Average, p99, etc.) across the configured dimensions, compares it to the threshold using the configured operator (GreaterThanThreshold, LessThanLowerOrGreaterThanUpperThreshold, etc.), and decides which state to be in. State changes — and only state changes — fire actions.

This is where silent failures live. If the metric stops being published, the alarm transitions to INSUFFICIENT_DATA, not ALARM — unless you've configured treat-missing-data=breaching, missing data won't trigger anything. If the dimension points at a resource that no longer exists, you get the same outcome. If the threshold is set to 999 when the metric never exceeds 100, the comparison never evaluates true. If the operator is inverted — GreaterThanThreshold when you meant LessThan — the alarm will never fire under the condition it was supposed to catch.

CloudWatch charges roughly $0.10 per alarm metric per month for standard-resolution alarms and $0.30 for high-resolution. That's pocket change individually, but across a fleet with thousands of alarms accumulated over years of incidents, the bill is real. Worse, every silent broken alarm is dead weight in your inventory — it dilutes attention and makes the working alarms harder to trust.

# Pull the full state history for a suspect alarm — has it ever changed state?
aws cloudwatch describe-alarm-history \
  --alarm-name legacy-worker-disk-full \
  --history-item-type StateUpdate \
  --max-records 50 \
  --query 'AlarmHistoryItems[].[Timestamp,HistorySummary]' \
  --output table

# If the only entries are the original "created" event, the alarm has literally
# never transitioned. Combined with treat-missing-data=missing, that often means
# the underlying metric is no longer being published.

What's the impact of unverified silent alarms?

The direct impact is missed incidents. Every silent alarm represents a control you believe is in place but haven't proven. When the condition it was meant to catch actually happens — bounce rate spikes, disk fills, dead-letter queue backs up — you find out about it from a customer ticket instead of a page, and the mean-time-to-detect doubles or triples.

The second-order impact is alert fatigue working in reverse. Engineers trust the alarm inventory as a proxy for "we'd hear about it." When that proxy is silently broken, post-incident reviews keep finding the same root cause: "we had an alarm for this, it just didn't fire." That erodes trust in the entire monitoring system, and the response is usually to add more alarms — which compounds the problem.

The financial impact is small but real. Each alarm costs roughly $0.10/month — a fleet with 2,000 alarms is paying ~$2,400/year just to keep them configured. A meaningful fraction of those are usually defunct. More importantly, every silent alarm is a future incident-investigation hour: when an outage hits and the alarm didn't fire, someone has to figure out why, fix it, and update the runbook — usually under pressure at 3am.

There's also a compliance angle. SOC 2 CC7.2 and ISO 27001 A.12.4 expect monitoring controls to be tested periodically. An auditor asking "how do you know your alarms work?" wants to hear about a documented verification process — not "they're in the inventory and we trust they're configured correctly."

How do you audit and fix silent alarms?

Auditing silent alarms is a four-step loop, run quarterly. The goal isn't to keep every alarm — it's to make sure every alarm in the inventory is verified, documented, and still relevant.

1. Inventory alarms by last state transition

Pull describe-alarms filtered to StateValue=OK, then cross-reference describe-alarm-history for the StateUpdate event count. Any alarm whose only history entry is its own creation, or whose last transition is older than your audit window (90 days is reasonable), goes on the audit list. Don't audit every alarm every quarter — focus on the silent set.

2. Verify the alarm actually fires

Two safe ways to test. set-alarm-state --state-value ALARM forces a transition and fires every configured action — the cleanest end-to-end test of the SNS topic, Lambda, and downstream paging integration. Or put-metric-data with a value above the threshold for a few evaluation periods, which exercises the metric→evaluation→action path too. Always set a clear state-reason like "quarterly audit" so the on-call team knows it's a drill, and reset to OK immediately after.

3. Tabletop the intent against the configuration

For each alarm, write one sentence describing what it's meant to catch and read the configuration with that intent in mind. Does the metric actually measure the thing you described? Is the threshold inside the range of plausible values? Is the comparison operator the right direction? An alarm called "low-disk-warning" with GreaterThanThreshold 10 is configured backwards — it'll only fire when the disk is more than 10% full, which is always. These bugs hide in plain sight until someone reads the config out loud.

4. Document intent in the description, then keep or delete

Every alarm needs a one-line description that explains its purpose, action, and severity — e.g. "Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation." If you can't write that sentence, delete the alarm. Untracked alarms accumulate over years; the quarterly audit is your chance to keep the inventory honest. A short, verified set of alarms is worth more than a long unverified one.

# Update the description so the next on-call (and the next auditor) knows the intent.
aws cloudwatch put-metric-alarm \
  --alarm-name ses-bounce-rate-high \
  --alarm-description 'Fires when SES bounce rate > 5% sustained 15min — pages on-call, indicates email reputation degradation. Last verified 2026-05-15.' \
  --metric-name Reputation.BounceRate \
  --namespace AWS/SES \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 5.0 \
  --comparison-operator GreaterThanThreshold \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-alerts

# Delete the alarms whose intent you couldn't articulate — they were dead weight.
aws cloudwatch delete-alarms --alarm-names legacy-worker-disk-full old-elb-latency-2023

Quick quiz

Question 1 of 5

You find an alarm that's been in OK for 14 months with zero state transitions in its history. The on-call team says they're sure the workload has had bad days in that window. What's the right next move?

Keep learning

Dig deeper into CloudWatch alarms and the discipline of testing your monitoring.

You've completed Audit alarms that never trigger. You can now spot silent alarms in your inventory, verify them with set-alarm-state or synthetic metric data, tabletop the configuration against the intent, and decide quarterly which alarms to keep and which to delete. The next ALH-003 finding won't be a question mark — you'll have a four-step loop ready to run.

Back to the library

Silent alarms: the cost and governance angle

CloudWatch alarms are a recurring line item — are you paying for controls that actually work?

CloudWatch standard-resolution alarms cost roughly $0.10 per alarm metric per month. Individually that's noise, but a large AWS account can accumulate thousands of alarms over years of incidents and projects — each one a small, ongoing charge regardless of whether it works. ALH-003 surfaces the subset of that inventory that has never fired: alarms that have sat in OK for months or years with no state transition on record.

From a spend perspective the direct cost is marginal. The FinOps concern is what the inventory represents: every silent alarm is either a legitimate control on a genuinely stable system, or dead weight — a control that would fail to fire even if the condition it was meant to catch actually happened. You can't tell which just from the count. The only way to distinguish a healthy-system alarm from a broken one is to verify it, and that verification process is what ALH-003 is prompting.

For finance, the right frame is governance hygiene. An alarm inventory with a meaningful fraction of never-triggered items is an inventory that hasn't been reviewed. Periodic audits — quarterly is reasonable — convert that ambiguous list into deliberate decisions: keep the verified ones, fix the broken ones, delete the defunct ones. That keeps both the spend and the risk posture honest, and means the team can answer 'do our monitoring controls work?' with evidence rather than assumption.

This lesson is for the finance partner who wants to understand why a CloudWatch alarm audit matters beyond the small per-alarm cost. You'll see what ALH-003 flags and why, how to think about the alarm inventory as a governance object rather than an engineering detail, and what a well-run quarterly review produces — a shorter, verified set of controls with documented decisions on every exception. No commands required; the value is the framing that turns an SRE audit into a defensible monitoring posture.

Fun fact

The Knight Capital alarm that never was

How a finance partner frames the alarm audit

Dana is the FinOps partner reviewing the Q2 cloud operations summary. The security dashboard shows 31 CloudWatch alarms flagged under ALH-003 — never triggered in 90-plus days. The per-alarm cost is small, but Dana's first question isn't about the cost; it's about what the list implies for the risk register.

She asks the SRE team to split the 31 by business criticality. Of the 31, eight are on production workloads — customer-facing services where an undetected incident would have revenue or SLA impact. The other 23 are on dev, test, and deprecated resources. For the 23, Dana's recommendation is: verify quickly or delete. Paying $0.10/month for an unverified alarm on a retired resource is waste; more importantly it pollutes the inventory and makes the working alarms harder to trust.

For the eight production alarms, Dana asks one question: when were they last proven to fire? The answer — never, for most of them — goes into the risk register as a monitoring-control gap. The cost to fix it is engineering hours, not cloud spend. Her takeaway for the finance pack is: 'Eight production monitoring controls are unverified. The risk isn't the alarm bill — it's that we'd find out about incidents from customers, not pages.'

What unverified alarms cost — beyond the alarm bill

The direct CloudWatch cost is easy to model: standard-resolution alarms run roughly $0.10 per alarm per month. A fleet of 2,000 alarms is approximately $2,400 per year. A meaningful fraction of that — in many mature accounts, 20–40% — is likely defunct or unverified. That's real but not the primary financial exposure.

The larger cost is incident economics. An unverified alarm on a production workload is a monitoring-control gap with a deferred price tag. When the condition it was supposed to catch actually occurs — a bounce-rate spike, a dead-letter queue backing up, a disk filling — the team finds out from a customer ticket or a manual check rather than a page. Mean-time-to-detect doubles or triples, and the cost of that gap shows up as incident response hours, SLA credits, and reputational exposure. That cost is episodic and hard to budget for, which makes it worse than a predictable line item.

There is also an audit exposure. SOC 2 CC7.2 and ISO 27001 A.12.4 require periodic testing of monitoring controls. An alarm inventory that has never been verified is a gap auditors will find. The cost of a finding at audit — remediation effort plus the possibility of a qualified opinion — typically exceeds the cost of a quarterly verification programme by an order of magnitude.

For finance, the right model is: unverified alarms represent a contingent liability, not just a sunk cost. The quarterly audit converts that liability into a documented, defensible control set, and the small engineering cost to run it is cheap relative to the incidents it prevents.

How finance drives the alarm audit without running commands

Finance can't run set-alarm-state, but it owns the framing that turns an SRE backlog item into a quarterly governance process with accountability. Four levers.

1. Put alarm verification on the regular security-and-cost review

Track the count of unverified production alarms as a standing metric alongside cost and compliance findings. Separating production from dev/test is essential — a large number of silent alarms on development resources is expected and low priority; a non-zero number on production workloads is a monitoring-control gap that should have a remediation date.

2. Require a documented reason for every silent alarm kept without verification

The output of a quarterly audit isn't just a shorter list — it's a record. Every alarm that survives the audit should have a one-line description of its intent, the date it was last verified, and either evidence of a successful test or a documented reason why the silent period is expected (e.g., 'fires only on SES bounce rate exceeding 5%; sustained send volumes have not approached threshold'). That documentation is what makes the control defensible at audit.

3. Price defunct alarms as waste, not just noise

When the engineering team identifies alarms pointing at terminated resources or renamed metrics, treat deletion as a cost-hygiene action as well as a security action. Each removed alarm saves $0.10/month, which is individually trivial but matters at scale — and more importantly, a smaller inventory is a more trustworthy one. Track alarm count alongside alarm spend to give the team a concrete output metric for the audit.

4. Treat the quarterly audit as a control, not a one-off

The risk of unverified alarms compounds over time: alarms accumulate, resources are renamed or retired, and the gap between what the inventory says and what actually works widens. The quarterly cadence is what keeps the posture honest. Finance can lock this in by making it a standing item in the governance calendar with an owner and a deliverable — not an ad-hoc request that gets deprioritised under sprint pressure.

Quick quiz

Question 1 of 5

A quarterly alarm audit finds 18 ALH-003 findings: 5 on production services, 13 on dev and test resources. As the FinOps partner, what is the right response?

Keep learning

Dig deeper into CloudWatch alarms and the discipline of testing your monitoring.

You've finished the finance partner's view of ALH-003. You know the real cost isn't the per-alarm spend — it's the contingent liability of unverified production monitoring controls and the audit exposure that comes with them. You have four governance levers: tracking unverified production alarms as a standing metric, requiring documented reasons for retained silent alarms, treating defunct-alarm deletion as cost hygiene, and locking in a quarterly verification cadence with an owner. The next ALH-003 finding is a conversation you can lead, not just observe.

Back to the library

Silent alarms: the accountability question

An untested alarm is a control you believe is in place but haven't proven

CloudWatch alarms are the mechanism that tells the team when something is wrong. A 'never-triggered' alarm is one that has sat quietly in an OK state for months or years with no record of ever firing — which could mean the system is healthy, or could mean the alarm is broken and would fail to fire even in an incident. ALH-003 flags the entire ambiguous set.

The leadership question this control surfaces is simple: can you defend the monitoring posture? If a major incident occurs and the alarm that should have caught it didn't fire, the post-incident review will ask when it was last verified. An untested inventory has no good answer. The low-severity rating reflects that a silent alarm isn't actively causing harm today — but the risk materialises at the exact moment you need it most.

The healthy posture is not a zero-finding dashboard — it's a verified, deliberately maintained alarm inventory where every silent alarm has been reviewed, tested, and kept or removed on purpose. That's the difference between 'we have monitoring' and 'our monitoring works.'

A concise read for the executive who needs to understand what this finding says about the organisation's monitoring posture. You'll get the plain-English version of why a never-triggered alarm is a risk rather than a reassurance, what good looks like (a verified, intentionally maintained inventory), and the one question to ask at a review to confirm the team has moved from assumption to evidence. No implementation detail required.

Fun fact

The Knight Capital alarm that never was

What happens when an untested alarm is the one that matters

At one company the SES bounce-rate alarm had been in the inventory for two years. It was listed in the runbook. It was counted in the monitoring dashboard. Nobody had ever tested it. When the email sending domain was migrated and the metric namespace changed, the alarm silently lost its connection to the underlying metric — but it stayed in OK and nobody noticed.

Six months later a bounce spike hit. The alarm didn't fire. The support team found out from a customer whose emails were bouncing. The post-incident review found the broken alarm within minutes — it had never transitioned state since the migration.

The finding wasn't that the alarm was broken. It was that nobody had a process to know. ALH-003 is that process: it surfaces the alarms that have been silent long enough to be worth verifying. The executive question is whether that verification happens on a schedule, before an incident, or in a post-incident review, after one.

Why monitoring gaps show up as business risk, not IT risk

An unverified alarm is a risk that is invisible until it materialises. The team believes the monitoring is in place; the alarm is in the dashboard; the control is on the compliance list. But if it hasn't been tested, all of that is assumption. The impact lands when an incident the alarm was supposed to catch instead arrives as a customer complaint — and the post-incident review finds the alarm has been silently broken for months.

The business impact of missed incidents is well-understood: extended downtime, SLA penalties, customer-facing errors, and the reputational cost of finding out about problems from outside rather than inside. What makes unverified alarms a distinct leadership concern is that the risk is latent and quantifiable — the team can run a quarterly audit and find it — but without that process the organisation carries the exposure invisibly.

The compliance dimension matters too. Regulators and auditors under SOC 2 and ISO 27001 expect monitoring controls to be tested, not just configured. The board question isn't 'do we have alarms?' — it's 'do we have a process that proves they work?' A verified, periodically reviewed alarm inventory answers that question with evidence. An unaudited one does not.

The leadership ask on ALH-003

The executive handle is not to drive the finding count to zero — it's to ensure there is a periodic, documented process that distinguishes verified alarms from unverified ones, and that production monitoring controls have evidence of working.

1. Require a verified alarm inventory on the quarterly security review

Ask for one number at each quarterly review: how many production-critical alarms were tested this quarter, and how many remain unverified? A declining unverified count over successive quarters is the evidence of a functioning process. It doesn't require technical depth to interpret.

2. Distinguish production from non-production in the finding

A large total count of ALH-003 findings is noise if most are on dev and test resources. The executive signal is the production subset — those are the controls that protect revenue and customer experience. Push for the finding to be reported with that segmentation rather than as a raw total.

3. Accept intentional silence with a documented reason

Not every alarm that rarely fires is broken. An SES bounce-rate alarm on a well-managed domain may legitimately never exceed threshold. The policy isn't 'every silent alarm must be fixed' — it's 'every silent alarm must have been reviewed and either verified or documented as expected-silent.' That's the distinction between governance and noise reduction.

Quick quiz

Question 1 of 5

At the quarterly business review the security team reports that all production-critical CloudWatch alarms have been tested this quarter and pass, while 15 dev/test alarms remain unverified. What is the right read of this signal?

Keep learning

Dig deeper into CloudWatch alarms and the discipline of testing your monitoring.

That's the lesson. The one takeaway: 'we have alarms' and 'our alarms work' are different sentences. The leadership posture that closes the gap is a quarterly verified inventory where every production-critical alarm has evidence of working and every silent exception has a documented reason. Ask for that at the governance review — it's a one-number signal that doesn't require technical depth to interpret.

Back to the library

Part of the learning path Get your alarms right

Audit alarms that never trigger

Silent alarms: the basics

The Knight Capital alarm that never was

Auditing silent alarms in action

How CloudWatch alarms actually workdeep dive

What's the impact of unverified silent alarms?

How do you audit and fix silent alarms?

1. Inventory alarms by last state transition

2. Verify the alarm actually fires

3. Tabletop the intent against the configuration

4. Document intent in the description, then keep or delete

Quick quiz

Keep learning

Silent alarms: the cost and governance angle

The Knight Capital alarm that never was

How a finance partner frames the alarm audit

What unverified alarms cost — beyond the alarm bill

How finance drives the alarm audit without running commands

1. Put alarm verification on the regular security-and-cost review

2. Require a documented reason for every silent alarm kept without verification

3. Price defunct alarms as waste, not just noise

4. Treat the quarterly audit as a control, not a one-off

Quick quiz

Keep learning

Silent alarms: the accountability question

The Knight Capital alarm that never was

What happens when an untested alarm is the one that matters

Why monitoring gaps show up as business risk, not IT risk

The leadership ask on ALH-003

1. Require a verified alarm inventory on the quarterly security review

2. Distinguish production from non-production in the finding

3. Accept intentional silence with a documented reason

Quick quiz

Keep learning

Related monitoring lessons