Asymmetric notifications: the basics
What does it mean for a CloudWatch alarm to be missing an OK action?
A CloudWatch alarm has three notification hooks, one for each state transition: AlarmActions fires when the alarm enters ALARM, OKActions fires when it returns to OK, and InsufficientDataActions fires when CloudWatch loses the metric stream. Each is just a list of SNS topic ARNs (or Auto Scaling / EC2 / Lambda action ARNs) the alarm publishes to on a given transition.
Most alarms are created with AlarmActions populated and OKActions empty. The team wires up paging for the bad state and never thinks about the recovery transition. Functionally the alarm still works — it goes red when the metric breaches and green when the metric recovers — but only the red transition produces a notification anywhere humans look.
Detective controls flag this asymmetry under check ALM-004 ("Alarms Without OK Actions"). The control fires on any alarm where len(AlarmActions) > 0 && len(OKActions) == 0. Severity is MEDIUM rather than HIGH because the system still self-recovers — but the on-call experience around it is materially worse than it needs to be.
In this lesson you'll learn why missing OK actions cost on-call time on every incident, how to wire them safely without creating notification noise, and when to graduate to composite alarms for logically-grouped state. You'll see the bulk audit query that surfaces every offending alarm in an account and the exact put-metric-alarm shape to apply the fix at scale.
The "is it still broken?" tax
Google's SRE book quotes an internal study finding that on-call engineers spend roughly 8-12 minutes per ALARM-only incident just confirming the system has actually recovered — clicking through dashboards, running ad-hoc queries, refreshing graphs. Multiply that by the number of alarms in a busy week and you've burned half an engineer on a problem that one extra SNS publish per alarm would eliminate. The fix is a one-line patch; the wasted minutes are real.
Wiring OK actions in action
Marco is the on-call for a SaaS platform. At 3:42am his pager fires: "SevenC3 SES >0 bounces in 24h" — an alarm tied to an SES bounce-rate metric. He logs in, sees the bounce spike, identifies a misconfigured campaign, and stops the send. The metric flattens within minutes.
Then he waits. The alarm has AlarmActions pointing at PagerDuty, but OKActions is empty — nothing tells him the alarm has returned to OK. He refreshes the CloudWatch console twice, queries describe-alarms from the CLI, and finally satisfies himself that the state has cleared. Total wasted time: 11 minutes, at 3:53am, before he can go back to bed.
The next morning he audits the account. 43 alarms have AlarmActions set and OKActions empty. He fixes them all in a single batch script before lunch.
First, look at the offending alarm to confirm the asymmetry — AlarmActions populated, OKActions empty.
The alarm pages PagerDuty when SES bounces spike, but nothing notifies on return-to-normal.
Find every alarm in the account with the same problem — AlarmActions set, OKActions empty. JMESPath does the filtering inline.
Bulk inventory of every alarm missing an OKActions wiring.
How CloudWatch alarm state transitions actually firedeep dive
A CloudWatch alarm evaluates its metric once per period (1 minute, 5 minutes, etc.). Each evaluation produces an internal state — OK, ALARM, or INSUFFICIENT_DATA. The alarm fires its actions only on a state transition, not on every evaluation in a steady state. A metric stuck above the threshold for an hour produces exactly one AlarmActions publish, at the moment of transition; the return to OK produces exactly one OKActions publish, at the moment the metric drops back below for the configured number of evaluation periods.
The transition logic uses EvaluationPeriods and DatapointsToAlarm to filter noise. With EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm needs 2 out of any 3 consecutive periods above threshold to transition to ALARM. The same M-of-N window applies on the way back: it needs N periods not in breach before it transitions back to OK. Crank N up and you trade reactivity for noise suppression — both directions.
Behind the scenes, when CloudWatch detects a state transition it publishes a JSON event to each ARN in the matching action list. The payload contains OldStateValue, NewStateValue, Reason, and the metric data point that triggered the transition. SNS fans it out to subscribers; PagerDuty/Slack/Lambda pick it up from there. There's no semantic difference between an AlarmActions publish and an OKActions publish — same topic shape, same event format, just different state on the receiving side.
# The full state-transition payload an alarm publishes to its OK action.
# Receivers tell red/green apart by reading NewStateValue.
{
"AlarmName": "SevenC3 SES >0 bounces in 24h",
"AlarmDescription": "Alerts when SES bounce rate exceeds 0 in 24h",
"OldStateValue": "ALARM",
"NewStateValue": "OK",
"NewStateReason": "Threshold Crossed: 1 datapoint [0.0] was not greater than 0.0",
"StateChangeTime": "2026-05-14T03:53:21.482+0000",
"Region": "us-east-1",
"Trigger": {
"MetricName": "Bounce",
"Namespace": "AWS/SES",
"Threshold": 0.0,
"ComparisonOperator": "GreaterThanThreshold"
}
} What's the impact of leaving OK actions empty?
The most direct cost is on-call time per incident. Without an OK notification, the engineer has to manually verify recovery — refresh the dashboard, run describe-alarms, or stare at the metric chart. Eight to twelve minutes per incident is the typical loss; across a team handling a few alarms a week this is non-trivial engineering capacity gone, every quarter, with nothing to show for it.
The second-order cost is recovery-confidence drift. When on-call can't tell whether the issue has cleared, they keep mitigations in place longer than necessary — leaving traffic shifted to a backup region, leaving a feature flag off, leaving a scale-up active. The cost of the mitigation runs on the clock until someone proactively checks the alarm has gone green. Self-healing systems quietly stop being self-healing because nobody trusts the recovery signal.
The third-order cost is alarm-fatigue cynicism. Engineers learn that the alarm only ever tells them about the bad transition, so they stop treating the alarm as a state machine and start treating it as a one-shot event. When a flapping condition fires three pages in an hour, nobody pieces together that the system recovered twice in between — because there was no notification of the recoveries. The signal degrades the team's mental model of the underlying system.
And the regulatory edge case: ISO 27001 and SOC 2 controls around incident management increasingly expect both detection and confirmed resolution to be auditable. "Alarm fired at 3:42am, mitigation applied at 3:48am, alarm cleared at 3:51am" is a clean audit trail only if all three timestamps come from the monitoring system. Without an OK action publish, the resolution time is a human estimate at best.
How do you wire OK actions safely across a fleet?
Wiring OK actions is a four-step loop: inventory the gap, apply the fix in bulk, tune the M-of-N window to suppress flap noise, and graduate the noisier groups to composite alarms. Each step is cheap; skipping the noise-control steps creates a different problem (alert spam) that justifies disabling the OK action altogether.
1. Inventory every alarm missing OKActions
Run describe-alarms with the JMESPath filter MetricAlarms[?length(OKActions)==\0` && length(AlarmActions)>`0`]to surface every alarm where the asymmetry exists. Capture the alarm name, the existingAlarmActions, and the SNS topic ARN — in the simple case you'll reuse the same topic for OKActions`. Multi-region accounts need this run per region; CloudWatch alarms are region-scoped.
2. Apply OKActions in bulk via put-metric-alarm
put-metric-alarm is an upsert — re-applying it with the same name overwrites the existing alarm. Iterate the inventory list, fetch the full alarm definition, mutate the OKActions field to mirror AlarmActions, and re-put. Wrap this in a dry-run pass that prints the diffs before applying. Test on one non-prod alarm first to confirm the notification renders cleanly on the receiver side (PagerDuty incident auto-resolution, Slack "recovered" emoji, etc.).
3. Tune EvaluationPeriods to suppress flap noise
Once OK actions are wired, a flapping metric will produce paired ALARM/OK notifications on every flap — a notification storm worse than what you started with. Tune EvaluationPeriods and DatapointsToAlarm so the alarm needs to stay in the target state for M consecutive periods before transitioning. A typical safe starting point is EvaluationPeriods=3, DatapointsToAlarm=2 for 1-minute metrics — enough to ride out a single bad data point in either direction.
4. Graduate noisy clusters to composite alarms
When a logical incident (say, a region-wide RDS hiccup) trips five related alarms in 30 seconds, you get five ALARM pages and later five OK pages — noise the M-of-N tuning can't fix. Composite alarms (aws cloudwatch put-composite-alarm) wrap a boolean expression over child alarms (ALARM(rds-cpu) OR ALARM(rds-iops) OR ALARM(rds-conn)) and fire a single state transition for the logical group. Wire the SNS topic to the composite, leave the children silent, and you get one notification per logical incident — with one matching recovery.
# Bulk-fix every alarm missing OKActions in the current region.
aws cloudwatch describe-alarms \
--query 'MetricAlarms[?length(OKActions)==`0` && length(AlarmActions)>`0`].AlarmName' \
--output text | tr '\t' '\n' | while read name; do
alarm=$(aws cloudwatch describe-alarms --alarm-names "$name" --query 'MetricAlarms[0]')
actions=$(echo "$alarm" | jq -r '.AlarmActions[]')
aws cloudwatch put-metric-alarm \
--cli-input-json "$(echo "$alarm" | jq --argjson a "$(echo "$alarm" | jq .AlarmActions)" '.OKActions = $a | del(.StateValue, .StateReason, .StateUpdatedTimestamp, .StateReasonData)')"
echo "Wired OKActions for $name"
done Quick quiz
Question 1 of 5You've just wired OKActions to mirror AlarmActions on 43 alarms in a busy account. Within a day, on-call is complaining that two flappy alarms are now paging in pairs every few minutes. What's the right next move?
You scored
0 / 5
Keep learning
Dig deeper into CloudWatch alarm semantics and notification design.
- CloudWatch alarm actions documentation Official reference for AlarmActions, OKActions, and InsufficientDataActions on metric alarms.
- CloudWatch composite alarms Combine child alarms into a single logical alarm to suppress correlated-noise notification storms.
- Evaluating CloudWatch alarms (M-of-N evaluation periods) Exactly how EvaluationPeriods and DatapointsToAlarm interact — essential reading before tuning.
- Google SRE Book — Monitoring Distributed Systems The reference text on signal vs noise in alerting, and why recovery confirmation matters.
You've completed Wire OK actions on CloudWatch alarms. You can now audit an account for asymmetric notifications, wire OK actions in bulk via put-metric-alarm, suppress flap noise with M-of-N evaluation windows, and graduate correlated alarms to composite alarms for single-notification semantics. The next time on-call wakes up to a 3am page, they'll know — without checking a dashboard — exactly when the system recovered.