Monitoring

Wire OK actions on CloudWatch alarms

If you page on alarm, page on recovery too — otherwise on-call wonders whether it's still broken.

11 min·10 sections·AWS

Last reviewed 27 May 2026

Asymmetric notifications: the basics

What does it mean for a CloudWatch alarm to be missing an OK action?

A CloudWatch alarm has three notification hooks, one for each state transition: AlarmActions fires when the alarm enters ALARM, OKActions fires when it returns to OK, and InsufficientDataActions fires when CloudWatch loses the metric stream. Each is just a list of SNS topic ARNs (or Auto Scaling / EC2 / Lambda action ARNs) the alarm publishes to on a given transition.

Most alarms are created with AlarmActions populated and OKActions empty. The team wires up paging for the bad state and never thinks about the recovery transition. Functionally the alarm still works — it goes red when the metric breaches and green when the metric recovers — but only the red transition produces a notification anywhere humans look.

Detective controls flag this asymmetry under check ALM-004 ("Alarms Without OK Actions"). The control fires on any alarm where len(AlarmActions) > 0 && len(OKActions) == 0. Severity is MEDIUM rather than HIGH because the system still self-recovers — but the on-call experience around it is materially worse than it needs to be.

In this lesson you'll learn why missing OK actions cost on-call time on every incident, how to wire them safely without creating notification noise, and when to graduate to composite alarms for logically-grouped state. You'll see the bulk audit query that surfaces every offending alarm in an account and the exact put-metric-alarm shape to apply the fix at scale.

Fun fact

The "is it still broken?" tax

Google's SRE book quotes an internal study finding that on-call engineers spend roughly 8-12 minutes per ALARM-only incident just confirming the system has actually recovered — clicking through dashboards, running ad-hoc queries, refreshing graphs. Multiply that by the number of alarms in a busy week and you've burned half an engineer on a problem that one extra SNS publish per alarm would eliminate. The fix is a one-line patch; the wasted minutes are real.

Wiring OK actions in action

Marco is the on-call for a SaaS platform. At 3:42am his pager fires: "SevenC3 SES >0 bounces in 24h" — an alarm tied to an SES bounce-rate metric. He logs in, sees the bounce spike, identifies a misconfigured campaign, and stops the send. The metric flattens within minutes.

Then he waits. The alarm has AlarmActions pointing at PagerDuty, but OKActions is empty — nothing tells him the alarm has returned to OK. He refreshes the CloudWatch console twice, queries describe-alarms from the CLI, and finally satisfies himself that the state has cleared. Total wasted time: 11 minutes, at 3:53am, before he can go back to bed.

The next morning he audits the account. 43 alarms have AlarmActions set and OKActions empty. He fixes them all in a single batch script before lunch.

First, look at the offending alarm to confirm the asymmetry — AlarmActions populated, OKActions empty.

$ aws cloudwatch describe-alarms --alarm-names "SevenC3 SES >0 bounces in 24h" --query 'MetricAlarms[0].{Name:AlarmName,Alarm:AlarmActions,OK:OKActions,InsufficientData:InsufficientDataActions}'

{

"Name": "SevenC3 SES >0 bounces in 24h",

"Alarm": [

"arn:aws:sns:us-east-1:123456789012:oncall-pagerduty"

"OK": [],

"InsufficientData": []

}

# Pages on bad. Silent on recovery. Classic asymmetric notification.

The alarm pages PagerDuty when SES bounces spike, but nothing notifies on return-to-normal.

Find every alarm in the account with the same problem — AlarmActions set, OKActions empty. JMESPath does the filtering inline.

$ aws cloudwatch describe-alarms --query 'MetricAlarms[?length(OKActions)==`0` && length(AlarmActions)>`0`].[AlarmName,AlarmActions[0]]' --output table

┌──────────────────────────────────────────────┬─────────────────────────────────────────────────────────┐

│ AlarmName │ AlarmAction │

├──────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤

│ SevenC3 SES >0 bounces in 24h │ arn:aws:sns:us-east-1:123456789012:oncall-pagerduty │

│ prod-rds-cpu-high │ arn:aws:sns:us-east-1:123456789012:oncall-pagerduty │

│ efs-mount-failures │ arn:aws:sns:us-east-1:123456789012:oncall-pagerduty │

│ alb-5xx-rate │ arn:aws:sns:us-east-1:123456789012:oncall-pagerduty │

│ ... 39 more rows │ │

└──────────────────────────────────────────────┴─────────────────────────────────────────────────────────┘

# 43 alarms across the account, all silent on recovery.

Bulk inventory of every alarm missing an OKActions wiring.

How CloudWatch alarm state transitions actually firedeep dive

A CloudWatch alarm evaluates its metric once per period (1 minute, 5 minutes, etc.). Each evaluation produces an internal state — OK, ALARM, or INSUFFICIENT_DATA. The alarm fires its actions only on a state transition, not on every evaluation in a steady state. A metric stuck above the threshold for an hour produces exactly one AlarmActions publish, at the moment of transition; the return to OK produces exactly one OKActions publish, at the moment the metric drops back below for the configured number of evaluation periods.

The transition logic uses EvaluationPeriods and DatapointsToAlarm to filter noise. With EvaluationPeriods=3 and DatapointsToAlarm=2, the alarm needs 2 out of any 3 consecutive periods above threshold to transition to ALARM. The same M-of-N window applies on the way back: it needs N periods not in breach before it transitions back to OK. Crank N up and you trade reactivity for noise suppression — both directions.

Behind the scenes, when CloudWatch detects a state transition it publishes a JSON event to each ARN in the matching action list. The payload contains OldStateValue, NewStateValue, Reason, and the metric data point that triggered the transition. SNS fans it out to subscribers; PagerDuty/Slack/Lambda pick it up from there. There's no semantic difference between an AlarmActions publish and an OKActions publish — same topic shape, same event format, just different state on the receiving side.

# The full state-transition payload an alarm publishes to its OK action.
# Receivers tell red/green apart by reading NewStateValue.

{
  "AlarmName": "SevenC3 SES >0 bounces in 24h",
  "AlarmDescription": "Alerts when SES bounce rate exceeds 0 in 24h",
  "OldStateValue": "ALARM",
  "NewStateValue": "OK",
  "NewStateReason": "Threshold Crossed: 1 datapoint [0.0] was not greater than 0.0",
  "StateChangeTime": "2026-05-14T03:53:21.482+0000",
  "Region": "us-east-1",
  "Trigger": {
    "MetricName": "Bounce",
    "Namespace": "AWS/SES",
    "Threshold": 0.0,
    "ComparisonOperator": "GreaterThanThreshold"
  }
}

What's the impact of leaving OK actions empty?

The most direct cost is on-call time per incident. Without an OK notification, the engineer has to manually verify recovery — refresh the dashboard, run describe-alarms, or stare at the metric chart. Eight to twelve minutes per incident is the typical loss; across a team handling a few alarms a week this is non-trivial engineering capacity gone, every quarter, with nothing to show for it.

The second-order cost is recovery-confidence drift. When on-call can't tell whether the issue has cleared, they keep mitigations in place longer than necessary — leaving traffic shifted to a backup region, leaving a feature flag off, leaving a scale-up active. The cost of the mitigation runs on the clock until someone proactively checks the alarm has gone green. Self-healing systems quietly stop being self-healing because nobody trusts the recovery signal.

The third-order cost is alarm-fatigue cynicism. Engineers learn that the alarm only ever tells them about the bad transition, so they stop treating the alarm as a state machine and start treating it as a one-shot event. When a flapping condition fires three pages in an hour, nobody pieces together that the system recovered twice in between — because there was no notification of the recoveries. The signal degrades the team's mental model of the underlying system.

And the regulatory edge case: ISO 27001 and SOC 2 controls around incident management increasingly expect both detection and confirmed resolution to be auditable. "Alarm fired at 3:42am, mitigation applied at 3:48am, alarm cleared at 3:51am" is a clean audit trail only if all three timestamps come from the monitoring system. Without an OK action publish, the resolution time is a human estimate at best.

How do you wire OK actions safely across a fleet?

Wiring OK actions is a four-step loop: inventory the gap, apply the fix in bulk, tune the M-of-N window to suppress flap noise, and graduate the noisier groups to composite alarms. Each step is cheap; skipping the noise-control steps creates a different problem (alert spam) that justifies disabling the OK action altogether.

1. Inventory every alarm missing OKActions

Run describe-alarms with the JMESPath filter MetricAlarms[?length(OKActions)==\0` && length(AlarmActions)>`0`]to surface every alarm where the asymmetry exists. Capture the alarm name, the existingAlarmActions, and the SNS topic ARN — in the simple case you'll reuse the same topic for OKActions`. Multi-region accounts need this run per region; CloudWatch alarms are region-scoped.

2. Apply OKActions in bulk via put-metric-alarm

put-metric-alarm is an upsert — re-applying it with the same name overwrites the existing alarm. Iterate the inventory list, fetch the full alarm definition, mutate the OKActions field to mirror AlarmActions, and re-put. Wrap this in a dry-run pass that prints the diffs before applying. Test on one non-prod alarm first to confirm the notification renders cleanly on the receiver side (PagerDuty incident auto-resolution, Slack "recovered" emoji, etc.).

3. Tune EvaluationPeriods to suppress flap noise

Once OK actions are wired, a flapping metric will produce paired ALARM/OK notifications on every flap — a notification storm worse than what you started with. Tune EvaluationPeriods and DatapointsToAlarm so the alarm needs to stay in the target state for M consecutive periods before transitioning. A typical safe starting point is EvaluationPeriods=3, DatapointsToAlarm=2 for 1-minute metrics — enough to ride out a single bad data point in either direction.

4. Graduate noisy clusters to composite alarms

When a logical incident (say, a region-wide RDS hiccup) trips five related alarms in 30 seconds, you get five ALARM pages and later five OK pages — noise the M-of-N tuning can't fix. Composite alarms (aws cloudwatch put-composite-alarm) wrap a boolean expression over child alarms (ALARM(rds-cpu) OR ALARM(rds-iops) OR ALARM(rds-conn)) and fire a single state transition for the logical group. Wire the SNS topic to the composite, leave the children silent, and you get one notification per logical incident — with one matching recovery.

# Bulk-fix every alarm missing OKActions in the current region.
aws cloudwatch describe-alarms \
  --query 'MetricAlarms[?length(OKActions)==`0` && length(AlarmActions)>`0`].AlarmName' \
  --output text | tr '\t' '\n' | while read name; do
  alarm=$(aws cloudwatch describe-alarms --alarm-names "$name" --query 'MetricAlarms[0]')
  actions=$(echo "$alarm" | jq -r '.AlarmActions[]')
  aws cloudwatch put-metric-alarm \
    --cli-input-json "$(echo "$alarm" | jq --argjson a "$(echo "$alarm" | jq .AlarmActions)" '.OKActions = $a | del(.StateValue, .StateReason, .StateUpdatedTimestamp, .StateReasonData)')"
  echo "Wired OKActions for $name"
done

Quick quiz

Question 1 of 5

You've just wired OKActions to mirror AlarmActions on 43 alarms in a busy account. Within a day, on-call is complaining that two flappy alarms are now paging in pairs every few minutes. What's the right next move?

Keep learning

Dig deeper into CloudWatch alarm semantics and notification design.

You've completed Wire OK actions on CloudWatch alarms. You can now audit an account for asymmetric notifications, wire OK actions in bulk via put-metric-alarm, suppress flap noise with M-of-N evaluation windows, and graduate correlated alarms to composite alarms for single-notification semantics. The next time on-call wakes up to a 3am page, they'll know — without checking a dashboard — exactly when the system recovered.

Back to the library

Asymmetric alarm notifications: what it costs in engineer-hours

Every alarm that pages on failure but stays silent on recovery adds a manual verification step to every incident

A CloudWatch alarm has two critical state transitions: the ALARM state, which fires when a metric breaches a threshold, and the OK state, which fires when the metric recovers. Most alarms in AWS accounts are configured to send a notification — usually a page or a Slack message — only on the ALARM transition. When the metric recovers, nothing is sent. The control ALM-004 flags every alarm where this asymmetry exists.

The unit economics are straightforward. Every incident handled by an on-call engineer requires a manual verification step if there is no OK notification — they must open a dashboard, run a CLI query, or refresh a console view to confirm the system has returned to normal. Studies of on-call patterns put this at roughly 8–12 minutes per incident. At a burdened engineering rate, a team handling several incidents per week can easily spend tens of thousands of dollars annually confirming recoveries that a single SNS publish per alarm would deliver automatically.

The fix costs essentially nothing in cloud spend — wiring an OK action reuses the same SNS topic already attached to the alarm, so there is no incremental per-message cost worth modeling. The control is about labour efficiency: the dollar impact is entirely in the on-call time saved, and that return starts accumulating on the first incident after the fix is applied.

This lesson is for the finance partner who wants to understand why a monitoring configuration detail translates into recurring engineering cost. It covers the labour-unit economics of on-call recovery verification, why the fix has near-zero cloud spend impact, how to frame the conversation around wasted engineer-hours rather than finding counts, and the one governance lever — a standing OK-actions audit as part of the monitoring hygiene review — that keeps the cost from accruing silently going forward. No CLI knowledge required.

Fun fact

The "is it still broken?" tax

How a finance partner quantifies the OK-actions gap

Dana is reviewing the quarterly engineering cost report and notices a line item: on-call support hours are running 15% above budget. She pulls the incident log and sees a pattern — a large share of incidents show a gap of 10–20 minutes between the engineer's mitigation action and the ticket close. The notes are consistent: 'confirmed metric recovered manually.'

She flags the ALM-004 finding count for the same period: 43 alarms with no OK actions across the production account. The math is simple — if each incident costs roughly 10 minutes of senior engineer time for manual recovery confirmation, and the account handles roughly 30 incidents per month touching these alarms, that's 5 hours of burdened engineer cost per month going to a task that a zero-cost configuration change would eliminate.

Dana brings the number to the next engineering review as a concrete ask: fix the 43 alarms, re-run the audit monthly. The cloud cost impact is negligible — SNS publishes are priced in fractions of a cent. The labour saving is immediate and recurring. It shows up in the next quarterly comparison as a reduction in on-call support hours, and the root cause is one line in a bulk fix script.

What missing OK actions cost on the P&L

The most direct line item is on-call labour: every incident that ends without an automated recovery notification requires the on-call engineer to verify recovery manually. At 8–12 minutes per incident and a burdened senior engineer rate, an account with 30 monthly incidents touching alarms with no OK actions can easily accumulate 60–100 hours per year of labour going to a task that a configuration change would eliminate entirely. That's a recurring cost with no corresponding value delivered.

The second cost is mitigation over-run. When on-call lacks a recovery signal, they leave mitigations active longer than necessary — a scale-up running an extra 30–60 minutes waiting for confirmation, a feature flag kept off through peak traffic, a failover left in place until someone manually checks. Each of these has a direct cloud cost or a revenue impact. The compounding effect across a year of incidents adds up to a number that rarely appears on a cost report but is fully attributable to the missing notification.

The third cost is audit exposure. Accurate incident close times require an automated record of when the system recovered. Without OK action timestamps, resolution times are human estimates that don't survive a rigorous SOC 2 or ISO 27001 review. If the organization is working toward or maintaining a certification, retroactively reconstructing resolution times from log data is expensive and often inconclusive. The cost of remediation after an audit finding is typically far higher than the cost of wiring the OK actions before it.

How finance drives OK-actions remediation as a spend-efficiency initiative

Finance can't run put-metric-alarm, but it can frame the remediation as a unit-economics problem and hold the recurring audit that prevents regression. Four levers that keep both the labour cost and the configuration gap visible.

1. Put on-call verification time on the cost report

Ask engineering for the count of incidents per month and the average manual recovery-verification time (typically 8–12 minutes per incident for accounts missing OK actions). Multiply by the burdened engineer rate and present the annual total as a line item. A concrete dollar figure converts an abstract configuration gap into an approved remediation item faster than a finding count does.

2. Track ALM-004 failures as a labour-efficiency metric

Include the ALM-004 finding count in the monthly cloud operations review alongside the on-call labour line. The goal metric is not zero findings (some alarms genuinely don't need OK notifications) but zero alarms where the decision is undocumented. Every alarm with AlarmActions set and OKActions empty should have either an OKActions entry or a recorded reason why it was omitted.

3. Require a post-fix comparison on incident labour

After the bulk remediation, ask for a before/after comparison of on-call hours at the next quarterly review. The reduction in manual verification time should be visible in the incident log. Closing the loop with data makes the case for funding similar monitoring hygiene work in the future and converts the remediation from a one-time fix to part of a repeatable cost-control pattern.

4. Build the audit into the monitoring hygiene cadence

Work with engineering to add the ALM-004 check to whatever periodic review already governs alarm quality — alarm noise audits, PagerDuty escalation reviews, or CloudWatch cost reviews. Scheduling it means the gap doesn't silently re-accumulate as new alarms are created without OK actions, and finance has a standing line of sight to the metric without waiting for a compliance report to surface it.

Quick quiz

Question 1 of 5

Your account has 38 alarms missing OK actions. Engineering estimates a bulk fix takes half a day and saves roughly 6 hours of on-call manual verification per month. The burdened engineer rate is $120/hr. How should finance frame the remediation decision?

Keep learning

Dig deeper into CloudWatch alarm semantics and notification design.

You've finished the finance partner's view of wiring OK actions. You know how to translate the ALM-004 finding into a labour-efficiency cost argument, why the cloud spend impact is negligible, how to frame the before/after comparison for on-call hours at the quarterly review, and the four levers — on-call cost reporting, ALM-004 as a labour metric, post-fix comparison, and recurring audit — that keep the saving visible and the gap from silently returning. Next time this comes up, you'll have a dollar figure ready before the engineering team finishes explaining the fix.

Back to the library

One-way alarms: a governance gap in incident management

Alerting that detects failure but never confirms recovery leaves on-call teams managing incidents by instinct rather than signal

Every CloudWatch alarm in an AWS account can send notifications in two directions: when a problem starts, and when it ends. Most accounts have wired the first and skipped the second. The result is an on-call team that gets paged when something breaks but receives no automated confirmation that it has recovered — they must go looking.

This is a process maturity issue. A monitoring system that only reports the start of an incident forces manual recovery verification on every on-call engineer. That adds latency to the incident close, introduces the risk that mitigations are left active longer than necessary, and creates a gap in the audit trail — the time of recovery becomes an estimate rather than a recorded event. For organizations subject to ISO 27001 or SOC 2, both the detection and the confirmed resolution of an incident are expected to be auditable; missing OK actions create a structural gap in that record.

The finding (ALM-004) is rated MEDIUM, not because the impact is minor, but because the system still self-recovers — the gap is in the notification layer, not the recovery itself. The governance question is whether incident management is operating by policy or by accident: a complete alarm configuration is a deliberate choice, and an incomplete one means the process depends on engineers filling in the gap manually every time.

A concise read for the executive who wants to understand what this control protects and what organizational outcome it drives. You'll learn why one-way alarms are a process maturity gap, why the fix matters for incident audit trails (not just on-call experience), what the healthy end state looks like — symmetric notifications by policy, not by chance — and the single leadership question that confirms the organization is there. No implementation detail.

Fun fact

The "is it still broken?" tax

What it looks like when the org closes the recovery-notification gap

A VP of Engineering at a mid-size SaaS company was reviewing a post-incident report after a payment processing degradation. The incident had been resolved within 12 minutes of detection — good. But the report's timeline showed that the on-call engineer had spent an additional 14 minutes after mitigation manually verifying the system had recovered before closing the ticket, because no automatic recovery notification came through.

She asked a simple question: 'Why does our monitoring system tell us when things break but not when they're fixed?' The answer — 43 alarms had AlarmActions wired but OKActions empty — took 20 minutes to fix in bulk. After the change, subsequent incident reports showed clean, automated close times. The recovery signal was in the audit trail alongside the alert, no manual verification step, no gap.

The lesson she took wasn't technical — it was organizational. A monitoring system that only reports failures but never confirms recoveries isn't a complete incident management system. Closing that gap is a policy decision as much as a configuration one: symmetric notifications should be the standard, enforced at the point where alarms are created, not patched after the fact when someone notices the trail going cold.

Why missing recovery notifications are an organizational risk, not just a technical gap

One-way alarms degrade an organization's incident management process in ways that compound over time. The immediate effect is that on-call engineers can't close incidents cleanly — they verify recovery manually, introducing both delay and the risk of a premature close if they're wrong. Over months, the team internalizes the alarm as a one-shot event rather than a state machine, which erodes their ability to reason about system stability from monitoring data alone.

The governance exposure is that incident audit trails become incomplete by construction. ISO 27001 and SOC 2 both treat detection and confirmed resolution as distinct auditable events. An account where OK actions are missing systematically produces audit trails with automated detection timestamps but estimated resolution times — a gap that becomes a finding during certification reviews and, worse, a gap in the post-incident record when an outage becomes a customer escalation.

The organizational signal is whether monitoring is governed by policy or by accident. A team that has reviewed its alarms and deliberately chosen symmetric notifications — ALARM and OK both wired — has made a process decision. A team where OK actions are missing on 40% of its alarms has an inherited default. The difference matters: the first team has a monitoring standard it can point to; the second has a gap it learns about at the worst possible time.

The leadership move on missing OK actions

The executive handle isn't to mandate specific SNS topic ARNs — it's to establish that symmetric alarm notification is the organizational standard and that exceptions require a recorded rationale.

1. Set a standard: alarms that page must also confirm recovery

Make it policy that any alarm wired to a paging or alerting channel must also carry an OK action to the same or equivalent channel. This removes the ambiguity — the question isn't "should this alarm have an OK action?" but "why doesn't this one?" A clear default reduces the engineering decision surface and means new alarms ship correctly configured rather than requiring a retroactive audit.

2. Treat undocumented exceptions as policy violations

Alarms where OK actions are intentionally omitted — for example, alarms on one-way state changes like deployment triggers — are legitimate exceptions, but they should be documented with a reason. An alarm that silently fails the ALM-004 check with no recorded rationale is a gap, not a deliberate choice. The audit trail requirement is the same as for any other exception to a stated security or operations policy.

3. Ask for the trend, not the total

At the quarterly operations review, the one question worth asking is: 'Is the share of new alarms shipping with OK actions configured going up?' A declining new-alarm gap means the standard is being absorbed into how the team works. A flat or rising gap means the policy isn't being enforced at the point where alarms are created — which is a process problem, not a remediation one.

Quick quiz

Question 1 of 5

A post-incident review shows the recovery timestamp for a major service degradation was estimated by the on-call engineer rather than captured by the monitoring system — there was no OK action on the triggering alarm. An auditor flags this as an incident management gap. What's the right organizational response?

Keep learning

Dig deeper into CloudWatch alarm semantics and notification design.

That's the lesson. Two takeaways: an alarm that pages on failure but never confirms recovery is an incomplete incident management process, and the fix is a policy decision as much as a configuration one — symmetric notifications by design, not by accident. The leadership question is whether every alarm that pages your team also notifies on recovery, and whether every exception to that standard is recorded. That's a one-question audit that takes thirty seconds and tells you whether your monitoring is governed or inherited.

Back to the library

Part of the learning path Get your alarms right

Wire OK actions on CloudWatch alarms

Asymmetric notifications: the basics

The "is it still broken?" tax

Wiring OK actions in action

How CloudWatch alarm state transitions actually firedeep dive

What's the impact of leaving OK actions empty?

How do you wire OK actions safely across a fleet?

1. Inventory every alarm missing OKActions

2. Apply OKActions in bulk via put-metric-alarm

3. Tune EvaluationPeriods to suppress flap noise

4. Graduate noisy clusters to composite alarms

Quick quiz

Keep learning

Asymmetric alarm notifications: what it costs in engineer-hours

The "is it still broken?" tax

How a finance partner quantifies the OK-actions gap

What missing OK actions cost on the P&L

How finance drives OK-actions remediation as a spend-efficiency initiative

1. Put on-call verification time on the cost report

2. Track ALM-004 failures as a labour-efficiency metric

3. Require a post-fix comparison on incident labour

4. Build the audit into the monitoring hygiene cadence

Quick quiz

Keep learning

One-way alarms: a governance gap in incident management

The "is it still broken?" tax

What it looks like when the org closes the recovery-notification gap

Why missing recovery notifications are an organizational risk, not just a technical gap

The leadership move on missing OK actions

1. Set a standard: alarms that page must also confirm recovery

2. Treat undocumented exceptions as policy violations

3. Ask for the trend, not the total

Quick quiz

Keep learning

Related monitoring lessons