Monitoring

Fix INSUFFICIENT_DATA alarms

An alarm in INSUFFICIENT_DATA isn't quiet — it's blind. The metric stopped reporting, often because the resource is gone or the dimensions are wrong.

12 min·10 sections·AWS

Last reviewed 27 May 2026

INSUFFICIENT_DATA alarms: the basics

What does INSUFFICIENT_DATA actually mean?

CloudWatch alarms have three states, not two. OK means the metric is reporting and within thresholds. ALARM means the metric is reporting and the threshold has been breached. INSUFFICIENT_DATA means CloudWatch isn't receiving enough datapoints in the recent evaluation window to make a determination either way — the alarm has nothing to alarm on.

On a dashboard an INSUFFICIENT_DATA alarm often looks fine. It's not red. It hasn't paged anyone. The graph next to it is blank but in a quiet way. That's exactly what makes it dangerous: a disk-full alarm sitting in INSUFFICIENT_DATA for six months is not protecting you from a disk filling up. It's protecting you from nothing.

AWS Trusted Advisor and the FinOps Dashboard's ALM-002 check flag this pattern because it's almost always a sign that something has drifted — the resource was deleted, the CloudWatch agent stopped reporting, an instance was replaced and the alarm still points at the dead instance ID, or the alarm was created against a metric namespace that never existed. The alarm is configured. The alarm is enabled. The alarm is blind.

In this lesson you'll learn the three-state CloudWatch alarm model, why INSUFFICIENT_DATA is a real failure mode rather than a quiet success, the four most common reasons alarms drift into it, and the diagnostic flow to decide whether to delete the alarm, repoint it, or fix the underlying agent. You'll see real AWS CLI investigation and the exact calls to clean up a fleet of stale alarms.

Fun fact

The alarm that outlived the instance

In a 2023 retro from a UK retailer, the post-mortem showed a critical disk-space alarm had been in INSUFFICIENT_DATA for 412 days before anyone noticed. The instance it monitored had been replaced as part of a routine AMI refresh — the new instance got a new ID, the alarm kept pointing at the old one, and CloudWatch dutifully reported "no data" every minute for over a year. The disk that eventually filled up was on the replacement instance. There was no alarm watching it.

Cleaning up INSUFFICIENT_DATA alarms in action

Marco runs the SRE rotation at a retail platform. A FinOps Dashboard scan flags ALM-002 with eight INSUFFICIENT_DATA alarms across the production account. One of them — "CRITICAL EC2 Sainsburys Web Server disk used % high" — is marked HIGH severity because the name suggests it's meant to catch a serious failure mode.

He doesn't know whether the alarm is broken, the instance is gone, or the CloudWatch agent has crashed. All three look identical from the alarm's perspective: no datapoints in the evaluation window. So he starts where the alarm starts — describe-alarms, filtered to just the broken ones.

The output tells him the alarm's metric, namespace, and dimensions. From there he can open the metric in the console and see immediately whether the resource it points at still exists.

First, list every alarm currently in INSUFFICIENT_DATA. The state-value filter is the fast way to scope the search.

$ aws cloudwatch describe-alarms --state-value INSUFFICIENT_DATA --query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,Namespace:Namespace,Dims:Dimensions}' --output table

┌──────────────────────────────────────────────────────┬──────────────┬──────────────────┬─────────────────────────────────────┐

│ Name │ Metric │ Namespace │ Dims │

├──────────────────────────────────────────────────────┼──────────────┼──────────────────┼─────────────────────────────────────┤

│ CRITICAL EC2 Sainsburys Web Server disk used % high │ DiskUsage │ CWAgent │ InstanceId=i-0c4f2a9b1e8d7c3f6 │

│ prod-rds-cpu-high │ CPUUtilization│ AWS/RDS │ DBInstanceIdentifier=prod-db-2022 │

│ checkout-worker-memory │ mem_used_pct │ CWAgent │ InstanceId=i-08fb9a32c1d4e7a09 │

│ legacy-elb-5xx-rate │ HTTPCode_ELB_5XX│ AWS/ELB │ LoadBalancerName=old-prod-elb │

└──────────────────────────────────────────────────────┴──────────────┴──────────────────┴─────────────────────────────────────┘

# Four alarms, four different shapes of broken — instance ID, RDS name, agent metric, deleted ELB.

All alarms in INSUFFICIENT_DATA across the account.

Now confirm whether the underlying resource still exists. If the instance is gone, the alarm has been blind since the terminate call.

$ aws ec2 describe-instances --instance-ids i-0c4f2a9b1e8d7c3f6 --query 'Reservations[].Instances[].{State:State.Name,Launched:LaunchTime}' --output table

An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstances operation:

The instance ID 'i-0c4f2a9b1e8d7c3f6' does not exist

# The instance was terminated. The alarm has been watching nothing for however long that's been true.

The InstanceId in the alarm's dimensions doesn't exist any more — the resource is gone.

How CloudWatch decides an alarm is INSUFFICIENT_DATAdeep dive

A CloudWatch alarm evaluates a metric on a fixed schedule defined by its Period and EvaluationPeriods. A typical disk-full alarm watches 5-minute datapoints over 3 consecutive periods. Each evaluation cycle CloudWatch looks at the most recent N periods and decides: does the metric breach the threshold (ALARM), is it within bounds (OK), or are there simply not enough datapoints to decide (INSUFFICIENT_DATA)?

What "not enough datapoints" means is governed by TreatMissingData. There are four options: missing (the default — a missing datapoint counts as missing, and if all evaluation periods are missing the alarm goes INSUFFICIENT_DATA), notBreaching (a missing datapoint counts as good, alarm stays OK), breaching (a missing datapoint counts as bad, alarm goes ALARM), and ignore (missing datapoints don't change the alarm state at all, so it stays in whatever state it last had). Most alarms in the wild use the default, which is why "the metric stopped reporting" almost always presents as INSUFFICIENT_DATA rather than ALARM.

Alarms are decoupled from the resources they monitor. There's no foreign key from an EC2 instance to its alarms — if you terminate the instance, the alarms survive. They just stop receiving data. This decoupling is what makes ASGs especially problematic: ephemeral instances come and go with fresh IDs, but per-instance alarms keep pointing at instance IDs that no longer exist. The fix is to alarm on aggregate metrics at the ASG or target-group level, not on individual instance IDs.

# Look at the alarm's full config — TreatMissingData is the single field that decides how it handles a missing datapoint.
aws cloudwatch describe-alarms \
  --alarm-names "CRITICAL EC2 Sainsburys Web Server disk used % high" \
  --query 'MetricAlarms[0].{Metric:MetricName,Period:Period,EvalPeriods:EvaluationPeriods,Treat:TreatMissingData,Dims:Dimensions}'

# Spot the alarms that will silently stay quiet because TreatMissingData is set permissively.
aws cloudwatch describe-alarms \
  --query 'MetricAlarms[?TreatMissingData==`notBreaching` || TreatMissingData==`ignore`].AlarmName'

What is the impact of leaving alarms in INSUFFICIENT_DATA?

The direct impact is that the alarm isn't doing its job. A disk-full alarm in INSUFFICIENT_DATA will not page you when the disk fills up. A 5xx-rate alarm in INSUFFICIENT_DATA will not catch the spike that wakes up customer support. You're paying for the alarm ($0.10/month per metric alarm — pocket change individually, real money across thousands), but more importantly you're paying with a false sense of coverage.

The second-order impact is what those alarms typically protect against. Disk-full leads to writes failing, databases corrupting, queues backing up. 5xx-rate leads to customer-visible outages. CPU saturation leads to autoscaling thrash. Each of these has its own bill — incident hours, SLO credit refunds, customer churn — that dwarfs the cost of the alarm itself, and the alarm was the cheapest line of defence against it.

The third-order impact is on the team's signal-to-noise ratio. Once people learn that some alarms are broken, they stop trusting the dashboard. The next person on call sees "INSUFFICIENT_DATA" next to a critical alarm and shrugs, because that's been the state for months. By the time it actually breaks for a real reason, nobody is watching.

On the compliance side, frameworks like SOC 2 and ISO 27001 expect monitoring to be both configured and effective. An alarm that exists but doesn't fire is audit evidence of a control gap, not evidence of a working control. An auditor sampling your alarms will eventually ask why eight of them have been blind for over a year.

How do you clean up INSUFFICIENT_DATA alarms?

Cleanup is a four-step loop. The point isn't to make every alarm green — it's to make every alarm honest. An honest alarm either watches a live resource or stops existing.

1. Inventory every alarm in INSUFFICIENT_DATA

Use describe-alarms --state-value INSUFFICIENT_DATA to pull the full list, with metric, namespace, and dimensions. Don't trust the alarm name — names lie, dimensions don't. Group the output by what the dimensions point at: an InstanceId (probably terminated), an AutoScalingGroupName (probably fine, agent issue), an ELB name (likely renamed or deleted), a CWAgent custom metric (probably an agent problem).

2. Triage by opening the metric in CloudWatch

For each alarm, open the metric directly — aws cloudwatch get-metric-statistics for the same dimensions over the last 24 hours. If the metric has zero datapoints, the resource is gone or the agent is dead. If the metric has datapoints but the alarm doesn't see them, the dimensions don't match what's actually being reported (an instance was replaced and the alarm wasn't repointed). Each case has a different fix.

3. Delete, repoint, or fix the agent

If the resource is gone for good (terminated, deleted), delete the alarm. If the resource was replaced, update the alarm's dimensions to the new ID — or better, move it to a dimension that survives replacement (AutoScalingGroupName, TargetGroup, ClusterName). If the resource exists but the metric is empty, log in and check the CloudWatch agent — systemctl status amazon-cloudwatch-agent will usually tell you it crashed three weeks ago.

4. Prevent recurrence with Config and ASG-level alarms

Enable the AWS Config managed rule cloudwatch-alarm-resource-check to alert whenever an alarm references a resource that no longer exists. For any workload behind an ASG, alarm on the ASG-level aggregate (CPUUtilization on the AutoScalingGroupName dimension) instead of per-instance — these survive instance churn. For disk and memory metrics that need the agent, monitor agent health itself with a separate alarm on the agent's own heartbeat metric.

# Resource is gone for good — delete the alarm.
aws cloudwatch delete-alarms \
  --alarm-names "CRITICAL EC2 Sainsburys Web Server disk used % high"

# Resource was replaced — repoint the alarm to the new instance ID (or the ASG, ideally).
aws cloudwatch put-metric-alarm \
  --alarm-name "CRITICAL EC2 Sainsburys Web Server disk used % high" \
  --metric-name DiskUsage \
  --namespace CWAgent \
  --dimensions Name=AutoScalingGroupName,Value=sainsburys-web-asg Name=device,Value=xvda1 \
  --statistic Maximum --period 300 --evaluation-periods 2 \
  --threshold 85 --comparison-operator GreaterThanThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-pager

Quick quiz

Question 1 of 5

A production disk-full alarm has been in INSUFFICIENT_DATA for three months. You confirm the instance ID in its dimensions was terminated when the ASG rolled. What's the best fix?

Keep learning

Dig deeper into CloudWatch alarm semantics and lifecycle management.

You've completed Fix INSUFFICIENT_DATA alarms. You now know the three-state alarm model, why INSUFFICIENT_DATA is a failure mode and not a quiet success, the four common causes (terminated resource, replaced instance, dead agent, wrong namespace), and the four-step loop to clean them up — inventory, triage, delete-or-repoint-or-fix, prevent recurrence. The next time the FinOps Dashboard flags ALM-002, you'll have a real plan instead of a shrug.

Back to the library

INSUFFICIENT_DATA alarms: the cost of monitoring that isn't monitoring

You're paying for alarms that stopped watching anything — and paying again when the incident they were meant to catch arrives

CloudWatch alarms can sit in a third state called INSUFFICIENT_DATA, which means the metric feeding them has stopped arriving. The alarm doesn't fire, it doesn't show red, and it doesn't page anyone. It just sits there, silently not watching. The FinOps Dashboard's ALM-002 check surfaces this because a monitoring bill with broken monitors is a cost that generates zero protection.

Each CloudWatch alarm costs a small fixed amount per month — roughly $0.10 per metric alarm. That isn't the real cost. The real cost is the incident these alarms were supposed to prevent: a disk filling up with no alert, a database CPU pegged with no notification, a 5xx spike that customer support discovers before your team does. Each of those incidents carries engineering hours, potential SLO credits, and customer impact. The alarm was the cheap insurance against it.

From a finance perspective this is a straightforward waste-plus-risk problem. The spend on non-functional alarms is small but tells you something: your monitoring investment isn't delivering its coverage. The right frame is not just cleaning up the cost line — it's asking what incidents your team is unprotected from right now, and whether the remediation cost (cheap) is justified against the incident cost (much larger). It almost always is.

This lesson is for the finance partner who wants to understand why broken alarms show up as both a cost issue and a risk issue. You'll get the plain-English version of what INSUFFICIENT_DATA means, why it's a monitoring spend that returns zero protection, how to think about which broken alarms represent genuine risk exposure versus harmless stale config, and the governance lever — a regular ALM-002 review with a documented exception process — that keeps the monitoring investment honest. No CLI commands required.

Fun fact

The alarm that outlived the instance

How a finance partner reads an ALM-002 report

Dana is the FinOps lead reviewing the monthly cloud controls report. ALM-002 shows eight alarms in INSUFFICIENT_DATA across the production account. Before asking the team to remediate all eight, Dana asks the tiering question: which of these alarms were protecting something that matters, and which are stale config on systems that no longer exist?

The team pulls the list with the alarm names and the resources they were watching. Three are on instance IDs that were terminated over a year ago during an AMI refresh — harmless stale config, cost $0.30/month combined, no risk exposure. Five are on live systems where the CloudWatch agent has stopped reporting: a database disk alarm, a checkout-worker memory alarm, and three network-related checks on active load balancers.

Dana's call on the five live-system alarms is immediate: the remediation cost is a few engineer-hours, the risk of leaving them broken is an incident nobody gets alerted to. She approves the fix and adds a line to the monthly review: ALM-002 findings on live systems must be remediated within two business days. The three terminated-instance alarms get deleted and documented. The monitoring budget now buys actual coverage.

Why this belongs in the cost-and-risk review, not just the ops backlog

The direct monitoring cost of an INSUFFICIENT_DATA alarm is negligible — a fraction of a dollar per month. That's not the number that belongs in a finance review. The number that belongs in a finance review is the expected cost of the incident the alarm was supposed to prevent. A disk-full alarm on a production database: what's the cost of a multi-hour database outage? A 5xx-rate alarm on checkout: what's the cost of a checkout disruption that customer support catches before engineering? Those are the exposure dollars that INSUFFICIENT_DATA alarms leave uninsured.

The second finance-relevant impact is audit posture. SOC 2 and ISO 27001 both evaluate whether monitoring is working, not merely configured. An auditor who asks to see alarm coverage and finds eight alarms that have been in INSUFFICIENT_DATA for over a year is not seeing evidence of good controls — they're seeing evidence of a gap. That creates audit finding risk, potential certification complications, and enterprise customer questions that cost more to manage than the remediation would have.

The right framing for finance is simple: categorize the broken alarms by what they were protecting. Alarms on terminated, low-stakes resources are cleanup items. Alarms on production systems watching for failure modes with real business impact are insured risks that are currently uninsured. Approve the fix on the second category immediately, document the first. The monitoring investment only pays off if the monitors work.

What finance can do about INSUFFICIENT_DATA alarms

Finance doesn't repoint alarms, but it owns the policy that ensures the monitoring investment delivers real coverage. Four levers that fit inside the regular FinOps cadence.

1. Separate the list by what the alarms were protecting

Not all broken alarms carry the same risk. Alarms on terminated or decommissioned resources are harmless stale config — worth cleaning up, but not urgent. Alarms on live production systems represent coverage gaps that are currently uninsured. Triage the ALM-002 list by resource status before deciding urgency and remediation priority.

2. Set a remediation SLA for production-system gaps

Agree with engineering on a response window for INSUFFICIENT_DATA findings on live, production-critical systems — two business days is a common target. This converts monitoring debt from a background backlog item into a tracked obligation, the same way you'd track an unpatched critical CVE. Document the SLA so exceptions require explicit sign-off.

3. Add ALM-002 to the compliance and audit narrative

SOC 2 and ISO 27001 evaluators ask whether monitoring is effective, not just configured. A recurring ALM-002 review with a documented remediation record is evidence of an effective control, not just a deployed one. Include the finding count and remediation rate in the controls section of the compliance pack, with a trend line rather than a snapshot.

4. Price the monitoring coverage gap against the incident it prevents

When a production alarm is broken, document the failure mode it was covering and estimate the cost of that failure mode occurring undetected — outage duration, engineering hours, SLO credits, customer impact. That number is almost always much larger than the remediation cost. Making the comparison explicit turns an engineering to-do into a funded, urgent finance decision.

Quick quiz

Question 1 of 5

ALM-002 flags twelve INSUFFICIENT_DATA alarms: three on terminated dev instances, two on a decommissioned staging ELB, and seven on live production systems including a checkout disk alarm and a payment-service CPU alarm. As the finance partner, how do you prioritize?

Keep learning

Dig deeper into CloudWatch alarm semantics and lifecycle management.

You've finished the finance partner's view of INSUFFICIENT_DATA alarms. You know why this is a risk-and-spend issue rather than just an engineering cleanup task, how to split the ALM-002 list into urgent live-system gaps versus harmless stale config, the four levers — tiered remediation SLAs, compliance narrative inclusion, pricing the coverage gap against the incident cost, and documented exceptions — that keep the monitoring investment honest. Next time ALM-002 fires, you'll ask the right question: not how many are broken, but what they were protecting.

Back to the library

INSUFFICIENT_DATA: the monitoring that looks fine but isn't

Alarms that are configured but blind are a gap in your operational risk posture, not a minor housekeeping item

AWS CloudWatch alarms are the primary tripwire between normal operations and an incident. An INSUFFICIENT_DATA alarm is one that has stopped receiving data — typically because the resource it watched was replaced or deleted — and is therefore not watching anything. It sits on the dashboard without a red badge, creating the appearance of coverage where there is none.

This is an operational risk question more than a cost one. The useful executive question is not how many alarms are broken, but which systems they were protecting. A disk-full alarm that has been blind for six months on a database server is a different risk exposure than a stale alarm on a decommissioned dev instance. The ALM-002 check surfaces the list so the risk can be assessed against business systems, not just alarm counts.

The healthy end state is simple to describe: every alarm that exists is watching a live resource, and every system that matters has a working alarm watching it. That's monitoring by policy, not by hope.

A short read for the executive sponsor of operational reliability. You'll understand what INSUFFICIENT_DATA actually means in non-technical terms, why it's an operational risk gap rather than a minor housekeeping task, and what the one leadership question is that distinguishes a healthy monitoring posture from an accidental one. No implementation depth required.

Fun fact

The alarm that outlived the instance

What it looks like when this goes wrong

At one mid-size retailer, a disk-full alarm had been in INSUFFICIENT_DATA for over a year. The instance it was watching had been replaced during a routine refresh; the new instance got a new ID and no one repointed the alarm. Eleven months later the replacement instance's disk filled up. There was no alert. Customer support noticed before the engineering team did.

The post-mortem identified the root cause quickly: the monitoring was configured but not verified. The alarm existed, it was enabled, it just wasn't watching anything real. The operational cost — incident response, a brief checkout outage, customer communications — was orders of magnitude larger than the cost of repointing the alarm would have been.

The executive takeaway wasn't technical. It was that 'monitoring is configured' and 'monitoring is working' are different statements that require different verification. ALM-002 is the check that distinguishes between them.

What is actually at risk when alarms go blind

The business impact of INSUFFICIENT_DATA alarms is a gap between the monitoring posture the organization believes it has and the one it actually has. Every alarm in this state represents a failure mode that no one will be alerted to if it occurs. The practical consequence is that incidents get discovered by customers, by support tickets, or by manual checks — not by the automated tripwires that were put in place precisely to catch them first.

There is also a secondary risk to organizational trust in the monitoring system itself. Operations teams learn quickly which alarms are reliable and which are habitually broken. Once that trust erodes, people stop acting urgently on alerts, and legitimate alarms get treated with the same skepticism as stale ones. Operational discipline degrades incrementally rather than catastrophically.

For the executive, the material question is not the alarm count — it is whether the systems the business depends on are actually being watched. A handful of broken alarms on live, critical infrastructure is a higher-priority item than dozens of stale alarms on decommissioned resources. The ALM-002 check gives you the list to make that distinction.

The leadership position on broken alarms

The executive ask is not to mandate alarm remediation counts — it's to require that the monitoring posture matches the stated operational commitments. Two decisions and one standing question.

1. Require working alarms on production-critical systems

Make it policy that any system covered by an SLA, SLO, or uptime commitment must have functional monitoring — alarms in INSUFFICIENT_DATA on those systems are a policy violation, not a backlog item. This creates a clear escalation path when the check fires rather than leaving it at engineering discretion.

2. Accept stale alarms on decommissioned resources as cleanup, not risk

Not every INSUFFICIENT_DATA finding is urgent. Alarms on terminated resources are harmless config drift — they should be cleaned up on a regular cadence, but they don't represent an uninsured risk. Distinguish between the two so leadership attention lands on the right category.

3. Ask the standing question at each operational review

The one-line leadership question for ALM-002 is: 'Are there any INSUFFICIENT_DATA alarms currently watching live production systems?' A consistent no — or a shrinking yes with a tracked remediation timeline — is the signal that monitoring is governed by policy. That's a one-minute read that requires no technical depth.

Quick quiz

Question 1 of 5

Your operational review shows ALM-002 has flagged four INSUFFICIENT_DATA alarms over the past quarter, all remediated within two days. Zero are currently on live production systems. What is the right read?

Keep learning

Dig deeper into CloudWatch alarm semantics and lifecycle management.

That's the lesson. Two takeaways: INSUFFICIENT_DATA means the alarm exists but is watching nothing, which is a different risk exposure depending entirely on whether the system it was supposed to watch is live and critical. The leadership question is not the alarm count — it's whether any production-critical systems are currently unmonitored. A policy that requires working alarms on committed-uptime systems, with a fast remediation SLA, is what separates monitoring by design from monitoring by assumption.

Back to the library

Part of the learning path Get your alarms right

Fix INSUFFICIENT_DATA alarms

INSUFFICIENT_DATA alarms: the basics

The alarm that outlived the instance

Cleaning up INSUFFICIENT_DATA alarms in action

How CloudWatch decides an alarm is INSUFFICIENT_DATAdeep dive

What is the impact of leaving alarms in INSUFFICIENT_DATA?

How do you clean up INSUFFICIENT_DATA alarms?

1. Inventory every alarm in INSUFFICIENT_DATA

2. Triage by opening the metric in CloudWatch

3. Delete, repoint, or fix the agent

4. Prevent recurrence with Config and ASG-level alarms

Quick quiz

Keep learning

INSUFFICIENT_DATA alarms: the cost of monitoring that isn't monitoring

The alarm that outlived the instance

How a finance partner reads an ALM-002 report

Why this belongs in the cost-and-risk review, not just the ops backlog

What finance can do about INSUFFICIENT_DATA alarms

1. Separate the list by what the alarms were protecting

2. Set a remediation SLA for production-system gaps

3. Add ALM-002 to the compliance and audit narrative

4. Price the monitoring coverage gap against the incident it prevents

Quick quiz

Keep learning

INSUFFICIENT_DATA: the monitoring that looks fine but isn't

The alarm that outlived the instance

What it looks like when this goes wrong

What is actually at risk when alarms go blind

The leadership position on broken alarms

1. Require working alarms on production-critical systems

2. Accept stale alarms on decommissioned resources as cleanup, not risk

3. Ask the standing question at each operational review

Quick quiz

Keep learning

Related monitoring lessons