INSUFFICIENT_DATA alarms: the basics
What does INSUFFICIENT_DATA actually mean?
CloudWatch alarms have three states, not two. OK means the metric is reporting and within thresholds. ALARM means the metric is reporting and the threshold has been breached. INSUFFICIENT_DATA means CloudWatch isn't receiving enough datapoints in the recent evaluation window to make a determination either way — the alarm has nothing to alarm on.
On a dashboard an INSUFFICIENT_DATA alarm often looks fine. It's not red. It hasn't paged anyone. The graph next to it is blank but in a quiet way. That's exactly what makes it dangerous: a disk-full alarm sitting in INSUFFICIENT_DATA for six months is not protecting you from a disk filling up. It's protecting you from nothing.
AWS Trusted Advisor and the FinOps Dashboard's ALM-002 check flag this pattern because it's almost always a sign that something has drifted — the resource was deleted, the CloudWatch agent stopped reporting, an instance was replaced and the alarm still points at the dead instance ID, or the alarm was created against a metric namespace that never existed. The alarm is configured. The alarm is enabled. The alarm is blind.
In this lesson you'll learn the three-state CloudWatch alarm model, why INSUFFICIENT_DATA is a real failure mode rather than a quiet success, the four most common reasons alarms drift into it, and the diagnostic flow to decide whether to delete the alarm, repoint it, or fix the underlying agent. You'll see real AWS CLI investigation and the exact calls to clean up a fleet of stale alarms.
The alarm that outlived the instance
In a 2023 retro from a UK retailer, the post-mortem showed a critical disk-space alarm had been in INSUFFICIENT_DATA for 412 days before anyone noticed. The instance it monitored had been replaced as part of a routine AMI refresh — the new instance got a new ID, the alarm kept pointing at the old one, and CloudWatch dutifully reported "no data" every minute for over a year. The disk that eventually filled up was on the replacement instance. There was no alarm watching it.
Cleaning up INSUFFICIENT_DATA alarms in action
Marco runs the SRE rotation at a retail platform. A FinOps Dashboard scan flags ALM-002 with eight INSUFFICIENT_DATA alarms across the production account. One of them — "CRITICAL EC2 Sainsburys Web Server disk used % high" — is marked HIGH severity because the name suggests it's meant to catch a serious failure mode.
He doesn't know whether the alarm is broken, the instance is gone, or the CloudWatch agent has crashed. All three look identical from the alarm's perspective: no datapoints in the evaluation window. So he starts where the alarm starts — describe-alarms, filtered to just the broken ones.
The output tells him the alarm's metric, namespace, and dimensions. From there he can open the metric in the console and see immediately whether the resource it points at still exists.
First, list every alarm currently in INSUFFICIENT_DATA. The state-value filter is the fast way to scope the search.
All alarms in INSUFFICIENT_DATA across the account.
Now confirm whether the underlying resource still exists. If the instance is gone, the alarm has been blind since the terminate call.
The InstanceId in the alarm's dimensions doesn't exist any more — the resource is gone.
How CloudWatch decides an alarm is INSUFFICIENT_DATAdeep dive
A CloudWatch alarm evaluates a metric on a fixed schedule defined by its Period and EvaluationPeriods. A typical disk-full alarm watches 5-minute datapoints over 3 consecutive periods. Each evaluation cycle CloudWatch looks at the most recent N periods and decides: does the metric breach the threshold (ALARM), is it within bounds (OK), or are there simply not enough datapoints to decide (INSUFFICIENT_DATA)?
What "not enough datapoints" means is governed by TreatMissingData. There are four options: missing (the default — a missing datapoint counts as missing, and if all evaluation periods are missing the alarm goes INSUFFICIENT_DATA), notBreaching (a missing datapoint counts as good, alarm stays OK), breaching (a missing datapoint counts as bad, alarm goes ALARM), and ignore (missing datapoints don't change the alarm state at all, so it stays in whatever state it last had). Most alarms in the wild use the default, which is why "the metric stopped reporting" almost always presents as INSUFFICIENT_DATA rather than ALARM.
Alarms are decoupled from the resources they monitor. There's no foreign key from an EC2 instance to its alarms — if you terminate the instance, the alarms survive. They just stop receiving data. This decoupling is what makes ASGs especially problematic: ephemeral instances come and go with fresh IDs, but per-instance alarms keep pointing at instance IDs that no longer exist. The fix is to alarm on aggregate metrics at the ASG or target-group level, not on individual instance IDs.
# Look at the alarm's full config — TreatMissingData is the single field that decides how it handles a missing datapoint.
aws cloudwatch describe-alarms \
--alarm-names "CRITICAL EC2 Sainsburys Web Server disk used % high" \
--query 'MetricAlarms[0].{Metric:MetricName,Period:Period,EvalPeriods:EvaluationPeriods,Treat:TreatMissingData,Dims:Dimensions}'
# Spot the alarms that will silently stay quiet because TreatMissingData is set permissively.
aws cloudwatch describe-alarms \
--query 'MetricAlarms[?TreatMissingData==`notBreaching` || TreatMissingData==`ignore`].AlarmName' What is the impact of leaving alarms in INSUFFICIENT_DATA?
The direct impact is that the alarm isn't doing its job. A disk-full alarm in INSUFFICIENT_DATA will not page you when the disk fills up. A 5xx-rate alarm in INSUFFICIENT_DATA will not catch the spike that wakes up customer support. You're paying for the alarm ($0.10/month per metric alarm — pocket change individually, real money across thousands), but more importantly you're paying with a false sense of coverage.
The second-order impact is what those alarms typically protect against. Disk-full leads to writes failing, databases corrupting, queues backing up. 5xx-rate leads to customer-visible outages. CPU saturation leads to autoscaling thrash. Each of these has its own bill — incident hours, SLO credit refunds, customer churn — that dwarfs the cost of the alarm itself, and the alarm was the cheapest line of defence against it.
The third-order impact is on the team's signal-to-noise ratio. Once people learn that some alarms are broken, they stop trusting the dashboard. The next person on call sees "INSUFFICIENT_DATA" next to a critical alarm and shrugs, because that's been the state for months. By the time it actually breaks for a real reason, nobody is watching.
On the compliance side, frameworks like SOC 2 and ISO 27001 expect monitoring to be both configured and effective. An alarm that exists but doesn't fire is audit evidence of a control gap, not evidence of a working control. An auditor sampling your alarms will eventually ask why eight of them have been blind for over a year.
How do you clean up INSUFFICIENT_DATA alarms?
Cleanup is a four-step loop. The point isn't to make every alarm green — it's to make every alarm honest. An honest alarm either watches a live resource or stops existing.
1. Inventory every alarm in INSUFFICIENT_DATA
Use describe-alarms --state-value INSUFFICIENT_DATA to pull the full list, with metric, namespace, and dimensions. Don't trust the alarm name — names lie, dimensions don't. Group the output by what the dimensions point at: an InstanceId (probably terminated), an AutoScalingGroupName (probably fine, agent issue), an ELB name (likely renamed or deleted), a CWAgent custom metric (probably an agent problem).
2. Triage by opening the metric in CloudWatch
For each alarm, open the metric directly — aws cloudwatch get-metric-statistics for the same dimensions over the last 24 hours. If the metric has zero datapoints, the resource is gone or the agent is dead. If the metric has datapoints but the alarm doesn't see them, the dimensions don't match what's actually being reported (an instance was replaced and the alarm wasn't repointed). Each case has a different fix.
3. Delete, repoint, or fix the agent
If the resource is gone for good (terminated, deleted), delete the alarm. If the resource was replaced, update the alarm's dimensions to the new ID — or better, move it to a dimension that survives replacement (AutoScalingGroupName, TargetGroup, ClusterName). If the resource exists but the metric is empty, log in and check the CloudWatch agent — systemctl status amazon-cloudwatch-agent will usually tell you it crashed three weeks ago.
4. Prevent recurrence with Config and ASG-level alarms
Enable the AWS Config managed rule cloudwatch-alarm-resource-check to alert whenever an alarm references a resource that no longer exists. For any workload behind an ASG, alarm on the ASG-level aggregate (CPUUtilization on the AutoScalingGroupName dimension) instead of per-instance — these survive instance churn. For disk and memory metrics that need the agent, monitor agent health itself with a separate alarm on the agent's own heartbeat metric.
# Resource is gone for good — delete the alarm.
aws cloudwatch delete-alarms \
--alarm-names "CRITICAL EC2 Sainsburys Web Server disk used % high"
# Resource was replaced — repoint the alarm to the new instance ID (or the ASG, ideally).
aws cloudwatch put-metric-alarm \
--alarm-name "CRITICAL EC2 Sainsburys Web Server disk used % high" \
--metric-name DiskUsage \
--namespace CWAgent \
--dimensions Name=AutoScalingGroupName,Value=sainsburys-web-asg Name=device,Value=xvda1 \
--statistic Maximum --period 300 --evaluation-periods 2 \
--threshold 85 --comparison-operator GreaterThanThreshold \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-pager Quick quiz
Question 1 of 5A production disk-full alarm has been in INSUFFICIENT_DATA for three months. You confirm the instance ID in its dimensions was terminated when the ASG rolled. What's the best fix?
You scored
0 / 5
Keep learning
Dig deeper into CloudWatch alarm semantics and lifecycle management.
- CloudWatch alarms and missing data Official documentation of the alarm state machine and the four TreatMissingData options.
- AWS Config managed rule: cloudwatch-alarm-resource-check Continuous detection for alarms pointing at resources that no longer exist.
- CloudWatch agent troubleshooting When the metric is gone but the resource still exists, this is where to start.
- AWS Well-Architected — Operational Excellence Pillar How alarming fits the broader observability and operations practice.
You've completed Fix INSUFFICIENT_DATA alarms. You now know the three-state alarm model, why INSUFFICIENT_DATA is a failure mode and not a quiet success, the four common causes (terminated resource, replaced instance, dead agent, wrong namespace), and the four-step loop to clean them up — inventory, triage, delete-or-repoint-or-fix, prevent recurrence. The next time the FinOps Dashboard flags ALM-002, you'll have a real plan instead of a shrug.
Back to the library