Alarm coverage: the basics
What does it mean for an EC2 instance to be "unmonitored"?
Every EC2 instance ships with two built-in status checks — system reachability and instance reachability — and a handful of free CloudWatch metrics: CPUUtilization, NetworkIn/Out, DiskReadOps/WriteOps, and the StatusCheckFailed family. The metrics exist whether or not anyone is looking at them. An alarm is what turns a metric into a page; without one, an instance can melt down for hours before anyone notices.
An "unmonitored" instance is one that emits those metrics but has zero CloudWatch alarms attached to it. The CPU can saturate, the kernel can panic, the EBS volume can fill — and the only signal is a customer complaint or a stalled deploy. AWS Trusted Advisor flags this under COV-001 ("EC2 Instances Without Alarms") because it's one of the highest-leverage gaps in any account: the fix is cheap, the cost of skipping it is operational pain that compounds.
The pattern is almost always the same. One-off boxes launched from the console for "a quick test" never get alarms. Older AMIs cloned into new VPCs miss the userdata that wires them up. Auto Scaling Group instances inherit alarms from their launch template, but the standalone EC2 sitting next to the ASG — the one that ended up running the cron job nobody documented — has nothing.
In this lesson you'll learn which four CloudWatch alarms every production EC2 instance should have, the difference between system and instance status checks, why the gap usually exists, and how to provision alarms automatically based on tags so new instances inherit the baseline on launch. You'll see the audit query that finds unmonitored instances, the CLI calls to create the four baseline alarms, and the tradeoff calculation for fleets where blanket alarm coverage stops being economical.
Auto-recovery is one alarm away
StatusCheckFailed_System isn't just a notification — wire it to an EC2 recovery action and AWS will automatically migrate the instance to new underlying hardware when the hypervisor fails, preserving the instance ID, private IP, Elastic IP, and all metadata. It's a one-line alarm definition that turns a silent hardware fault into a 3-minute self-heal. Most teams find out this feature exists about a week after a host failure has already cost them a midnight incident.
Adding alarm coverage in action
Marco runs platform engineering at a SaaS company. Trusted Advisor flags COV-001 against the production account with 47 EC2 instances missing alarms — including i-02ea2a0c3d34975b9, the "Seven Web Server 2026" box that handles the marketing site.
He's not surprised. The web server was spun up by a contractor 18 months ago, never made it into Terraform, and survived two team handovers without ever being audited. There's no point arguing about how it got there — he needs to know exactly which instances are bare and what their tags say so he can apply the right baseline.
He starts by cross-referencing every running instance against the alarms that have a matching InstanceId dimension.
First, find every running instance with zero alarms whose dimensions include its instance ID. The trick is comparing two list outputs.
Set-subtraction across describe-instances and describe-alarms — the cheapest audit query AWS gives you.
Now provision the four baseline alarms for the worst offender. CPU, both status checks, and (because the CloudWatch agent is installed) disk space.
Baseline alarm set for one instance. The status-system alarm has EC2 auto-recovery as its action.
Alarms and status checks under the hooddeep dive
EC2 runs two status checks every minute, on every instance, for free. The system status check covers the hypervisor and the underlying host — network connectivity, hardware faults, power. When it fails, the right response is almost always to move the instance to new hardware: that's what the ec2:recover alarm action does, and it works for any current-gen instance type with an EBS-only root volume. The instance status check covers everything inside the guest OS: out-of-memory kernel panics, exhausted ephemeral disk, misconfigured networking. AWS can't fix that for you — it's your problem to investigate.
Default CloudWatch metrics come from the hypervisor, which means CPU, network, and EBS volume IOPS are free and require zero agent. Memory and disk-space utilisation, on the other hand, are inside the guest — AWS cannot see them — so you need the CloudWatch agent installed and configured to push mem_used_percent and disk_used_percent to a custom namespace (typically CWAgent). No agent, no memory or disk alarms. This is the single most common reason an alarm "didn't fire" — the metric was never there to begin with.
Alarm pricing is straightforward: $0.10 per alarm-month for standard-resolution alarms, $0.30 for high-resolution. Composite alarms (which combine other alarms with a boolean expression) are $0.50 each but they don't trigger on every state change — they fire when the composite expression flips. On a small fleet the bill is invisible; on a 10,000-instance fleet four alarms each becomes $4,000/month and the conversation shifts from "alarm everything" to "alarm what we'll actually act on."
# The two built-in status checks — both are free metrics, both ship every minute.
aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--metric-name StatusCheckFailed_System \
--dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9
aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--metric-name StatusCheckFailed_Instance \
--dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9
# Memory and disk metrics only exist if the CloudWatch agent is pushing them.
aws cloudwatch list-metrics \
--namespace CWAgent \
--dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9 What is the impact of running EC2 without alarms?
The most direct impact is undetected failure. An instance can sit pinned at 100% CPU, swapping memory, or with a full root disk for hours — and the first signal is a customer ticket or a stalled batch job. Mean time to detect (MTTD) for an unmonitored failure is whatever the slowest external feedback loop is, which on a B2B SaaS is typically half a business day. By the time you know, you've already missed your SLO.
The second-order impact is that auto-recovery never gets to do its job. When the underlying hardware fails — and AWS publicly states this happens to a non-trivial fraction of instances every year — a StatusCheckFailed_System alarm wired to ec2:recover resolves it in about three minutes with no human in the loop. Without the alarm, the host is just dead until someone notices and manually stops and starts the instance, which on a Sunday night usually means an hour-long incident.
The third-order impact is the compounding cost of "we don't actually know." Capacity planning, right-sizing, anomaly detection — all of them depend on a baseline of trusted metrics with at least some operational scrutiny. Instances without alarms tend to drift outside the visibility envelope: nobody's looking, so nobody notices when CPU has been at 95% for a month, or when the disk has been quietly filling at 200MB/day. The fix when it finally surfaces is always more expensive than the fix would have been when the alarm should have fired.
There's a financial impact too, but it cuts both ways. Five alarms across a 1,000-instance fleet is 5,000 alarms × $0.10 = $500/month for full coverage. On a multi-tenant Kubernetes platform with ten thousand short-lived nodes, that math stops working — at which point you move alarming up the stack (container health, ingress success rate, queue depth) and drop per-node alarms to just the status checks. The cost of monitoring is real; the cost of not monitoring is usually larger.
How do you close the coverage gap?
Closing the gap is a four-step loop. The first three are about getting to 100% coverage; the fourth makes sure new instances inherit it automatically instead of starting from zero every time.
1. Audit which instances have no alarms
Cross-reference describe-instances against describe-alarms filtered on the InstanceId dimension. Any running instance ID that doesn't appear in the alarm list is bare. Run this as a daily Lambda — the output is one number (count of unmonitored instances) and one S3 file (the list). Trend the number; if it grows, you have a process problem upstream of the alarm gap itself.
2. Define and apply the baseline four alarms
CPUUtilization > 85% sustained, StatusCheckFailed_System ≥ 1 (with ec2:recover as the action), StatusCheckFailed_Instance ≥ 1 (with SNS paging), and either disk_used_percent or mem_used_percent from the CloudWatch agent if it's installed. For workloads that need it, add a fifth — request latency, queue depth, or whatever the actual SLI is. Stop there: more alarms past the baseline make on-call worse, not better.
3. Use Systems Manager Quick Setup or Application Insights for curated bundles
AWS Systems Manager Quick Setup for CloudWatch Application Insights auto-discovers common stacks (Java EE, .NET, SQL Server, MySQL) on tagged instances and provisions a workload-aware alarm set — not just the OS-level metrics. Worth turning on for any fleet running supported software; it's free, and the curated alarm definitions are better than what most teams write from scratch.
4. Provision alarms automatically via tags
Wire an EventBridge rule on EC2 RunInstances to a Lambda that reads the new instance's tags and creates the standard alarm set if Environment=prod (or whatever your convention is). This is the only sustainable answer — manual coverage is a treadmill, tag-driven coverage scales without intervention. For endpoint-level monitoring, CloudWatch Synthetics canaries hit the user-facing URL and alarm on the experience, which catches problems instance metrics never see.
# Provision the four baseline alarms for one instance. Repeat in a loop or wrap in a Lambda triggered on RunInstances.
INSTANCE=i-02ea2a0c3d34975b9
TOPIC=arn:aws:sns:eu-west-1:123456789012:ops-alerts
REGION=eu-west-1
ACCOUNT=123456789012
aws cloudwatch put-metric-alarm \
--alarm-name "$INSTANCE-cpu" \
--metric-name CPUUtilization --namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=$INSTANCE \
--statistic Average --period 60 --evaluation-periods 5 --datapoints-to-alarm 3 \
--threshold 85 --comparison-operator GreaterThanThreshold \
--alarm-actions $TOPIC
aws cloudwatch put-metric-alarm \
--alarm-name "$INSTANCE-status-system" \
--metric-name StatusCheckFailed_System --namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=$INSTANCE \
--statistic Maximum --period 60 --evaluation-periods 2 \
--threshold 0 --comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:automate:$REGION:ec2:recover $TOPIC
aws cloudwatch put-metric-alarm \
--alarm-name "$INSTANCE-status-instance" \
--metric-name StatusCheckFailed_Instance --namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=$INSTANCE \
--statistic Maximum --period 60 --evaluation-periods 2 \
--threshold 0 --comparison-operator GreaterThanThreshold \
--alarm-actions $TOPIC
aws cloudwatch put-metric-alarm \
--alarm-name "$INSTANCE-disk" \
--metric-name disk_used_percent --namespace CWAgent \
--dimensions Name=InstanceId,Value=$INSTANCE Name=path,Value=/ Name=fstype,Value=xfs \
--statistic Average --period 300 --evaluation-periods 2 \
--threshold 85 --comparison-operator GreaterThanThreshold \
--alarm-actions $TOPIC Quick quiz
Question 1 of 5An EC2 host hits a hardware fault and the instance becomes unreachable. Which single alarm, configured correctly, will recover it automatically with no human in the loop?
You scored
0 / 5
Keep learning
Dig deeper into EC2 monitoring, automated recovery, and tag-driven provisioning.
- AWS — Status checks for your EC2 instances How system vs instance status checks work, and how to wire EC2 auto-recovery to a CloudWatch alarm.
- Installing the CloudWatch agent on EC2 The agent setup that unlocks memory and disk-space metrics — without it those alarms can't exist.
- CloudWatch Application Insights Curated, workload-aware alarm bundles for Java, .NET, SQL Server, and other common stacks.
- CloudWatch Synthetics canaries Endpoint-level monitoring that catches user-visible problems instance metrics miss.
You've completed Add CloudWatch alarms to EC2 instances. You know which four alarms form the baseline, why StatusCheckFailed_System is the highest-leverage one you can wire, when blanket coverage stops being economical, and how to make new instances inherit alarms automatically through tag-driven Lambda provisioning. The next time COV-001 lights up against an account, you'll have a four-step loop — audit, baseline, curate, automate — ready to run.
Back to the library