Monitoring

Add CloudWatch alarms to EC2 instances

An instance with no alarms tells you nothing when it breaks — wire the baseline four metrics and tag-driven automation.

13 min·10 sections·AWS

Last reviewed 27 May 2026

Alarm coverage: the basics

What does it mean for an EC2 instance to be "unmonitored"?

Every EC2 instance ships with two built-in status checks — system reachability and instance reachability — and a handful of free CloudWatch metrics: CPUUtilization, NetworkIn/Out, DiskReadOps/WriteOps, and the StatusCheckFailed family. The metrics exist whether or not anyone is looking at them. An alarm is what turns a metric into a page; without one, an instance can melt down for hours before anyone notices.

An "unmonitored" instance is one that emits those metrics but has zero CloudWatch alarms attached to it. The CPU can saturate, the kernel can panic, the EBS volume can fill — and the only signal is a customer complaint or a stalled deploy. AWS Trusted Advisor flags this under COV-001 ("EC2 Instances Without Alarms") because it's one of the highest-leverage gaps in any account: the fix is cheap, the cost of skipping it is operational pain that compounds.

The pattern is almost always the same. One-off boxes launched from the console for "a quick test" never get alarms. Older AMIs cloned into new VPCs miss the userdata that wires them up. Auto Scaling Group instances inherit alarms from their launch template, but the standalone EC2 sitting next to the ASG — the one that ended up running the cron job nobody documented — has nothing.

In this lesson you'll learn which four CloudWatch alarms every production EC2 instance should have, the difference between system and instance status checks, why the gap usually exists, and how to provision alarms automatically based on tags so new instances inherit the baseline on launch. You'll see the audit query that finds unmonitored instances, the CLI calls to create the four baseline alarms, and the tradeoff calculation for fleets where blanket alarm coverage stops being economical.

Fun fact

Auto-recovery is one alarm away

StatusCheckFailed_System isn't just a notification — wire it to an EC2 recovery action and AWS will automatically migrate the instance to new underlying hardware when the hypervisor fails, preserving the instance ID, private IP, Elastic IP, and all metadata. It's a one-line alarm definition that turns a silent hardware fault into a 3-minute self-heal. Most teams find out this feature exists about a week after a host failure has already cost them a midnight incident.

Adding alarm coverage in action

Marco runs platform engineering at a SaaS company. Trusted Advisor flags COV-001 against the production account with 47 EC2 instances missing alarms — including i-02ea2a0c3d34975b9, the "Seven Web Server 2026" box that handles the marketing site.

He's not surprised. The web server was spun up by a contractor 18 months ago, never made it into Terraform, and survived two team handovers without ever being audited. There's no point arguing about how it got there — he needs to know exactly which instances are bare and what their tags say so he can apply the right baseline.

He starts by cross-referencing every running instance against the alarms that have a matching InstanceId dimension.

First, find every running instance with zero alarms whose dimensions include its instance ID. The trick is comparing two list outputs.

$ comm -23 <(aws ec2 describe-instances --filters Name=instance-state-name,Values=running --query 'Reservations[].Instances[].InstanceId' --output text | tr '\t' '\n' | sort) <(aws cloudwatch describe-alarms --query 'MetricAlarms[].Dimensions[?Name==`InstanceId`].Value' --output text | tr '\t' '\n' | sort -u)

i-02ea2a0c3d34975b9

i-0741b8e9f2c3d4a51

i-09c2d3e4f5a6b7891

i-0a1b2c3d4e5f67890

i-0f8e7d6c5b4a39281

# 47 instance IDs total — every one of them is running blind right now.

Set-subtraction across describe-instances and describe-alarms — the cheapest audit query AWS gives you.

Now provision the four baseline alarms for the worst offender. CPU, both status checks, and (because the CloudWatch agent is installed) disk space.

$ INSTANCE=i-02ea2a0c3d34975b9; TOPIC=arn:aws:sns:eu-west-1:123456789012:ops-alerts; for a in cpu status-system status-instance disk; do aws cloudwatch put-metric-alarm --alarm-name "$INSTANCE-$a" --instance-id "$INSTANCE" --topic "$TOPIC"; done

Creating i-02ea2a0c3d34975b9-cpu CPUUtilization > 85% for 3 of 5 minutes

Creating i-02ea2a0c3d34975b9-status-system StatusCheckFailed_System >= 1 → ec2:recover

Creating i-02ea2a0c3d34975b9-status-instance StatusCheckFailed_Instance >= 1 → SNS page

Creating i-02ea2a0c3d34975b9-disk disk_used_percent > 85% (CWAgent /)

# 4 alarms × $0.10 = $0.40/month per instance. 47 instances = $18.80/month for full coverage.

Baseline alarm set for one instance. The status-system alarm has EC2 auto-recovery as its action.

Alarms and status checks under the hooddeep dive

EC2 runs two status checks every minute, on every instance, for free. The system status check covers the hypervisor and the underlying host — network connectivity, hardware faults, power. When it fails, the right response is almost always to move the instance to new hardware: that's what the ec2:recover alarm action does, and it works for any current-gen instance type with an EBS-only root volume. The instance status check covers everything inside the guest OS: out-of-memory kernel panics, exhausted ephemeral disk, misconfigured networking. AWS can't fix that for you — it's your problem to investigate.

Default CloudWatch metrics come from the hypervisor, which means CPU, network, and EBS volume IOPS are free and require zero agent. Memory and disk-space utilisation, on the other hand, are inside the guest — AWS cannot see them — so you need the CloudWatch agent installed and configured to push mem_used_percent and disk_used_percent to a custom namespace (typically CWAgent). No agent, no memory or disk alarms. This is the single most common reason an alarm "didn't fire" — the metric was never there to begin with.

Alarm pricing is straightforward: $0.10 per alarm-month for standard-resolution alarms, $0.30 for high-resolution. Composite alarms (which combine other alarms with a boolean expression) are $0.50 each but they don't trigger on every state change — they fire when the composite expression flips. On a small fleet the bill is invisible; on a 10,000-instance fleet four alarms each becomes $4,000/month and the conversation shifts from "alarm everything" to "alarm what we'll actually act on."

# The two built-in status checks — both are free metrics, both ship every minute.
aws cloudwatch list-metrics \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9

aws cloudwatch list-metrics \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_Instance \
  --dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9

# Memory and disk metrics only exist if the CloudWatch agent is pushing them.
aws cloudwatch list-metrics \
  --namespace CWAgent \
  --dimensions Name=InstanceId,Value=i-02ea2a0c3d34975b9

What is the impact of running EC2 without alarms?

The most direct impact is undetected failure. An instance can sit pinned at 100% CPU, swapping memory, or with a full root disk for hours — and the first signal is a customer ticket or a stalled batch job. Mean time to detect (MTTD) for an unmonitored failure is whatever the slowest external feedback loop is, which on a B2B SaaS is typically half a business day. By the time you know, you've already missed your SLO.

The second-order impact is that auto-recovery never gets to do its job. When the underlying hardware fails — and AWS publicly states this happens to a non-trivial fraction of instances every year — a StatusCheckFailed_System alarm wired to ec2:recover resolves it in about three minutes with no human in the loop. Without the alarm, the host is just dead until someone notices and manually stops and starts the instance, which on a Sunday night usually means an hour-long incident.

The third-order impact is the compounding cost of "we don't actually know." Capacity planning, right-sizing, anomaly detection — all of them depend on a baseline of trusted metrics with at least some operational scrutiny. Instances without alarms tend to drift outside the visibility envelope: nobody's looking, so nobody notices when CPU has been at 95% for a month, or when the disk has been quietly filling at 200MB/day. The fix when it finally surfaces is always more expensive than the fix would have been when the alarm should have fired.

There's a financial impact too, but it cuts both ways. Five alarms across a 1,000-instance fleet is 5,000 alarms × $0.10 = $500/month for full coverage. On a multi-tenant Kubernetes platform with ten thousand short-lived nodes, that math stops working — at which point you move alarming up the stack (container health, ingress success rate, queue depth) and drop per-node alarms to just the status checks. The cost of monitoring is real; the cost of not monitoring is usually larger.

How do you close the coverage gap?

Closing the gap is a four-step loop. The first three are about getting to 100% coverage; the fourth makes sure new instances inherit it automatically instead of starting from zero every time.

1. Audit which instances have no alarms

Cross-reference describe-instances against describe-alarms filtered on the InstanceId dimension. Any running instance ID that doesn't appear in the alarm list is bare. Run this as a daily Lambda — the output is one number (count of unmonitored instances) and one S3 file (the list). Trend the number; if it grows, you have a process problem upstream of the alarm gap itself.

2. Define and apply the baseline four alarms

CPUUtilization > 85% sustained, StatusCheckFailed_System ≥ 1 (with ec2:recover as the action), StatusCheckFailed_Instance ≥ 1 (with SNS paging), and either disk_used_percent or mem_used_percent from the CloudWatch agent if it's installed. For workloads that need it, add a fifth — request latency, queue depth, or whatever the actual SLI is. Stop there: more alarms past the baseline make on-call worse, not better.

3. Use Systems Manager Quick Setup or Application Insights for curated bundles

AWS Systems Manager Quick Setup for CloudWatch Application Insights auto-discovers common stacks (Java EE, .NET, SQL Server, MySQL) on tagged instances and provisions a workload-aware alarm set — not just the OS-level metrics. Worth turning on for any fleet running supported software; it's free, and the curated alarm definitions are better than what most teams write from scratch.

4. Provision alarms automatically via tags

Wire an EventBridge rule on EC2 RunInstances to a Lambda that reads the new instance's tags and creates the standard alarm set if Environment=prod (or whatever your convention is). This is the only sustainable answer — manual coverage is a treadmill, tag-driven coverage scales without intervention. For endpoint-level monitoring, CloudWatch Synthetics canaries hit the user-facing URL and alarm on the experience, which catches problems instance metrics never see.

# Provision the four baseline alarms for one instance. Repeat in a loop or wrap in a Lambda triggered on RunInstances.
INSTANCE=i-02ea2a0c3d34975b9
TOPIC=arn:aws:sns:eu-west-1:123456789012:ops-alerts
REGION=eu-west-1
ACCOUNT=123456789012

aws cloudwatch put-metric-alarm \
  --alarm-name "$INSTANCE-cpu" \
  --metric-name CPUUtilization --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=$INSTANCE \
  --statistic Average --period 60 --evaluation-periods 5 --datapoints-to-alarm 3 \
  --threshold 85 --comparison-operator GreaterThanThreshold \
  --alarm-actions $TOPIC

aws cloudwatch put-metric-alarm \
  --alarm-name "$INSTANCE-status-system" \
  --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=$INSTANCE \
  --statistic Maximum --period 60 --evaluation-periods 2 \
  --threshold 0 --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:automate:$REGION:ec2:recover $TOPIC

aws cloudwatch put-metric-alarm \
  --alarm-name "$INSTANCE-status-instance" \
  --metric-name StatusCheckFailed_Instance --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=$INSTANCE \
  --statistic Maximum --period 60 --evaluation-periods 2 \
  --threshold 0 --comparison-operator GreaterThanThreshold \
  --alarm-actions $TOPIC

aws cloudwatch put-metric-alarm \
  --alarm-name "$INSTANCE-disk" \
  --metric-name disk_used_percent --namespace CWAgent \
  --dimensions Name=InstanceId,Value=$INSTANCE Name=path,Value=/ Name=fstype,Value=xfs \
  --statistic Average --period 300 --evaluation-periods 2 \
  --threshold 85 --comparison-operator GreaterThanThreshold \
  --alarm-actions $TOPIC

Quick quiz

Question 1 of 5

An EC2 host hits a hardware fault and the instance becomes unreachable. Which single alarm, configured correctly, will recover it automatically with no human in the loop?

Keep learning

Dig deeper into EC2 monitoring, automated recovery, and tag-driven provisioning.

You've completed Add CloudWatch alarms to EC2 instances. You know which four alarms form the baseline, why StatusCheckFailed_System is the highest-leverage one you can wire, when blanket coverage stops being economical, and how to make new instances inherit alarms automatically through tag-driven Lambda provisioning. The next time COV-001 lights up against an account, you'll have a four-step loop — audit, baseline, curate, automate — ready to run.

Back to the library

EC2 alarm coverage: what it costs and why it matters

A predictable $0.10-per-alarm-per-month spend that eliminates undetected failures

Every EC2 instance continuously generates free metrics — CPU, status checks, network. Without a CloudWatch alarm attached, those metrics are invisible: the instance can be broken for hours and the only signal is a customer complaint. AWS Trusted Advisor flags this as COV-001. The fix is attaching alarms, and the cost is unusually transparent: each standard-resolution alarm costs $0.10 per month, flat.

The finance framing is a coverage question: four baseline alarms per instance at $0.40/instance/month gives you CPU saturation, hardware fault detection, OS-level failure detection, and disk pressure. On a 100-instance fleet that's $40/month for end-to-end coverage. The alternative — zero alarms — is not a saving, it's an uninsured exposure. An undetected failure on a production instance during business hours typically costs multiples of the entire fleet's annual monitoring budget in incident response and revenue impact.

This control is also a good lens on fleet hygiene. The instances most likely to be unmonitored are the ones nobody owns: contractor-built boxes, cloned AMIs, instances that outlived their original project. A systematic alarm-coverage program surfaces these orphans as a byproduct — and decommissioning a handful of them usually recovers more cost than the alarms themselves add.

This lesson is for the finance partner who sees CloudWatch on the cloud bill and wants to understand what alarm coverage actually buys — and what it costs. You'll get the per-alarm unit economics ($0.10/month), the four-alarm baseline that every production instance should carry, the breakeven logic for large fleets where per-node alarms stop making sense, and the governance framing: which instances are covered, which are intentionally bare, and what a recorded exception looks like. No CLI knowledge required.

Fun fact

Auto-recovery is one alarm away

How a finance partner frames the alarm coverage decision

Priya is the finance partner for the infrastructure team. At the quarterly review, the Trusted Advisor report flags 47 EC2 instances with COV-001 — zero alarms attached. Her first instinct is not to approve blanket remediation; it's to ask a tiering question: which of these 47 instances are production workloads where an undetected failure actually costs money, and which are dev boxes where the impact is inconvenience at most?

The team pulls the instance list with environment tags. Thirty-one are tagged Environment=prod, including the marketing web server, the API gateway nodes, and several data processing instances. The remaining sixteen are dev, staging, or untagged. Priya agrees the four-alarm baseline on the 31 production instances is straightforward: 31 × $0.40/month = $12.40/month for full production coverage. The dev instances she treats as a separate conversation — some should be decommissioned entirely, which would save more than the monitoring would cost.

Her output for the finance pack is simple: "$12.40/month closes the production monitoring gap. Separately, 16 dev/untagged instances are being reviewed for decommission." The alarm cost is trivially approved; the real value Priya drives is surfacing that 16 instances of uncertain ownership are costing real compute spend with no monitoring and possibly no active use.

What missing alarm coverage costs — and what fixing it costs

The direct financial exposure from an unmonitored instance is the elapsed time between failure and detection multiplied by the business cost of that failure. On a production B2B application, an undetected failure that surfaces six hours later via a customer ticket typically means missed SLOs, support escalation, and engineering incident response — costs that stack up quickly against a $0.10/month alarm that would have paged someone within five minutes.

The self-healing angle has a hard dollar value too. The StatusCheckFailed_System alarm with an ec2:recover action is the cheapest form of business continuity available in AWS: free status-check metric plus a $0.10 alarm triggers automatic hardware migration in about three minutes, with no on-call engineer in the loop. A 3 a.m. hardware fault on a monitored, auto-recovering instance is a non-event. The same fault on an unmonitored instance is an incident with an on-call page, a manual restart, and whatever downstream impact accumulated during the gap.

The cost math scales predictably with fleet size, which makes it easy to model. Four alarms per instance at $0.40/month: a 100-instance production fleet is $40/month, a 500-instance fleet is $200/month, a 1,000-instance fleet is $400/month. These are not material monitoring costs. The crossover point where you'd reconsider blanket per-instance alarms is in the tens of thousands of short-lived nodes — at which point you restructure alarming up the stack rather than eliminating it.

A secondary financial benefit of a coverage program is fleet hygiene. The instances most likely to show up bare in a COV-001 audit are also the most likely to be orphaned compute spend — nobody owns them, nobody decommissions them. Closing the alarm gap often surfaces instances that should be terminated, recovering EC2 spend that dwarfs the monitoring budget.

What finance can do about the alarm coverage gap

Finance can't write the alarms, but it can set the framing that turns alarm coverage from an engineering to-do into a governed spend with predictable cost and clear accountability. Four levers.

1. Budget the monitoring premium per environment tier

Agree with engineering that production instances are budgeted with the four-alarm baseline included — $0.40/instance/month is the standard line. Lower environments are either covered at a reduced baseline or not at all, by design. This makes the monitoring cost a planned input rather than a surprise on the CloudWatch bill, and gives finance a clear expectation to check against.

2. Track COV-001 failures as a production-instance count, not a raw number

Total unmonitored instances is the wrong metric — it conflates production gaps with dev instances nobody cares about. The number that matters is unmonitored production instances: any non-zero value here is an uninsured operational risk that has a dollar cost when it fires. Put this on the security-and-cost review with the estimated remediation cost alongside it.

3. Require documented exceptions for intentionally bare instances

Any production instance left without the alarm baseline should carry a recorded, finance-visible reason — not a silently suppressed finding. The distinction between 'we reviewed this and accepted the risk' and 'we never looked at it' is what makes the alarm-coverage picture defensible at audit or in a post-incident review.

4. Use decommission opportunities found during the audit

Unmonitored instances are disproportionately orphaned compute — no owner, no active use. The COV-001 audit query produces a list that's also a decommission candidate list. Recovering the EC2 cost of even a handful of forgotten instances typically pays for the entire fleet's alarm coverage several times over, making this a cost-neutral or cost-positive remediation.

Quick quiz

Question 1 of 5

A COV-001 audit finds 60 EC2 instances with no alarms: 20 production, 25 dev/test, and 15 untagged with no clear owner. As the finance partner, what's the right approach?

Keep learning

Dig deeper into EC2 monitoring, automated recovery, and tag-driven provisioning.

You've finished the finance view of EC2 alarm coverage. You know the unit economics ($0.10/alarm/month, $0.40/instance for the baseline four), why a COV-001 audit is also a decommission-candidate list, and the four levers — tier-based budgeting, production-instance coverage rate, documented exceptions, and decommission recovery — that make this a cost-neutral or cost-positive fix. Next time it comes up, you'll ask a sharper question than 'how much will monitoring cost?'

Back to the library

EC2 alarm coverage: the headline

Whether broken instances fail silently or page someone

EC2 instances generate health signals continuously. Without alarms, those signals go nowhere — a server can fail, saturate, or fill its disk and the first anyone knows is a customer complaint. AWS Trusted Advisor tracks this gap as COV-001. Closing it means wiring four baseline alarms per instance at roughly $0.40/instance/month.

The leadership question is straightforward: which instances are running in production without basic failure detection? The answer on most accounts is not zero — it's typically a mix of forgotten boxes, old clones, and standalone instances that escaped the standard provisioning process. This control forces those gaps into the open. The cost of fixing them is predictable and small; the cost of leaving them is periodic operational surprises at the worst possible times.

A short read for the leader who wants to know what COV-001 is signalling and what the one accountability question is. You'll understand why unmonitored instances are an operational risk rather than just a technical gap, what the alarm coverage rate tells you about fleet hygiene, and what the right end state looks like: production instances with baseline detection, intentional exceptions documented, and new instances automatically wired from day one.

Fun fact

Auto-recovery is one alarm away

What it looks like when alarm coverage is policy, not accident

After a late-night incident where a production web server had been pegged at 100% CPU for six hours before a customer complaint surfaced it, the CTO asked a simple question: why did nobody know? The answer was that there were no alarms — the instance had been launched manually and never wired into the standard monitoring stack.

The team's response was to treat alarm coverage as a policy question rather than a per-instance engineering task. Production instances get the four-alarm baseline automatically on launch via tag-driven Lambda provisioning. Any instance without that baseline shows up on the weekly COV-001 report and is remediated or documented within a week.

Two months later the same CTO asked the same question in a different form: "if a server broke right now, how quickly would we know?" The answer had changed from 'when a customer tells us' to 'within five minutes.' That's the shift this control enables — and it's the one-line confidence signal that belongs on any operational review.

Why this matters above the technical level

Unmonitored EC2 instances are a governance gap as much as an engineering one. The question isn't whether the team is competent — it's whether the systems that carry business risk have the basic detection infrastructure to give anyone a chance to respond. A production instance with zero alarms means that when it fails, the business is relying on luck, customer complaints, or a stalled batch job to surface it. That's not a posture; it's an accident waiting to happen.

The auto-recovery capability makes this especially stark for leadership. A hardware fault on a properly alarmed instance resolves automatically in three minutes. The same fault on an unmonitored instance becomes a midnight call, a manual restart, and a post-incident review. The people who pay for that incident — in overtime, in customer trust, in SLA credits — are rarely the same people who left the alarm off when the instance was launched.

This control is also a proxy for how mature the provisioning process is. Accounts with high alarm-coverage rates have standardised their instance launch process so monitoring is applied automatically, not manually remembered. Accounts with dozens of COV-001 failures have a process gap upstream: instances are being launched by hand, by contractors, or from old AMIs outside any standard. That upstream gap is worth more executive attention than the alarm count itself.

The leadership move on EC2 alarm coverage

The executive handle isn't to approve every alarm individually — it's to require that production instances carry basic failure detection by default and that any exception is a deliberate, recorded decision.

1. Set a default: production instances get the baseline on launch

Make it policy that anything tagged as production is provisioned with at least the four-alarm baseline — CPU, hardware fault (with auto-recovery), OS failure, and disk pressure. The right place to enforce this is the provisioning process, not a manual checklist. Tag-driven Lambda provisioning means the policy executes automatically and the COV-001 count is an exception report, not a to-do list.

2. Accept intentional gaps for low-stakes environments

Don't mandate alarms on every dev and test instance — that's waste. The goal is right coverage for each tier, not uniform spend. Dev instances that are intentionally bare are fine; production instances that are accidentally bare are not. The distinction is whether the decision was made or just overlooked.

3. Ask for the coverage rate on production instances

At the operational review the one question worth asking is: 'What percentage of our production EC2 instances have at least the four-alarm baseline, and what are the exceptions?' A coverage rate consistently above 95% with documented exceptions is a healthy signal. A rate below that without a clear plan is an accountability conversation.

Quick quiz

Question 1 of 5

Your cloud operations report shows production EC2 alarm coverage at 72% — 28% of production instances have no alarms. The team says it's because instances launched before the tag-driven automation was in place were never backfilled. What's the right response?

Keep learning

Dig deeper into EC2 monitoring, automated recovery, and tag-driven provisioning.

Two takeaways: an unmonitored production instance is an uninsured operational risk, and the right metric isn't the count of alarms — it's the percentage of production instances with the baseline covered and every exception on the record. The four-alarm baseline at $0.40/instance/month is never the budget conversation; the process gap that leaves instances bare is.

Back to the library

Part of the learning path Get your alarms right

Add CloudWatch alarms to EC2 instances

Alarm coverage: the basics

Auto-recovery is one alarm away

Adding alarm coverage in action

Alarms and status checks under the hooddeep dive

What is the impact of running EC2 without alarms?

How do you close the coverage gap?

1. Audit which instances have no alarms

2. Define and apply the baseline four alarms

3. Use Systems Manager Quick Setup or Application Insights for curated bundles

4. Provision alarms automatically via tags

Quick quiz

Keep learning

EC2 alarm coverage: what it costs and why it matters

Auto-recovery is one alarm away

How a finance partner frames the alarm coverage decision

What missing alarm coverage costs — and what fixing it costs

What finance can do about the alarm coverage gap

1. Budget the monitoring premium per environment tier

2. Track COV-001 failures as a production-instance count, not a raw number

3. Require documented exceptions for intentionally bare instances

4. Use decommission opportunities found during the audit

Quick quiz

Keep learning

EC2 alarm coverage: the headline

Auto-recovery is one alarm away

What it looks like when alarm coverage is policy, not accident

Why this matters above the technical level

The leadership move on EC2 alarm coverage

1. Set a default: production instances get the baseline on launch

2. Accept intentional gaps for low-stakes environments

3. Ask for the coverage rate on production instances

Quick quiz

Keep learning

Related monitoring lessons