RDS alarms: the basics
What does it mean for an RDS instance to be "unalarmed"?
Amazon RDS publishes a deep set of metrics to CloudWatch by default — CPU, free storage, connections, IOPS, latency, replica lag — but it does not create a single alarm on any of them. Every alarm is opt-in. An RDS instance can be running in production for years emitting healthy metrics and zero notifications: AWS shows you the line on a graph, it doesn't tell you when the line goes bad.
An "unalarmed" RDS instance is one without alarms on the handful of metrics that catch the incidents you actually care about. The classic symptom is hearing about the database failing from your application team — "the API is timing out" — instead of from your monitoring stack. By the time the app is timing out you've usually already taken customer-visible downtime and lost the early-warning window when a simple action (kill a runaway query, add storage, restart a connection pool) would have prevented the outage.
The cloudwatchOps check COV-002 flags this exact pattern: any DBInstance with no CloudWatch alarms attached to its core metrics. Severity is CRITICAL because the failure mode isn't "slow" — it's "silent until the application breaks," which is the worst kind of outage to debug at 3am with no signal pointing at the actual cause.
In this lesson you'll learn the five RDS alarms that catch the overwhelming majority of database incidents, why the disk-space alarm is the highest-leverage of the lot, how to extend the set for Aurora and Multi-AZ deployments, and how to bulk-create the standard alarm set across an entire fleet of existing instances. You'll see the actual CloudWatch CLI calls and a Lambda pattern for auto-attaching alarms whenever a new DBInstance is created.
Disk-full is RDS's silent killer
When an RDS instance runs out of storage it doesn't crash — it transitions to a STORAGE_FULL state, becomes read-only, and refuses every write until you grow the volume. The application sees writes hang, then fail; reads keep working for a while, which makes the incident look like a partial outage and sends responders chasing the wrong thing. AWS has published storage-autoscaling for years and still the single most common preventable RDS outage is a database that quietly filled up overnight because no one had an alarm on FreeStorageSpace.
Wiring the standard alarm set in action
Priya runs platform reliability at a fintech. A cloudwatchOps scan returns 47 RDS instances across three regions with no CloudWatch alarms attached at all — every one of them production or production-adjacent. Severity CRITICAL on each.
She doesn't start with all 47. She picks one — db-prod-payments-1, the busiest writer — and works out the right alarm set for it first. The same set will then go onto every instance via a small Lambda, but the thresholds need to be sensible before they get fanned out, otherwise she's just trading silence for alarm fatigue.
She starts by listing the existing alarms (expecting none) and the metric baseline for the last 14 days.
First, confirm there genuinely are no alarms on the instance. The CloudWatch describe-alarms-for-metric call is per-metric, so check the high-value metrics one at a time.
Five blank lines, five missing alarms. The disk-space gap is the one that ends careers.
Now create the standard set in one go. Thresholds are conservative defaults — tighten or loosen per workload after a week of observation.
FreeStorageSpace alarm at 20 GiB — the single highest-leverage RDS alarm you'll ever create.
RDS alarms under the hooddeep dive
RDS publishes core metrics to CloudWatch at 1-minute resolution by default, with the namespace AWS/RDS and a single dimension DBInstanceIdentifier. A CloudWatch alarm is a stateless rule that polls a metric on a defined period (typically 60 or 300 seconds), evaluates the data points against a threshold for a number of consecutive periods (EvaluationPeriods), and transitions to ALARM/OK based on the result. State transitions fire AlarmActions — usually an SNS topic — which fan out to PagerDuty, Slack, email, or a Lambda.
The five alarms that catch the most incidents are: CPUUtilization > 80% sustained (query inefficiency or genuine scale-up), DatabaseConnections > some percentage of max_connections (connection leak or pool too small), FreeStorageSpace < 20% of allocated (the silent killer), ReadLatency/WriteLatency p99 above a workload-specific baseline (IOPS contention or slow queries), and ReplicaLag > 60s for any read replica. Aurora adds AuroraReplicaLag and Aurora-specific CPU/IO metrics — alarm on those too at the cluster level rather than per-instance.
The trick people miss: --treat-missing-data matters. The default behaviour is missing, which means an instance that stops emitting metrics entirely (because it's been deleted, stuck, or detached from CloudWatch) goes silent rather than alarming. For FreeStorageSpace and CPU, set it to breaching — missing data should be treated as a failure. Multi-AZ failovers also emit RDS events (not CloudWatch metrics), so subscribe to RDS event categories failover, failure, and maintenance via SNS to catch the control-plane side.
# Subscribe to RDS event categories for control-plane signals (Multi-AZ failover, etc.).
aws rds create-event-subscription \
--subscription-name rds-prod-events \
--sns-topic-arn arn:aws:sns:eu-west-1:123456789012:rds-alerts \
--source-type db-instance \
--event-categories failover failure maintenance availability \
--enabled
# List which event categories exist (handy for filtering noise out of the subscription).
aws rds describe-event-categories --source-type db-instance What is the impact of running RDS without alarms?
The direct impact is detection latency. Without alarms, the first signal of a database problem is usually a customer complaint or an application-tier error rate spike — both of which arrive several minutes (sometimes hours) after the underlying RDS metric started telling the story. Every minute of detection latency is a minute of customer-visible failure that could have been prevented or shortened.
The disk-full case is the textbook example. FreeStorageSpace usually trends downward predictably — a slow week or two as logs and tables grow — and an alarm at 20% gives you days of lead time to investigate, archive, or extend. Without the alarm the first symptom is the database flipping to read-only at 100% full, which produces a partial outage that looks like an application bug. The mean time to recovery is at least 10× longer than the prevented path because responders chase the wrong layer first.
The second-order impact is on-call quality. Engineers who get paged by a database problem via "the app is timing out" don't know whether it's the app, the network, the load balancer, or the database — every page becomes an investigation rather than a fix. Once standard RDS alarms are in place, the page itself tells you the cause: "FreeStorageSpace LOW" is a different runbook from "CPU HIGH," and the on-call engineer can be productive in 30 seconds instead of 30 minutes.
On the regulatory and audit side, frameworks like SOC 2, ISO 27001, and PCI DSS expect a documented monitoring posture for systems holding regulated data. "We have CloudWatch dashboards" doesn't satisfy that — auditors want to see alarms wired to a paging destination with a documented response SLA. An unalarmed production database is an audit finding waiting to happen, regardless of whether you've had a real incident yet.
How do you safely roll out RDS alarms across a fleet?
Closing the alarm-coverage gap is a four-step loop. It mirrors the EC2 coverage problem: instances created via console, by Terraform without alarm modules, or by app teams who didn't know to set them — same gap, different service.
1. Inventory every DBInstance and check for the standard alarm set
Use aws rds describe-db-instances to enumerate every instance across every region, then aws cloudwatch describe-alarms-for-metric to check coverage on the five core metrics. Don't trust Terraform state or runbook spreadsheets — go to the source of truth. The output is your remediation backlog, sorted by environment (prod first), then engine (Aurora gets cluster-level alarms in addition).
2. Bulk-create the standard set with sensible defaults
Wrap put-metric-alarm in a small script (the cli-demo above is the shape) and run it across every unalarmed instance. Defaults: CPU > 80% for 15 min, FreeStorageSpace < 20% of allocated, DatabaseConnections > 80% of max_connections (look up per instance — it scales with class), ReadLatency/WriteLatency p99 > 100ms sustained, ReplicaLag > 60s. Send all of them to one SNS topic initially — split routing later once you know the noise floor.
3. Add control-plane and Performance Insights coverage on top
CloudWatch metrics only show you the data plane. Subscribe to RDS event categories (failover, failure, maintenance) via create-event-subscription so Multi-AZ failovers, snapshot failures, and parameter-group changes also page. For Aurora and any high-traffic instance, enable Performance Insights and alarm on top-N waits and top SQL by DB load — those signals catch slow-query problems an hour before they show up as CPU or latency anomalies.
4. Close the provisioning gap with a tag-based Lambda
Same pattern as EC2 coverage: wire an EventBridge rule on aws.rds -> CreateDBInstance to a Lambda that reads the new instance's tags and creates the standard alarm set automatically. Add an AWS Config managed rule (rds-cluster-iam-authentication-enabled is unrelated but db-instance-backup-enabled plus a custom rule for alarm coverage) to detect any instance drifting back into the unalarmed state. The Lambda handles new instances; Config flags anyone who deletes the alarms after the fact.
# Find every RDS instance across every region with zero alarms on FreeStorageSpace.
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
for id in $(aws rds describe-db-instances --region $region --query 'DBInstances[].DBInstanceIdentifier' --output text); do
count=$(aws cloudwatch describe-alarms-for-metric \
--region $region \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=$id \
--query 'length(MetricAlarms)' --output text)
[ "$count" -eq 0 ] && echo "UNALARMED: $region/$id"
done
done
# Feed the list into a put-metric-alarm loop with your standard thresholds. Quick quiz
Question 1 of 5You're rolling out the standard alarm set across 47 unalarmed RDS instances. Which single alarm should you prioritize because its absence causes the most preventable major incidents?
You scored
0 / 5
Keep learning
Dig deeper into RDS observability and the CloudWatch tooling around it.
- Amazon RDS — Monitoring metrics with CloudWatch Full metric reference for AWS/RDS, including which metrics are emitted per engine.
- Amazon RDS event categories and notification subscriptions Control-plane events (failover, failure, maintenance) that don't show up as CloudWatch metrics.
- Amazon RDS Performance Insights Top-N waits and SQL-by-DB-load signals that catch slow queries before they become CPU alarms.
- AWS Well-Architected — Reliability Pillar: Monitoring Where standard alarm coverage fits in the broader reliability operating model.
You've completed Add CloudWatch alarms to RDS instances. You now know the five alarms that catch the overwhelming majority of database incidents, why FreeStorageSpace is the single highest-leverage one, how to extend coverage with RDS event subscriptions and Performance Insights, and how to close the provisioning gap with a tag-based Lambda. The next time a cloudwatchOps scan flags COV-002 across a fleet, you'll have a four-step loop ready to run: inventory, bulk-create, extend, prevent recurrence.
Back to the library