Monitoring

Add CloudWatch alarms to RDS instances

RDS without alarms only tells you it failed by the application timing out — wire the standard set so the database tells on itself first.

13 min·10 sections·AWS

Last reviewed 27 May 2026

RDS alarms: the basics

What does it mean for an RDS instance to be "unalarmed"?

Amazon RDS publishes a deep set of metrics to CloudWatch by default — CPU, free storage, connections, IOPS, latency, replica lag — but it does not create a single alarm on any of them. Every alarm is opt-in. An RDS instance can be running in production for years emitting healthy metrics and zero notifications: AWS shows you the line on a graph, it doesn't tell you when the line goes bad.

An "unalarmed" RDS instance is one without alarms on the handful of metrics that catch the incidents you actually care about. The classic symptom is hearing about the database failing from your application team — "the API is timing out" — instead of from your monitoring stack. By the time the app is timing out you've usually already taken customer-visible downtime and lost the early-warning window when a simple action (kill a runaway query, add storage, restart a connection pool) would have prevented the outage.

The cloudwatchOps check COV-002 flags this exact pattern: any DBInstance with no CloudWatch alarms attached to its core metrics. Severity is CRITICAL because the failure mode isn't "slow" — it's "silent until the application breaks," which is the worst kind of outage to debug at 3am with no signal pointing at the actual cause.

In this lesson you'll learn the five RDS alarms that catch the overwhelming majority of database incidents, why the disk-space alarm is the highest-leverage of the lot, how to extend the set for Aurora and Multi-AZ deployments, and how to bulk-create the standard alarm set across an entire fleet of existing instances. You'll see the actual CloudWatch CLI calls and a Lambda pattern for auto-attaching alarms whenever a new DBInstance is created.

Fun fact

Disk-full is RDS's silent killer

When an RDS instance runs out of storage it doesn't crash — it transitions to a STORAGE_FULL state, becomes read-only, and refuses every write until you grow the volume. The application sees writes hang, then fail; reads keep working for a while, which makes the incident look like a partial outage and sends responders chasing the wrong thing. AWS has published storage-autoscaling for years and still the single most common preventable RDS outage is a database that quietly filled up overnight because no one had an alarm on FreeStorageSpace.

Wiring the standard alarm set in action

Priya runs platform reliability at a fintech. A cloudwatchOps scan returns 47 RDS instances across three regions with no CloudWatch alarms attached at all — every one of them production or production-adjacent. Severity CRITICAL on each.

She doesn't start with all 47. She picks one — db-prod-payments-1, the busiest writer — and works out the right alarm set for it first. The same set will then go onto every instance via a small Lambda, but the thresholds need to be sensible before they get fanned out, otherwise she's just trading silence for alarm fatigue.

She starts by listing the existing alarms (expecting none) and the metric baseline for the last 14 days.

First, confirm there genuinely are no alarms on the instance. The CloudWatch describe-alarms-for-metric call is per-metric, so check the high-value metrics one at a time.

$ for m in CPUUtilization FreeStorageSpace DatabaseConnections ReadLatency WriteLatency; do echo "== $m =="; aws cloudwatch describe-alarms-for-metric --namespace AWS/RDS --metric-name $m --dimensions Name=DBInstanceIdentifier,Value=db-prod-payments-1 --query 'MetricAlarms[*].AlarmName' --output text; done

== CPUUtilization ==

== FreeStorageSpace ==

== DatabaseConnections ==

== ReadLatency ==

== WriteLatency ==

# Zero alarms on every metric. This is what COV-002 means by 'unalarmed'.

Five blank lines, five missing alarms. The disk-space gap is the one that ends careers.

Now create the standard set in one go. Thresholds are conservative defaults — tighten or loosen per workload after a week of observation.

$ aws cloudwatch put-metric-alarm --alarm-name rds-db-prod-payments-1-free-storage --metric-name FreeStorageSpace --namespace AWS/RDS --statistic Average --dimensions Name=DBInstanceIdentifier,Value=db-prod-payments-1 --period 300 --evaluation-periods 2 --threshold 21474836480 --comparison-operator LessThanThreshold --alarm-actions arn:aws:sns:eu-west-1:123456789012:rds-alerts --treat-missing-data breaching

# Threshold = 20 GiB in bytes — fires when free space drops below 20 GB for 10 min.

# (No output on success — describe-alarms confirms it.)

$ aws cloudwatch describe-alarms --alarm-names rds-db-prod-payments-1-free-storage --query 'MetricAlarms[0].{Name:AlarmName,State:StateValue,Threshold:Threshold}'

{

"Name": "rds-db-prod-payments-1-free-storage",

"State": "OK",

"Threshold": 21474836480.0

}

# One alarm down. The other four follow the same pattern with different metrics and thresholds.

FreeStorageSpace alarm at 20 GiB — the single highest-leverage RDS alarm you'll ever create.

RDS alarms under the hooddeep dive

RDS publishes core metrics to CloudWatch at 1-minute resolution by default, with the namespace AWS/RDS and a single dimension DBInstanceIdentifier. A CloudWatch alarm is a stateless rule that polls a metric on a defined period (typically 60 or 300 seconds), evaluates the data points against a threshold for a number of consecutive periods (EvaluationPeriods), and transitions to ALARM/OK based on the result. State transitions fire AlarmActions — usually an SNS topic — which fan out to PagerDuty, Slack, email, or a Lambda.

The five alarms that catch the most incidents are: CPUUtilization > 80% sustained (query inefficiency or genuine scale-up), DatabaseConnections > some percentage of max_connections (connection leak or pool too small), FreeStorageSpace < 20% of allocated (the silent killer), ReadLatency/WriteLatency p99 above a workload-specific baseline (IOPS contention or slow queries), and ReplicaLag > 60s for any read replica. Aurora adds AuroraReplicaLag and Aurora-specific CPU/IO metrics — alarm on those too at the cluster level rather than per-instance.

The trick people miss: --treat-missing-data matters. The default behaviour is missing, which means an instance that stops emitting metrics entirely (because it's been deleted, stuck, or detached from CloudWatch) goes silent rather than alarming. For FreeStorageSpace and CPU, set it to breaching — missing data should be treated as a failure. Multi-AZ failovers also emit RDS events (not CloudWatch metrics), so subscribe to RDS event categories failover, failure, and maintenance via SNS to catch the control-plane side.

# Subscribe to RDS event categories for control-plane signals (Multi-AZ failover, etc.).
aws rds create-event-subscription \
  --subscription-name rds-prod-events \
  --sns-topic-arn arn:aws:sns:eu-west-1:123456789012:rds-alerts \
  --source-type db-instance \
  --event-categories failover failure maintenance availability \
  --enabled

# List which event categories exist (handy for filtering noise out of the subscription).
aws rds describe-event-categories --source-type db-instance

What is the impact of running RDS without alarms?

The direct impact is detection latency. Without alarms, the first signal of a database problem is usually a customer complaint or an application-tier error rate spike — both of which arrive several minutes (sometimes hours) after the underlying RDS metric started telling the story. Every minute of detection latency is a minute of customer-visible failure that could have been prevented or shortened.

The disk-full case is the textbook example. FreeStorageSpace usually trends downward predictably — a slow week or two as logs and tables grow — and an alarm at 20% gives you days of lead time to investigate, archive, or extend. Without the alarm the first symptom is the database flipping to read-only at 100% full, which produces a partial outage that looks like an application bug. The mean time to recovery is at least 10× longer than the prevented path because responders chase the wrong layer first.

The second-order impact is on-call quality. Engineers who get paged by a database problem via "the app is timing out" don't know whether it's the app, the network, the load balancer, or the database — every page becomes an investigation rather than a fix. Once standard RDS alarms are in place, the page itself tells you the cause: "FreeStorageSpace LOW" is a different runbook from "CPU HIGH," and the on-call engineer can be productive in 30 seconds instead of 30 minutes.

On the regulatory and audit side, frameworks like SOC 2, ISO 27001, and PCI DSS expect a documented monitoring posture for systems holding regulated data. "We have CloudWatch dashboards" doesn't satisfy that — auditors want to see alarms wired to a paging destination with a documented response SLA. An unalarmed production database is an audit finding waiting to happen, regardless of whether you've had a real incident yet.

How do you safely roll out RDS alarms across a fleet?

Closing the alarm-coverage gap is a four-step loop. It mirrors the EC2 coverage problem: instances created via console, by Terraform without alarm modules, or by app teams who didn't know to set them — same gap, different service.

1. Inventory every DBInstance and check for the standard alarm set

Use aws rds describe-db-instances to enumerate every instance across every region, then aws cloudwatch describe-alarms-for-metric to check coverage on the five core metrics. Don't trust Terraform state or runbook spreadsheets — go to the source of truth. The output is your remediation backlog, sorted by environment (prod first), then engine (Aurora gets cluster-level alarms in addition).

2. Bulk-create the standard set with sensible defaults

Wrap put-metric-alarm in a small script (the cli-demo above is the shape) and run it across every unalarmed instance. Defaults: CPU > 80% for 15 min, FreeStorageSpace < 20% of allocated, DatabaseConnections > 80% of max_connections (look up per instance — it scales with class), ReadLatency/WriteLatency p99 > 100ms sustained, ReplicaLag > 60s. Send all of them to one SNS topic initially — split routing later once you know the noise floor.

3. Add control-plane and Performance Insights coverage on top

CloudWatch metrics only show you the data plane. Subscribe to RDS event categories (failover, failure, maintenance) via create-event-subscription so Multi-AZ failovers, snapshot failures, and parameter-group changes also page. For Aurora and any high-traffic instance, enable Performance Insights and alarm on top-N waits and top SQL by DB load — those signals catch slow-query problems an hour before they show up as CPU or latency anomalies.

4. Close the provisioning gap with a tag-based Lambda

Same pattern as EC2 coverage: wire an EventBridge rule on aws.rds -> CreateDBInstance to a Lambda that reads the new instance's tags and creates the standard alarm set automatically. Add an AWS Config managed rule (rds-cluster-iam-authentication-enabled is unrelated but db-instance-backup-enabled plus a custom rule for alarm coverage) to detect any instance drifting back into the unalarmed state. The Lambda handles new instances; Config flags anyone who deletes the alarms after the fact.

# Find every RDS instance across every region with zero alarms on FreeStorageSpace.
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
  for id in $(aws rds describe-db-instances --region $region --query 'DBInstances[].DBInstanceIdentifier' --output text); do
    count=$(aws cloudwatch describe-alarms-for-metric \
      --region $region \
      --namespace AWS/RDS \
      --metric-name FreeStorageSpace \
      --dimensions Name=DBInstanceIdentifier,Value=$id \
      --query 'length(MetricAlarms)' --output text)
    [ "$count" -eq 0 ] && echo "UNALARMED: $region/$id"
  done
done

# Feed the list into a put-metric-alarm loop with your standard thresholds.

Quick quiz

Question 1 of 5

You're rolling out the standard alarm set across 47 unalarmed RDS instances. Which single alarm should you prioritize because its absence causes the most preventable major incidents?

Keep learning

Dig deeper into RDS observability and the CloudWatch tooling around it.

You've completed Add CloudWatch alarms to RDS instances. You now know the five alarms that catch the overwhelming majority of database incidents, why FreeStorageSpace is the single highest-leverage one, how to extend coverage with RDS event subscriptions and Performance Insights, and how to close the provisioning gap with a tag-based Lambda. The next time a cloudwatchOps scan flags COV-002 across a fleet, you'll have a four-step loop ready to run: inventory, bulk-create, extend, prevent recurrence.

Back to the library

RDS alarms: what they cost and what they prevent

CloudWatch alarms are a near-zero spend item that materially shortens the cost of a database incident

Amazon RDS writes metrics — CPU, storage, connections, latency — to CloudWatch automatically and at no incremental cost. A CloudWatch alarm on one of those metrics costs roughly $0.10 per alarm per month. The entire standard set of five alarms on a production database costs less than a dollar a month. Despite that, RDS creates zero alarms by default — the platform gives you the data, the alerting is entirely opt-in.

The finance relevance is what 'unalarmed' costs when something goes wrong. Without alarms the first signal of a database problem is a customer complaint or application error spike — both of which arrive after the incident has already started, when it is too late for preventive action. Detection latency has a direct dollar cost: every minute of customer-visible downtime that a timely alarm would have prevented is revenue and reputation lost for roughly $0.10/month of avoided spend.

The cloudwatchOps check COV-002 flags this gap at CRITICAL severity. From a finance perspective the finding is almost never a cost reduction opportunity — CloudWatch alarms are too cheap to optimise. It is a risk-and-efficiency opportunity: alarms shorten mean time to detect, shorten mean time to respond, and reduce the probability of a costly extended outage. The cost of not having them is paid in incident hours, not cloud spend.

This lesson is for the finance partner who wants to understand why COV-002 findings show up on the operations review and what the right cost-and-risk framing is. You'll learn why CloudWatch alarms are almost never a cost optimisation target — they're too cheap — and why the real dollar story is in detection latency: how much of a database incident's cost is attributable to not knowing about the problem until the application broke. No CLI commands required; the focus is on the tiering logic for which databases need a full alarm set, what exception governance looks like, and how alarm coverage connects to the audit and compliance picture.

Fun fact

Disk-full is RDS's silent killer

How a finance partner frames the alarm-coverage gap

Caitlin is the FinOps partner reviewing the monthly operations scorecard. The cloudwatchOps section shows 47 CRITICAL COV-002 findings — 47 RDS instances with zero CloudWatch alarms. Her first question is not 'how much does it cost to fix?' — CloudWatch alarms are less than a dollar each per month, so the remediation cost is negligible. Her question is 'what is the expected cost of not fixing this?'

She works with the platform team to split the 47 into tiers: 31 are production or production-adjacent databases behind customer-facing services; 16 are development and test instances. For the 31 production databases, she prices the gap. The team's last major RDS incident — a disk-full event six months ago — caused four hours of partial outage; a FreeStorageSpace alarm would have given two days of lead time. At roughly $30k/hour in lost transaction revenue, that single preventable incident cost more than the total annual alarm spend on all 47 databases combined.

Caitlin's recommendation: approve the full alarm rollout immediately, budget the alarm spend as a fixed reliability line item under the operations budget, and flag unalarmed production RDS instances on the monthly risk register until coverage is complete. The math is clear enough that she doesn't need a second meeting to get the decision.

The cost model behind unalarmed RDS databases

The cost of not alarming a production RDS database has two components: the cost of the incidents that happen and the cost of the incidents that could have been prevented. Both are substantially larger than the cost of the alarms themselves. A full standard alarm set for one database costs roughly $0.50–$1.00 per month. A single major incident attributable to late detection — a disk-full that flipped the database read-only while engineers chased the application layer for an hour before diagnosing the real cause — easily runs to tens of thousands of dollars in incident labour, lost revenue, and customer remediation.

For finance, the disk-space alarm is the clearest case to model. FreeStorageSpace declines predictably over days or weeks, so an alarm at 20% typically gives 48–72 hours of lead time. The cost of the alarm firing and someone extending storage is engineer-hours. The cost of missing it is a STORAGE_FULL transition that makes the database read-only, creates a partial outage that looks like an application bug, generates an all-hands incident, and — depending on how the application handles write failures — may require manual data reconciliation afterward. The mean time to recovery on the prevented path is minutes; on the undetected path it is hours.

Alarm coverage also has a direct compliance cost. SOC 2 Type II, ISO 27001, and PCI DSS all expect organisations to demonstrate a documented monitoring posture — alarms wired to a paging destination with a response SLA — for systems holding regulated data. An unalarmed production database that holds payment or customer records is a control gap that an auditor will flag and that will require remediation evidence, which is additional cost on top of the original exposure. Addressing COV-002 now removes this as a future audit finding.

The right budget framing: alarm coverage is not an optional operational nicety — it is a component of the cost of running a production database responsibly. Add the monthly alarm spend to the database line item as a standard overhead, track the percentage of production RDS instances with full coverage on the operations scorecard, and treat any unalarmed production database as an open risk item with a dollar estimate attached until it is remediated.

What finance can drive on RDS alarm coverage

Finance can't create CloudWatch alarms, but it can own the decision framework, the coverage metric, and the budget governance that keeps alarm coverage complete and defensible at audit.

1. Define the alarm coverage tier by environment, not by incident history

Agree a policy with engineering: production and production-adjacent RDS databases must have the full standard alarm set, funded as part of the database's operational cost. Dev and test databases are opt-in. Document this tier definition once and use it as the decision rule for every new database provisioned — the goal is a predictable, pre-approved default rather than a per-database finance conversation every time.

2. Track unalarmed production instances as an open risk item with a cost estimate

Put the count of COV-002 failing production databases on the monthly operations scorecard, alongside an estimated cost exposure: for each unalarmed production database, note the relevant incident type (disk-full, connection-leak, etc.) and a rough cost-of-incident estimate based on past incidents or industry benchmarks. This keeps the finding visible and quantified rather than buried in a security dashboard.

3. Budget alarm spend as a standard database overhead

The CloudWatch alarm cost for a full standard set on one database is under a dollar per month. Add it as a per-database line item in the cloud unit cost model — similar to how backup storage or monitoring agents are accounted for — rather than treating it as discretionary ops spend subject to quarterly review. This removes the friction of reapproving a $0.50/month item repeatedly and prevents alarm coverage from being quietly cut during cost optimisation rounds.

4. Require documented justification for every unalarmed production exception

Any production database intentionally left without the full alarm set should carry a finance-visible suppression reason in the operations record. This is the same discipline as a documented exception to a compliance control: it converts 'we ignored the finding' into 'we made a recorded decision with a named owner.' At audit, a clean documented exception is defensible; a silent omission is not.

Quick quiz

Question 1 of 5

Your operations review shows 12 production RDS instances flagged COV-002 with no CloudWatch alarms. Engineering estimates the alarm set costs $0.60/month per database. A previous disk-full incident on an unalarmed database took 4 hours to resolve and cost an estimated $45,000 in lost revenue. What is the right finance recommendation?

Keep learning

Dig deeper into RDS observability and the CloudWatch tooling around it.

You've finished the finance view of COV-002. The key numbers to carry: the standard five-alarm set costs under a dollar per database per month, a single preventable disk-full incident typically costs orders of magnitude more, and the alarm spend should be budgeted as standard database operational overhead rather than discretionary monitoring spend. Your contributions are the tier policy, the risk-register line item for unalarmed production instances, and the requirement that every exception is a documented decision — not a silent omission.

Back to the library

RDS alarms: the leadership angle

Does the database tell us when it is failing, or do customers tell us?

AWS provides detailed database health data but creates zero automatic alerts on it. Without explicit configuration, the first sign of a database problem is typically a customer complaint or a monitoring alert from the application layer — by which point the incident has already started. CloudWatch alarms are the mechanism that makes the database report its own problems first.

The check COV-002 flags this at CRITICAL severity because silent database failures are the hardest kind to respond to: there is no early-warning signal, responders arrive late, and the window for a simple preventive action has already closed. This is a monitoring posture question, not a cost question — setting up the standard alarm set costs under a dollar a month per database. The leadership issue is whether the organisation has decided that production databases must alert on their own problems, and enforces that as policy rather than leaving it to individual engineers.

A short read for the executive who wants to understand what this finding actually means and what the one policy question is. You'll get the plain-English version of why databases don't alarm by default, why silent failures are the most expensive kind, and what 'good' looks like: every production-critical database configured to report its own problems, with coverage tracked as an operational posture metric and every exception a deliberate, recorded decision. No technical implementation detail.

Fun fact

Disk-full is RDS's silent killer

What it looks like when an org closes the alarm gap

Before the CTO at one company set alarm coverage as a tracked metric, the answer to 'how do we find out about database problems?' was effectively 'a customer tells us or the app goes red.' The mean time to detect a database incident was measured in minutes to hours, and the first responders spent most of the early recovery window figuring out what layer the problem was even at.

After defining a policy — every production RDS instance must carry the standard five alarms, with any exception documented — the character of database incidents changed. The on-call engineer now arrives to a page that names the problem: 'FreeStorageSpace LOW on db-payments-1' is a different runbook from 'CPU HIGH', and both are faster to resolve than 'the app is slow, good luck.' Mean time to detect dropped from tens of minutes to under five.

The CTO's one-line question at each review became: 'What percentage of our production RDS instances have full alarm coverage, and what are the documented exceptions?' A consistent answer above 95% means the monitoring posture is governed. That's the signal that matters — not whether a specific incident happened, but whether the infrastructure is set up to tell the team before a customer does.

Why this is a risk item, not just an engineering to-do

Running a production database without alarms is equivalent to deciding that the organisation will hear about database problems from customers before it hears about them internally. That is not a technical default — it is an implicit risk decision that most organisations would not make deliberately. COV-002 makes it explicit: these are the systems currently configured to fail silently.

The detection-latency gap has a compounding effect on incident cost. Early-warning alarms don't just shorten time to detect — they change the character of the response. An engineer who gets a 'FreeStorageSpace LOW' alert two days in advance has a scheduled task; an engineer who gets a 'the API is timing out' call at 2am has an investigation. The latter produces more total downtime, more engineer hours, more customer communication, and a much higher probability of a compliance or contractual SLA breach.

On the audit and accountability side, regulators and enterprise customers increasingly ask for documented evidence of monitoring controls — not dashboards, but alarms with defined thresholds, named destinations, and response SLAs. The absence of alarms on a production database holding regulated data is an audit finding regardless of whether an incident has occurred. Leadership owns the answer to 'are our production databases configured to tell us before customers notice?', and right now the honest answer, for every instance flagged by COV-002, is no.

The leadership ask on RDS alarm coverage

The executive lever is not to review alarm configurations — it is to set the policy and the accountability structure so that alarm coverage is the default, not the exception.

1. Set the policy: production databases alarm on their own problems

Define and communicate a single rule: every production or customer-facing RDS database must carry the standard alarm set, with any exception documented and owner-signed. A clear, standing policy removes the per-database debate and makes alarm coverage a provisioning requirement rather than a post-launch review item.

2. Measure it and put it on the operations review

Ask for one number at each review: what percentage of production RDS instances have full alarm coverage? A target above 95% is reasonable; a number consistently below that is a policy adherence issue, not just an engineering backlog. The trend is what matters — coverage should improve monotonically as remediations land and new provisioning defaults enforce the alarm set.

3. Hold exceptions to the same standard as compliance findings

An intentionally unalarmed production database should require the same documentation and sign-off as a suppressed compliance control: a stated reason, a named owner, and a review date. This prevents the exception list from growing silently and ensures the monitoring posture is defensible at audit or after an incident.

Quick quiz

Question 1 of 5

At the quarterly business review the CISO reports that 80% of production RDS instances now have full CloudWatch alarm coverage, up from 30% last quarter, with all exceptions documented and owner-signed. What is the right leadership response?

Keep learning

Dig deeper into RDS observability and the CloudWatch tooling around it.

That's the lesson. Two things to remember: databases don't alarm by default, which means the organisation has implicitly chosen to hear about failures from customers unless someone configures them otherwise. And fixing it costs less than a dollar per database per month. The leadership question is whether there is a policy that makes alarm coverage the default for production systems, a metric that tracks it, and an accountability structure that ensures exceptions are deliberate. If those three are in place, COV-002 becomes evidence the policy is working.

Back to the library

Part of the learning path Get your alarms right

Add CloudWatch alarms to RDS instances

RDS alarms: the basics

Disk-full is RDS's silent killer

Wiring the standard alarm set in action

RDS alarms under the hooddeep dive

What is the impact of running RDS without alarms?

How do you safely roll out RDS alarms across a fleet?

1. Inventory every DBInstance and check for the standard alarm set

2. Bulk-create the standard set with sensible defaults

3. Add control-plane and Performance Insights coverage on top

4. Close the provisioning gap with a tag-based Lambda

Quick quiz

Keep learning

RDS alarms: what they cost and what they prevent

Disk-full is RDS's silent killer

How a finance partner frames the alarm-coverage gap

The cost model behind unalarmed RDS databases

What finance can drive on RDS alarm coverage

1. Define the alarm coverage tier by environment, not by incident history

2. Track unalarmed production instances as an open risk item with a cost estimate

3. Budget alarm spend as a standard database overhead

4. Require documented justification for every unalarmed production exception

Quick quiz

Keep learning

RDS alarms: the leadership angle

Disk-full is RDS's silent killer

What it looks like when an org closes the alarm gap

Why this is a risk item, not just an engineering to-do

The leadership ask on RDS alarm coverage

1. Set the policy: production databases alarm on their own problems

2. Measure it and put it on the operations review

3. Hold exceptions to the same standard as compliance findings

Quick quiz

Keep learning

Related monitoring lessons