Skip to main content
emnode / learn
Cost

Delete idle ElastiCache clusters

An ElastiCache cluster serving no traffic still bills per node-hour for its full reserved capacity — find the idle ones, take a final snapshot, and tear them down cleanly.

13 min·10 sections·AWS

Last reviewed

Idle ElastiCache clusters: the basics

Why a cache with no traffic still costs full price

Amazon ElastiCache (Redis OSS, Valkey, or Memcached) is billed per node-hour for the full provisioned capacity of every node in the cluster — not for how much of the cache you actually read or write. A cache.t3.medium node runs about $0.068/hour, or roughly $50 a month, and that meter keeps ticking whether the cache serves a million requests a second or none at all. Multi-node clusters scale the bill linearly: a three-node replication group is three times the node-hours, every hour, forever, until you delete it.

The check flags a cluster that shows effectively no activity — below 1% memory utilization (BytesUsedForCache near zero), near-zero CurrConnections, and flat CmdGet/CmdSet counts over a 7-to-14-day window. That pattern almost always means one of two things: the cluster was over-provisioned at launch "to be safe" and the workload never grew into it, or the application that depended on it was retired and the cache was left behind. Either way it's reserved capacity nobody is using.

It's flagged because caches are easy to forget. They sit one layer behind the application, they don't appear in the user-facing architecture, and "is anything still pointing at this Redis?" is a harder question than it looks. A single idle cache.r6g.large costs about $115 a month; a forgotten three-node cluster is several hundred. The longer it runs unexamined, the more it costs and the less anyone remembers why it exists.

In this lesson you'll learn the node-hour billing model that makes an idle cache cost full price, how to tell a genuinely abandoned cluster from a legitimate warm-standby, staging, or batch cache, and the safe teardown pattern — confirm idleness over 14 days, identify the owning app, take a final snapshot, delete the replication group, then clean up the orphaned subnet group, parameter group, and security-group rules. You'll see the CloudWatch metric pulls that prove a cache is idle and the CLI sequence to snapshot-then-delete without leaving plumbing behind for someone to wire up by mistake.

Fun fact

The cache that outlived its app by a year

A platform team auditing a long-lived staging account found a three-node cache.r5.large Redis replication group humming along with CurrConnections flat at zero for 380 days straight. The session-store service it had been built for was decommissioned the previous spring — but the cache, its subnet group, and its security group were never touched. At roughly $0.226/node-hour it had quietly billed over $5,900 across the year for a cache nothing had connected to since the service shipped its last release.

Retiring an idle cache in action

Devi runs the FinOps cadence at a mid-sized SaaS company. The dashboard flags a Redis OSS replication group, sessions-stg, a single cache.r6g.large node, with BytesUsedForCache under 1% and CurrConnections flat at zero for the last 14 days — about $115/month on the bill, provisioned 11 months ago.

She doesn't delete it on the metric alone. A cache can look idle and still be a warm standby, a staging cache that only sees traffic during release windows, or a batch cache that's hot for two hours once a week. So she pulls 14 days of CurrConnections, CacheHits, and NetworkBytesIn at hourly granularity to rule out a periodic pattern, then chases the owning app via the cluster's security-group references and its Owner tag.

The metrics are dead flat — no weekly spike, no release-window blip — and the security group is referenced by nothing. The Owner=raj.p tag leads to a Slack message: "sessions-stg Redis still needed?" Raj replies in minutes: "That was for the old session service, we moved to DynamoDB last year — kill it." Devi takes a final snapshot to S3 as insurance, waits for available, deletes the replication group, then cleans up the now-orphaned subnet group and parameter group so nothing reattaches to them by accident.

First, confirm the cache is genuinely idle: pull 14 days of connection counts at hourly granularity so a weekly batch or release-window spike can't hide in a daily average.

$ aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --metric-name CurrConnections --dimensions Name=CacheClusterId,Value=sessions-stg-001 --start-time $(date -u -d '14 days ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 3600 --statistics Maximum
{
"Datapoints": [
{ "Timestamp": "2026-05-12T00:00:00Z", "Maximum": 0.0, "Unit": "Count" },
{ "Timestamp": "2026-05-13T00:00:00Z", "Maximum": 0.0, "Unit": "Count" },
{ "Timestamp": "2026-05-14T00:00:00Z", "Maximum": 0.0, "Unit": "Count" },
{ "Timestamp": "2026-05-15T00:00:00Z", "Maximum": 0.0, "Unit": "Count" }
]
}
# Peak connections never rose above 0 in 14 days — no client, no batch, no warm standby.

Maximum (not Sum) at hourly granularity exposes any periodic client — a weekly batch would show a non-zero peak.

Confirmed idle and owner signed off — take a final snapshot to S3 first, then delete the whole replication group (this removes every node, not just one).

$ aws elasticache create-snapshot --replication-group-id sessions-stg --snapshot-name sessions-stg-final-2026-05-26 && aws elasticache delete-replication-group --replication-group-id sessions-stg --final-snapshot-identifier sessions-stg-final-2026-05-26
{
"ReplicationGroup": {
"ReplicationGroupId": "sessions-stg",
"Status": "deleting",
"MemberClusters": ["sessions-stg-001", "sessions-stg-002", "sessions-stg-003"],
"SnapshottingClusterId": "sessions-stg-002"
}
}
# Final snapshot captured before teardown — restorable for ~the snapshot retention window if needed.

Delete the replication group, not individual nodes — and Redis/Valkey can snapshot first; Memcached cannot, so confirm extra-hard before deleting one.

How ElastiCache billing actually worksdeep dive

ElastiCache bills per node-hour for every node in the cluster, on demand, regardless of utilisation. Rough US-East on-demand rates: cache.t3.micro ≈ $0.017/hr ($12/mo), cache.t3.medium ≈ $0.068/hr ($50/mo), cache.r6g.large ≈ $0.156/hr ($115/mo), cache.r6g.xlarge ≈ $0.312/hr ($230/mo). A Redis OSS or Valkey cluster-mode replication group multiplies that by its node count — primaries and replicas alike bill at the node rate — so a three-node r6g.large group is ~$345/month whether it serves a request or not. There is no "stopped" state for ElastiCache the way there is for EC2: the only way to stop the meter is to delete the cluster.

The metrics that prove idleness live in the AWS/ElastiCache CloudWatch namespace. BytesUsedForCache under ~1% of node capacity means almost nothing is stored; CurrConnections flat at zero means no client is even connected; and CmdGet/CmdSet (Redis/Valkey) or GetHits/CmdGet (Memcached) flat at zero means no reads or writes are happening. Pull Maximum at hourly granularity over 14 days rather than daily Average — a daily average can flatten a once-a-week batch spike into apparent silence, and a warm-standby or release-window cache is exactly the false positive you want to avoid deleting.

Teardown differs by engine. Redis OSS and Valkey support a final snapshot — the cache serialises to an .rdb file stored in S3-backed snapshot storage (billed at standard backup rates, far cheaper than the live node-hours), so you get a restore path. Memcached has no persistence and cannot snapshot; deleting it is irreversible, so the connection and command-rate checks have to carry the full weight. For cluster-mode replication groups, delete the replication group rather than individual cache clusters — deleting nodes one at a time can trigger failovers and leaves a partial cluster billing on the way down.

# Describe the cluster to capture engine, node type, node count, and creation time.
aws elasticache describe-replication-groups \
  --replication-group-id sessions-stg \
  --query 'ReplicationGroups[0].{Status:Status, NodeType:CacheNodeType, Members:MemberClusters, AutomaticFailover:AutomaticFailover}'

# Confirm memory is effectively empty before deletion (Average over 14 days, MB).
aws cloudwatch get-metric-statistics \
  --namespace AWS/ElastiCache --metric-name BytesUsedForCache \
  --dimensions Name=CacheClusterId,Value=sessions-stg-001 \
  --start-time $(date -u -d '14 days ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 86400 --statistics Average Maximum

What's the impact of leaving idle ElastiCache clusters running?

The direct cost is node-hours billed for capacity nothing is using. A single idle cache.t3.medium is ~$50/month; a cache.r6g.large is ~$115; a three-node r6g.large replication group provisioned for HA is ~$345. Across a long-lived org with staging accounts, decommissioned services, and over-provisioned launches, idle caches routinely add up to thousands of dollars a month — and unlike EC2 there's no stop button, so the meter runs every hour until someone deletes the cluster outright.

There's a sizing trap layered on top of the idle-cache problem. Teams pick a node type at launch "to be safe," and a cache that genuinely needed a t3.micro ($12/mo) ends up on an r6g.large ($115/mo) — a 10x overspend that looks like normal usage on the bill because the cache isn't idle, just oversized. The idle check catches the clusters with no traffic; a right-sizing review catches the ones running hot on far more memory than they'll ever fill.

Cluster topology multiplies any mistake. A multi-AZ replication group bills every replica node at the full node rate, so an over-provisioned or abandoned cluster-mode cache compounds three- or six-fold versus a single node. And because ElastiCache has no stopped state, the usual "pause it for a sprint" instinct doesn't exist here — the only lever is delete-and-restore-from-snapshot, which means idle caches tend to just sit there indefinitely rather than being parked.

Finally, abandoned caches are a hygiene and security drag, not only a cost one. They hold network plumbing — subnet groups, parameter groups, security-group rules — that clutters the VPC, and a forgotten Redis with an open security group is exactly the kind of stale, unowned data store that shows up as an audit or pentest finding. A short list of caches with clear owners is far easier to reason about, and to defend at audit, than a long one nobody remembers provisioning.

How do you retire idle ElastiCache clusters safely?

Deleting a cache is cheap; deleting one that a weekly batch quietly depends on is how you cause a 2 a.m. incident. Run this four-step loop for every flagged cluster.

1. Confirm idleness over a real 14-day window

Pull CurrConnections, CacheHits/CmdGet, and NetworkBytesIn at hourly granularity for at least 14 days, using Maximum not Average. A daily average flattens a once-a-week batch or a release-window cache into apparent silence; hourly peaks won't. If any hour is non-zero, treat the cluster as in use until you've identified the client — a warm standby, staging cache, or monthly compliance job is exactly the false positive you don't want to delete.

2. Identify the owning application before touching anything

Trace who depends on the cache via its security-group references (which security groups are allowed to reach it, and what's attached to them), its parameter group and tags, and the cluster's Owner tag. A short message to the owner — "this cache has been idle 14 days, OK to retire?" — resolves most decisions in a day. Never delete on the metric alone on a first pass; build trust with the team before automating the destruction path.

3. Snapshot (if you can), then delete the replication group

For Redis OSS or Valkey, take a final snapshot to S3-backed storage — it's far cheaper than the live node-hours and gives you a restore path for the retention window. For cluster-mode, delete the whole replication group rather than individual nodes; deleting nodes one at a time can trigger failovers and leaves a partial cluster billing. Memcached cannot snapshot at all, so its deletion is irreversible — make the connection and command checks carry the full weight before you pull it.

4. Clean up the orphaned plumbing so it can't be reused by mistake

Deleting the cluster leaves its subnet group, parameter group, and security-group rules behind. Remove or clearly tag them so nobody wires a new cache to a stale config — and so the next audit doesn't have to puzzle over networking that points at nothing. Adopt a tag convention (Owner, ExpiresAt, Lifecycle=ephemeral|persistent) on new caches and enforce it with AWS Config, so the next month's idle-cache report is shorter by default.

# 1. Final snapshot (Redis/Valkey only) and delete the replication group in one go.
aws elasticache delete-replication-group \
  --replication-group-id sessions-stg \
  --final-snapshot-identifier sessions-stg-final-2026-05-26

# 2. Wait for full deletion before cleaning up dependent resources.
aws elasticache wait replication-group-deleted \
  --replication-group-id sessions-stg

# 3. Remove the now-orphaned subnet group and custom parameter group.
aws elasticache delete-cache-subnet-group --cache-subnet-group-name sessions-stg-subnets
aws elasticache delete-cache-parameter-group --cache-parameter-group-name sessions-stg-params

# 4. For a standalone Memcached/single-node cache there's no snapshot — delete directly.
# aws elasticache delete-cache-cluster --cache-cluster-id legacy-memcached-001

Quick quiz

Question 1 of 5

You find a single-node Redis OSS cluster with CurrConnections flat at zero and BytesUsedForCache under 1% for 14 days. The owner confirms the dependent service was retired last year. What's the right next move?

You've completed Delete idle ElastiCache clusters. You now know why a cache with no traffic still bills full price per node-hour, how to prove idleness over a real 14-day window without nuking a warm standby or batch cache, and the safe teardown loop — confirm, identify the owner, snapshot-then-delete the replication group, clean up the plumbing. The next time the wastage report flags an idle cache, you'll have a defensible path from "flagged" to "resolved" with a snapshot in your back pocket.

Back to the library