Skip to main content
emnode / learn
Cost

Reduce SQS queue message retention

Queues default to a long retention "just in case" — tighten it once you know the real consumer SLA.

12 min·10 sections·AWS

Last reviewed

Message retention: the basics

What does SQS retention actually control?

Every SQS queue has a MessageRetentionPeriod attribute — the maximum amount of time SQS will hold an unconsumed message before silently deleting it. The default is 4 days; the legal range is 60 seconds to 14 days. It's not a delivery delay or a consumer timeout — it's a failure budget for everything downstream of the queue.

When a queue is created with 14-day retention, it's almost never because someone sat down with the consumer team and calculated the real outage window the workload can tolerate. It's because somebody picked the maximum value in the console to be safe. That choice survives in production for years and quietly becomes the team's de-facto SLA: "we can be down for two weeks before customers notice messages have been dropped."

The wastage checker flags this as a LOW-severity finding because the $ cost is indirect — SQS bills per request, not per second of retention. But a 14-day retention is a smell. It usually means there's no DLQ, no alarm on ApproximateAgeOfOldestMessage, and nobody knows what the consumer's real recovery-time objective should be. The retention value is doing a job that an alarm and a DLQ should be doing.

In this lesson you'll learn what message retention is actually for, why "maximum just in case" is the wrong default, and how to pick a retention value that matches your consumer's real SLA. You'll see how to inspect a queue's attributes, change retention live (no draining required), and put the right alarms and DLQ in place so retention becomes a meaningful guard rail instead of a way to ignore broken consumers.

Fun fact

The 14-day silent drop

SQS does not notify you when retention expires. Messages older than MessageRetentionPeriod are simply removed from the queue — no event, no CloudWatch metric increment, no DLQ. The only trace you have is NumberOfMessagesDeleted minus NumberOfMessagesReceived going non-zero, and almost nobody alarms on that gap. Teams routinely discover their consumer has been broken for a week only when a customer asks where their data went — by which time SQS has quietly thrown it away.

Tightening retention in action

Nina is on call for the order-processing pipeline. A finance review flags an SQS queue named order-events-prod with the recommendation "Queue has 14 day message retention. Review message retention settings." She's never thought about retention before — the queue was created two years ago and has "just worked."

She pulls the queue's attributes and the consumer team's runbook. The runbook says the order-processor must recover within 4 hours of an incident or finance starts losing reconciliation data. There is no DLQ. There is no alarm on the queue's oldest-message age.

Nina realises the 14-day retention isn't protecting anything — it's hiding the fact that nobody has ever wired up the alarm and DLQ that would actually catch a stuck consumer. She drops the retention to 2 days (4× the runbook RTO, with margin), adds a DLQ for poison-pill messages, and creates a CloudWatch alarm on ApproximateAgeOfOldestMessage > 30 minutes.

First, inspect the queue's current attributes to confirm the retention setting and check whether a DLQ is already configured.

$ aws sqs get-queue-attributes --queue-url https://sqs.eu-west-1.amazonaws.com/123456789012/order-events-prod --attribute-names All
{
"Attributes": {
"MessageRetentionPeriod": "1209600",
"VisibilityTimeout": "30",
"ApproximateNumberOfMessages": "412",
"ApproximateNumberOfMessagesNotVisible": "6",
"CreatedTimestamp": "1714521600",
"RedrivePolicy": null
}
}
# 1209600 seconds = 14 days. No RedrivePolicy means no DLQ.

Retention is at the 14-day maximum and there's no DLQ — classic "set once, never revisited" pattern.

Now apply the new retention online. set-queue-attributes is instant and doesn't require draining or recreating the queue.

$ aws sqs set-queue-attributes --queue-url https://sqs.eu-west-1.amazonaws.com/123456789012/order-events-prod --attributes MessageRetentionPeriod=172800,RedrivePolicy='{"deadLetterTargetArn":"arn:aws:sqs:eu-west-1:123456789012:order-events-dlq","maxReceiveCount":"5"}'
# Command returns no output on success. Verify with get-queue-attributes.
$ aws sqs get-queue-attributes --queue-url ... --attribute-names MessageRetentionPeriod RedrivePolicy
{
"Attributes": {
"MessageRetentionPeriod": "172800",
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:eu-west-1:123456789012:order-events-dlq\",\"maxReceiveCount\":\"5\"}"
}
}
# 172800s = 2 days. DLQ catches anything that fails delivery 5 times.

Retention is now 2 days and poison-pill messages flow to a DLQ rather than rotting in the main queue.

Retention under the hooddeep dive

SQS does not bill for retention itself. The price model is per-request — $0.40 per million API calls for Standard queues, $0.50 per million for FIFO — and storage is included in the request charge. So in pure $ terms, a 14-day retention costs no more than a 1-hour retention if the queue is being drained healthily. The cost shows up indirectly, in two ways.

First, when a consumer is broken, long retention lets the queue grow unboundedly until the maximum is hit. Every backed-up message gets receive-polled repeatedly by other consumers, each poll is a billable request, and the request count balloons. A queue that normally costs $5/month can quietly turn into a $500/month queue while the consumer is wedged. Second, downstream resources (Lambda concurrency, RDS connections, third-party APIs) often charge per invocation — a flood of replays after a 14-day backlog releases can produce a six-figure surprise bill.

Retention is enforced server-side by SQS's storage layer. When set-queue-attributes updates MessageRetentionPeriod, the new value applies to every message currently in the queue and every message that arrives after — there's no per-message override. Messages that have already exceeded the new retention window are deleted on the next sweep, typically within minutes. There is no warning, no DLQ delivery, no event — they're just gone.

# List every queue in the account and pull MessageRetentionPeriod for each.
aws sqs list-queues --query 'QueueUrls[]' --output text | tr '\t' '\n' | while read -r url; do
  retention=$(aws sqs get-queue-attributes \
    --queue-url "$url" \
    --attribute-names MessageRetentionPeriod \
    --query 'Attributes.MessageRetentionPeriod' \
    --output text)
  printf '%s\t%s seconds (%s days)\n' "$url" "$retention" "$((retention / 86400))"
done

# Pipe through `sort -k2 -n` to find the longest-retention queues first.

What is the impact of leaving retention at the maximum?

The direct cost is small but real. A wedged consumer against a 14-day-retention queue produces 14 days of consumer-replay request charges before SQS finally starts dropping messages. For a high-throughput queue (say 1000 messages/sec with 10 idle consumers each polling once per second), that's roughly 12 billion extra ReceiveMessage calls — $4,800 in SQS request fees on top of whatever the upstream and downstream services charge for the replay storm.

The second-order impact is operational. Long retention masks broken consumers — the queue depth grows for days before anyone notices, and by the time someone investigates, the consumer's owning team is on holiday and the runbook is two years out of date. A 4-hour retention would have paged someone within the hour; 14 days lets the problem incubate.

The third-order impact is data correctness. When SQS finally hits the retention limit and starts deleting messages, those messages are gone — no DLQ, no audit trail, no notification. For order pipelines, billing events, or analytics ingestion, this is a silent data-loss event. The team thinks they have a 14-day safety net; they actually have a 14-day countdown to losing data with zero observability.

On the compliance side, long retention can also extend the data-protection surface: SQS messages may contain PII, and a 14-day retention means that PII is queryable via API for two weeks after the producer wrote it. Some regulators treat this as a separate processing activity that needs its own justification.

How do you set retention correctly?

Picking the right retention is a four-step loop: understand the SLA, pick a value, wire up the safety net, and make sure new queues inherit the pattern.

1. Ask the consumer team for the real RTO

Before you change a single attribute, find out from the consumer's owning team how long they can be offline before customers notice or data is lost. That number — usually somewhere between 1 hour and 2 days for production workloads — is your retention floor. Multiply by 2–4× for safety, but don't reach for 14 days just because it's available. If nobody can answer the RTO question, that's the real finding to raise.

2. Set retention and add a DLQ in the same change

Reducing retention without a DLQ is dangerous — you're just shrinking the silent-drop window. Always pair the change with a RedrivePolicy pointing at a dead-letter queue and a sensible maxReceiveCount (typically 5). The DLQ should have its own long retention (14 days is appropriate here — it's what 14 days was always supposed to mean) so failed messages survive long enough to be diagnosed.

3. Alarm on ApproximateAgeOfOldestMessage

Retention is the failure budget; the alarm is what tells you you're spending it. Create a CloudWatch alarm on ApproximateAgeOfOldestMessage set to something well below your retention — typically 10–25% of the retention value. For a 2-day retention, alarm at 4–6 hours. This catches stuck consumers long before SQS starts deleting anything, and gives the on-call engineer real time to react.

4. Enforce the pattern via IaC and SCP

New queues will drift back to the default — and somebody, somewhere, will pick 14 days again — unless the pattern is enforced. Bake retention + DLQ + alarm into your Terraform/CloudFormation queue module so every team gets them for free. For organisations with strong governance, an SCP or AWS Config rule can outright reject any queue created with MessageRetentionPeriod > 345600 (4 days) without an attached RedrivePolicy.

# Audit every queue without a DLQ — these are the high-risk ones.
aws sqs list-queues --query 'QueueUrls[]' --output text | tr '\t' '\n' | while read -r url; do
  attrs=$(aws sqs get-queue-attributes \
    --queue-url "$url" \
    --attribute-names MessageRetentionPeriod RedrivePolicy \
    --output json)
  retention=$(echo "$attrs" | jq -r '.Attributes.MessageRetentionPeriod // "0"')
  dlq=$(echo "$attrs" | jq -r '.Attributes.RedrivePolicy // "none"')
  if [ "$retention" -gt 345600 ] && [ "$dlq" = "none" ]; then
    printf 'RISK\t%s\tretention=%sd\tno-DLQ\n' "$url" "$((retention / 86400))"
  fi
done

Quick quiz

Question 1 of 5

You inherit a production SQS queue with MessageRetentionPeriod=1209600 (14 days), no DLQ, and no CloudWatch alarms. The consumer team's runbook says the workload's RTO is 4 hours. What's the right change?

You've completed Reduce SQS queue message retention. You now know that retention is a failure budget, not a safety blanket; that the right value comes from the consumer's real RTO; and that retention only works as a guard rail when paired with a DLQ and an oldest-message alarm. Next time you see a 14-day retention on a production queue, you'll have the four-step loop — ask the RTO, set retention + DLQ, alarm on age, enforce in IaC — ready to run.

Back to the library