Lambda error alarms: the basics
Why does a Lambda function need its own CloudWatch alarms?
Every Lambda invocation emits a small set of CloudWatch metrics by default: Invocations, Errors, Throttles, Duration, ConcurrentExecutions, and a few others. These metrics are free, automatic, and high-resolution — but they're inert. Nothing happens when Errors spikes from 0 to 50 unless you've attached a CloudWatch alarm and pointed it at a notification target. Without that alarm, the function can fail in production for hours, days, or quietly forever, and the only signal anyone gets is the downstream user noticing their order didn't arrive.
A Lambda "without error alarms" usually means there's no CloudWatch alarm on Errors, no alarm on Throttles, and no alarm on Duration approaching the configured Timeout. The team relied on the function never failing — or on dashboards being checked — or on an upstream error handler doing the right thing. None of those substitute for an actual alarm with a notification target.
Security Hub doesn't catch this; AWS Config's lambda-function-settings-check doesn't catch it either. The check that does is COV-003 ("Lambda Functions Without Error Alarms"), which scans every function in the account and lists the ones with no alarm watching their Errors or Throttles metrics. Severity is medium because the failure mode is invisible-by-default rather than immediately destructive — but the cost of finding out late, in retries and incident hours, is usually much higher than the cost of the alarms.
In this lesson you'll learn the four Lambda metrics that actually matter, when to alarm on an absolute error count vs an error rate, the async-invoke retry trap that makes errors expensive even when they look harmless, and how to bulk-create a sensible alarm set across every function in an account. You'll see real CLI investigation, an Embedded Metric Format example for business-level alarms, and the standard pattern for auto-provisioning alarms on function creation.
Async invokes retry. Twice. Silently.
When Lambda is invoked asynchronously — by S3, SNS, EventBridge, or an SDK call with InvocationType=Event — and the function throws an error, AWS retries the invocation twice with exponential backoff before giving up. Each retry is a full billable invocation. Without an alarm or a DLQ, a broken async function quietly costs 3× its normal bill, succeeds at nothing, and produces no visible signal anywhere except the Errors metric you weren't watching. Teams routinely discover this only when the monthly bill is triple what it should be.
Adding Lambda alarms in action
Marco runs the platform team at a fintech. A weekly cost review flags that one of their event-processing Lambdas — order-enricher-prod — has burned through $2,800 in extra invocations over the last 14 days with no corresponding traffic increase. Nobody got paged. The function is invoked async from an EventBridge rule, has no DLQ, and no CloudWatch alarms.
He pulls the Errors metric for the last week. Errors is averaging 180 per hour against ~600 invocations per hour — a 30% failure rate. Every failure is silently retried twice, which is what's inflating the invocation count and the bill. The downstream system was tolerant enough that the missing data wasn't visible from the user side.
Step one is to confirm the gap. Step two is to put the alarms in place so the next incident pages someone within minutes instead of two weeks.
First, list every Lambda in the account that has no alarms attached to its Errors metric. The check joins describe-functions against describe-alarms-for-metric.
Inventory of every Lambda function and the number of alarms watching its Errors metric.
Now confirm the impact on order-enricher-prod by pulling the actual Errors-to-Invocations ratio for the last 7 days.
7-day error rate for the unmonitored function — sustained 30% failure, silently retried.
Lambda metrics under the hooddeep dive
Lambda emits its core metrics directly from the worker — Errors, Throttles, Invocations, Duration, and ConcurrentExecutions land in CloudWatch within a few seconds of the invocation completing. Errors counts any invocation that exited non-zero (uncaught exception, runtime error, timeout, OOM). Throttles counts invocations rejected because the account or reserved-concurrency limit was hit. ConcurrentExecutions samples how many invocations are running at the same instant. Duration is the wall-clock time from handler entry to exit.
The two most useful alarms are usually built differently: an absolute Errors count alarm (e.g. >5 errors in 5 minutes) catches sudden spikes — a deploy that broke production, an upstream IAM key that just expired. An error-rate alarm — Errors / Invocations expressed as a percentage via a CloudWatch metric math expression — catches the slow-burn pattern, the function that's quietly failing 30% of the time at low volume and never trips the absolute-count threshold. Most mature teams run both, with different notification targets (rate → ticket, spike → page).
Duration alarms should be set against the configured Timeout, not against a guess. If your Timeout is 30s, alarm at 24s (80%) on the p99 statistic — that gives you a window to investigate before invocations start dying with a Task timed out after… error. ConcurrentExecutions should be alarmed against the account or function concurrency limit; a sustained climb toward the limit is your canary for needing more reserved concurrency, or for a downstream service slowing down and back-pressuring the Lambda.
# The standard alarm set for a single Lambda function — Errors, Throttles, Duration, Concurrency.
FN=order-enricher-prod
TOPIC_ARN=arn:aws:sns:eu-west-1:123456789012:lambda-alarms
TIMEOUT_MS=30000
# 1. Absolute error spike — pages on a sudden break.
aws cloudwatch put-metric-alarm \
--alarm-name "${FN}-errors-spike" \
--namespace AWS/Lambda --metric-name Errors \
--dimensions Name=FunctionName,Value=${FN} \
--statistic Sum --period 60 --evaluation-periods 5 \
--threshold 5 --comparison-operator GreaterThanThreshold \
--alarm-actions ${TOPIC_ARN}
# 2. Throttles — anything above 0 sustained is a capacity problem.
aws cloudwatch put-metric-alarm \
--alarm-name "${FN}-throttles" \
--namespace AWS/Lambda --metric-name Throttles \
--dimensions Name=FunctionName,Value=${FN} \
--statistic Sum --period 60 --evaluation-periods 3 \
--threshold 0 --comparison-operator GreaterThanThreshold \
--alarm-actions ${TOPIC_ARN}
# 3. Duration at 80% of Timeout — warn before invocations start timing out.
aws cloudwatch put-metric-alarm \
--alarm-name "${FN}-duration-p99" \
--namespace AWS/Lambda --metric-name Duration \
--dimensions Name=FunctionName,Value=${FN} \
--extended-statistic p99 --period 60 --evaluation-periods 5 \
--threshold $((TIMEOUT_MS * 80 / 100)) --comparison-operator GreaterThanThreshold \
--alarm-actions ${TOPIC_ARN} What is the impact of running Lambda without alarms?
The most direct impact is undetected failure. A Lambda invoked from API Gateway returns 5xx to the caller, and at least the caller knows something broke. A Lambda invoked async from S3, SNS, or EventBridge fails into the void — the event source has already moved on. Without an alarm, the only signal is downstream: the expected output never appears, the customer's report is missing, the invoice doesn't get sent. Teams routinely discover async Lambda failures from a customer support ticket several days later.
The second-order impact is cost. Async invokes retry twice by default, so a 30% failure rate on a 50M-invocation-per-month function adds 30M extra billable invocations — at roughly $0.20 per million plus GB-seconds, that's hundreds to thousands of dollars per month per function. If the function has a DLQ wired up, dead-letter writes also bill (SQS or SNS). If it doesn't, the events are silently dropped after the retries and you're paying for compute that produced nothing.
Duration approaching Timeout has its own cost shape: every invocation that hits Timeout is billed for the full Timeout duration regardless of whether useful work happened. A function with a 30s Timeout that's quietly running at 28s on p99 is paying for nearly 30× the compute it would at a 1s p99. Right-sizing memory (which proportionally scales CPU) often pays back instantly here — but you can't right-size what you can't see.
On the SRE side, no alarms means no MTTR data, no error budgets, no SLO. The function effectively has no SLA — it works until it doesn't, and "doesn't" can mean anything from a 30-second blip to a six-week silent outage. Compliance frameworks (SOC 2, ISO 27001) increasingly expect documented monitoring of production workloads; "we look at the dashboard occasionally" is not an answer that survives an audit.
How do you put Lambda alarms in place safely?
Adding alarms to existing Lambdas is a four-step loop. The hard part isn't writing the alarms — it's making sure they fire on the right thing, with the right threshold, and that every new function gets the same treatment automatically.
1. Inventory and prioritise
Use the cli-demo pattern above to list every function and the number of alarms attached. Sort by monthly invocation volume — the top 20% of functions by invocations are almost always the top 80% of impact. Cover those first with the standard alarm set (Errors spike, Throttles, Duration p99). Lower-traffic functions can get a minimal Errors alarm and graduate later.
2. Pick thresholds from real data, not from defaults
Pull 14 days of metrics for each function before setting a threshold. An Errors threshold of 5 in 5 minutes is sensible for a 1000-rpm function and screams at 10 rpm. An error-rate alarm (Errors / Invocations * 100 via metric math) is volume-agnostic — alarm at >2% sustained for 5 minutes for most production functions. For Duration, set the alarm at 80% of the configured Timeout, on the p99 statistic, not the Average.
3. Layer alarms on business outcomes, not just runtime
A Lambda can have 0 errors and still be broken — it can return wrong data, skip work, or silently swallow exceptions. Use Embedded Metric Format (EMF) inside the function to emit business metrics: orders_processed, payments_settled, records_written. Alarm on "orders_processed == 0 over 5 minutes" alongside the standard runtime alarms. AWS Lambda Powertools' metrics module gives you the EMF boilerplate for free.
4. Auto-provision alarms on function creation
Manually adding alarms to existing functions is a one-off project; making sure every new function gets them is what makes it stick. Wire an EventBridge rule on eventName = CreateFunction20150331 to a small "alarm-provisioner" Lambda that reads tags (monitoring=standard, monitoring=high-value) and calls put-metric-alarm for the appropriate set. New functions arrive with alarms attached; engineers don't have to remember.
# Bulk-apply the standard alarm set to every production Lambda missing one.
TOPIC_ARN=arn:aws:sns:eu-west-1:123456789012:lambda-alarms
aws lambda list-functions --query 'Functions[?ends_with(FunctionName, `-prod`)].FunctionName' --output text \
| tr '\t' '\n' \
| while read FN; do
EXISTING=$(aws cloudwatch describe-alarms-for-metric \
--namespace AWS/Lambda --metric-name Errors \
--dimensions Name=FunctionName,Value=$FN \
--query 'length(MetricAlarms)' --output text)
if [ "$EXISTING" = "0" ]; then
echo "Provisioning alarms for $FN"
./provision-lambda-alarms.sh "$FN" "$TOPIC_ARN"
fi
done Quick quiz
Question 1 of 5You discover a Lambda invoked asynchronously from S3 has been failing 30% of the time for two weeks with no alarms. You're about to fix the bug. What's the most important monitoring step to do at the same time?
You scored
0 / 5
Keep learning
Dig deeper into Lambda observability and the alarm patterns that make it stick.
- AWS Lambda — Working with Lambda function metrics Canonical reference for every metric Lambda emits, what it counts, and when it ships.
- AWS Lambda Powertools Well-trodden library for logger, tracer, and EMF metrics inside the function — observability scaffolding without rolling your own.
- CloudWatch Embedded Metric Format (EMF) specification Emit structured metrics from inside the function to alarm on business outcomes, not just runtime.
- AWS Well-Architected — Reliability Pillar (Monitoring) Where Lambda alarms fit into the broader monitoring strategy for serverless workloads.
You've completed Add error alarms to Lambda functions. You now know which Lambda metrics actually matter, when to alarm on absolute counts versus error rates, why async-invoke retries make unmonitored failures so expensive, and how to bulk-provision alarms across an existing fleet and auto-attach them to every new function. The next time COV-003 flags an unmonitored Lambda, you'll have a four-step loop — inventory, threshold, business-outcome, auto-provision — ready to run.
Back to the library