Monitoring

Add error alarms to Lambda functions

A Lambda that throws errors silently is the most expensive failure mode in serverless — every retry costs money and the user never sees it.

12 min·10 sections·AWS

Last reviewed 27 May 2026

Lambda error alarms: the basics

Why does a Lambda function need its own CloudWatch alarms?

Every Lambda invocation emits a small set of CloudWatch metrics by default: Invocations, Errors, Throttles, Duration, ConcurrentExecutions, and a few others. These metrics are free, automatic, and high-resolution — but they're inert. Nothing happens when Errors spikes from 0 to 50 unless you've attached a CloudWatch alarm and pointed it at a notification target. Without that alarm, the function can fail in production for hours, days, or quietly forever, and the only signal anyone gets is the downstream user noticing their order didn't arrive.

A Lambda "without error alarms" usually means there's no CloudWatch alarm on Errors, no alarm on Throttles, and no alarm on Duration approaching the configured Timeout. The team relied on the function never failing — or on dashboards being checked — or on an upstream error handler doing the right thing. None of those substitute for an actual alarm with a notification target.

Security Hub doesn't catch this; AWS Config's lambda-function-settings-check doesn't catch it either. The check that does is COV-003 ("Lambda Functions Without Error Alarms"), which scans every function in the account and lists the ones with no alarm watching their Errors or Throttles metrics. Severity is medium because the failure mode is invisible-by-default rather than immediately destructive — but the cost of finding out late, in retries and incident hours, is usually much higher than the cost of the alarms.

In this lesson you'll learn the four Lambda metrics that actually matter, when to alarm on an absolute error count vs an error rate, the async-invoke retry trap that makes errors expensive even when they look harmless, and how to bulk-create a sensible alarm set across every function in an account. You'll see real CLI investigation, an Embedded Metric Format example for business-level alarms, and the standard pattern for auto-provisioning alarms on function creation.

Fun fact

Async invokes retry. Twice. Silently.

When Lambda is invoked asynchronously — by S3, SNS, EventBridge, or an SDK call with InvocationType=Event — and the function throws an error, AWS retries the invocation twice with exponential backoff before giving up. Each retry is a full billable invocation. Without an alarm or a DLQ, a broken async function quietly costs 3× its normal bill, succeeds at nothing, and produces no visible signal anywhere except the Errors metric you weren't watching. Teams routinely discover this only when the monthly bill is triple what it should be.

Adding Lambda alarms in action

Marco runs the platform team at a fintech. A weekly cost review flags that one of their event-processing Lambdas — order-enricher-prod — has burned through $2,800 in extra invocations over the last 14 days with no corresponding traffic increase. Nobody got paged. The function is invoked async from an EventBridge rule, has no DLQ, and no CloudWatch alarms.

He pulls the Errors metric for the last week. Errors is averaging 180 per hour against ~600 invocations per hour — a 30% failure rate. Every failure is silently retried twice, which is what's inflating the invocation count and the bill. The downstream system was tolerant enough that the missing data wasn't visible from the user side.

Step one is to confirm the gap. Step two is to put the alarms in place so the next incident pages someone within minutes instead of two weeks.

First, list every Lambda in the account that has no alarms attached to its Errors metric. The check joins describe-functions against describe-alarms-for-metric.

$ aws lambda list-functions --query 'Functions[].FunctionName' --output text | xargs -n1 -I{} sh -c 'echo "{} $(aws cloudwatch describe-alarms-for-metric --namespace AWS/Lambda --metric-name Errors --dimensions Name=FunctionName,Value={} --query \"length(MetricAlarms)\")"'

order-enricher-prod 0

payment-callback-prod 0

user-signup-prod 1

billing-monthly-prod 0

report-generator-prod 1

image-thumbnailer-prod 0

# 4 of 6 production functions have zero alarms on Errors. That's the gap.

Inventory of every Lambda function and the number of alarms watching its Errors metric.

Now confirm the impact on order-enricher-prod by pulling the actual Errors-to-Invocations ratio for the last 7 days.

$ aws cloudwatch get-metric-data --metric-data-queries file://error-rate.json --start-time $(date -u -d '7 days ago' +%FT%TZ) --end-time $(date -u +%FT%TZ)

{

"MetricDataResults": [

{

"Id": "errorRate",

"Label": "Errors / Invocations (%)",

"Values": [29.7, 31.2, 28.4, 30.9, 32.1, 29.0, 30.5],

"StatusCode": "Complete"

}

]

}

# ~30% sustained error rate. Each failure is retried 2x, so ~60% of invocation cost is wasted.

7-day error rate for the unmonitored function — sustained 30% failure, silently retried.

Lambda metrics under the hooddeep dive

Lambda emits its core metrics directly from the worker — Errors, Throttles, Invocations, Duration, and ConcurrentExecutions land in CloudWatch within a few seconds of the invocation completing. Errors counts any invocation that exited non-zero (uncaught exception, runtime error, timeout, OOM). Throttles counts invocations rejected because the account or reserved-concurrency limit was hit. ConcurrentExecutions samples how many invocations are running at the same instant. Duration is the wall-clock time from handler entry to exit.

The two most useful alarms are usually built differently: an absolute Errors count alarm (e.g. >5 errors in 5 minutes) catches sudden spikes — a deploy that broke production, an upstream IAM key that just expired. An error-rate alarm — Errors / Invocations expressed as a percentage via a CloudWatch metric math expression — catches the slow-burn pattern, the function that's quietly failing 30% of the time at low volume and never trips the absolute-count threshold. Most mature teams run both, with different notification targets (rate → ticket, spike → page).

Duration alarms should be set against the configured Timeout, not against a guess. If your Timeout is 30s, alarm at 24s (80%) on the p99 statistic — that gives you a window to investigate before invocations start dying with a Task timed out after… error. ConcurrentExecutions should be alarmed against the account or function concurrency limit; a sustained climb toward the limit is your canary for needing more reserved concurrency, or for a downstream service slowing down and back-pressuring the Lambda.

# The standard alarm set for a single Lambda function — Errors, Throttles, Duration, Concurrency.
FN=order-enricher-prod
TOPIC_ARN=arn:aws:sns:eu-west-1:123456789012:lambda-alarms
TIMEOUT_MS=30000

# 1. Absolute error spike — pages on a sudden break.
aws cloudwatch put-metric-alarm \
  --alarm-name "${FN}-errors-spike" \
  --namespace AWS/Lambda --metric-name Errors \
  --dimensions Name=FunctionName,Value=${FN} \
  --statistic Sum --period 60 --evaluation-periods 5 \
  --threshold 5 --comparison-operator GreaterThanThreshold \
  --alarm-actions ${TOPIC_ARN}

# 2. Throttles — anything above 0 sustained is a capacity problem.
aws cloudwatch put-metric-alarm \
  --alarm-name "${FN}-throttles" \
  --namespace AWS/Lambda --metric-name Throttles \
  --dimensions Name=FunctionName,Value=${FN} \
  --statistic Sum --period 60 --evaluation-periods 3 \
  --threshold 0 --comparison-operator GreaterThanThreshold \
  --alarm-actions ${TOPIC_ARN}

# 3. Duration at 80% of Timeout — warn before invocations start timing out.
aws cloudwatch put-metric-alarm \
  --alarm-name "${FN}-duration-p99" \
  --namespace AWS/Lambda --metric-name Duration \
  --dimensions Name=FunctionName,Value=${FN} \
  --extended-statistic p99 --period 60 --evaluation-periods 5 \
  --threshold $((TIMEOUT_MS * 80 / 100)) --comparison-operator GreaterThanThreshold \
  --alarm-actions ${TOPIC_ARN}

What is the impact of running Lambda without alarms?

The most direct impact is undetected failure. A Lambda invoked from API Gateway returns 5xx to the caller, and at least the caller knows something broke. A Lambda invoked async from S3, SNS, or EventBridge fails into the void — the event source has already moved on. Without an alarm, the only signal is downstream: the expected output never appears, the customer's report is missing, the invoice doesn't get sent. Teams routinely discover async Lambda failures from a customer support ticket several days later.

The second-order impact is cost. Async invokes retry twice by default, so a 30% failure rate on a 50M-invocation-per-month function adds 30M extra billable invocations — at roughly $0.20 per million plus GB-seconds, that's hundreds to thousands of dollars per month per function. If the function has a DLQ wired up, dead-letter writes also bill (SQS or SNS). If it doesn't, the events are silently dropped after the retries and you're paying for compute that produced nothing.

Duration approaching Timeout has its own cost shape: every invocation that hits Timeout is billed for the full Timeout duration regardless of whether useful work happened. A function with a 30s Timeout that's quietly running at 28s on p99 is paying for nearly 30× the compute it would at a 1s p99. Right-sizing memory (which proportionally scales CPU) often pays back instantly here — but you can't right-size what you can't see.

On the SRE side, no alarms means no MTTR data, no error budgets, no SLO. The function effectively has no SLA — it works until it doesn't, and "doesn't" can mean anything from a 30-second blip to a six-week silent outage. Compliance frameworks (SOC 2, ISO 27001) increasingly expect documented monitoring of production workloads; "we look at the dashboard occasionally" is not an answer that survives an audit.

How do you put Lambda alarms in place safely?

Adding alarms to existing Lambdas is a four-step loop. The hard part isn't writing the alarms — it's making sure they fire on the right thing, with the right threshold, and that every new function gets the same treatment automatically.

1. Inventory and prioritise

Use the cli-demo pattern above to list every function and the number of alarms attached. Sort by monthly invocation volume — the top 20% of functions by invocations are almost always the top 80% of impact. Cover those first with the standard alarm set (Errors spike, Throttles, Duration p99). Lower-traffic functions can get a minimal Errors alarm and graduate later.

2. Pick thresholds from real data, not from defaults

Pull 14 days of metrics for each function before setting a threshold. An Errors threshold of 5 in 5 minutes is sensible for a 1000-rpm function and screams at 10 rpm. An error-rate alarm (Errors / Invocations * 100 via metric math) is volume-agnostic — alarm at >2% sustained for 5 minutes for most production functions. For Duration, set the alarm at 80% of the configured Timeout, on the p99 statistic, not the Average.

3. Layer alarms on business outcomes, not just runtime

A Lambda can have 0 errors and still be broken — it can return wrong data, skip work, or silently swallow exceptions. Use Embedded Metric Format (EMF) inside the function to emit business metrics: orders_processed, payments_settled, records_written. Alarm on "orders_processed == 0 over 5 minutes" alongside the standard runtime alarms. AWS Lambda Powertools' metrics module gives you the EMF boilerplate for free.

4. Auto-provision alarms on function creation

Manually adding alarms to existing functions is a one-off project; making sure every new function gets them is what makes it stick. Wire an EventBridge rule on eventName = CreateFunction20150331 to a small "alarm-provisioner" Lambda that reads tags (monitoring=standard, monitoring=high-value) and calls put-metric-alarm for the appropriate set. New functions arrive with alarms attached; engineers don't have to remember.

# Bulk-apply the standard alarm set to every production Lambda missing one.
TOPIC_ARN=arn:aws:sns:eu-west-1:123456789012:lambda-alarms

aws lambda list-functions --query 'Functions[?ends_with(FunctionName, `-prod`)].FunctionName' --output text \
  | tr '\t' '\n' \
  | while read FN; do
      EXISTING=$(aws cloudwatch describe-alarms-for-metric \
        --namespace AWS/Lambda --metric-name Errors \
        --dimensions Name=FunctionName,Value=$FN \
        --query 'length(MetricAlarms)' --output text)
      if [ "$EXISTING" = "0" ]; then
        echo "Provisioning alarms for $FN"
        ./provision-lambda-alarms.sh "$FN" "$TOPIC_ARN"
      fi
    done

Quick quiz

Question 1 of 5

You discover a Lambda invoked asynchronously from S3 has been failing 30% of the time for two weeks with no alarms. You're about to fix the bug. What's the most important monitoring step to do at the same time?

Keep learning

Dig deeper into Lambda observability and the alarm patterns that make it stick.

You've completed Add error alarms to Lambda functions. You now know which Lambda metrics actually matter, when to alarm on absolute counts versus error rates, why async-invoke retries make unmonitored failures so expensive, and how to bulk-provision alarms across an existing fleet and auto-attach them to every new function. The next time COV-003 flags an unmonitored Lambda, you'll have a four-step loop — inventory, threshold, business-outcome, auto-provision — ready to run.

Back to the library

Lambda error alarms: what they cost and what silence costs more

Unmonitored Lambda failures are a direct, quantifiable line on the bill — and you won't see them until weeks later

AWS Lambda is billed per invocation and per GB-second of compute. That sounds clean until you learn that a failing async Lambda retries twice by default — automatically, silently, and fully billable. A function with a 30% error rate doesn't cost 30% more; it costs roughly three times what a healthy function at the same traffic would cost, because every failure triggers two extra paid retries before Lambda gives up. COV-003 flags functions with no alarms watching this, at medium severity.

The finance framing is straightforward: an unmonitored Lambda is an open cost line with no ceiling. You only find out something is wrong when the bill arrives — or when a customer tickets about missing data. Alarms are cheap (CloudWatch alarms cost a few cents per metric per month) and they are the only mechanism that turns a silent failure into an actionable signal before the cost accumulates.

This is not a blanket "alarm everything equally" situation. The right approach is to tier by invocation volume and business criticality: high-volume production functions get a full alarm set (absolute error spike, sustained error-rate, throttles, duration at 80% of timeout); lower-traffic or internal functions get a minimal errors alarm. The goal is that every function's failure mode has a known cost ceiling, not that every function gets the same treatment.

This lesson is for the finance partner who wants to understand why Lambda errors translate directly to overspend, how to quantify the exposure per function, and what a cost-justified monitoring posture looks like. You'll learn how the retry-billing mechanism works, how to read the invocation-to-error ratio on the bill, and how to frame the alarm spend as a small, predictable cost against a potentially large silent-failure cost. No CLI or infrastructure knowledge required.

Fun fact

Async invokes retry. Twice. Silently.

How a finance partner reads a Lambda cost anomaly

Priya is the FinOps lead reviewing the monthly AWS bill. One line item catches her eye: a single Lambda function, order-enricher-prod, has $2,800 in invocation charges over 14 days that can't be explained by traffic growth. The traffic trend from EventBridge is flat. The invocation count is roughly triple what it was the previous month.

She knows the async-retry pattern: when a Lambda fails, AWS retries it twice, billing each attempt. A tripling of invocations against flat traffic is the tell. She asks the platform team for the error rate — it's 30%. That means roughly two-thirds of the invocation spend that month was wasted compute on retries that all failed. The function has no alarms. Nobody was paged. The bug has been running for weeks.

Her takeaway for the finance pack is cost and process: the $2,800 overspend is recoverable, but the real exposure is that this pattern can recur on any of the other unmonitored functions in the account. The fix is not just patching this function — it's requiring that every production Lambda has an alarm that would have surfaced this within minutes. She adds it to the FinOps governance checklist as a standing requirement, alongside a monthly check of COV-003 findings.

The direct cost of an unmonitored Lambda — and how to model it

The billing impact of Lambda errors is more mechanical than most cloud cost issues: it's the async-retry multiplier. An async invocation that fails is retried twice by AWS, each at full price. A function with a 30% error rate on 50 million invocations per month generates roughly 30 million extra billable invocations — at the standard $0.20 per million plus GB-seconds of compute, that's $6,000–$15,000 per month per function depending on memory tier and duration. Without an alarm, this accumulates undetected for weeks.

The duration-to-timeout pattern adds a second exposure. Lambda bills for the full configured Timeout on any invocation that hits it — not just the work done. A function with a 30-second Timeout that's quietly running at 28 seconds p99 costs nearly 30 times what it should if the underlying issue were caught and right-sized. This is invisible without a Duration p99 alarm.

For the finance model, the right unit of exposure is per-function-per-month: take the current monthly invocation count, apply the expected error rate (or a conservative 5–10% assumption if unknown), multiply by 3 for retries, and cost it at the function's per-invocation rate. That gives you the wasted-compute exposure for that function. For a portfolio of unmonitored production Lambdas the aggregate is often surprisingly large.

The mitigation cost is small and fixed: a CloudWatch alarm costs roughly $0.10 per metric per month. A full alarm set (Errors spike, error rate, Throttles, Duration p99) is under $0.50 per function per month. The case for adding alarms to every production Lambda is not a close call financially — the only question is whether you tier the alarm set by function criticality or apply a uniform baseline.

The finance levers for getting Lambda alarms in place

Finance can't write CloudWatch alarms, but it owns the framing that turns alarm coverage from a nice-to-have into a budgeted, auditable requirement. Four levers, applied at the regular governance cadence.

1. Quantify the unmonitored exposure before the next review

Ask engineering for the list of production Lambdas with no alarms (COV-003 output) and their monthly invocation volumes. Apply a conservative 5% error-rate assumption to each and multiply by 3 for the async-retry billing multiplier. That gives you the worst-case wasted-invocation cost for each unmonitored function — a concrete number to put next to the remediation cost (under $0.50 per function per month for a full alarm set). The case almost always makes itself.

2. Require alarm coverage as a production gate

Work with engineering to add alarm coverage to the definition of production-readiness. A function without at least an Errors alarm should not be deployed to production without a documented exception. This is a process control, not a technical one — finance owns the budget approval for production workloads and can make this a condition of sign-off.

3. Track COV-003 findings with their estimated wasted-cost exposure

Put the count of COV-003 production findings on the monthly FinOps review, alongside the estimated monthly overspend from unmonitored retries. Trend it. A declining count with a tracked cost impact is the metric that keeps engineering prioritising this work against other demands.

4. Treat the auto-provisioner as infrastructure spend, not overhead

The auto-provisioner Lambda that attaches alarms to new functions on creation (step 4 of the engineering remediation) is a small, fixed cost that prevents the problem from recurring. Budget it explicitly as a monitoring infrastructure line, not as a project task that gets cut. Its value is preventing future COV-003 findings before they accrue cost.

Quick quiz

Question 1 of 5

A COV-003 scan shows 8 production Lambdas with no alarms. The team estimates the three highest-volume functions account for 95% of total invocations. How should the finance partner frame the remediation priority?

Keep learning

Dig deeper into Lambda observability and the alarm patterns that make it stick.

You've finished the finance partner's view of COV-003. You know how to model the wasted-retry cost exposure per function, why the remediation cost (cents per alarm per month) makes this a straightforward ROI, and the four governance levers — quantify exposure first, make alarm coverage a production gate, track the trend with cost impact, and budget the auto-provisioner as infrastructure — that keep the posture from slipping. The next time this shows up on the FinOps report, you'll have a sharper question than "how many findings are there?"

Back to the library

Lambda error alarms: the board-level question

Are we paying for compute that fails silently, and do we know about it before the customer does?

Serverless functions are invisible-by-default when they fail. Unlike a web server that returns an error to the caller, an async Lambda that fails just retries quietly and charges twice more before giving up. A function can fail at a 30% rate for weeks, cost triple its expected bill, and produce no customer-visible signal — until someone notices on the monthly invoice or a customer files a support ticket.

The leadership question is simple: do we have a mechanism that tells us a production function is broken before the customer does? COV-003 flags every function that lacks one. Alarms are not expensive or complex to add — they're CloudWatch rules that fire a notification when a metric crosses a threshold. The absence of them is the risk. The healthy posture is that every production workload has a known owner who gets paged when it fails, not a dashboard someone checks occasionally.

A short read for the executive who needs to know whether the engineering team has a reliable mechanism for detecting silent Lambda failures. You'll learn why async serverless workloads fail differently from web services, why that makes monitoring a business accountability question rather than a purely technical one, and what the one-line answer looks like at a leadership review: every production function has an owner who gets paged when it breaks, and exceptions are documented.

Fun fact

Async invokes retry. Twice. Silently.

What it looks like when a silent failure reaches the boardroom

At one company, the CTO first learned that a core data-enrichment Lambda had been broken for three weeks not from engineering — from the customer success team, who'd been fielding complaints about missing order history. The function was invoked asynchronously, had no DLQ, and no alarms. Nobody owned it. The bill for that month had a $4,000 anomaly nobody had investigated.

The question she asked wasn't technical: "Why did no one know?" The answer was that there was no system designed to tell anyone. The function ran, failed, retried, charged, and produced nothing — and the only way to notice was to either check a dashboard proactively or wait for a customer to complain.

The corrective action she drove wasn't about Lambda specifically. It was a policy: every production workload has a named owner and a notification path. If it breaks, someone gets paged within minutes, not weeks. COV-003 is the mechanism that enforces this for Lambda — every function it flags is a workload that doesn't yet meet that standard.

Why Lambda monitoring is a governance question, not an engineering detail

The headline risk is not technical — it's accountability. A production Lambda without alarms is a workload with no owner in the operational sense: nobody is notified when it breaks, nobody is on the hook for its reliability, and the business discovers failures through customer complaints or unexpected bills rather than through internal signals. That is a governance gap, not just a monitoring gap.

The cost side is real and quantifiable. A failing async Lambda retries twice, so a broken function costs up to three times its expected bill before anyone notices. For high-volume functions this is thousands of dollars per month per function, accruing silently until the invoice closes. COV-003 surfaces every function in that state.

The compliance exposure is growing. SOC 2, ISO 27001, and many enterprise security frameworks now expect documented monitoring of production workloads. "We review dashboards occasionally" is not a control. Alarms with named notification targets and documented owners are. Getting COV-003 to zero is as much about audit readiness as it is about cost or reliability.

The leadership move on COV-003

The executive action on Lambda alarm coverage isn't to mandate a specific alarm type — it's to establish that every production workload has an operational owner and a notification path, and that gaps are tracked and closed.

1. Set the standard: production workloads get monitored

Make it a standing policy that any function processing production traffic has at minimum one alarm with a named notification target. The specifics (which metrics, which thresholds) are engineering decisions. The accountability question — "who gets paged when this breaks?" — is not. Require an answer to that question for every production Lambda.

2. Accept a tiered approach for lower-stakes functions

Not every Lambda needs a full five-alarm set. Internal utility functions, low-volume batch jobs, and dev-environment Lambdas can carry a minimal Errors alarm or none at all. The tier boundary should be explicit: production-traffic functions get monitored, everything else by documented exception. Don't drive COV-003 to zero at the cost of over-alarming low-stakes functions — drive it to zero for production-critical ones.

3. Ask for the trend on COV-003 production findings

At the leadership review, the one metric that matters is the count of production Lambdas with no Errors alarm, trended month over month. A declining number means the org is closing the gap. Flat or rising means monitoring is not being treated as a production requirement. That is the signal to act on — not the absolute count.

Quick quiz

Question 1 of 5

Your security review shows 12 Lambda functions flagged by COV-003. Engineering confirms 4 are production-critical, 5 are internal tooling, and 3 are dev-environment functions. What's the right executive response?

Keep learning

Dig deeper into Lambda observability and the alarm patterns that make it stick.

That's the lesson. One takeaway: a Lambda without alarms is a production workload with no operational owner — it breaks silently, costs more than it should, and your first signal is a customer complaint or a surprise on the invoice. The leadership standard is that every production-critical function has a named owner who gets notified when it fails. COV-003 tracks the gap. A declining count means the policy is being enforced; a flat one means it isn't.

Back to the library

Part of the learning path Get your alarms right

Add error alarms to Lambda functions

Lambda error alarms: the basics

Async invokes retry. Twice. Silently.

Adding Lambda alarms in action

Lambda metrics under the hooddeep dive

What is the impact of running Lambda without alarms?

How do you put Lambda alarms in place safely?

1. Inventory and prioritise

2. Pick thresholds from real data, not from defaults

3. Layer alarms on business outcomes, not just runtime

4. Auto-provision alarms on function creation

Quick quiz

Keep learning

Lambda error alarms: what they cost and what silence costs more

Async invokes retry. Twice. Silently.

How a finance partner reads a Lambda cost anomaly

The direct cost of an unmonitored Lambda — and how to model it

The finance levers for getting Lambda alarms in place

1. Quantify the unmonitored exposure before the next review

2. Require alarm coverage as a production gate

3. Track COV-003 findings with their estimated wasted-cost exposure

4. Treat the auto-provisioner as infrastructure spend, not overhead

Quick quiz

Keep learning

Lambda error alarms: the board-level question

Async invokes retry. Twice. Silently.

What it looks like when a silent failure reaches the boardroom

Why Lambda monitoring is a governance question, not an engineering detail

The leadership move on COV-003

1. Set the standard: production workloads get monitored

2. Accept a tiered approach for lower-stakes functions

3. Ask for the trend on COV-003 production findings

Quick quiz

Keep learning

Related monitoring lessons