Skip to main content
emnode / learn
Cost

Right-size Lambda provisioned concurrency

Provisioned concurrency keeps execution environments warm to kill cold starts — but you pay for every warm second whether traffic uses it or not. Set it wrong and you're heating an empty room around the clock.

13 min·10 sections·AWS

Last reviewed

Right-sizing provisioned concurrency: the basics

Why warm capacity is a second meter you opted into

On-demand Lambda is pay-per-use: an environment spins up when a request arrives, and a brand-new environment incurs a cold start — the time to download code, start the runtime, and run your init code. Provisioned concurrency (PC) is the fix for latency-sensitive functions: you tell Lambda to keep N execution environments initialised and warm at all times, so requests skip the cold start entirely. The catch is that this is a separate, always-on price dimension. In US-East-1 you pay roughly $0.0000041667 per GB-second of provisioned concurrency for every second the capacity is reserved — whether a single request ever touches it or not — and you still pay the normal per-invocation duration cost on top when requests do arrive.

That changes the cost model fundamentally. On-demand Lambda scales to zero: no traffic, no bill. PC does not scale to zero — it bills 86,400 seconds a day per reserved environment, times the function's memory in GB, times the PC rate. A 1024 MB function with 50 units of PC reserved 24/7 is about $450 a month in PC charges before a single invocation. If that function only sees real concurrency of 5–8 during business hours and nothing overnight, you're paying to keep 40-plus environments warm that traffic never reaches.

It's flagged because PC is almost always provisioned for the peak and then forgotten. Someone sets it during a launch to guarantee p99 latency, picks a comfortable round number, and never revisits it as traffic patterns settle. The signal to watch is ProvisionedConcurrencyUtilization — the fraction of warm environments actually serving traffic. Persistently low utilization (say, under 40-50%) means most of the warm capacity is idle heat: you're carrying the cold-start insurance for concurrency you don't have, every second of every day.

In this lesson you'll learn how provisioned concurrency differs from on-demand Lambda, the separate per-GB-second price you pay for warm capacity around the clock, and how to read ProvisionedConcurrencyUtilization and ProvisionedConcurrencySpilloverInvocations to tell whether you're over- or under-provisioned. You'll see the real CLI to inspect utilization and set the PC count, how to use Application Auto Scaling — scheduled or target-tracking — to ramp warm capacity up only during peak windows instead of 24/7, and when to drop PC entirely: latency-tolerant workloads where cold starts are fine, or Java functions where SnapStart is the cheaper fix.

Fun fact

The reservation that worked weekends for free

A payments team set 100 units of provisioned concurrency on a checkout-validation function ahead of a Black Friday launch, sizing for the busiest expected minute. The launch went fine — and the reservation stayed at 100, 24/7, for the next eleven months. Their own dashboards later showed average ProvisionedConcurrencyUtilization of 11%: outside a two-hour weekday lunch spike, real concurrency rarely cleared 12. At 1536 MB that idle 89% was burning roughly $4,000 a month to keep environments warm that essentially never took a request on evenings or weekends. The fix wasn't to delete the safety net — it was a scheduled scaling action that set PC to 15 overnight and weekends and ramped to 60 for the lunch window, cutting the bill by two-thirds with no latency regression.

Right-sizing provisioned concurrency in action

Dana runs the platform team at a fintech. A cost review flags one Lambda — a real-time fraud-scoring function on the checkout path — carrying about $3,800 of monthly provisioned-concurrency charges. It's configured with 80 units of PC at 1536 MB, reserved 24/7, and it was set during last year's product launch.

She pulls the ProvisionedConcurrencyUtilization metric for the last two weeks. The picture is stark: a weekday peak of about 55% concurrent usage between 9am and 6pm, dropping to under 8% overnight and barely 12% at weekends. Spillover invocations — requests that exceeded the warm pool and fell back to cold starts — are essentially zero, which confirms 80 is well above what even the peak needs.

Dana doesn't just delete the reservation; the latency guarantee matters on a checkout path. She sets the steady reservation to 45 (covering the weekday peak with headroom) and adds two Application Auto Scaling scheduled actions: ramp to 45 at 8am UTC and back to 12 at 7pm UTC on weekdays, with a flat 12 over the weekend. Utilization climbs from ~25% blended to ~70%, spillover stays at zero, and the PC bill drops from $3,800 to about $1,400 a month — with p99 latency on the critical path unchanged.

First, read how much of the reserved warm capacity is actually being used. ProvisionedConcurrencyUtilization is the fraction (0-1) of warm environments serving traffic — pull the average and peak over the last week.

$ aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name ProvisionedConcurrencyUtilization --dimensions Name=FunctionName,Value=fraud-scoring Name=Resource,Value=fraud-scoring:live --start-time $(date -u -d '7 days ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 3600 --statistics Average Maximum --query 'Datapoints | sort_by(@,&Timestamp)[-4:]'
[
{ "Timestamp": "2026-05-25T02:00:00Z", "Average": 0.07, "Maximum": 0.11 },
{ "Timestamp": "2026-05-25T03:00:00Z", "Average": 0.06, "Maximum": 0.09 },
{ "Timestamp": "2026-05-25T13:00:00Z", "Average": 0.55, "Maximum": 0.62 },
{ "Timestamp": "2026-05-25T14:00:00Z", "Average": 0.53, "Maximum": 0.60 }
]
# Peak ~55%, overnight ~6-7%. The reservation is sized for a peak it rarely hits and never drops.

Utilization is the ground truth: warm environments serving traffic versus warm environments billed. Low overnight = idle heat.

Confirm you're not under-provisioned before cutting. ProvisionedConcurrencySpilloverInvocations counts requests that exceeded the warm pool and fell back to on-demand (cold-start) execution. Near-zero means there's room to trim.

$ aws cloudwatch get-metric-statistics --namespace AWS/Lambda --metric-name ProvisionedConcurrencySpilloverInvocations --dimensions Name=FunctionName,Value=fraud-scoring Name=Resource,Value=fraud-scoring:live --start-time $(date -u -d '7 days ago' +%FT%TZ) --end-time $(date -u +%FT%TZ) --period 86400 --statistics Sum
{
"Datapoints": [
{ "Timestamp": "2026-05-24T00:00:00Z", "Sum": 0.0 },
{ "Timestamp": "2026-05-25T00:00:00Z", "Sum": 0.0 }
],
"Label": "ProvisionedConcurrencySpilloverInvocations"
}
# Zero spillover at 80 units + low utilization = clear over-provisioning. Safe to right-size and schedule.

Spillover is the safety check: zero spillover plus low utilization confirms you can cut the reservation without forcing cold starts.

Right-sizing provisioned concurrency under the hooddeep dive

Provisioned concurrency is billed on its own price dimension, distinct from on-demand invocation cost. In US-East-1 you pay roughly $0.0000041667 per GB-second of PC for every second a unit is reserved, plus a reduced duration charge (about $0.0000097222 per GB-second on x86) for invocations that run on that warm capacity, plus the standard per-request fee. The reservation charge is the part that doesn't sleep: one PC unit on a 1024 MB function held for a full 30-day month is roughly 1 GB x 2,592,000 seconds x $0.0000041667 ≈ $10.80 a month before any traffic. Multiply by the unit count and you have a fixed monthly floor that on-demand Lambda never has, because on-demand scales to zero and PC does not.

The two metrics that tell you whether the floor is right-sized are ProvisionedConcurrencyUtilization and ProvisionedConcurrencySpilloverInvocations. Utilization is a 0-1 ratio: the number of warm environments actively running an invocation divided by the number reserved. Persistently low utilization means you're paying to keep environments warm that traffic never reaches. Spillover counts invocations that arrived when all warm environments were busy and therefore ran as on-demand cold starts — it's the under-provisioning signal. The sweet spot is high utilization (you're using what you pay for) with near-zero spillover (you're not forcing cold starts on the requests you provisioned to protect). Both are emitted per function version or alias, which is why the CloudWatch dimensions include the Resource qualifier alongside FunctionName.

Because demand is rarely flat, the cost-efficient move is usually to make the reservation move with traffic rather than sizing it for a 24/7 peak. Application Auto Scaling registers the function's PC as a scalable target and supports two policies: scheduled actions (set PC to a value at a cron time — ideal for predictable daily/weekly patterns) and target-tracking (Application Auto Scaling watches utilization and adds or removes warm capacity to hold it near a target like 0.7). Scheduled scaling is cheapest and most predictable when the traffic shape is known; target-tracking adapts to noisier patterns but reacts with a lag, so it's paired with a floor that covers baseline. For latency-tolerant workloads the right answer is often to drop PC entirely and accept cold starts; for Java specifically, SnapStart restores a snapshot of an initialised environment at no extra reservation charge, making PC unnecessary for most Java cold-start problems.

# Inspect the current provisioned-concurrency config on an alias.
aws lambda get-provisioned-concurrency-config \
  --function-name fraud-scoring \
  --qualifier live \
  --query '{Requested:RequestedProvisionedConcurrentExecutions,Available:AvailableProvisionedConcurrentExecutions,Status:Status}'

# Set the steady reservation to a right-sized value covering the weekday peak.
aws lambda put-provisioned-concurrency-config \
  --function-name fraud-scoring \
  --qualifier live \
  --provisioned-concurrent-executions 45

# Register PC as a scalable target so Application Auto Scaling can ramp it on a schedule.
aws application-autoscaling register-scalable-target \
  --service-namespace lambda \
  --resource-id function:fraud-scoring:live \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --min-capacity 12 --max-capacity 45

What is the impact of over-provisioned concurrency?

The direct cost is a fixed monthly floor that runs whether or not traffic uses it. Unlike on-demand Lambda — where idle is free — every reserved PC unit bills around the clock at the reservation rate. At 1024 MB that's about $10.80 a month per unit before a single invocation; at 1536 MB it's roughly $16.20. A function over-provisioned at 80 units when 30 would cover peak is wasting 50 units — $540 to $810 a month on that one function — every month, all year, on capacity that never serves a request.

The waste concentrates exactly where on-demand thinking misleads people. Engineers reason about Lambda as pay-per-use and assume quiet hours are cheap; PC inverts that, so the most expensive hours are the idle ones, when utilization collapses but the reservation keeps billing. A function reserved for a weekday lunch spike pays peak rates through every night and weekend — often two-thirds of the calendar — for capacity sitting cold-but-billed. Scheduling the reservation to match the traffic shape, rather than the peak, is where most of the saving lives.

There's a commitment angle too. Provisioned concurrency is covered by Compute Savings Plans, which discount the PC dimension alongside on-demand Lambda, Fargate, and EC2. If you commit a Savings Plan against an over-provisioned PC baseline, you lock a discount onto idle warm capacity — the same stranded-commitment trap as reserving an oversized EC2 fleet. The reservation should be right-sized and scheduled first, so any Savings Plan sits on the efficient warm-capacity run-rate rather than the inflated one.

Finally, over-provisioning hides the real latency question. When a function is slow, the reflexive fix is "add more provisioned concurrency," and a generous reservation papers over genuine init-time problems — heavy SDK initialisation, large deployment packages, slow VPC ENI attachment — that more warm environments only mask at increasing cost. Worse, PC is sometimes reached for on workloads that don't need it at all: latency-tolerant batch or async functions where a cold start is irrelevant, or Java functions where SnapStart would solve cold starts for free. Each of those is a reservation bill with no latency justification underneath it.

How do you right-size provisioned concurrency safely?

Right-sizing PC is a four-step loop that runs on the FinOps cadence: find the functions carrying warm-capacity charges, measure their utilization and spillover, right-size and schedule the reservation to the real traffic shape, and re-check whether the workload needs PC at all.

1. Find every function with provisioned concurrency and its cost

List the functions and aliases that carry PC configs across every region and account, with the reserved unit count, memory size, and estimated monthly reservation cost (units x memory-GB x ~$0.0000041667 x seconds in month). PC is opt-in and per alias, so the list is short — usually a handful of latency-critical functions. Rank by reservation cost, not invocation volume, because the bill follows reserved units x memory x time, independent of how much traffic actually hits them.

2. Measure utilization and spillover before changing anything

Pull ProvisionedConcurrencyUtilization (the fraction of warm capacity in use) and ProvisionedConcurrencySpilloverInvocations (requests that overflowed to cold starts) over at least one full traffic cycle — a week covering weekdays and weekend. Low utilization with zero spillover means you're over-provisioned and can cut safely. High utilization with rising spillover means you're under-provisioned and should hold or raise the floor. The daily and weekly shape of utilization is what tells you whether to schedule.

3. Right-size the steady value, then schedule with Application Auto Scaling

Set the steady reservation to cover the weekday peak with modest headroom (use the peak utilization, not the average). Then register the alias as a scalable target and add scheduled actions to ramp capacity up before the busy window and down for nights and weekends — or target-tracking to hold utilization near ~0.7 for noisier patterns, paired with a min-capacity floor for baseline. Scheduling against a known daily shape captures the bulk of the saving because it stops paying peak rates through the two-thirds of the week that is quiet.

4. Ask whether the workload needs PC at all

Provisioned concurrency only earns its cost on functions where cold-start latency is genuinely user-visible and SLA-bound. For latency-tolerant async, batch, or queue-driven functions, drop PC entirely and accept cold starts — the saving is the whole reservation. For Java functions, SnapStart restores a pre-initialised snapshot at no reservation charge and removes most cold-start pain for free, so it's almost always the better lever than PC. Re-confirm the justification on a cadence: a function that needed PC at launch may not need it once traffic and init time have settled.

# Add scheduled scaling: ramp warm capacity to 45 for the weekday peak, down to 12 overnight.
aws application-autoscaling put-scheduled-action \
  --service-namespace lambda \
  --resource-id function:fraud-scoring:live \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --scheduled-action-name ramp-up-weekday-peak \
  --schedule 'cron(0 8 ? * MON-FRI *)' \
  --scalable-target-action MinCapacity=45,MaxCapacity=45

aws application-autoscaling put-scheduled-action \
  --service-namespace lambda \
  --resource-id function:fraud-scoring:live \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --scheduled-action-name ramp-down-overnight \
  --schedule 'cron(0 19 ? * MON-FRI *)' \
  --scalable-target-action MinCapacity=12,MaxCapacity=12

Quick quiz

Question 1 of 5

A checkout function reserves 80 units of provisioned concurrency at 1536 MB, 24/7. Over a week, ProvisionedConcurrencyUtilization peaks at 55% on weekday afternoons and sits under 8% overnight, with zero spillover invocations. What's the right move?

You've completed Right-size Lambda provisioned concurrency. You now know how PC differs from on-demand Lambda, why warm capacity bills around the clock on its own price dimension, how to read ProvisionedConcurrencyUtilization and ProvisionedConcurrencySpilloverInvocations, and the four-step loop — find the provisioned functions, measure utilization and spillover, right-size and schedule with Application Auto Scaling, then re-check whether the workload needs PC at all. The next time a cost review flags warm-capacity charges, you'll have a defensible path from "flagged" to "right-sized" without touching the latency guarantee that matters.

Back to the library