Cost

Investigate a cost anomaly

A 7x cost spike on EC2 in one account isn't always an outage — but it always means the bill changed. Triage, attribute, decide.

16 min·10 sections·AWS

Last reviewed 27 May 2026

Cost anomalies: the basics

What does it mean for cloud cost to be "anomalous"?

A cost anomaly is a daily spend value that lands outside the expected band for a given (account, service) tuple — far enough outside that a baseline model trained on the last 30–90 days of history flags it as unlikely to be normal variation. Daily AWS spend has natural rhythm: batch jobs run nightly, traffic dips on weekends, monthly cron jobs spike on the 1st. Anomaly detection learns those rhythms and surfaces the points that break them.

AWS Cost Anomaly Detection and most third-party tools (including this dashboard's ml-anomaly detector) work the same way conceptually: they train a model per monitored dimension, compute an expected value plus a confidence band for each day, and emit an anomaly when the actual value crosses that band. The output isn't "something is wrong" — it's "this looks unusual, here's what changed."

A real example: Amazon EC2 cost in one account jumped from an expected $188.62/day to an actual $1,366.40/day — variance $1,177.78, a 7.2× spike. The model had been watching that account+service pair for weeks and was confident the previous days fit the band. That's an anomaly worth a human looking at — not necessarily an emergency, but the bill changed and someone needs to know why.

In this lesson you'll learn how cloud cost anomaly detection actually works, the triage decision tree for any spike (usage change vs. rate change), how to attribute a 7× anomaly back to a specific resource and a specific team in under thirty minutes, and how to decide whether the new cost should be accepted (budget up) or fixed (root-cause down). You'll see real Cost Explorer queries and the post-mortem format that turns one-off spikes into team-wide muscle memory.

Fun fact

The most expensive bug is the one nobody owns

A well-cited industry estimate is that 30%+ of cloud spend is wasted, and a meaningful chunk of that waste lives in anomalies that nobody investigated because they fell into the cracks between teams. A 7× EC2 spike on a shared account is everyone's problem and therefore nobody's. The first team to claim an anomaly tends to fix it; the ones that bounce between Slack channels for two weeks become the line items finance asks about in the next QBR.

Triaging a cost anomaly in action

Marco runs FinOps at a healthcare SaaS. On a Friday morning the dashboard fires an anomaly: Amazon Elastic Compute Cloud in account 412988273341 jumped from $188.62/day expected to $1,366.40/day actual on 2026-02-26 — variance $1,177.78, 7.2×. The detector tags it cost-spike, ml-anomaly. Severity HIGH.

Before pinging anyone, Marco runs the triage decision tree. Cost = usage × rate. So the spike is either a usage change (a new resource, more requests, more data) or a rate change (a Savings Plan that lapsed, a region with different pricing, an instance family change). He starts with usage because that's the more common cause and the easier one to verify.

He pulls Cost Explorer grouped by USAGE_TYPE for that account+service+date range. The 80/20 of any anomaly is usually one usage type doing 80% of the damage.

Group the anomalous day's spend by usage type to see which line item moved.

$ aws ce get-cost-and-usage --time-period Start=2026-02-26,End=2026-02-27 --granularity DAILY --metrics UnblendedCost --filter '{"And":[{"Dimensions":{"Key":"LINKED_ACCOUNT","Values":["412988273341"]}},{"Dimensions":{"Key":"SERVICE","Values":["Amazon Elastic Compute Cloud - Compute"]}}]}' --group-by Type=DIMENSION,Key=USAGE_TYPE --query 'ResultsByTime[0].Groups[?Metrics.UnblendedCost.Amount>`50`]'

[

{ "Keys": ["USE1-BoxUsage:p4d.24xlarge"], "Metrics": { "UnblendedCost": { "Amount": "982.31", "Unit": "USD" } } },

{ "Keys": ["USE1-BoxUsage:m5.large"], "Metrics": { "UnblendedCost": { "Amount": "171.04", "Unit": "USD" } } },

{ "Keys": ["USE1-DataTransfer-Regional-Bytes"], "Metrics": { "UnblendedCost": { "Amount": "118.92", "Unit": "USD" } } },

{ "Keys": ["USE1-EBS:VolumeUsage.gp3"], "Metrics": { "UnblendedCost": { "Amount": "61.40", "Unit": "USD" } } }

]

# p4d.24xlarge alone is $982 of the $1,366 — 72% of the spike. That's the resource.

One usage type accounts for ~72% of the variance — classic 80/20 of an anomaly.

Now confirm whether this is a new resource (RunInstances around detection time) or an existing one whose usage changed.

$ aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances --start-time 2026-02-25T00:00:00Z --end-time 2026-02-26T23:59:59Z --query 'Events[?contains(Resources[].ResourceType, `AWS::EC2::Instance`)].[EventTime,Username,Resources[?ResourceType==`AWS::EC2::Instance`].ResourceName|[0]]' --output table

┌──────────────────────┬──────────────────────────┬──────────────────────┐

│ EventTime │ Username │ InstanceId │

├──────────────────────┼──────────────────────────┼──────────────────────┤

│ 2026-02-25T18:42:17Z │ ml-research/k.tanaka │ i-0d8c91a4e7b2f6033 │

│ 2026-02-25T18:42:31Z │ ml-research/k.tanaka │ i-0a17e8d2b9c4f5018 │

│ 2026-02-26T09:11:04Z │ autoscaling.amazonaws.com│ i-094f8c1e2a7d6b502 │

└──────────────────────┴──────────────────────────┴──────────────────────┘

# Two p4d.24xlarge instances launched the evening before the spike. Owner: ml-research team.

CloudTrail confirms a new launch by a specific IAM principal — attribution complete.

How anomaly detection actually worksdeep dive

Under the hood, the detector trains a separate model per monitored dimension — most commonly per (account_id, service) tuple, sometimes finer (per usage_type, per tag value). For each tuple it ingests the last 30–90 days of daily UnblendedCost, fits an expected curve (typically a seasonal decomposition plus a residual model), and computes a confidence band. When a new day's actual spend lands outside that band, the detector emits an event with expectedCost, actualCost, variance, the band width, and a confidence score.

The detector also tracks detectionCount and detectionDates — every day the anomaly persists, the count ticks up. A single day at 7× could be a one-off (a one-shot batch job, a botched test). Sixteen consecutive days at 7× means the cost base has permanently moved and the model is now consistently confused. Persistence is the strongest signal that something real happened; the cost-anomaly inbox should sort by detection count, not severity.

AWS Cost Anomaly Detection (the AWS-native version) lets you create monitors of four types: AWS services, linked accounts, cost categories, and tags. Each monitor has its own alert threshold (absolute $ or %) and notification destinations (SNS, email, Slack via EventBridge). The same detection logic powers this dashboard's ml-anomaly channel — the difference is where the model lives and how the alerts are surfaced, not what they mean.

# Pull the raw anomaly record from AWS Cost Anomaly Detection.
aws ce get-anomalies \
  --date-interval StartDate=2026-02-26,EndDate=2026-02-26 \
  --query 'Anomalies[?Impact.TotalImpact>`500`].[AnomalyId,AnomalyStartDate,Impact.TotalActualSpend,Impact.TotalExpectedSpend,DimensionValue]' \
  --output table

# List the monitors so you know which dimensions are actually watched.
aws ce get-anomaly-monitors \
  --query 'AnomalyMonitors[].[MonitorName,MonitorType,MonitorDimension]' \
  --output table

What is the impact of unattributed cost anomalies?

The direct impact is the bill. A persistent 7× spike on an EC2 account that was running $188/day costs an extra ~$35k a month if nobody catches it — and "nobody catches it" is the default outcome when an anomaly bounces between Slack channels for a week. The detector did its job; the cost is the gap between detection and ownership.

The second-order impact is forecast pollution. Budgets, savings-plan recommendations, and capacity plans all train on recent spend. An unflagged anomaly that persists becomes the new baseline; six weeks later you're sizing your reserved-capacity commitment on numbers that include $35k/month of accidental p4d. The model can't tell the difference between intentional growth and a forgotten test rig.

The third-order impact is trust. Finance loses confidence in cloud spend forecasts when anomalies don't get explained; engineering loses confidence in the alert channel when too many anomalies fire without follow-up. Both problems compound — finance starts demanding manual approval for any spend increase, engineering starts muting the channel.

The good news: anomalies that are attributed within 24 hours rarely cost more than a few thousand dollars in waste. Speed of triage is the dominant variable. A clear decision tree and a documented owner per (account, service) tuple turns a $35k/month surprise into a $2k one-off.

How do you triage and act on a cost anomaly?

Anomaly response is a four-step loop. Skipping any step turns a one-day spike into a six-week mystery.

1. Attribute — find the resource and the owner

Group the anomalous day by USAGE_TYPE in Cost Explorer to find the 80/20 line item, then RESOURCE_ID (or CloudTrail RunInstances/CreateBucket events around the detection timestamp) to find the specific resource. Cost-allocation tags (or the IAM principal that created the resource) give you the owning team. The whole exercise is 15 minutes if your tagging is in shape.

2. Diagnose — usage change or rate change?

Cost = usage × rate. If usage rose, check CloudTrail for new launches and CloudWatch for traffic/queue/processing increases on existing resources. If usage is flat, check rate: a Savings Plan or RI that lapsed, a workload that moved to a more expensive region, a switch from spot to on-demand. Data-transfer spikes (especially cross-AZ and cross-region) are the most-missed category — always check DataTransfer-Regional-Bytes and DataTransfer-Out-Bytes.

3. Decide — intentional or accidental?

Once you know what changed and who owns it, ask the owner one question: was this intentional? If yes (feature launch, planned capacity, marketing event) update the budget and forecast — that's now the new normal and the anomaly model needs to relearn. If no (forgotten test rig, autoscaling bug, runaway loop) treat it as a real incident: fix the root cause and verify the next day's spend returns to the expected band.

4. Post-mortem — one-liner per anomaly

Every anomaly should produce a single-sentence note: "X caused Y, fixed by Z, prevented by W." Build the team library and reference it the next time something similar fires. Patterns repeat — p4d instances left running over a weekend, ECR pulls from the wrong region, ALB log retention misconfigured — and the second occurrence should be a 30-second triage, not a re-investigation.

# Set up a per-service anomaly monitor with an SNS notification.
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "per-service-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "finops-cost-anomalies",
    "Threshold": 500,
    "Frequency": "DAILY",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/..."],
    "Subscribers": [{"Type": "SNS", "Address": "arn:aws:sns:us-east-1:123456789012:finops-alerts"}]
  }'

Quick quiz

Question 1 of 5

A cost-anomaly alert shows EC2 in account 412988273341 jumped from $188.62/day expected to $1,366.40/day actual — a 7.2× spike. detectionCount is 1. What's the right next step?

Keep learning

Dig deeper into cost anomaly detection, attribution, and FinOps incident response.

You've completed Investigate a cost anomaly. You now know how anomaly detection works, the usage-vs-rate triage decision tree, how to attribute a 7× spike to a resource and an owner in under thirty minutes, and how to close every anomaly with a one-line post-mortem that compounds team knowledge. The next time the inbox fires an ml-anomaly with $1,177 of variance, you'll have a four-step loop — attribute, diagnose, decide, post-mortem — ready to run.

Back to the library

Cost anomalies: what finance needs to know

A statistical alert that the bill changed — and a prompt to find out whether it should have

A cost anomaly is a data point: the actual spend on a given (account, service) pair on a given day came in materially outside the range the model predicted based on the prior 30–90 days of history. The detector learned the normal rhythm of that spend — weeknight batch spikes, weekend dips, end-of-month jobs — and flagged a value that breaks that pattern. It is a signal that something changed, not a diagnosis of what changed or whether it matters.

For finance, the number that counts is the variance in dollars: how much did actual exceed expected, and has it persisted? A single-day $1,177 variance on a $188/day workload is notable but might be a one-shot batch run. Sixteen consecutive days of the same variance is $18,800 of unplanned run-rate — potentially $226k annualised — that belongs in the forecast revision and the budget conversation. DetectionCount is the variable to watch; it converts a spike into a rate.

The central question for every anomaly is ownership: which team ran the cost up, was it intentional, and does the forecast need to move? A usage spike caused by a feature launch is expected spend that just wasn't communicated to finance in time — the right response is to update the budget, not to reverse it. A forgotten test rig is waste that should be stopped. The detector cannot tell the difference; only attribution to an owner and a conversation with that team can.

This lesson is for the finance partner who wants to understand what a cost anomaly alert actually represents, how the numbers in the alert (expected cost, actual cost, variance, detectionCount) translate to budget exposure, and how to distinguish planned spend that wasn't communicated from genuine waste. No CLI commands required — the focus is on the attribution workflow, the decision to accept or fix, and the forecast and budget adjustments that close the loop.

Fun fact

The most expensive bug is the one nobody owns

How a finance partner moves through an anomaly alert

Dana is the finance partner for the platform org. When the $1,177 EC2 anomaly lands in her inbox she doesn't start with the CLI — she starts with the numbers. Expected $188.62/day. Actual $1,366.40/day. detectionCount: 1. That last number matters: a single-day spike at 7× could be a one-shot batch run that will self-correct; if detectionCount climbs to 5 or 10 she's looking at $6k–$12k of unplanned month-to-date variance and a potential $35k run-rate change.

Her first question is whether there's a named owner on the account. She checks the account's cost-allocation tags — account 412988273341 is tagged to the ML Research team. That gives her someone to call before escalating anything. She drops a one-line Slack message: "$1,177 EC2 spike in your account yesterday — intentional? If so I need to update the forecast."

K. Tanaka replies within the hour: two p4d.24xlarge GPU instances launched for a short training run, expected to terminate by Sunday. Dana logs the conversation, notes the expected termination, and sets a calendar check for Monday to confirm the instances stopped and the anomaly closed. If they're still running Monday, that becomes a $35k/month forecast revision conversation — and a budget amendment request if the run goes longer than the sprint.

The financial exposure of an unattributed anomaly

The direct cost is arithmetic. A $188/day EC2 account at 7× produces an extra $1,177/day of unplanned spend. One day is a rounding error in most budgets. But if the anomaly persists for 30 days — which is the median outcome when nobody claims ownership — that's $35,310 of unbudgeted run-rate, roughly $424k annualised. The detector flagged it on day one; every subsequent day of delay is a finance problem, not a detection problem.

The second financial impact is forecast contamination. Savings Plan coverage recommendations, RI sizing, and next-quarter budget models all ingest recent spend data. An unattributed anomaly that persists for four to six weeks silently inflates the baseline those models train on. You end up over-committing to reserved capacity because the model believed $1,366/day was the new normal for that account — and unwinding an over-commit costs money too.

The chargeback and audit-trail impact runs parallel. If your chargeback model allocates anomalous spend to the team that owns the account, an unattributed spike shows up as an unexplained cost spike in that team's report. Finance gets questions it can't answer, teams dispute charges they didn't authorise, and the credibility of the cost allocation model erodes. A one-line attribution note per anomaly — "ML Research team, training run, intentional, expected to close Sunday" — is all the audit trail you need.

The controllable variable is speed of closure. Anomalies resolved within 24 hours rarely accumulate more than $1–2k in waste; anomalies that drift for a week routinely reach $8–12k before someone escalates. SLA-ing the response time — e.g., any anomaly over $500 variance gets an owner claim within 4 business hours — is the single highest-ROI process change in an anomaly-management program.

The finance partner's triage loop for a cost anomaly

Finance can't run Cost Explorer queries or CloudTrail lookups directly, but it owns the SLA, the attribution record, and the budget decision that closes the anomaly. Four steps, used every time.

1. Convert variance to run-rate exposure

The first thing to compute is detectionCount × daily variance. One day of $1,177 is noise; ten days of $1,177 is $11,770 of month-to-date overrun and a $141k annualised run-rate change that belongs in the next forecast revision. Sorting the anomaly inbox by (detectionCount × variance) rather than severity alone surfaces the ones that matter financially.

2. Check ownership before escalating

Use the account's cost-allocation tags or the team responsible for that account in your cloud governance model to identify the owner before pinging anyone. A cold message to the wrong team creates noise; a message to the right owner with "$1,177 spike in your EC2 account yesterday — was this intentional?" usually gets a same-day response. Document the response — even a one-line Slack reply — as the attribution record.

3. Decide: update the forecast or track the waste

Once the owner responds, the finance decision is binary. If the spike was intentional — a feature launch, a training run, a marketing event — log the reason and update the relevant account's forecast for however long the elevated spend is expected to continue. If it was accidental — a forgotten test rig, an autoscaling bug — log it as waste-to-recover and track the account until next-day spend returns to the expected band. Both outcomes need a record; the worst outcome is an anomaly that closes without any note.

4. Set the response SLA and enforce it

Anomalies over a materiality threshold — e.g., any single-day variance above $500, or any anomaly with detectionCount ≥ 3 — should have a response SLA of four business hours for owner identification and 24 hours for a documented disposition (intentional / accidental / under investigation). SLA compliance is a metric worth tracking in the FinOps review: the ratio of anomalies closed with an attribution note within SLA is the leading indicator of how well the organisation's accountability loop is working.

Quick quiz

Question 1 of 5

An anomaly has been open for 8 days: EC2 in account 412988273341 running $1,177/day above expected. detectionCount is 8. No owner has responded. What's the right finance move?

Keep learning

Dig deeper into cost anomaly detection, attribution, and FinOps incident response.

You've finished the finance partner's view of cost anomaly investigation. You know how to convert a daily variance into a run-rate exposure, how to identify the account owner before escalating, how to log the disposition — intentional spend or waste-to-recover — in a way that satisfies the audit trail, and how a response-time SLA converts an unpredictable alert channel into a measurable governance metric. Next time an anomaly sits open for three days, you'll have the framing to escalate it without waiting for engineering to notice.

Back to the library

Cost anomalies: the headline

The bill changed materially — does the organisation know why?

A cost anomaly alert means the cloud bill for a specific account and service moved well outside what the model expected based on recent history. The 7.2× EC2 spike in this lesson's example — $188/day expected, $1,366/day actual — is not necessarily a problem. It might be a planned capacity event. What it is, definitively, is a change the organisation should be able to explain.

The leadership question is not technical: it's whether there is a named owner who knows why spend moved, and whether that information reached finance in time to update the forecast. Anomalies that sit unanswered for weeks become the surprise line items at the next QBR. The maturity signal is speed of attribution: the best-run teams close every anomaly with a one-sentence owner response within 24 hours, turning a potential $35k monthly surprise into a two-thousand-dollar one-off.

A short read on why cost anomalies are a leadership concern even when the amounts seem small. You'll understand the one question an anomaly always demands an answer to — does this organisation know why the bill changed? — and what a mature anomaly-response culture looks like: fast attribution, a named owner, and a one-sentence closure note. No technical depth needed.

Fun fact

The most expensive bug is the one nobody owns

What this anomaly looks like from the top

The same Friday-morning anomaly lands in the leadership digest: a $1,177 single-day cost spike on EC2 in the ML Research account. The operational question — what resource, which team — was answered inside an hour. K. Tanaka launched two GPU instances for a training run; they're expected down by Sunday. Finance has the note.

The thing worth tracking at leadership level isn't this specific anomaly — it's whether the pattern holds. Did the team close it within 24 hours? Yes. Did the owner respond with a one-sentence explanation? Yes. Did finance get the information they needed to update the forecast if the run extends? Yes. That's exactly the anomaly-response culture that keeps a $1k spike from becoming a $35k surprise on the QBR slide.

The maturity signal isn't zero anomalies — it's that every anomaly has a named owner and a one-line closure note. When the digest starts showing anomalies that sat unanswered for five days, that's the governance gap to address.

Why unattributed anomalies are a governance gap, not just a cost issue

An anomaly that nobody investigates within 24 hours is a signal about how the organisation governs cloud spending, not just a line on the bill. The dollar amount in isolation — $1,177 on day one — is manageable. What it tests is whether there is a functioning accountability loop: a named owner per cloud account, a process for claiming an anomaly, and a closed-loop response that reaches finance before the month-end actuals land.

When that loop doesn't exist, the second-order cost is more damaging than the first. Unattributed anomalies pollute the spend data that informs capacity planning, reserved-capacity commitments, and budget forecasts. Six weeks of unacknowledged anomalous spend can skew the numbers that drive millions of dollars of annual commitment decisions — at which point the $35k anomaly has a much larger tail.

The trust impact compounds the financial one. Finance partners who stop trusting cloud spend forecasts because anomalies frequently go unexplained start requiring manual sign-off on every cloud spend increase — slowing down engineering velocity as a side effect. Engineering teams who stop trusting the alert channel because alerts rarely get acknowledged start muting notifications — eliminating the early-warning system entirely. Both outcomes are self-inflicted and preventable by the same process: a fast, consistent, documented response to every anomaly over a materiality threshold.

The leadership frame for anomaly response

The executive's role in anomaly response isn't to triage individual spikes — it's to ensure the organisation has the process and accountability structure that makes fast triage the default. Three levers.

1. Require a named owner per cloud account

Every cloud account should map to a named team or cost centre, and that mapping should be current and queryable. Anomaly attribution fails at the first step when "who owns this account?" doesn't have an instant answer. A maintained account register — even a shared spreadsheet that feeds into the cost-allocation tagging policy — is the foundation the whole triage loop rests on.

2. Set a public SLA and review it monthly

Publish a response-time expectation for anomalies over a materiality threshold — for example, any anomaly over $500 variance gets an owner claim within four business hours and a disposition note within 24. Review the SLA compliance rate monthly in the finance or FinOps operating review. The number itself matters less than the act of measuring it: teams that know anomaly-response time is tracked tend to respond faster.

3. Treat the post-mortem library as institutional memory

Recurring anomaly patterns — GPU instances left running over weekends, cross-region data transfer spikes, autoscaling groups that never scale back down — compound until they're named and documented. Ask the FinOps team to maintain a one-liner-per-anomaly log and reference it when the same pattern fires again. The second occurrence of any anomaly type should take thirty seconds to triage, not thirty minutes.

Quick quiz

Question 1 of 5

Your FinOps team reports that last month 60% of anomalies over $500 variance were closed within 24 hours with an attribution note; 40% drifted for more than 5 days with no response. What's the right read?

Keep learning

Dig deeper into cost anomaly detection, attribution, and FinOps incident response.

That's the lesson. Two things to carry forward: every cost anomaly is a test of whether the organisation has a functioning ownership loop — a named account owner, a fast attribution, and a one-line disposition that reaches finance before month-end. And the leading indicator of that loop's health is a single metric: the share of anomalies over your materiality threshold that close with an attribution note within 24 hours. When that number is high, surprises at the QBR get rarer. When it's low, that's the governance gap to fix.

Back to the library

Investigate a cost anomaly

Cost anomalies: the basics

The most expensive bug is the one nobody owns

Triaging a cost anomaly in action

How anomaly detection actually worksdeep dive

What is the impact of unattributed cost anomalies?

How do you triage and act on a cost anomaly?

1. Attribute — find the resource and the owner

2. Diagnose — usage change or rate change?

3. Decide — intentional or accidental?

4. Post-mortem — one-liner per anomaly

Quick quiz

Keep learning

Cost anomalies: what finance needs to know

The most expensive bug is the one nobody owns

How a finance partner moves through an anomaly alert

The financial exposure of an unattributed anomaly

The finance partner's triage loop for a cost anomaly

1. Convert variance to run-rate exposure

2. Check ownership before escalating

3. Decide: update the forecast or track the waste

4. Set the response SLA and enforce it

Quick quiz

Keep learning

Cost anomalies: the headline

The most expensive bug is the one nobody owns

What this anomaly looks like from the top

Why unattributed anomalies are a governance gap, not just a cost issue

The leadership frame for anomaly response

1. Require a named owner per cloud account

2. Set a public SLA and review it monthly

3. Treat the post-mortem library as institutional memory

Quick quiz

Keep learning

Related cost lessons