Cost anomalies: the basics
What does it mean for cloud cost to be "anomalous"?
A cost anomaly is a daily spend value that lands outside the expected band for a given (account, service) tuple — far enough outside that a baseline model trained on the last 30–90 days of history flags it as unlikely to be normal variation. Daily AWS spend has natural rhythm: batch jobs run nightly, traffic dips on weekends, monthly cron jobs spike on the 1st. Anomaly detection learns those rhythms and surfaces the points that break them.
AWS Cost Anomaly Detection and most third-party tools (including this dashboard's ml-anomaly detector) work the same way conceptually: they train a model per monitored dimension, compute an expected value plus a confidence band for each day, and emit an anomaly when the actual value crosses that band. The output isn't "something is wrong" — it's "this looks unusual, here's what changed."
A real example: Amazon EC2 cost in one account jumped from an expected $188.62/day to an actual $1,366.40/day — variance $1,177.78, a 7.2× spike. The model had been watching that account+service pair for weeks and was confident the previous days fit the band. That's an anomaly worth a human looking at — not necessarily an emergency, but the bill changed and someone needs to know why.
In this lesson you'll learn how cloud cost anomaly detection actually works, the triage decision tree for any spike (usage change vs. rate change), how to attribute a 7× anomaly back to a specific resource and a specific team in under thirty minutes, and how to decide whether the new cost should be accepted (budget up) or fixed (root-cause down). You'll see real Cost Explorer queries and the post-mortem format that turns one-off spikes into team-wide muscle memory.
The most expensive bug is the one nobody owns
A well-cited industry estimate is that 30%+ of cloud spend is wasted, and a meaningful chunk of that waste lives in anomalies that nobody investigated because they fell into the cracks between teams. A 7× EC2 spike on a shared account is everyone's problem and therefore nobody's. The first team to claim an anomaly tends to fix it; the ones that bounce between Slack channels for two weeks become the line items finance asks about in the next QBR.
Triaging a cost anomaly in action
Marco runs FinOps at a healthcare SaaS. On a Friday morning the dashboard fires an anomaly: Amazon Elastic Compute Cloud in account 412988273341 jumped from $188.62/day expected to $1,366.40/day actual on 2026-02-26 — variance $1,177.78, 7.2×. The detector tags it cost-spike, ml-anomaly. Severity HIGH.
Before pinging anyone, Marco runs the triage decision tree. Cost = usage × rate. So the spike is either a usage change (a new resource, more requests, more data) or a rate change (a Savings Plan that lapsed, a region with different pricing, an instance family change). He starts with usage because that's the more common cause and the easier one to verify.
He pulls Cost Explorer grouped by USAGE_TYPE for that account+service+date range. The 80/20 of any anomaly is usually one usage type doing 80% of the damage.
Group the anomalous day's spend by usage type to see which line item moved.
One usage type accounts for ~72% of the variance — classic 80/20 of an anomaly.
Now confirm whether this is a new resource (RunInstances around detection time) or an existing one whose usage changed.
CloudTrail confirms a new launch by a specific IAM principal — attribution complete.
How anomaly detection actually worksdeep dive
Under the hood, the detector trains a separate model per monitored dimension — most commonly per (account_id, service) tuple, sometimes finer (per usage_type, per tag value). For each tuple it ingests the last 30–90 days of daily UnblendedCost, fits an expected curve (typically a seasonal decomposition plus a residual model), and computes a confidence band. When a new day's actual spend lands outside that band, the detector emits an event with expectedCost, actualCost, variance, the band width, and a confidence score.
The detector also tracks detectionCount and detectionDates — every day the anomaly persists, the count ticks up. A single day at 7× could be a one-off (a one-shot batch job, a botched test). Sixteen consecutive days at 7× means the cost base has permanently moved and the model is now consistently confused. Persistence is the strongest signal that something real happened; the cost-anomaly inbox should sort by detection count, not severity.
AWS Cost Anomaly Detection (the AWS-native version) lets you create monitors of four types: AWS services, linked accounts, cost categories, and tags. Each monitor has its own alert threshold (absolute $ or %) and notification destinations (SNS, email, Slack via EventBridge). The same detection logic powers this dashboard's ml-anomaly channel — the difference is where the model lives and how the alerts are surfaced, not what they mean.
# Pull the raw anomaly record from AWS Cost Anomaly Detection.
aws ce get-anomalies \
--date-interval StartDate=2026-02-26,EndDate=2026-02-26 \
--query 'Anomalies[?Impact.TotalImpact>`500`].[AnomalyId,AnomalyStartDate,Impact.TotalActualSpend,Impact.TotalExpectedSpend,DimensionValue]' \
--output table
# List the monitors so you know which dimensions are actually watched.
aws ce get-anomaly-monitors \
--query 'AnomalyMonitors[].[MonitorName,MonitorType,MonitorDimension]' \
--output table What is the impact of unattributed cost anomalies?
The direct impact is the bill. A persistent 7× spike on an EC2 account that was running $188/day costs an extra ~$35k a month if nobody catches it — and "nobody catches it" is the default outcome when an anomaly bounces between Slack channels for a week. The detector did its job; the cost is the gap between detection and ownership.
The second-order impact is forecast pollution. Budgets, savings-plan recommendations, and capacity plans all train on recent spend. An unflagged anomaly that persists becomes the new baseline; six weeks later you're sizing your reserved-capacity commitment on numbers that include $35k/month of accidental p4d. The model can't tell the difference between intentional growth and a forgotten test rig.
The third-order impact is trust. Finance loses confidence in cloud spend forecasts when anomalies don't get explained; engineering loses confidence in the alert channel when too many anomalies fire without follow-up. Both problems compound — finance starts demanding manual approval for any spend increase, engineering starts muting the channel.
The good news: anomalies that are attributed within 24 hours rarely cost more than a few thousand dollars in waste. Speed of triage is the dominant variable. A clear decision tree and a documented owner per (account, service) tuple turns a $35k/month surprise into a $2k one-off.
How do you triage and act on a cost anomaly?
Anomaly response is a four-step loop. Skipping any step turns a one-day spike into a six-week mystery.
1. Attribute — find the resource and the owner
Group the anomalous day by USAGE_TYPE in Cost Explorer to find the 80/20 line item, then RESOURCE_ID (or CloudTrail RunInstances/CreateBucket events around the detection timestamp) to find the specific resource. Cost-allocation tags (or the IAM principal that created the resource) give you the owning team. The whole exercise is 15 minutes if your tagging is in shape.
2. Diagnose — usage change or rate change?
Cost = usage × rate. If usage rose, check CloudTrail for new launches and CloudWatch for traffic/queue/processing increases on existing resources. If usage is flat, check rate: a Savings Plan or RI that lapsed, a workload that moved to a more expensive region, a switch from spot to on-demand. Data-transfer spikes (especially cross-AZ and cross-region) are the most-missed category — always check DataTransfer-Regional-Bytes and DataTransfer-Out-Bytes.
3. Decide — intentional or accidental?
Once you know what changed and who owns it, ask the owner one question: was this intentional? If yes (feature launch, planned capacity, marketing event) update the budget and forecast — that's now the new normal and the anomaly model needs to relearn. If no (forgotten test rig, autoscaling bug, runaway loop) treat it as a real incident: fix the root cause and verify the next day's spend returns to the expected band.
4. Post-mortem — one-liner per anomaly
Every anomaly should produce a single-sentence note: "X caused Y, fixed by Z, prevented by W." Build the team library and reference it the next time something similar fires. Patterns repeat — p4d instances left running over a weekend, ECR pulls from the wrong region, ALB log retention misconfigured — and the second occurrence should be a 30-second triage, not a re-investigation.
# Set up a per-service anomaly monitor with an SNS notification.
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "per-service-monitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "finops-cost-anomalies",
"Threshold": 500,
"Frequency": "DAILY",
"MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/..."],
"Subscribers": [{"Type": "SNS", "Address": "arn:aws:sns:us-east-1:123456789012:finops-alerts"}]
}' Quick quiz
Question 1 of 5A cost-anomaly alert shows EC2 in account 412988273341 jumped from $188.62/day expected to $1,366.40/day actual — a 7.2× spike. detectionCount is 1. What's the right next step?
You scored
0 / 5
Keep learning
Dig deeper into cost anomaly detection, attribution, and FinOps incident response.
- AWS Cost Anomaly Detection documentation Service docs covering monitor types, alert subscriptions, and the underlying detection model.
- AWS Cost Explorer API reference The full set of dimensions and group-by keys for attribution queries.
- FinOps Foundation — Anomaly Management capability How anomaly response fits the broader FinOps lifecycle, including roles and SLAs.
- AWS Well-Architected — Cost Optimization Pillar Anomaly detection in the broader context of cost-aware architecture and governance.
You've completed Investigate a cost anomaly. You now know how anomaly detection works, the usage-vs-rate triage decision tree, how to attribute a 7× spike to a resource and an owner in under thirty minutes, and how to close every anomaly with a one-line post-mortem that compounds team knowledge. The next time the inbox fires an ml-anomaly with $1,177 of variance, you'll have a four-step loop — attribute, diagnose, decide, post-mortem — ready to run.