Compliance

Harden SageMaker and ML workloads

One capability across SageMaker notebooks, models, processing jobs and endpoints: cut their network paths, drop their privileges and encrypt their traffic so a single compromised ML job cannot reach your data or your account.

15 min·10 sections·AWS

Last reviewed 16 June 2026

Remediates AWS Security Hub: SageMaker.1 SageMaker.2 SageMaker.3 SageMaker.4 SageMaker.5 SageMaker.8 SageMaker.9 SageMaker.10 SageMaker.11 SageMaker.12 SageMaker.13 SageMaker.14 SageMaker.15 SageMaker.16 SageMaker.17 SageMaker.19

Hardening ML workloads: the basics

Why a SageMaker resource is also a credential-bearing compute box on your network

Amazon SageMaker is not one thing, it is a fleet of compute resources that all carry an IAM execution role and sit somewhere on your network. A notebook instance is a managed JupyterLab server. A model is an inference container. A processing job (data quality, model quality, model bias, explainability) is a transient container that reads your data. An endpoint serves predictions at scale. Each of these runs with permissions to read S3 buckets and call other services, and by default many of them can reach the public internet, run with broad OS privileges, or send traffic between containers in the clear.

AWS Security Hub turns each weak default into its own control, which is why a single ML estate can fail a dozen or more SageMaker checks at once. SageMaker.1, SageMaker.2 and SageMaker.3 harden notebook instances (no direct internet access, deploy in a VPC, no root access). SageMaker.5 enables network isolation on models. SageMaker.9 keeps notebook platforms upgraded. SageMaker.8 increases endpoint instance count for availability. SageMaker.10 through SageMaker.17 and SageMaker.19 isolate the network on monitoring and processing jobs, encrypt inter-container traffic on those jobs, encrypt the feature group offline store, and require a private container registry. They look like separate findings, but they are one capability: shrink what an ML job can reach and remove the credentials and privileges it does not need.

The reason these matter more than the average finding is that an ML resource almost always holds a copy of real data plus a role that can read more. The job is to find every SageMaker resource that can reach the internet, run as root, or talk in plaintext, decide which genuinely need an exception, route the rest through your VPC with isolation and encryption on, and then make the hardened configuration the default so new resources arrive compliant.

In this lesson you will learn how SageMaker expresses network reachability, privilege and encryption across notebooks, models, processing jobs and endpoints, how to find every weakly-configured ML resource in an account, and how to harden them without breaking the science. The Controls this lesson covers section lists every Security Hub control in this capability, each linking to a deep page with the exact check and a copy-and-paste fix.

Fun fact

The notebook that owned the data lake

In a well-known cloud red-team exercise the entry point was not a leaked key or a public bucket, it was a SageMaker notebook left with direct internet access, root access, and a generously scoped execution role. The tester opened a terminal, queried the instance metadata endpoint at 169.254.169.254 to lift the role's temporary credentials, and from there read every S3 bucket the role could touch, the company's entire training data lake. None of it required exploiting AWS itself. The weak defaults on one ML resource were the whole exploit. With network isolation on, root off, and the resource inside a VPC, the same foothold reaches nothing.

Finding weakly-configured ML resources across an estate

Priya is the ML platform security lead at a healthcare analytics company preparing for a HIPAA assessment. Security Hub shows SageMaker failures spread across notebooks, models and processing jobs in accounts that pre-date the team's current guardrails.

Rather than work the findings one by one, she starts by listing the notebooks that can reach the internet directly, since those are the highest-impact and the ones that need a rebuild rather than a flag flip.

Start with the resources that expose data directly. Notebooks with direct internet access and no VPC subnet are a common and high-impact finding.

$ aws sagemaker list-notebook-instances --query 'NotebookInstances[].NotebookInstanceName' --output text | tr '\t' '\n' | while read n; do aws sagemaker describe-notebook-instance --notebook-instance-name "$n" --query '{Name:NotebookInstanceName,Internet:DirectInternetAccess,Subnet:SubnetId,Root:RootAccess}' --output text; done

ml-feature-exploration Enabled None Enabled

shared-training-box Enabled None Enabled

fraud-model-prod Disabled subnet-0ab12cd34 Disabled

# Two notebooks route straight to the internet, run as root, and carry data-reading roles.

Enabled internet plus no Subnet plus Enabled root is the worst case: unmonitored egress, OS-level privilege and a role that can read data. Fix these first.

How SageMaker hardening actually worksdeep dive

Most SageMaker controls resolve to one of three mechanisms. The first is network reachability: a notebook's DirectInternetAccess and SubnetId (SageMaker.1, SageMaker.2), a model's EnableNetworkIsolation (SageMaker.5), and the network isolation and VPC configuration on monitoring and processing jobs (the SageMaker.10 to SageMaker.17 and SageMaker.19 family). The second is privilege: a notebook's RootAccess (SageMaker.3) and keeping the platform identifier upgraded (SageMaker.9). The third is data protection on the jobs that read your data: EnableInterContainerTrafficEncryption on data-quality, model-quality, model-bias and explainability jobs, encryption on the feature group OfflineStoreConfig, and requiring images come from a private ECR registry rather than a public one.

A critical operational detail is immutability. Several of the highest-value settings, DirectInternetAccess and SubnetId on a notebook, EnableNetworkIsolation on a model, and the encryption and isolation flags on a created job, are fixed at creation and cannot be flipped in place. Remediating those is a rebuild: stop and recreate the notebook, or create a new model and cut the endpoint over. RootAccess is the exception, it is mutable on a stopped notebook via update-notebook-instance. Knowing which is which is the difference between a five-minute fix and a planned migration.

Security Hub evaluates these through AWS Config, mostly change-triggered, so a fix flips the finding to PASSED on the next evaluation rather than instantly. The strongest position is preventive: an SCP or IAM condition that denies creating a notebook with DirectInternetAccess Enabled or RootAccess Enabled, and IaC templates that set EnableNetworkIsolation and inter-container encryption to true by default, so the hardened state is the only state new resources can be born in.

What is the impact of leaving ML workloads unhardened?

The direct impact is data exfiltration. An ML resource almost always carries an execution role with read access to S3 and other data stores, and scientists routinely pull copies of real datasets onto notebooks and into processing jobs. A resource with direct internet access, or a model container that can make outbound calls, has an unmonitored path to ship that data and those credentials straight out of your account, with none of your VPC flow logs or egress controls seeing it happen. Root access on a notebook amplifies this by letting a compromised session install tooling, persist and lift the role's credentials from the metadata service.

The second-order impact is blast radius and integrity. Inter-container traffic on a monitoring or processing job that runs in the clear can be read or tampered with; a public container registry pulls images you do not control into a job that touches sensitive data. Each weak default is attack surface that has to be defended continuously. Hardening shrinks that surface to the handful of controlled paths you actually operate.

On the compliance side, every modern framework, SOC 2, ISO 27001, HIPAA, PCI DSS and FedRAMP, expects evidence of network segmentation, least privilege and encryption in transit on infrastructure that handles regulated data. A passing set of SageMaker controls across every account is the cheapest and most defensible artefact you can hand an auditor, and it maps directly to NIST 800-53 boundary-protection (SC-7) and least-privilege (AC-6) requirements.

How do you harden ML workloads safely?

Work the capability as one loop rather than chasing individual findings. The order matters: find the resources that need a rebuild before you start, so you can plan the migrations rather than discover them mid-change.

1. Inventory every ML resource and what it can reach

Across accounts and regions, list notebook instances (DirectInternetAccess, SubnetId, RootAccess), models (EnableNetworkIsolation), monitoring and processing jobs (network isolation, inter-container encryption, VPC config), feature groups (offline store encryption) and the registries images come from. Cross-reference each resource's execution role against the data it can read. A notebook with broad S3 access is more urgent than a sandbox with none. Treat this inventory as the source of truth, not the finding count, because one resource can trigger several controls.

2. Separate genuine exceptions from drift, and confirm before destroying

Most weak configuration is unintended. For notebooks and models that need a rebuild, confirm with the owner that work is pushed to Git or S3 before you delete anything, since the volume is lost on delete. Establish whether any resource genuinely needs an outbound path (model weights, a third-party API); the right answer is to bake the dependency into the image or stage it in S3, not to leave the resource open.

3. Harden highest impact first, with the right method per setting

For mutable settings (RootAccess), stop the notebook, update the flag, start it again. For immutable settings (DirectInternetAccess, SubnetId, EnableNetworkIsolation), rebuild: recreate the notebook in a private subnet with internet access disabled, or create a new isolated model and cut the endpoint over blue-green. Turn on inter-container traffic encryption and network isolation on processing and monitoring jobs at creation, encrypt the feature group offline store, and switch jobs to a private ECR registry. Make sure locked-down notebooks have a NAT gateway or VPC endpoints so pip and conda still work.

4. Ratchet it shut with preventive guardrails

Cleanup without prevention just resets the clock. Use a Service Control Policy or IAM condition to deny creating notebooks with DirectInternetAccess Enabled or RootAccess Enabled, and bake EnableNetworkIsolation, inter-container encryption and a private registry into your CloudFormation, CDK or Terraform model and job modules so new resources arrive compliant. Keep the AWS Config rules running so Security Hub re-flags any drift.

# Disable root across every notebook that has it on (mutable on a stopped instance).
for n in $(aws sagemaker list-notebook-instances \
    --query 'NotebookInstances[].NotebookInstanceName' --output text); do
  root=$(aws sagemaker describe-notebook-instance --notebook-instance-name "$n" \
    --query 'RootAccess' --output text)
  if [ "$root" = "Enabled" ]; then
    aws sagemaker stop-notebook-instance --notebook-instance-name "$n"
    aws sagemaker wait notebook-instance-stopped --notebook-instance-name "$n"
    aws sagemaker update-notebook-instance --notebook-instance-name "$n" --root-access Disabled
    aws sagemaker start-notebook-instance --notebook-instance-name "$n"
    echo "$n: root access disabled"
  fi
done

# Immutable settings need a rebuild. Recreate a notebook locked down: private subnet,
# no direct internet. (DirectInternetAccess and SubnetId cannot be changed in place.)
aws sagemaker create-notebook-instance \
  --notebook-instance-name ml-feature-exploration \
  --instance-type ml.t3.medium \
  --role-arn arn:aws:iam::111122223333:role/SageMakerExecution \
  --subnet-id subnet-0ab12cd34ef56 \
  --security-group-ids sg-0aa11bb22cc33 \
  --direct-internet-access Disabled \
  --root-access Disabled

Quick quiz

Question 1 of 5

Security Hub shows SageMaker failures across notebooks, models and processing jobs. What is the most efficient way to think about them?

Keep learning

Go deeper on how SageMaker network isolation, notebook networking and the rest of this capability work.

You can now treat ML hardening as one capability rather than a scatter of findings: inventory what each SageMaker resource can reach, separate the genuine exceptions, harden highest-impact first (rebuilding for the immutable settings and flipping the mutable ones), and ratchet the estate shut with preventive guardrails and secure-by-default templates. The Controls this lesson covers section below links every control in this group to its deep page and fix.

Back to the library

Hardening ML workloads: the cost and risk view

A near-zero-cost capability that removes the data-exfiltration risk on your most data-rich workloads

SageMaker resources are the workbenches and serving infrastructure your data science teams use. The relevant point for finance is not their hourly cost, which is small, but that they hold access to sensitive datasets and the keys to read more, often with an unmonitored path to send all of it outside the company. Almost every control in this group costs nothing in AWS spend to fix. Disabling root, turning on network isolation, encrypting inter-container traffic and requiring a private registry do not change the bill. The only real cost sits in the few legitimate exceptions and in the VPC plumbing (NAT gateways, VPC endpoints) that a locked-down notebook needs to still install packages.

Frame each failing control as a line on the risk register rather than a compliance checkbox. A notebook with direct internet access and a broad role holds far higher expected loss than a sandbox with no data, yet both can show up as red findings. Map the failures to the data each resource can reach and prioritise by exposure, not by control count. The downside of leaving them open is the documented cost of a data-exfiltration breach: regulatory fines, breach notification, forensics and reputational damage, which land in entirely different parts of the P&L from the cloud bill.

These controls also tend to accumulate, because some settings (direct internet access, network isolation) are fixed at creation and need a rebuild rather than a flag flip. The metric to watch is the count of non-hardened ML resources and how quickly it returns to zero after being flagged, with the durable answer being a guardrail so new resources are born compliant rather than cleaned up later.

This lesson is for the finance partner who sees a cluster of SageMaker findings on the security report and wants to know what the right response is and what it costs. It covers why most of these controls are free to fix, which carry a real rebuild or migration cost, and how to turn a list of red findings into a risk-ordered remediation plan keyed to the data each resource can reach.

Fun fact

The notebook that owned the data lake

How a finance partner frames the ML hardening decision

Anika is the finance partner for a healthcare analytics company heading into a HIPAA assessment. Security Hub has thrown a cluster of SageMaker findings across notebooks, models and processing jobs, and the platform team has booked time on the next risk review to walk her through them. Her first instinct is not to ask what the fix costs, because she already suspects the answer is close to nothing. It is to ask which of these resources actually touch regulated patient data and which are empty sandboxes, so the work is ordered by exposure rather than by the length of the finding list.

The team maps each failing resource to the data its execution role can read. Two notebooks (ml-feature-exploration and shared-training-box) have direct internet access, root on, and broad S3 read across the patient datasets, the worst combination on the estate. A handful of processing jobs run inter-container traffic in the clear, and one feature group offline store is unencrypted. Anika's output for the finance pack is short: the AWS spend to remediate is effectively zero, the only real costs are a little VPC plumbing (a NAT gateway plus VPC endpoints so the locked-down notebooks can still install packages) and the engineering time for two notebook rebuilds. She records the two internet-facing notebooks as the highest expected loss on the risk register, because each is an unmonitored path that could ship the entire training data lake out of the account.

Why ML hardening belongs on the risk register

The cost model here is asymmetric in a way that is rare for security work. Hardening ML workloads costs effectively nothing in AWS spend, and leaving them open creates an unbounded tail risk on the workloads most likely to hold sensitive data. Each failing control is a resource where a future accident or compromise could expose its contents, and while the probability per event is low, the probability across all resources over years is not.

The documented cost of a data-exfiltration breach includes mandatory notification within tight regulatory windows, forensics and legal fees that routinely run into six and seven figures, and customer churn. For regulated health or financial data the numbers are larger still. Against that, the remediation is engineering time plus a bounded set of notebook rebuilds and endpoint cutovers for the immutable settings. The finance role is to attach expected loss to each failing resource so the work is prioritised by the data it can reach, not by the order the report lists it.

What finance can do about the ML hardening gap

Finance cannot turn off root or rebuild a notebook, but it owns the framing that turns a scatter of SageMaker findings into a risk-ordered, governed programme with a near-zero budget. Three levers.

1. Separate the near-zero fixes from the real costs

Most of this capability costs nothing in AWS spend: disabling root, turning on network isolation, encrypting inter-container traffic and requiring a private registry do not move the bill. The only genuine costs are the VPC plumbing a locked-down notebook needs (NAT gateway or VPC endpoints so pip and conda still work) and the engineering time for the immutable rebuilds. Make that split explicit in the finance pack so the work is not delayed by a budget conversation that does not really exist.

2. Risk-weight each failing resource by the data it can reach

Control count is the wrong metric. A notebook with direct internet access, root, and broad S3 read across regulated data carries far higher expected loss than an idle sandbox with no data-access role, even though both show as red findings. Map every failing resource to its execution role and the data that role can read, then prioritise by exposure. The expected loss of a data-exfiltration breach (regulatory fines, breach notification, forensics, churn) lands in a different part of the P&L from the trivial cloud cost of the fix.

3. Track the non-hardened count and require recorded exceptions

The metric to watch is the count of non-hardened ML resources and how quickly it returns to zero after being flagged. Any resource left open on purpose (a genuine outbound dependency for model weights or a third-party API) should carry a recorded, finance-visible exception, not a silently suppressed finding. The durable answer is a preventive guardrail so new resources are born compliant rather than cleaned up later, which converts a recurring remediation cost into a one-time one.

Quick quiz

Question 1 of 5

A cluster of SageMaker findings lands on the security report ahead of a HIPAA assessment. What is the right first question for finance to ask?

Keep learning

Go deeper on how SageMaker network isolation, notebook networking and the rest of this capability work.

You have finished the finance view of ML hardening. You know that almost every control in this group costs nothing in AWS spend, that the only real costs are a little VPC plumbing and a bounded set of notebook rebuilds, and that the right way to order the work is by the data each resource can reach, not by the finding count. Next time a cluster of SageMaker findings lands on the report, you will risk-weight it, separate the near-zero fixes from the real ones, and push for a guardrail so the count stays at zero.

Back to the library

Hardening ML workloads: the headline

Whether ML infrastructure is held to the same network and privilege standards as the rest of the estate

Machine-learning workloads run on infrastructure that, by default, can reach the open internet, run with high privileges and carry credentials to your data. That is a powerful combination on resources that exist specifically to work with sensitive datasets, and the report shows it as a scatter of separate findings across notebooks, models, processing jobs and endpoints.

The leadership question is not whether the estate is compliant today, it is whether ML workloads are held to the same network and privilege standards as everything else, and whether a new model or notebook can be created outside those guardrails tomorrow. The defensible end state is that ML resources run inside the VPC with isolation, least privilege and encryption on by default, with the handful of genuine exceptions documented.

None of this is a cost decision. Hardening ML workloads is essentially free in AWS terms. It is a governance decision about whether secure-by-default is enforced on the ML platform or only depends on each team remembering to do the right thing.

A short read for the leader who needs to know what weakly-configured ML infrastructure exposes, why hardening it is a governance decision rather than a budget one, and what a defensible secure-by-default end state looks like across the ML estate.

Fun fact

The notebook that owned the data lake

What it looks like when ML hardening is policy, not accident

After a red-team exercise lifted an execution role's credentials from a SageMaker notebook left with direct internet access and root on, then read the company's entire training data lake without exploiting AWS at all, the CTO asked a blunt question at the next review: are our ML workboxes held to the same network and privilege standard as everything else, or do they get a pass because the data scientists need to move fast? The honest answer at the time was no. Notebooks were being created by hand, outside the guardrails, and nobody could say how many carried a path straight to the internet.

The response was to treat ML hardening as a governance default rather than a per-resource engineering task. New models and jobs ship with network isolation, inter-container encryption and a private registry baked into the IaC modules, and an SCP denies creating a notebook with direct internet access or root enabled. The handful of resources with a genuine outbound dependency are documented exceptions, not silent gaps. Two quarters on, the same CTO asked the harder question, if a notebook were compromised tomorrow, what could it reach, and the answer had moved from the whole data lake to nothing outside its isolated subnet. That shift, from hoping each team remembers to enforcing secure-by-default, is the whole point of this control.

Why this is a board-level risk

ML workloads concentrate two ingredients of a serious breach: access to sensitive data and, by default, weak network and privilege boundaries. The pattern behind most cloud data exposure, a resource that could reach the internet or run with broad privilege did so by accident, applies with extra force here because these resources exist to work with data. Hardening makes that root cause structurally difficult.

The cost of getting this wrong is large and well documented, while the cost of fixing it is essentially engineering time plus a small, bounded set of rebuilds. This is the rare class of control where the risk of inaction is severe and the cost of action is small, which is exactly the trade leadership should be quickest to approve, and to make permanent with a preventive guardrail.

The leadership move on ML hardening

The executive handle is not to approve each SageMaker fix individually. It is to require that ML workloads meet the same network, privilege and encryption standard as the rest of the estate, by default, and that any exception is a deliberate, recorded decision.

1. Set a default: ML resources are secure-by-default on creation

Make it policy that anything tagged production runs inside the VPC with network isolation, least privilege and encryption on. The right place to enforce this is the provisioning path, an SCP or IAM condition that denies creating a notebook with direct internet access or root enabled, plus IaC modules that default isolation and inter-container encryption to true. That way the hardened state is the only state a new model or notebook can be born in, and the Security Hub count becomes an exception report rather than a backlog.

2. Demand proof before any rebuild destroys work

Several of the highest-value settings (direct internet access, subnet, model network isolation) are fixed at creation and need a rebuild, which deletes the notebook volume. Require confirmation that work is pushed to Git or S3 before anything is destroyed, and that the migration is planned rather than discovered mid-change. This is the difference between a clean cutover and a data-science team losing a week of unsaved work to a security fix.

3. Ask for the hardened-coverage rate and the exception list

At the review the one question worth asking is: what share of our production ML resources run inside the VPC with isolation, least privilege and encryption on, and what are the documented exceptions? This is the rare control where the risk of inaction is severe and the cost of action is essentially engineering time, so a coverage rate that is not trending hard toward 100 percent, with a short named exception list, is an accountability conversation rather than a budget one.

Quick quiz

Question 1 of 5

Your security report shows ML hardening coverage well below the rest of the estate because notebooks were created by hand outside the guardrails. What is the right leadership response?

Keep learning

Go deeper on how SageMaker network isolation, notebook networking and the rest of this capability work.

Two takeaways: weakly-configured ML infrastructure concentrates access to sensitive data with weak network and privilege boundaries, which is the exact recipe behind most cloud data exposure, and this is the rare control where the risk of inaction is severe while the cost of action is essentially free. The right metric is not how many SageMaker findings are open, it is whether ML workloads are held to the same standard as the rest of the estate by default, with every exception on the record.

Back to the library

Controls this lesson covers

One capability, many AWS Security Hub controls. This lesson is the shared playbook; each control below keeps its own deep page with the exact check, severity and a copy-and-paste fix.

SageMaker

Part of the learning path Lock down access

Harden SageMaker and ML workloads

Hardening ML workloads: the basics

The notebook that owned the data lake

Finding weakly-configured ML resources across an estate

How SageMaker hardening actually worksdeep dive

What is the impact of leaving ML workloads unhardened?

How do you harden ML workloads safely?

1. Inventory every ML resource and what it can reach

2. Separate genuine exceptions from drift, and confirm before destroying

3. Harden highest impact first, with the right method per setting

4. Ratchet it shut with preventive guardrails

Quick quiz

Keep learning

Hardening ML workloads: the cost and risk view

The notebook that owned the data lake

How a finance partner frames the ML hardening decision

Why ML hardening belongs on the risk register

What finance can do about the ML hardening gap

1. Separate the near-zero fixes from the real costs

2. Risk-weight each failing resource by the data it can reach

3. Track the non-hardened count and require recorded exceptions

Quick quiz

Keep learning

Hardening ML workloads: the headline

The notebook that owned the data lake

What it looks like when ML hardening is policy, not accident

Why this is a board-level risk

The leadership move on ML hardening

1. Set a default: ML resources are secure-by-default on creation

2. Demand proof before any rebuild destroys work

3. Ask for the hardened-coverage rate and the exception list

Quick quiz

Keep learning

Controls this lesson covers

SageMaker

Related compliance lessons