Skip to main content
emnode / learn
Compliance

Harden SageMaker and ML workloads

One capability across SageMaker notebooks, models, processing jobs and endpoints: cut their network paths, drop their privileges and encrypt their traffic so a single compromised ML job cannot reach your data or your account.

15 min·10 sections·AWS

Last reviewed

Hardening ML workloads: the basics

Why a SageMaker resource is also a credential-bearing compute box on your network

Amazon SageMaker is not one thing, it is a fleet of compute resources that all carry an IAM execution role and sit somewhere on your network. A notebook instance is a managed JupyterLab server. A model is an inference container. A processing job (data quality, model quality, model bias, explainability) is a transient container that reads your data. An endpoint serves predictions at scale. Each of these runs with permissions to read S3 buckets and call other services, and by default many of them can reach the public internet, run with broad OS privileges, or send traffic between containers in the clear.

AWS Security Hub turns each weak default into its own control, which is why a single ML estate can fail a dozen or more SageMaker checks at once. SageMaker.1, SageMaker.2 and SageMaker.3 harden notebook instances (no direct internet access, deploy in a VPC, no root access). SageMaker.5 enables network isolation on models. SageMaker.9 keeps notebook platforms upgraded. SageMaker.8 increases endpoint instance count for availability. SageMaker.10 through SageMaker.17 and SageMaker.19 isolate the network on monitoring and processing jobs, encrypt inter-container traffic on those jobs, encrypt the feature group offline store, and require a private container registry. They look like separate findings, but they are one capability: shrink what an ML job can reach and remove the credentials and privileges it does not need.

The reason these matter more than the average finding is that an ML resource almost always holds a copy of real data plus a role that can read more. The job is to find every SageMaker resource that can reach the internet, run as root, or talk in plaintext, decide which genuinely need an exception, route the rest through your VPC with isolation and encryption on, and then make the hardened configuration the default so new resources arrive compliant.

In this lesson you will learn how SageMaker expresses network reachability, privilege and encryption across notebooks, models, processing jobs and endpoints, how to find every weakly-configured ML resource in an account, and how to harden them without breaking the science. The Controls this lesson covers section lists every Security Hub control in this capability, each linking to a deep page with the exact check and a copy-and-paste fix.

Fun fact

The notebook that owned the data lake

In a well-known cloud red-team exercise the entry point was not a leaked key or a public bucket, it was a SageMaker notebook left with direct internet access, root access, and a generously scoped execution role. The tester opened a terminal, queried the instance metadata endpoint at 169.254.169.254 to lift the role's temporary credentials, and from there read every S3 bucket the role could touch, the company's entire training data lake. None of it required exploiting AWS itself. The weak defaults on one ML resource were the whole exploit. With network isolation on, root off, and the resource inside a VPC, the same foothold reaches nothing.

Finding weakly-configured ML resources across an estate

Priya is the ML platform security lead at a healthcare analytics company preparing for a HIPAA assessment. Security Hub shows SageMaker failures spread across notebooks, models and processing jobs in accounts that pre-date the team's current guardrails.

Rather than work the findings one by one, she starts by listing the notebooks that can reach the internet directly, since those are the highest-impact and the ones that need a rebuild rather than a flag flip.

Start with the resources that expose data directly. Notebooks with direct internet access and no VPC subnet are a common and high-impact finding.

$ aws sagemaker list-notebook-instances --query 'NotebookInstances[].NotebookInstanceName' --output text | tr '\t' '\n' | while read n; do aws sagemaker describe-notebook-instance --notebook-instance-name "$n" --query '{Name:NotebookInstanceName,Internet:DirectInternetAccess,Subnet:SubnetId,Root:RootAccess}' --output text; done
ml-feature-exploration Enabled None Enabled
shared-training-box Enabled None Enabled
fraud-model-prod Disabled subnet-0ab12cd34 Disabled
# Two notebooks route straight to the internet, run as root, and carry data-reading roles.

Enabled internet plus no Subnet plus Enabled root is the worst case: unmonitored egress, OS-level privilege and a role that can read data. Fix these first.

How SageMaker hardening actually worksdeep dive

Most SageMaker controls resolve to one of three mechanisms. The first is network reachability: a notebook's DirectInternetAccess and SubnetId (SageMaker.1, SageMaker.2), a model's EnableNetworkIsolation (SageMaker.5), and the network isolation and VPC configuration on monitoring and processing jobs (the SageMaker.10 to SageMaker.17 and SageMaker.19 family). The second is privilege: a notebook's RootAccess (SageMaker.3) and keeping the platform identifier upgraded (SageMaker.9). The third is data protection on the jobs that read your data: EnableInterContainerTrafficEncryption on data-quality, model-quality, model-bias and explainability jobs, encryption on the feature group OfflineStoreConfig, and requiring images come from a private ECR registry rather than a public one.

A critical operational detail is immutability. Several of the highest-value settings, DirectInternetAccess and SubnetId on a notebook, EnableNetworkIsolation on a model, and the encryption and isolation flags on a created job, are fixed at creation and cannot be flipped in place. Remediating those is a rebuild: stop and recreate the notebook, or create a new model and cut the endpoint over. RootAccess is the exception, it is mutable on a stopped notebook via update-notebook-instance. Knowing which is which is the difference between a five-minute fix and a planned migration.

Security Hub evaluates these through AWS Config, mostly change-triggered, so a fix flips the finding to PASSED on the next evaluation rather than instantly. The strongest position is preventive: an SCP or IAM condition that denies creating a notebook with DirectInternetAccess Enabled or RootAccess Enabled, and IaC templates that set EnableNetworkIsolation and inter-container encryption to true by default, so the hardened state is the only state new resources can be born in.

What is the impact of leaving ML workloads unhardened?

The direct impact is data exfiltration. An ML resource almost always carries an execution role with read access to S3 and other data stores, and scientists routinely pull copies of real datasets onto notebooks and into processing jobs. A resource with direct internet access, or a model container that can make outbound calls, has an unmonitored path to ship that data and those credentials straight out of your account, with none of your VPC flow logs or egress controls seeing it happen. Root access on a notebook amplifies this by letting a compromised session install tooling, persist and lift the role's credentials from the metadata service.

The second-order impact is blast radius and integrity. Inter-container traffic on a monitoring or processing job that runs in the clear can be read or tampered with; a public container registry pulls images you do not control into a job that touches sensitive data. Each weak default is attack surface that has to be defended continuously. Hardening shrinks that surface to the handful of controlled paths you actually operate.

On the compliance side, every modern framework, SOC 2, ISO 27001, HIPAA, PCI DSS and FedRAMP, expects evidence of network segmentation, least privilege and encryption in transit on infrastructure that handles regulated data. A passing set of SageMaker controls across every account is the cheapest and most defensible artefact you can hand an auditor, and it maps directly to NIST 800-53 boundary-protection (SC-7) and least-privilege (AC-6) requirements.

How do you harden ML workloads safely?

Work the capability as one loop rather than chasing individual findings. The order matters: find the resources that need a rebuild before you start, so you can plan the migrations rather than discover them mid-change.

1. Inventory every ML resource and what it can reach

Across accounts and regions, list notebook instances (DirectInternetAccess, SubnetId, RootAccess), models (EnableNetworkIsolation), monitoring and processing jobs (network isolation, inter-container encryption, VPC config), feature groups (offline store encryption) and the registries images come from. Cross-reference each resource's execution role against the data it can read. A notebook with broad S3 access is more urgent than a sandbox with none. Treat this inventory as the source of truth, not the finding count, because one resource can trigger several controls.

2. Separate genuine exceptions from drift, and confirm before destroying

Most weak configuration is unintended. For notebooks and models that need a rebuild, confirm with the owner that work is pushed to Git or S3 before you delete anything, since the volume is lost on delete. Establish whether any resource genuinely needs an outbound path (model weights, a third-party API); the right answer is to bake the dependency into the image or stage it in S3, not to leave the resource open.

3. Harden highest impact first, with the right method per setting

For mutable settings (RootAccess), stop the notebook, update the flag, start it again. For immutable settings (DirectInternetAccess, SubnetId, EnableNetworkIsolation), rebuild: recreate the notebook in a private subnet with internet access disabled, or create a new isolated model and cut the endpoint over blue-green. Turn on inter-container traffic encryption and network isolation on processing and monitoring jobs at creation, encrypt the feature group offline store, and switch jobs to a private ECR registry. Make sure locked-down notebooks have a NAT gateway or VPC endpoints so pip and conda still work.

4. Ratchet it shut with preventive guardrails

Cleanup without prevention just resets the clock. Use a Service Control Policy or IAM condition to deny creating notebooks with DirectInternetAccess Enabled or RootAccess Enabled, and bake EnableNetworkIsolation, inter-container encryption and a private registry into your CloudFormation, CDK or Terraform model and job modules so new resources arrive compliant. Keep the AWS Config rules running so Security Hub re-flags any drift.

# Disable root across every notebook that has it on (mutable on a stopped instance).
for n in $(aws sagemaker list-notebook-instances \
    --query 'NotebookInstances[].NotebookInstanceName' --output text); do
  root=$(aws sagemaker describe-notebook-instance --notebook-instance-name "$n" \
    --query 'RootAccess' --output text)
  if [ "$root" = "Enabled" ]; then
    aws sagemaker stop-notebook-instance --notebook-instance-name "$n"
    aws sagemaker wait notebook-instance-stopped --notebook-instance-name "$n"
    aws sagemaker update-notebook-instance --notebook-instance-name "$n" --root-access Disabled
    aws sagemaker start-notebook-instance --notebook-instance-name "$n"
    echo "$n: root access disabled"
  fi
done

# Immutable settings need a rebuild. Recreate a notebook locked down: private subnet,
# no direct internet. (DirectInternetAccess and SubnetId cannot be changed in place.)
aws sagemaker create-notebook-instance \
  --notebook-instance-name ml-feature-exploration \
  --instance-type ml.t3.medium \
  --role-arn arn:aws:iam::111122223333:role/SageMakerExecution \
  --subnet-id subnet-0ab12cd34ef56 \
  --security-group-ids sg-0aa11bb22cc33 \
  --direct-internet-access Disabled \
  --root-access Disabled

Quick quiz

Question 1 of 5

Security Hub shows SageMaker failures across notebooks, models and processing jobs. What is the most efficient way to think about them?

You can now treat ML hardening as one capability rather than a scatter of findings: inventory what each SageMaker resource can reach, separate the genuine exceptions, harden highest-impact first (rebuilding for the immutable settings and flipping the mutable ones), and ratchet the estate shut with preventive guardrails and secure-by-default templates. The Controls this lesson covers section below links every control in this group to its deep page and fix.

Back to the library

Controls this lesson covers

One capability, many AWS Security Hub controls. This lesson is the shared playbook; each control below keeps its own deep page with the exact check, severity and a copy-and-paste fix.