Skip to main content
emnode / learn
Compliance Medium severity

AWS Security Hub · SageMaker

SageMaker.4: Endpoint variants should have > 1 instance

Written and reviewed by Emnode · Last reviewed

What does AWS Security Hub SageMaker.4 check?

SageMaker.4 fails when a production variant in an endpoint configuration has `InitialInstanceCount` set to 1, leaving the endpoint with no Availability Zone redundancy.

Why does SageMaker.4 matter?

A single-instance variant has nowhere to fail over to. When AWS reclaims the instance during an AZ maintenance event, SageMaker must reprovision — pulling the image and loading model weights — and the endpoint returns 5xx errors for minutes while it does. For a model on a checkout or recommendation path, that downtime hits revenue directly.

How do I fix SageMaker.4?

  1. Inventory endpoint configs and flag any production variant with an instance count of 1.
  2. Create a new endpoint configuration with the count raised to 2 or more, spreading instances across AZs.
  3. Update the endpoint to the new config — SageMaker swaps it in with no downtime.
  4. For endpoints where a second instance genuinely is not worth the cost, document a tracked exception.

Remediation script · bash

# Disable root across every notebook that has it on (mutable on a stopped instance).
for n in $(aws sagemaker list-notebook-instances \
    --query 'NotebookInstances[].NotebookInstanceName' --output text); do
  root=$(aws sagemaker describe-notebook-instance --notebook-instance-name "$n" \
    --query 'RootAccess' --output text)
  if [ "$root" = "Enabled" ]; then
    aws sagemaker stop-notebook-instance --notebook-instance-name "$n"
    aws sagemaker wait notebook-instance-stopped --notebook-instance-name "$n"
    aws sagemaker update-notebook-instance --notebook-instance-name "$n" --root-access Disabled
    aws sagemaker start-notebook-instance --notebook-instance-name "$n"
    echo "$n: root access disabled"
  fi
done

# Immutable settings need a rebuild. Recreate a notebook locked down: private subnet,
# no direct internet. (DirectInternetAccess and SubnetId cannot be changed in place.)
aws sagemaker create-notebook-instance \
  --notebook-instance-name ml-feature-exploration \
  --instance-type ml.t3.medium \
  --role-arn arn:aws:iam::111122223333:role/SageMakerExecution \
  --subnet-id subnet-0ab12cd34ef56 \
  --security-group-ids sg-0aa11bb22cc33 \
  --direct-internet-access Disabled \
  --root-access Disabled

Full walkthrough (console steps, edge cases and verification) in the lesson Harden SageMaker and ML workloads.

Is SageMaker.4 a false positive?

This is one of the rare findings where remediation costs money. Low-value or batch-style endpoints may legitimately run single-instance; the right move there is a documented exception, not blind remediation.

Part of the learning path Lock down access