Compliance

Enable event replication on EventBridge global endpoints

Security Hub EventBridge.4 — a global endpoint without event replication can't recover from a Regional failover on its own, leaving you to fix it by hand at the worst possible moment.

11 min·10 sections·AWS

Last reviewed 27 May 2026

Remediates AWS Security Hub: EventBridge.4

EventBridge global endpoints: the basics

Why a failover-ready endpoint still needs replication switched on

An EventBridge global endpoint is the front door that makes an event-driven application Region-fault-tolerant. You point your publishers at one endpoint, attach a Route 53 health check, and configure a primary and a secondary Region. If the health check flips to unhealthy, EventBridge automatically reroutes all custom events to the event bus in the secondary Region within minutes — no code change, no manual cutover. It's the resilience pattern AWS recommends for any workload that can't tolerate a single Region going dark.

EventBridge.4 checks one specific knob on that endpoint: whether event replication is enabled. With replication on, EventBridge uses managed rules to send every custom event to the event buses in both the primary and the secondary Region, all the time. The control fails — flagging the AWS::Events::Endpoint resource — whenever a global endpoint exists with replication turned off.

It's flagged Medium because the gap is invisible until you actually need it. An endpoint with replication off still routes events normally and still fails over when the health check trips. The catch surfaces afterward: without replication, EventBridge can't automatically recover, so getting traffic back to the primary Region requires someone to manually reset the Route 53 health check to healthy. That's exactly the kind of manual step you don't want sitting on the critical path during an incident.

In this lesson you'll learn what an EventBridge global endpoint does, how Route 53 health checks drive automatic failover to a secondary Region, and why event replication is the difference between a system that recovers on its own and one that needs a manual health-check reset. You'll see the AWS CLI commands to find global endpoints with replication off and to inspect their configuration, the requirement that custom event buses share a name across Regions, the small cost replication adds, and how to enforce the setting so new endpoints arrive compliant.

Fun fact

Failover worked. Failback didn't.

A payments team built a textbook multi-Region setup: global endpoint, Route 53 health check, primary in us-east-1, secondary in us-west-2. During a real Regional event the failover fired exactly as designed — events rerouted to the secondary bus within minutes, and customers never noticed. The surprise came after the Region recovered. Because event replication had never been enabled, EventBridge couldn't auto-reset, so traffic stayed pinned to the secondary Region until an on-call engineer manually flipped the Route 53 health check back to healthy hours later. The outage they'd engineered around lasted minutes; the extra manual hours of running on the backup Region were entirely self-inflicted by one disabled setting.

Turning on event replication in action

Devon owns the events platform at a logistics company. Security Hub flags EventBridge.4 on the orders-global endpoint — a global endpoint fronting the order-processing event bus, with replication disabled. The endpoint has been live for eight months and has never been exercised by a real failover.

He confirms the topology first: a Route 53 health check is attached, the primary is us-east-1 and the secondary is us-west-2, and crucially both Regions have a custom event bus named orders-bus in the same account — the precondition for failover to work at all. Without matching bus names across Regions, enabling replication wouldn't fix anything.

Devon enables event replication on the endpoint via the EventBridge console — Edit endpoint, then select Event replication enabled — and re-runs the check. EventBridge now publishes every custom event to both Regions' buses through managed rules. He notes the small bump in EventBridge cost in the next month's bill as expected, and documents that orders-global will now recover automatically after a failover instead of waiting on a manual health-check reset.

First, list every EventBridge global endpoint and whether event replication is enabled, so you know which ones the control will flag.

$ aws events list-endpoints --query 'Endpoints[].{Name:Name,State:State,Replication:ReplicationConfig.State}' --output table

-------------------------------------------------

| ListEndpoints |

+-----------------+-----------+-----------------+

| Name | State | Replication |

+-----------------+-----------+-----------------+

| orders-global | ACTIVE | DISABLED |

| billing-global | ACTIVE | ENABLED |

+-----------------+-----------+-----------------+

# orders-global has replication DISABLED — it can't auto-recover after failover.

An inventory of global endpoints with their replication state — DISABLED is exactly what EventBridge.4 fails on.

Inspect the flagged endpoint to confirm its Route 53 health check and the primary/secondary event buses before enabling replication.

$ aws events describe-endpoint --name orders-global --query '{Replication:ReplicationConfig.State,HealthCheck:RoutingConfig.FailoverConfig.Primary.HealthCheck,Primary:EventBuses[0].EventBusArn,Secondary:EventBuses[1].EventBusArn}'

{

"Replication": "DISABLED",

"HealthCheck": "arn:aws:route53:::healthcheck/abc123",

"Primary": "arn:aws:events:us-east-1:111122223333:event-bus/orders-bus",

"Secondary": "arn:aws:events:us-west-2:111122223333:event-bus/orders-bus"

}

# Same bus name 'orders-bus' in both Regions — failover precondition is met.

Confirm the health check exists and both Regions use a same-named custom bus before flipping replication on.

Global endpoints and replication under the hooddeep dive

A global endpoint is an AWS::Events::Endpoint resource that wraps a pair of custom event buses — one in a primary Region, one in a secondary Region — behind a single endpoint ID. Publishers call PutEvents against the endpoint instead of a specific bus. The endpoint's RoutingConfig.FailoverConfig references a Route 53 health check; while that check reports healthy, events flow to the primary bus, and when it flips to unhealthy, EventBridge reroutes all custom events to the secondary bus within minutes. For this to work, both Regions must have a custom event bus with the same name in the same account.

Event replication is controlled by ReplicationConfig.State, which is either ENABLED or DISABLED. When enabled, EventBridge provisions managed rules that copy every custom event to the buses in both Regions continuously — not just during failover. That ongoing dual-write does two things: it continuously proves the secondary path is correctly wired (you'd notice immediately if events stopped arriving there), and it's the mechanism EventBridge relies on to automatically reset the Route 53 health check to healthy and fail traffic back to the primary once the Region recovers.

Without replication, the failover-out path still works — the health check trips and events route to the secondary — but the failback-in path is manual. There's no continuous secondary stream for EventBridge to confirm health against, so an operator has to manually mark the Route 53 health check healthy to return traffic to the primary Region. Replication trades a small per-event cost (events are billed in both Regions) for removing that manual step from the recovery runbook.

# Find every global endpoint with replication disabled across the account.
aws events list-endpoints \
  --query 'Endpoints[?ReplicationConfig.State==`DISABLED`].[Name,Arn]' \
  --output table

# Inspect one endpoint's full routing and replication configuration.
aws events describe-endpoint --name orders-global \
  --query '{Replication:ReplicationConfig.State,Routing:RoutingConfig,Buses:EventBuses}'

# Confirm both Regions actually have the same-named custom bus (failover precondition).
aws events describe-event-bus --name orders-bus --region us-east-1 --query 'Arn'
aws events describe-event-bus --name orders-bus --region us-west-2 --query 'Arn'

What is the impact of disabled event replication?

The headline impact is a broken recovery path. With replication off, your global endpoint will fail over to the secondary Region correctly, but it cannot automatically fail back. Returning traffic to the primary Region requires an operator to manually reset the Route 53 health check to healthy. That means extra unplanned time running on the secondary Region — time that depends on someone noticing, having the right access, and executing the reset correctly under incident pressure.

The second impact is loss of continuous validation. Replication's ongoing dual-write to both Regions is a built-in canary: if the secondary path is misconfigured, you find out during normal operation, not during an outage. Without it, your failover configuration is untested in practice, and the first time it's exercised for real is the worst possible time to discover that the secondary bus name doesn't match or a permission is missing.

There is a real cost to enabling replication, and it's worth naming honestly: because events are sent to event buses in both Regions, you pay EventBridge ingestion on both, increasing the monthly bill. AWS calls this out explicitly. For a high-volume endpoint that's a measurable line item, not a rounding error — which is exactly why the trade-off deserves a deliberate decision rather than a default.

Finally, there's a compliance and audit dimension. EventBridge.4 maps to NIST 800-53 high-availability and resilience controls (CP-10, SC-5(2), SI-13(5)). A failing control on a system you've designated business-critical is the kind of finding that surfaces in audits and disaster-recovery reviews — and 'we have failover but it doesn't auto-recover' is a hard position to defend when the whole point of the architecture was resilience.

How do you remediate EventBridge.4 safely?

Remediation is a four-step loop: confirm the endpoint is genuinely critical, verify the failover preconditions, enable replication, then enforce the setting so new endpoints arrive compliant.

1. Confirm the endpoint matters before paying for replication

Replication adds recurring cost, so start by classifying the endpoint. Is this global endpoint fronting a business-critical, low-RTO workload, or a low-stakes internal one? For critical systems, automatic recovery is worth the per-event charge in both Regions; for genuinely non-critical ones you may consciously accept the finding with a documented exception. Don't blanket-enable across every endpoint without this triage.

2. Verify the failover preconditions are actually met

Before enabling replication, confirm the topology is sound: a Route 53 health check is attached to the endpoint, and both the primary and secondary Regions have a custom event bus with the same name in the same account. AWS calls this out explicitly — failover only works with matching same-named buses across Regions. Enabling replication on a misconfigured endpoint won't deliver the recovery you expect.

3. Enable event replication on the endpoint

Set the endpoint's replication to enabled. In the EventBridge console, edit the global endpoint and under Event replication choose 'Event replication enabled', as the AWS remediation guidance describes. EventBridge then provisions managed rules that copy every custom event to both Regions' buses, restoring the automatic failback path and giving you continuous validation that the secondary side is wired correctly.

4. Enforce replication on new endpoints with Config and IaC

Stop the finding from recurring by codifying it. Bake replication-enabled into your CloudFormation, CDK, or Terraform module for global endpoints so every new one ships compliant, and keep the AWS Config rule global-endpoint-event-replication-enabled in scope so any drift or hand-built endpoint is caught automatically. The control re-evaluates on change, so a non-compliant endpoint surfaces within minutes of creation.

# Enable event replication on an existing global endpoint.
# (Re-supply the existing routing config so it isn't cleared on update.)
aws events update-endpoint \
  --name orders-global \
  --replication-config State=ENABLED \
  --routing-config '{"FailoverConfig":{"Primary":{"HealthCheck":"arn:aws:route53:::healthcheck/abc123"},"Secondary":{"Route":"us-west-2"}}}' \
  --event-buses '[{"EventBusArn":"arn:aws:events:us-east-1:111122223333:event-bus/orders-bus"},{"EventBusArn":"arn:aws:events:us-west-2:111122223333:event-bus/orders-bus"}]'

# Verify replication is now enabled.
aws events describe-endpoint --name orders-global \
  --query 'ReplicationConfig.State'

Quick quiz

Question 1 of 5

A business-critical global endpoint fails over to its secondary Region correctly during a real outage, but EventBridge.4 is failing on it. After the primary Region recovers, what behavior should you expect, and what's the right fix?

Keep learning

Dig deeper into EventBridge global endpoints, failover mechanics, and the controls around them.

You've completed Enable event replication on EventBridge global endpoints. You now know what a global endpoint does, why Route 53 health checks drive automatic failover, and why event replication is the difference between a system that recovers on its own and one that needs a manual health-check reset. You've also seen the failover precondition — same-named custom buses in both Regions — and the deliberate cost trade-off replication carries. The next time EventBridge.4 flags an endpoint, you'll know how to triage, verify, enable, and enforce in one pass.

Back to the library

EventBridge global endpoints: what they protect

A resilience setting, not a line item — but it's about avoided cost

Some applications are built so that if one AWS Region has an outage, traffic automatically shifts to a backup Region and customers barely notice. EventBridge global endpoints are one of the pieces that makes that automatic shift work for event-driven systems — the kind of plumbing that moves messages between parts of an application. This control checks one safety setting on that plumbing called event replication.

When replication is enabled, the system keeps a live copy of every message flowing through both the main Region and the backup Region at all times. That redundancy is what lets the application heal itself after an outage without an engineer logging in to flip switches by hand. When replication is off, the failover still happens, but coming back afterward is a manual job — slower, more error-prone, and dependent on the right person being awake.

From a finance standpoint this isn't a cost-saving finding; it's a resilience and avoided-cost finding. The dollars at stake aren't on the AWS bill — enabling replication actually adds a small amount to it — they're in the downtime you avoid. The question to hold in mind is: what does an hour of this application being unable to recover cost the business in lost revenue, SLA penalties, or customer trust? If that number is large, this is cheap insurance.

This lesson is for the finance partner who needs to understand why a resilience setting shows up on a compliance dashboard and how to weigh it. It explains what global endpoints protect, why replication adds a small recurring cost rather than saving one, how to frame the trade-off as avoided downtime versus a known monthly charge, and the governance question to ask engineering: which of our multi-Region systems are business-critical, and do they all recover automatically? No commands, no AWS internals — just the framing for the conversation.

Fun fact

Failover worked. Failback didn't.

How a finance partner frames the trade-off

Priya is the finance partner working with the platform team at a logistics company. A resilience finding lands on the compliance dashboard: the order-processing system's failover setup is missing a setting called event replication. Engineering explains that turning it on adds a small recurring charge to the cloud bill because events now get copied to a backup Region continuously.

Priya's instinct is to ask whether the extra charge is justified, and she frames it the way she'd frame any insurance decision: what does an hour of the order system being stuck on manual recovery cost the business? Engineering's answer — lost orders, SLA credits to two large customers, and the on-call time to fix it by hand — dwarfs the monthly replication charge by orders of magnitude. The decision takes about five minutes.

She doesn't stop at this one system. The useful move is to ask the broader question: which of our multi-Region systems are business-critical, and do all of them recover automatically, or just this one? That turns a single dashboard finding into a one-time portfolio review — and a standing expectation that anything important enough to be multi-Region is also funded to recover on its own.

Why this matters to risk, not the cost line

This is not a cost-savings finding — it's the opposite, and that's the most important thing to understand about it. Enabling event replication adds a recurring charge to the cloud bill because every event is billed in two Regions instead of one. For a high-volume system that can be a meaningful number, so the right framing is a deliberate insurance decision rather than a default-on or default-off rule.

The value being purchased is avoided downtime and avoided manual toil during an incident. The relevant comparison is the recurring replication charge against the cost of the system being stuck on manual recovery: lost revenue per hour, SLA penalties or service credits owed to customers, and the engineering time consumed firefighting instead of building. For a genuinely business-critical system those avoided costs typically exceed the replication charge by a wide margin.

The decision should be scoped, not blanket. Not every multi-Region endpoint is business-critical, and replication on a low-stakes internal system may not be worth its added cost. The finance contribution is to force the categorisation: which endpoints are critical enough to fund automatic recovery, and which can accept a manual reset? That single classification turns a recurring stream of identical findings into a one-time policy decision.

Treat a cluster of these findings on critical systems as a resilience-investment signal. If business-critical multi-Region workloads keep showing up with replication off, it usually means resilience is being designed but not fully funded — the architecture stops one setting short of self-healing. That gap is cheaper to close on purpose than to discover during an outage.

What finance can actually do about this

Finance can't enable replication, but it can frame the decision so engineering applies it where it pays off. Three levers, used at the resilience and budget reviews.

1. Insist on a criticality classification first

Before approving recurring replication cost, ask engineering to classify each global endpoint as critical or non-critical. Replication is worth funding where downtime is expensive and acceptable to skip where it isn't. That single classification turns a recurring stream of identical findings into a one-time, defensible policy decision.

2. Frame it as insurance, with the avoided cost named

For each critical endpoint, get a rough number for what an hour of stuck-on-manual-recovery costs — lost revenue, SLA credits, firefighting time. Compare that to the recurring replication charge. When the avoided cost dwarfs the charge, the decision is easy and the spend is justified on the record, not waved through.

3. Fund resilience to completion, not halfway

If the business has paid to make a system multi-Region, the small additional cost of automatic recovery is part of the same commitment. Treat a critical system that can fail over but not fail back as an under-funded design, and budget the replication cost as a line item in the system's resilience envelope rather than an unexpected overage.

4. Track the finding count on critical systems as a signal

Watch how many business-critical endpoints carry this finding over time. A persistent count means resilience is being designed but not fully funded. A clean state on critical systems is the metric — not driving every endpoint everywhere to enabled regardless of stakes.

Quick quiz

Question 1 of 5

Engineering says enabling event replication on a critical global endpoint will add a recurring charge because events are billed in two Regions. As the finance partner, what's the right way to decide?

Keep learning

Dig deeper into EventBridge global endpoints, failover mechanics, and the controls around them.

You've finished the finance partner's view of EventBridge.4. You know this is an avoided-cost decision rather than a saving, why replication adds a recurring charge, and how to frame it as insurance against expensive downtime on the systems that warrant it. The key levers are a criticality classification, a named avoided-cost comparison, funding resilience to completion, and tracking the finding count on critical systems as a signal. Next time it lands on the dashboard, you'll have a sharper question than 'why is the bill going up?'

Back to the library

EventBridge global endpoints: the headline

Whether a critical system can heal itself after a Regional outage

This control answers a simple resilience question: if an AWS Region fails, can this application recover on its own, or does it need a human to step in? EventBridge global endpoints provide the automatic Region-to-Region failover; a setting called event replication is what lets the system recover automatically afterward instead of waiting on a manual reset.

The finding is Medium severity, and it's fundamentally about operational readiness, not cost. The small recurring AWS charge for replication is the price of an automatic recovery path for a system the business has already decided is worth making multi-Region. The leadership read is whether the resilience you've paid to design actually works end to end, or stops short of self-healing at the moment you'd most need it to.

A short read for the executive who wants to know what this Medium finding means for resilience and what one question to ask. You'll get the plain-language version of automatic failover, why the recovery half of that story matters as much as the failover half, and what 'good' looks like: every business-critical multi-Region system recovers without manual intervention. No commands, no implementation detail.

Fun fact

Failover worked. Failback didn't.

What it looks like when the org gets this right

At one company the quarterly resilience review used to focus on whether failover worked — could the system shift to a backup Region if one failed? The answer was reassuringly yes. What the exec sponsor eventually asked was sharper: 'And then what? Does it come back on its own, or does someone have to be awake to fix it?'

That question exposed a gap. Several critical systems could fail over automatically but needed manual intervention to recover — the exact pattern EventBridge.4 flags. The org set a simple norm: any system important enough to run in two Regions is important enough to recover without a human in the loop, and the small recurring cost of that automation is just part of the price of being multi-Region.

The right outcome state isn't 'we have failover.' It's 'our critical systems heal themselves end to end, and we've consciously funded the parts that make that true.' The compliance finding becomes a confidence signal that the resilience the business paid for actually closes the loop.

Why this is on the report at all

This finding tracks whether the resilience the business has already invested in actually closes the loop. The organisation decided certain systems were important enough to run in two Regions; this control checks whether those systems can also recover on their own after a Regional event, or whether recovery still depends on a person executing a manual step under pressure. The gap is invisible until an outage, which is exactly why it's worth tracking proactively.

There's a deliberate cost trade-off embedded here, and leadership should see it as such. Enabling automatic recovery adds a small recurring charge. The right posture isn't 'always on' or 'always off' but a conscious classification of which systems are critical enough to fund self-healing. A clean state on this control for your critical systems is a signal that resilience is both designed and funded end to end — and that the disaster-recovery story you'd tell an auditor or a board actually holds together.

The leadership move on this category

The executive handle isn't to mandate replication everywhere — it's to set the norm that critical systems recover on their own and to fund that consciously.

1. Set the norm: critical means self-healing

Establish that any system important enough to run in two Regions is important enough to recover without a human in the loop. That single principle resolves most instances of this finding without case-by-case debate and aligns architecture with the resilience the business actually wants.

2. Accept and fund the trade-off explicitly

Automatic recovery has a recurring cost. Decide deliberately which systems warrant it rather than letting the default decide by omission. Funding self-healing on critical systems is a small, knowable cost against an unknowable outage cost — make that a stated leadership decision, not an accident.

3. Ask one question at the resilience review

'Do our critical multi-Region systems recover on their own, or do they need someone to step in?' is a one-minute item that tells you whether the resilience you paid for closes the loop. If the answer is 'on their own' across the critical list, the disaster-recovery story holds and leadership attention can move elsewhere.

Quick quiz

Question 1 of 5

At the resilience review you learn the critical event-processing systems can fail over to a backup Region but need a manual step to recover afterward. What's the right leadership response?

Keep learning

Dig deeper into EventBridge global endpoints, failover mechanics, and the controls around them.

That's the lesson. Two takeaways worth holding onto: failover working is only half the resilience story — recovering on its own is the other half — and automatic recovery carries a small, deliberate cost worth funding on the systems that matter. The leadership question is whether your critical multi-Region systems heal themselves end to end, not whether any single setting is enabled.

Back to the library

Enable event replication on EventBridge global endpoints

EventBridge global endpoints: the basics

Failover worked. Failback didn't.

Turning on event replication in action

Global endpoints and replication under the hooddeep dive

What is the impact of disabled event replication?

How do you remediate EventBridge.4 safely?

1. Confirm the endpoint matters before paying for replication

2. Verify the failover preconditions are actually met

3. Enable event replication on the endpoint

4. Enforce replication on new endpoints with Config and IaC

Quick quiz

Keep learning

EventBridge global endpoints: what they protect

Failover worked. Failback didn't.

How a finance partner frames the trade-off

Why this matters to risk, not the cost line

What finance can actually do about this

1. Insist on a criticality classification first

2. Frame it as insurance, with the avoided cost named

3. Fund resilience to completion, not halfway

4. Track the finding count on critical systems as a signal

Quick quiz

Keep learning

EventBridge global endpoints: the headline

Failover worked. Failback didn't.

What it looks like when the org gets this right

Why this is on the report at all

The leadership move on this category

1. Set the norm: critical means self-healing

2. Accept and fund the trade-off explicitly

3. Ask one question at the resilience review

Quick quiz

Keep learning

Related compliance lessons