| 1 | The Problem It Solves | 2 | Pattern Structure |
| 3 | When to Use | 4 | When Not to Use |
| 5 | Trade-offs | 6 | Implementation Approach |
| 7 | Anti-Patterns to Avoid | 8 | Cloud-Specific Implementations |
| 9 | References |
The Problem It Solves
Without a circuit breaker, a slow or unavailable downstream service causes the calling service to exhaust its thread pool waiting for responses. Each waiting thread holds a connection and memory. New requests queue up behind the waiting threads. The calling service eventually runs out of resources and fails too — a cascading failure that takes down healthy services along with the unhealthy one.
Pattern Structure
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Service Makes Remote Call]) START --> STATE{Circuit State} STATE -->|Closed — normal operation| CALL[Attempt remote call] CALL --> OUTCOME{Call Outcome} OUTCOME -->|Success| RESET[Reset failure counter\nReturn response] OUTCOME -->|Failure or timeout| COUNT[Increment failure counter] COUNT --> THRESHOLD{Failure threshold\nexceeded?} THRESHOLD -->|No| CALL THRESHOLD -->|Yes| OPEN[Open circuit\nStart recovery timer] STATE -->|Open — failing fast| FAST_FAIL[Fail immediately\nNo remote call made\nReturn fallback or error] FAST_FAIL --> TIMER{Recovery timer\nexpired?} TIMER -->|No| FAST_FAIL TIMER -->|Yes| HALF[Half-open state\nAllow probe request through] HALF --> PROBE[Attempt probe call] PROBE --> PROBE_RESULT{Probe\nSucceeded?} PROBE_RESULT -->|Yes| CLOSED([Close circuit\nResume normal operation]) PROBE_RESULT -->|No| OPEN style START fill:#4f8ef7,color:#fff style CLOSED fill:#10b981,color:#fff style OPEN fill:#fef3c7 style FAST_FAIL fill:#fef3c7 style HALF fill:#e0f2fe
When to Use
- Any service that makes synchronous remote calls to downstream dependencies
- Systems where a downstream dependency failing should not cause the caller to fail
- High-traffic services where thread pool exhaustion from slow downstream calls is a realistic risk
- Microservices architectures where cascading failures across services are a known operational concern
When Not to Use
- Asynchronous messaging patterns where the caller does not wait for a response
- Internal in-process calls that do not cross a network boundary
- Simple two-tier applications where there is only one dependency and failure is acceptable
Trade-offs
| Benefit | Cost |
|---|---|
| Prevents cascading failures — failing fast protects the caller | Fallback behaviour must be designed and tested |
| Gives the downstream service time to recover | Adds latency measurement overhead per call |
| Enables graceful degradation — serve partial results | State management for the circuit requires storage or in-process counters |
| Provides operational visibility into dependency health | Half-open probe logic must be tuned per dependency |
Implementation Approach
Define thresholds appropriate to the dependency. A payment service tolerates fewer failures before opening than a recommendation service. Common starting points: open after five consecutive failures or 50% failure rate over a ten-second window.
Implement meaningful fallbacks. When the circuit is open, return a cached result, a default value, or a clear error that the upstream caller can handle. A cached product catalogue from five minutes ago is better than an exception that propagates to the user.
Expose circuit state as a metric. The circuit state — closed, open, half-open — and the failure rate per dependency are essential operational metrics. Alert when any circuit opens in production. A circuit opening is a signal that a dependency is failing.
Set appropriate timeouts on the calls the circuit wraps. A circuit breaker without a timeout is incomplete. If the call never times out, the circuit never opens. Set a timeout shorter than the caller's own timeout so failures are detected before the caller times out itself.
Anti-Patterns to Avoid
Opening the circuit and returning an unhandled exception that propagates to the user as a 500 error. The cascade is stopped at the service boundary but the user experience is no better than if there were no circuit breaker.
Design a fallback response for every circuit that can open. The fallback may be degraded — an empty list, a cached result, a user-visible message — but it is a deliberate choice, not an unhandled exception.
Each instance of a horizontally scaled service maintains its own in-process circuit state. Instance A opens its circuit while Instance B sees different traffic and stays closed. The circuit state is inconsistent across the fleet.
For stateless horizontally-scaled services, use a distributed circuit breaker backed by a shared cache (Redis) or accept that each instance manages its own state independently and use percentage-based thresholds rather than absolute counts.
Cloud-Specific Implementations
- AWS: Lambda and API Gateway have built-in timeout and retry configuration. For circuit breaker state shared across instances, use ElastiCache Redis. Resilience4j implements circuit breakers for Java-based Lambda functions.
Flowchart
References
- Nygard, Michael T. — Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf, 2018.
- Fowler, Martin — Circuit Breaker. martinfowler.com/bliki/CircuitBreaker
- Resilience4j — Circuit breaker for Java. resilience4j.readme.io
- Netflix — Hystrix: Latency and Fault Tolerance. github.com/Netflix/Hystrix