Circuit Breaker & Resilience - Microservices Architecture

Why Services Fail

In a distributed system like microservices, failures are inevitable. A service might be slow, unresponsive, or completely down. Without protection, one failing service can crash your entire system!

The Cascading Failure Problem

Imagine a chain of dominoes:

Service A calls Service B
Service B is slow/failing
Service A waits, using up threads
Service A runs out of threads and fails
Now systems calling Service A also fail!
💥 Entire system crashes from one failure

Circuit Breaker Pattern

Think of your home's electrical circuit breaker. When there's a problem, it "opens" (breaks the circuit) to prevent damage. Microservices use the same concept!

Circuit Breaker States

🟢 Closed (Normal): Requests flow normally

🔴 Open (Failing): Requests are rejected immediately, no calls to failing service

🟡 Half-Open (Testing): Trying limited requests to see if service recovered

Python - Simple Circuit Breaker

import requests
from datetime import datetime, timedelta

class SimpleCircuitBreaker:
    def __init__(self, failure_threshold=5):
        self.failure_threshold = failure_threshold
        self.failure_count = 0
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.open_until = None

    def call_service(self, url):
        # If circuit is OPEN, fail fast
        if self.state == "OPEN":
            if datetime.now() > self.open_until:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit is OPEN - service unavailable")

        try:
            # Try to call the service
            response = requests.get(url, timeout=3)

            # Success! Reset failure count
            self.failure_count = 0
            self.state = "CLOSED"
            return response

        except Exception as e:
            # Failure!
            self.failure_count += 1

            # Too many failures? Open the circuit
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
                self.open_until = datetime.now() + timedelta(seconds=60)
                print("Circuit OPENED - too many failures!")

            raise e

# Usage
breaker = SimpleCircuitBreaker()
try:
    result = breaker.call_service('http://unreliable-service/api')
except Exception as e:
    print("Service call failed:", e)

Other Resilience Patterns

Timeout: Don't wait forever - set maximum wait time
Retry: Try again if request fails (but not forever!)
Fallback: Return cached data or default value when service fails

Advanced Resilience Patterns

Build robust microservices that handle failures gracefully.

Circuit Breaker Implementation

Let's implement a production-ready circuit breaker with all three states:

Java - Resilience4j Circuit Breaker

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                    // Open at 50% failure rate
    .waitDurationInOpenState(Duration.ofSeconds(60))  // Stay open for 60s
    .slidingWindowSize(10)                       // Look at last 10 calls
    .minimumNumberOfCalls(5)                     // Need 5 calls before checking
    .permittedNumberOfCallsInHalfOpenState(3)   // Try 3 calls in half-open
    .build();

CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker breaker = registry.circuitBreaker("paymentService");

// Use circuit breaker
public PaymentResult processPayment(PaymentRequest request) {
    return breaker.executeSupplier(() -> {
        // This code is protected by circuit breaker
        return paymentClient.process(request);
    });
}

Retry Pattern with Exponential Backoff

Don't retry immediately - use increasing delays to give service time to recover.

Python - Exponential Backoff

import time
import random

def retry_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Last attempt, give up

            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed. Retrying in {wait_time:.1f}s...")
            time.sleep(wait_time)

# Usage
result = retry_with_backoff(lambda: call_external_api())

Bulkhead Pattern

Isolate resources to prevent one failing component from consuming all resources.

Pattern	Purpose	Implementation
Thread Pool Bulkhead	Separate thread pools per service	Each external service gets own thread pool
Semaphore Bulkhead	Limit concurrent calls	Max N concurrent calls to service
Resource Bulkhead	Separate resources	Dedicated DB connections, memory

Node.js - Bulkhead with Pool

const { Worker } = require('worker_threads');

class Bulkhead {
    constructor(maxConcurrent = 10) {
        this.maxConcurrent = maxConcurrent;
        this.currentRunning = 0;
        this.queue = [];
    }

    async execute(task) {
        // Check if we have capacity
        if (this.currentRunning >= this.maxConcurrent) {
            // Wait in queue
            return new Promise((resolve, reject) => {
                this.queue.push({ task, resolve, reject });
            });
        }

        return this._run(task);
    }

    async _run(task) {
        this.currentRunning++;
        try {
            const result = await task();
            return result;
        } finally {
            this.currentRunning--;
            this._processQueue();
        }
    }

    _processQueue() {
        if (this.queue.length > 0 && this.currentRunning < this.maxConcurrent) {
            const { task, resolve, reject } = this.queue.shift();
            this._run(task).then(resolve).catch(reject);
        }
    }
}

// Usage: Separate bulkheads for different services
const paymentBulkhead = new Bulkhead(5);   // Max 5 concurrent payment calls
const shippingBulkhead = new Bulkhead(10); // Max 10 concurrent shipping calls

await paymentBulkhead.execute(() => processPayment(order));
await shippingBulkhead.execute(() => shipOrder(order));

Netflix Hystrix

Netflix pioneered resilience patterns with Hystrix (now in maintenance mode, but concepts remain relevant):

Circuit Breaker: Automatic failure detection and recovery
Bulkhead: Thread pool isolation per dependency
Fallbacks: Graceful degradation with cached data
Real-time Monitoring: Dashboard showing circuit states

Impact: Serves 200M+ subscribers with high availability despite failures

Advanced Resilience Engineering

Production-grade resilience patterns and chaos engineering practices.

Multi-Layer Defense Strategy

Combine multiple resilience patterns for robust systems:

Java - Combined Resilience Patterns

import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.ratelimiter.RateLimiter;
import io.github.resilience4j.timelimiter.TimeLimiter;

public class ResilientServiceCall {
    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final Bulkhead bulkhead;
    private final RateLimiter rateLimiter;
    private final TimeLimiter timeLimiter;

    public Result callService(Request request) {
        return Decorators.ofSupplier(() -> makeActualCall(request))
            .withCircuitBreaker(circuitBreaker)    // Layer 1: Circuit breaker
            .withRetry(retry)                       // Layer 2: Retry with backoff
            .withBulkhead(bulkhead)                // Layer 3: Limit concurrency
            .withRateLimiter(rateLimiter)          // Layer 4: Rate limiting
            .withTimeLimiter(timeLimiter)          // Layer 5: Timeout
            .withFallback(this::fallbackResponse)  // Layer 6: Fallback
            .get();
    }

    private Result fallbackResponse(Throwable throwable) {
        // Return cached data or default response
        return cacheService.getLastKnownGood()
            .orElse(Result.defaultResult());
    }
}

Chaos Engineering

Intentionally inject failures to test resilience:

Chaos Experiment	What It Tests	Example
Latency Injection	Slow dependencies	Add 2s delay to 10% of requests
Error Injection	Service failures	Return 500 error for 5% of calls
Instance Termination	Server crashes	Kill random service instances
Network Partition	Network splits	Block communication between services
Resource Exhaustion	Memory/CPU limits	Consume 90% of available memory

Chaos Experiment with Gremlin

# Gremlin Chaos Experiment Configuration
{
  "name": "Payment Service Latency Test",
  "hypothesis": "System remains functional with 95% success rate even when payment service has 2s latency",
  "experiment": {
    "target": {
      "service": "payment-service",
      "percentage": 50  # Affect 50% of instances
    },
    "attack": {
      "type": "latency",
      "duration": "10m",
      "magnitude": "2000ms"
    }
  },
  "steady_state": {
    "metrics": [
      {"name": "success_rate", "min": 95},
      {"name": "p99_latency", "max": "5000ms"},
      {"name": "error_rate", "max": 5}
    ]
  },
  "rollback": {
    "on_violation": true,
    "alerts": ["slack", "pagerduty"]
  }
}

Chaos Engineering Best Practices

Start small - begin in test environments
Define hypothesis before experiment
Have rollback plan ready
Monitor closely during experiments
Run during business hours (when team is available)
Document learnings and improve systems

Observability for Resilience

You can't fix what you can't see. Proper monitoring is crucial:

Prometheus Metrics for Circuit Breaker

from prometheus_client import Counter, Gauge, Histogram

# Circuit breaker metrics
circuit_breaker_state = Gauge(
    'circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half_open)',
    ['service']
)

circuit_breaker_failures = Counter(
    'circuit_breaker_failures_total',
    'Total number of circuit breaker failures',
    ['service']
)

circuit_breaker_success = Counter(
    'circuit_breaker_success_total',
    'Total number of successful calls',
    ['service']
)

service_call_duration = Histogram(
    'service_call_duration_seconds',
    'Service call duration',
    ['service'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

# Update metrics
circuit_breaker_state.labels(service='payment').set(1)  # Open
circuit_breaker_failures.labels(service='payment').inc()
service_call_duration.labels(service='payment').observe(2.5)

Amazon's Resilience Culture

Amazon's approach to building resilient systems:

Shuffle Sharding: Customers isolated to subset of servers - failure affects only small percentage
Cell-Based Architecture: Independent failure domains limit blast radius
Static Stability: Systems designed to work without dependencies when possible
Game Days: Regular chaos experiments in production
Wheel of Misfortune: Training exercises with real failure scenarios

Philosophy: "Everything fails, all the time" - design for failure, not perfection