Why Services Fail

In a distributed system like microservices, failures are inevitable. A service might be slow, unresponsive, or completely down. Without protection, one failing service can crash your entire system!

The Cascading Failure Problem

Imagine a chain of dominoes:

  • Service A calls Service B
  • Service B is slow/failing
  • Service A waits, using up threads
  • Service A runs out of threads and fails
  • Now systems calling Service A also fail!
  • 💥 Entire system crashes from one failure

Circuit Breaker Pattern

Think of your home's electrical circuit breaker. When there's a problem, it "opens" (breaks the circuit) to prevent damage. Microservices use the same concept!

Circuit Breaker States

🟢 Closed (Normal): Requests flow normally

🔴 Open (Failing): Requests are rejected immediately, no calls to failing service

🟡 Half-Open (Testing): Trying limited requests to see if service recovered

Python - Simple Circuit Breaker
import requests
from datetime import datetime, timedelta

class SimpleCircuitBreaker:
    def __init__(self, failure_threshold=5):
        self.failure_threshold = failure_threshold
        self.failure_count = 0
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.open_until = None

    def call_service(self, url):
        # If circuit is OPEN, fail fast
        if self.state == "OPEN":
            if datetime.now() > self.open_until:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit is OPEN - service unavailable")

        try:
            # Try to call the service
            response = requests.get(url, timeout=3)

            # Success! Reset failure count
            self.failure_count = 0
            self.state = "CLOSED"
            return response

        except Exception as e:
            # Failure!
            self.failure_count += 1

            # Too many failures? Open the circuit
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
                self.open_until = datetime.now() + timedelta(seconds=60)
                print("Circuit OPENED - too many failures!")

            raise e

# Usage
breaker = SimpleCircuitBreaker()
try:
    result = breaker.call_service('http://unreliable-service/api')
except Exception as e:
    print("Service call failed:", e)

Other Resilience Patterns

  • Timeout: Don't wait forever - set maximum wait time
  • Retry: Try again if request fails (but not forever!)
  • Fallback: Return cached data or default value when service fails

Advanced Resilience Patterns

Build robust microservices that handle failures gracefully.

Circuit Breaker Implementation

Let's implement a production-ready circuit breaker with all three states:

Java - Resilience4j Circuit Breaker
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

// Configure circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)                    // Open at 50% failure rate
    .waitDurationInOpenState(Duration.ofSeconds(60))  // Stay open for 60s
    .slidingWindowSize(10)                       // Look at last 10 calls
    .minimumNumberOfCalls(5)                     // Need 5 calls before checking
    .permittedNumberOfCallsInHalfOpenState(3)   // Try 3 calls in half-open
    .build();

CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker breaker = registry.circuitBreaker("paymentService");

// Use circuit breaker
public PaymentResult processPayment(PaymentRequest request) {
    return breaker.executeSupplier(() -> {
        // This code is protected by circuit breaker
        return paymentClient.process(request);
    });
}

Retry Pattern with Exponential Backoff

Don't retry immediately - use increasing delays to give service time to recover.

Python - Exponential Backoff
import time
import random

def retry_with_backoff(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Last attempt, give up

            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed. Retrying in {wait_time:.1f}s...")
            time.sleep(wait_time)

# Usage
result = retry_with_backoff(lambda: call_external_api())

Bulkhead Pattern

Isolate resources to prevent one failing component from consuming all resources.

Pattern Purpose Implementation
Thread Pool Bulkhead Separate thread pools per service Each external service gets own thread pool
Semaphore Bulkhead Limit concurrent calls Max N concurrent calls to service
Resource Bulkhead Separate resources Dedicated DB connections, memory
Node.js - Bulkhead with Pool
const { Worker } = require('worker_threads');

class Bulkhead {
    constructor(maxConcurrent = 10) {
        this.maxConcurrent = maxConcurrent;
        this.currentRunning = 0;
        this.queue = [];
    }

    async execute(task) {
        // Check if we have capacity
        if (this.currentRunning >= this.maxConcurrent) {
            // Wait in queue
            return new Promise((resolve, reject) => {
                this.queue.push({ task, resolve, reject });
            });
        }

        return this._run(task);
    }

    async _run(task) {
        this.currentRunning++;
        try {
            const result = await task();
            return result;
        } finally {
            this.currentRunning--;
            this._processQueue();
        }
    }

    _processQueue() {
        if (this.queue.length > 0 && this.currentRunning < this.maxConcurrent) {
            const { task, resolve, reject } = this.queue.shift();
            this._run(task).then(resolve).catch(reject);
        }
    }
}

// Usage: Separate bulkheads for different services
const paymentBulkhead = new Bulkhead(5);   // Max 5 concurrent payment calls
const shippingBulkhead = new Bulkhead(10); // Max 10 concurrent shipping calls

await paymentBulkhead.execute(() => processPayment(order));
await shippingBulkhead.execute(() => shipOrder(order));

Netflix Hystrix

Netflix pioneered resilience patterns with Hystrix (now in maintenance mode, but concepts remain relevant):

  • Circuit Breaker: Automatic failure detection and recovery
  • Bulkhead: Thread pool isolation per dependency
  • Fallbacks: Graceful degradation with cached data
  • Real-time Monitoring: Dashboard showing circuit states

Impact: Serves 200M+ subscribers with high availability despite failures

Advanced Resilience Engineering

Production-grade resilience patterns and chaos engineering practices.

Multi-Layer Defense Strategy

Combine multiple resilience patterns for robust systems:

Java - Combined Resilience Patterns
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.ratelimiter.RateLimiter;
import io.github.resilience4j.timelimiter.TimeLimiter;

public class ResilientServiceCall {
    private final CircuitBreaker circuitBreaker;
    private final Retry retry;
    private final Bulkhead bulkhead;
    private final RateLimiter rateLimiter;
    private final TimeLimiter timeLimiter;

    public Result callService(Request request) {
        return Decorators.ofSupplier(() -> makeActualCall(request))
            .withCircuitBreaker(circuitBreaker)    // Layer 1: Circuit breaker
            .withRetry(retry)                       // Layer 2: Retry with backoff
            .withBulkhead(bulkhead)                // Layer 3: Limit concurrency
            .withRateLimiter(rateLimiter)          // Layer 4: Rate limiting
            .withTimeLimiter(timeLimiter)          // Layer 5: Timeout
            .withFallback(this::fallbackResponse)  // Layer 6: Fallback
            .get();
    }

    private Result fallbackResponse(Throwable throwable) {
        // Return cached data or default response
        return cacheService.getLastKnownGood()
            .orElse(Result.defaultResult());
    }
}

Chaos Engineering

Intentionally inject failures to test resilience:

Chaos Experiment What It Tests Example
Latency Injection Slow dependencies Add 2s delay to 10% of requests
Error Injection Service failures Return 500 error for 5% of calls
Instance Termination Server crashes Kill random service instances
Network Partition Network splits Block communication between services
Resource Exhaustion Memory/CPU limits Consume 90% of available memory
Chaos Experiment with Gremlin
# Gremlin Chaos Experiment Configuration
{
  "name": "Payment Service Latency Test",
  "hypothesis": "System remains functional with 95% success rate even when payment service has 2s latency",
  "experiment": {
    "target": {
      "service": "payment-service",
      "percentage": 50  # Affect 50% of instances
    },
    "attack": {
      "type": "latency",
      "duration": "10m",
      "magnitude": "2000ms"
    }
  },
  "steady_state": {
    "metrics": [
      {"name": "success_rate", "min": 95},
      {"name": "p99_latency", "max": "5000ms"},
      {"name": "error_rate", "max": 5}
    ]
  },
  "rollback": {
    "on_violation": true,
    "alerts": ["slack", "pagerduty"]
  }
}
Chaos Engineering Best Practices
  • Start small - begin in test environments
  • Define hypothesis before experiment
  • Have rollback plan ready
  • Monitor closely during experiments
  • Run during business hours (when team is available)
  • Document learnings and improve systems

Observability for Resilience

You can't fix what you can't see. Proper monitoring is crucial:

Prometheus Metrics for Circuit Breaker
from prometheus_client import Counter, Gauge, Histogram

# Circuit breaker metrics
circuit_breaker_state = Gauge(
    'circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half_open)',
    ['service']
)

circuit_breaker_failures = Counter(
    'circuit_breaker_failures_total',
    'Total number of circuit breaker failures',
    ['service']
)

circuit_breaker_success = Counter(
    'circuit_breaker_success_total',
    'Total number of successful calls',
    ['service']
)

service_call_duration = Histogram(
    'service_call_duration_seconds',
    'Service call duration',
    ['service'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

# Update metrics
circuit_breaker_state.labels(service='payment').set(1)  # Open
circuit_breaker_failures.labels(service='payment').inc()
service_call_duration.labels(service='payment').observe(2.5)

Amazon's Resilience Culture

Amazon's approach to building resilient systems:

  • Shuffle Sharding: Customers isolated to subset of servers - failure affects only small percentage
  • Cell-Based Architecture: Independent failure domains limit blast radius
  • Static Stability: Systems designed to work without dependencies when possible
  • Game Days: Regular chaos experiments in production
  • Wheel of Misfortune: Training exercises with real failure scenarios

Philosophy: "Everything fails, all the time" - design for failure, not perfection