Error Handling & Recovery

Medium 22 min read

Error Types in Agent Systems

Why Error Handling Matters

The Problem: Agents interact with external APIs, databases, and LLMs -- all of which can fail. Without proper error handling, a single failure can crash the entire agent workflow.

The Solution: Robust error handling with retry strategies, fallback chains, and graceful degradation keeps agents running reliably even when individual components fail.

Real Impact: Production agents with proper error handling achieve 99%+ uptime versus 80-90% without it.

Real-World Analogy

Think of error handling like a pilot handling in-flight problems:

  • Retry = Toggle the switch again -- maybe it was a momentary glitch
  • Fallback = Switch to the backup system when primary fails
  • Self-Correction = Adjust altitude based on new instrument readings
  • Graceful Degradation = Land at a closer airport instead of the destination
  • Circuit Breaker = Stop trying a system that keeps failing

Common Error Categories

Tool Errors

API timeouts, rate limits, invalid parameters, network failures. Most common and most recoverable.

LLM Errors

Rate limits, context overflow, content filtering, malformed output. Requires retry or model switching.

Logic Errors

Agent enters infinite loops, makes contradictory decisions, or fails to make progress. Needs loop detection.

Data Errors

Invalid input data, schema mismatches, encoding issues. Requires validation and sanitization.

Key Takeaway: Agent errors fall into three categories: LLM errors (hallucinations, refusals, malformed tool calls), tool errors (API failures, timeouts, invalid responses), and orchestration errors (infinite loops, deadlocks, state corruption). Each requires different handling strategies.

Retry Strategies

Error Handling Decision Tree
Error Occurs Retryable? Fatal Error Retry w/ backoff Try fallback Degrade gracefully Report & stop Yes No
error_handling.py
import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
        except (ToolError, APIError) as e:
            # Return error to agent for self-correction
            return {"error": str(e), "retryable": True}

class FallbackChain:
    def __init__(self, providers):
        self.providers = providers  # [gpt4, claude, local_model]

    def call(self, messages):
        for provider in self.providers:
            try:
                return provider.chat(messages)
            except Exception as e:
                continue  # Try next provider
        raise Exception("All providers failed")
Output (retry with backoff)
Attempt 1: API rate limited (429), retrying in 1s...
Attempt 2: API rate limited (429), retrying in 2s...
Attempt 3: Success - response received in 340ms

Common Mistake

Wrong: Retrying all errors with the same strategy

Why it fails: Retrying a 400 Bad Request wastes tokens -- the same malformed request will fail every time. Retrying a hallucinated tool call with the same prompt often produces the same hallucination.

Instead: Classify errors: retry transient errors (429, 500, timeouts) with backoff; for LLM errors, modify the prompt to include the error feedback; for permanent errors (invalid API key, missing resource), fail fast with a clear message.

Fallback Chains

Fallback StrategyWhen to UseExample
Model FallbackPrimary model unavailableGPT-4o -> Claude -> Llama
Tool FallbackPrimary tool failsGoogle Search -> Bing -> DuckDuckGo
Strategy FallbackPrimary approach failsRAG -> Web Search -> Cached Answer
Quality FallbackReduce quality for reliabilityDetailed answer -> Summary -> "I cannot help"

Self-Correction

Self-Correction Patterns

  • Output Validation: Check agent output against expected schema, retry if invalid
  • Reflection: Ask the LLM to evaluate its own response and identify errors
  • Test-Driven: Run generated code, feed errors back for fixing
  • Critic Agent: A separate agent evaluates and requests corrections

Graceful Degradation

LevelBehaviorUser Experience
Full ServiceAll systems operationalComplete, detailed response
ReducedSome tools unavailablePartial answer with disclaimer
MinimalOnly LLM availableBest-effort from training data
CachedAll APIs downReturn cached/pre-computed answers
FailureCritical failureClear error message + escalation
Deep Dive: Self-Correction Patterns

The most powerful error recovery for LLM agents is self-correction: feeding the error back to the model and asking it to fix its approach. For example, if a tool call returns a validation error, append the error to the conversation and prompt "The previous tool call failed with: [error]. Please fix the parameters and try again." This works because LLMs can reason about their mistakes when given explicit feedback. Limit self-correction to 2-3 attempts to avoid token waste on unfixable errors.

Key Takeaway: Implement graceful degradation: when an agent cannot complete its primary task, it should fall back to a simpler approach or provide a partial result with an explanation, rather than failing silently or returning an error.

Quick Reference

Best PracticeDescriptionImplementation
Exponential BackoffIncrease delay between retriesdelay = base * 2^attempt + jitter
Circuit BreakerStop calling failing servicesTrack failures, open after threshold
TimeoutSet max wait time per tool call30s for APIs, 60s for complex tasks
Max StepsLimit agent loop iterationsPrevent infinite loops, usually 10-25
LoggingRecord all errors and decisionsStructured logs with trace IDs