Error Handling & Recovery | LIZIU AI Agents

Error Types in Agent Systems

Why Error Handling Matters

The Problem: Agents interact with external APIs, databases, and LLMs -- all of which can fail. Without proper error handling, a single failure can crash the entire agent workflow.

The Solution: Robust error handling with retry strategies, fallback chains, and graceful degradation keeps agents running reliably even when individual components fail.

Real Impact: Production agents with proper error handling achieve 99%+ uptime versus 80-90% without it.

Real-World Analogy

Think of error handling like a pilot handling in-flight problems:

Retry = Toggle the switch again -- maybe it was a momentary glitch
Fallback = Switch to the backup system when primary fails
Self-Correction = Adjust altitude based on new instrument readings
Graceful Degradation = Land at a closer airport instead of the destination
Circuit Breaker = Stop trying a system that keeps failing

Common Error Categories

Tool Errors

API timeouts, rate limits, invalid parameters, network failures. Most common and most recoverable.

LLM Errors

Rate limits, context overflow, content filtering, malformed output. Requires retry or model switching.

Logic Errors

Agent enters infinite loops, makes contradictory decisions, or fails to make progress. Needs loop detection.

Data Errors

Invalid input data, schema mismatches, encoding issues. Requires validation and sanitization.

Retry Strategies

Error Handling Decision Tree

error_handling.py

import time
import random

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
        except (ToolError, APIError) as e:
            # Return error to agent for self-correction
            return {"error": str(e), "retryable": True}

class FallbackChain:
    def __init__(self, providers):
        self.providers = providers  # [gpt4, claude, local_model]

    def call(self, messages):
        for provider in self.providers:
            try:
                return provider.chat(messages)
            except Exception as e:
                continue  # Try next provider
        raise Exception("All providers failed")

Fallback Chains

Fallback Strategy	When to Use	Example
Model Fallback	Primary model unavailable	GPT-4o -> Claude -> Llama
Tool Fallback	Primary tool fails	Google Search -> Bing -> DuckDuckGo
Strategy Fallback	Primary approach fails	RAG -> Web Search -> Cached Answer
Quality Fallback	Reduce quality for reliability	Detailed answer -> Summary -> "I cannot help"

Self-Correction

Self-Correction Patterns

Output Validation: Check agent output against expected schema, retry if invalid
Reflection: Ask the LLM to evaluate its own response and identify errors
Test-Driven: Run generated code, feed errors back for fixing
Critic Agent: A separate agent evaluates and requests corrections

Graceful Degradation

Level	Behavior	User Experience
Full Service	All systems operational	Complete, detailed response
Reduced	Some tools unavailable	Partial answer with disclaimer
Minimal	Only LLM available	Best-effort from training data
Cached	All APIs down	Return cached/pre-computed answers
Failure	Critical failure	Clear error message + escalation

Quick Reference

Best Practice	Description	Implementation
Exponential Backoff	Increase delay between retries	delay = base * 2^attempt + jitter
Circuit Breaker	Stop calling failing services	Track failures, open after threshold
Timeout	Set max wait time per tool call	30s for APIs, 60s for complex tasks
Max Steps	Limit agent loop iterations	Prevent infinite loops, usually 10-25
Logging	Record all errors and decisions	Structured logs with trace IDs