AI Experimentation & A/B Testing - Learn Step by Step

🎯 Why Experimentation Matters in AI

📊 Data-Driven Decisions

Remove guesswork from product development. Every feature change is validated with real user data.

67%

Better Decisions

2.5x

ROI Increase

🎯 Risk Mitigation

Test changes on small user segments before full rollout, preventing costly mistakes.

Catch performance regressions early
Validate business assumptions
Protect user experience
Ensure model improvements

🚀 Innovation Velocity

Ship faster with confidence. Multiple experiments run in parallel accelerate learning.

✓ 10x faster iteration cycles

✓ 3x more features tested

📈 ROI Calculator for A/B Testing

Enter your metrics to calculate the potential ROI of A/B testing...

📊 Industry Impact

Company	Experiment	Result	Business Impact
Google	41 shades of blue	+$200M revenue	Optimal link color
Amazon	1-Click ordering	+35% conversion	Billions in revenue
Netflix	Personalized thumbnails	+30% engagement	Reduced churn
Booking.com	Urgency messaging	+10% bookings	Market leadership
LinkedIn	AI recommendations	+50% connections	Network growth

🎓 Key Principles

Principle 1: "If you can't measure it, you can't improve it" - Every AI feature should be tested before full deployment.

Principle 2: Small improvements compound - A 1% weekly improvement = 67% annual gain.

Principle 3: Not everything that can be measured matters - Focus on metrics that drive business value.

📚 Experimentation Fundamentals

🔬 The Scientific Method for Product Development

1. Hypothesis

"AI recommendations will increase engagement"

2. Design

Control vs Treatment groups

3. Execute

Run for statistical significance

4. Analyze

Make data-driven decision

📊 Core Statistical Concepts

Hypothesis Testing

Null vs Alternative Hypothesis

# Null Hypothesis (H₀)
H₀: μ_treatment = μ_control
"There is no difference between groups"

# Alternative Hypothesis (H₁)
H₁: μ_treatment ≠ μ_control
"There is a significant difference"

# Decision Rule
if p_value < α (0.05):
    reject H₀  # Treatment has effect
else:
    fail to reject H₀  # No evidence of effect

Statistical Significance

The probability that results are not due to chance

p-value: Probability of observing results if H₀ is true
α (alpha): Significance level (typically 0.05)
Confidence: 1 - α (typically 95%)

Rule: p < 0.05 = Statistically significant

Statistical Power

Probability of detecting a true effect

Power Analysis

80% power (recommended minimum)

Factors affecting power:

Sample size (larger = more power)
Effect size (larger = easier to detect)
Variance (lower = more power)

🎯 Types of Tests

Test Type	Use Case	Example	Pros	Cons
A/B Test	Compare two versions	Button color	Simple, clear results	One variable at a time
A/B/n Test	Multiple variants	3+ headlines	Test many options	Requires more traffic
Multivariate	Multiple variables	Layout + color + text	Interaction effects	Very large sample needed
Bandit	Optimize while testing	Dynamic allocation	Minimize opportunity cost	Complex analysis
Holdout	Long-term effects	Algorithm changes	Measure cumulative impact	Reduces test velocity

📏 Sample Size Calculation

Sample Size Calculator

Enter parameters to calculate required sample size...

🎨 Key Metrics Types

Primary Metrics

Main success indicators

Conversion rate
Revenue per user
Engagement rate
Retention

Secondary Metrics

Supporting indicators

Click-through rate
Time on page
Bounce rate
Feature adoption

Guardrail Metrics

Protect from harm

Page load time
Error rates
User complaints
System stability

🔄 Common Experimentation Patterns

🏗️ Experiment Design Patterns

Sequential Testing

Stop early when results are clear

Sequential Analysis

import numpy as np
from scipy import stats

class SequentialTest:
    def __init__(self, alpha=0.05, beta=0.20):
        self.alpha = alpha  # Type I error
        self.beta = beta    # Type II error
        self.log_likelihood_ratio = 0
        
    def update(self, control_success, control_total, 
                treatment_success, treatment_total):
        """Update test with new data"""
        # Calculate conversion rates
        p_control = control_success / control_total
        p_treatment = treatment_success / treatment_total
        
        # Update log likelihood ratio
        if p_control > 0 and p_treatment > 0:
            self.log_likelihood_ratio += np.log(
                p_treatment / p_control
            )
        
        # Check stopping conditions
        upper_bound = np.log((1 - self.beta) / self.alpha)
        lower_bound = np.log(self.beta / (1 - self.alpha))
        
        if self.log_likelihood_ratio >= upper_bound:
            return "STOP: Treatment wins"
        elif self.log_likelihood_ratio <= lower_bound:
            return "STOP: No difference"
        else:
            return "CONTINUE: Need more data"

Stratified Randomization

Ensure balanced groups across segments

Balance by user demographics
Account for seasonality
Control for device types
Geographic distribution

Reduces variance by 20-40%

Switchback Testing

For marketplace and network effects

Hour 1: A

Hour 2: B

Hour 3: A

Hour 4: B

📊 Analysis Patterns

P-Value Simulator

Enter your test results to calculate statistical significance...

🚨 Common Pitfalls & Solutions

Pitfall	Description	Impact	Solution
Peeking	Checking results too early	Inflated false positives	Sequential testing or fixed horizon
Multiple Testing	Testing many metrics	Type I error inflation	Bonferroni correction
Simpson's Paradox	Aggregate vs segment results differ	Wrong conclusions	Segment analysis
Novelty Effect	Initial excitement bias	Overestimated impact	Longer test duration
Sample Ratio Mismatch	Unequal group sizes	Invalid results	SRM detection

🎯 Metric Selection Patterns

Metric Framework

class MetricFramework:
    def __init__(self):
        self.metrics = {
            'primary': [],
            'secondary': [],
            'guardrails': []
        }
    
    def add_primary_metric(self, metric, success_criteria):
        """Add primary success metric"""
        self.metrics['primary'].append({
            'name': metric,
            'type': 'primary',
            'success_criteria': success_criteria,
            'weight': 1.0
        })
    
    def add_guardrail(self, metric, threshold):
        """Add guardrail metric"""
        self.metrics['guardrails'].append({
            'name': metric,
            'type': 'guardrail',
            'threshold': threshold,
            'direction': 'no_harm'  # Should not decrease
        })
    
    def evaluate_experiment(self, results):
        """Evaluate if experiment is successful"""
        decision = {
            'ship': True,
            'reasons': []
        }
        
        # Check primary metrics
        for metric in self.metrics['primary']:
            if not self.meets_criteria(results[metric['name']], 
                                      metric['success_criteria']):
                decision['ship'] = False
                decision['reasons'].append(
                    f"{metric['name']} did not meet success criteria"
                )
        
        # Check guardrails
        for guardrail in self.metrics['guardrails']:
            if self.violates_guardrail(results[guardrail['name']], 
                                       guardrail['threshold']):
                decision['ship'] = False
                decision['reasons'].append(
                    f"{guardrail['name']} guardrail violated"
                )
        
        return decision
    
    def meets_criteria(self, result, criteria):
        """Check if result meets success criteria"""
        return result['lift'] >= criteria['min_lift'] and \
               result['p_value'] < criteria['significance_level']
    
    def violates_guardrail(self, result, threshold):
        """Check if guardrail is violated"""
        return result['value'] < threshold

# Usage Example
framework = MetricFramework()
framework.add_primary_metric(
    'conversion_rate',
    {'min_lift': 0.02, 'significance_level': 0.05}
)
framework.add_guardrail('page_load_time', threshold=2.0)
framework.add_guardrail('error_rate', threshold=0.01)

🔄 Iteration Patterns

Fast Iteration

Week 1: Test A

Week 2: Test B

Week 3: Test C

Week 4: Winner

Rapid testing for quick wins

Staged Rollout

Rollout Progress

1% → 5% → 20% → 100%

Feature Flags

Feature Flag Pattern

if feature_flag.is_enabled('new_ai_model', user_id):
    result = new_model.predict(data)
else:
    result = old_model.predict(data)

Decouple deployment from release

💻 Hands-On Practice

🧮 Statistical Significance Calculator

Test Your Results

Control Group

Treatment Group

Enter your test data to check for statistical significance...

📊 Confidence Interval Visualizer

Visualize Uncertainty

Configure parameters to visualize confidence interval...

🎲 Power Analysis Tool

Check Test Power

Enter parameters to calculate statistical power...

🔬 Experiment Simulator

Complete A/B Test Simulation

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

class ABTestSimulator:
    def __init__(self, control_rate, treatment_rate, daily_traffic):
        self.control_rate = control_rate
        self.treatment_rate = treatment_rate
        self.daily_traffic = daily_traffic
        self.results = []
        
    def run_day(self):
        """Simulate one day of the experiment"""
        # Split traffic 50/50
        control_n = self.daily_traffic // 2
        treatment_n = self.daily_traffic // 2
        
        # Generate conversions
        control_conversions = np.random.binomial(
            control_n, self.control_rate
        )
        treatment_conversions = np.random.binomial(
            treatment_n, self.treatment_rate
        )
        
        return {
            'control_visitors': control_n,
            'control_conversions': control_conversions,
            'treatment_visitors': treatment_n,
            'treatment_conversions': treatment_conversions
        }
    
    def run_experiment(self, days):
        """Run full experiment"""
        cumulative_control_v = 0
        cumulative_control_c = 0
        cumulative_treatment_v = 0
        cumulative_treatment_c = 0
        
        for day in range(1, days + 1):
            day_results = self.run_day()
            
            cumulative_control_v += day_results['control_visitors']
            cumulative_control_c += day_results['control_conversions']
            cumulative_treatment_v += day_results['treatment_visitors']
            cumulative_treatment_c += day_results['treatment_conversions']
            
            # Calculate current statistics
            control_rate = cumulative_control_c / cumulative_control_v
            treatment_rate = cumulative_treatment_c / cumulative_treatment_v
            
            # Perform statistical test
            test_result = self.statistical_test(
                cumulative_control_c, cumulative_control_v,
                cumulative_treatment_c, cumulative_treatment_v
            )
            
            self.results.append({
                'day': day,
                'control_rate': control_rate,
                'treatment_rate': treatment_rate,
                'lift': (treatment_rate - control_rate) / control_rate,
                'p_value': test_result['p_value'],
                'significant': test_result['p_value'] < 0.05,
                'confidence_interval': test_result['ci']
            })
        
        return pd.DataFrame(self.results)
    
    def statistical_test(self, c_conv, c_total, t_conv, t_total):
        """Perform chi-square test"""
        contingency_table = [
            [c_conv, c_total - c_conv],
            [t_conv, t_total - t_conv]
        ]
        
        chi2, p_value, dof, expected = stats.chi2_contingency(
            contingency_table
        )
        
        # Calculate confidence interval for lift
        p_c = c_conv / c_total
        p_t = t_conv / t_total
        se_c = np.sqrt(p_c * (1 - p_c) / c_total)
        se_t = np.sqrt(p_t * (1 - p_t) / t_total)
        se_diff = np.sqrt(se_c**2 + se_t**2)
        
        diff = p_t - p_c
        ci_lower = diff - 1.96 * se_diff
        ci_upper = diff + 1.96 * se_diff
        
        return {
            'p_value': p_value,
            'ci': (ci_lower, ci_upper)
        }
    
    def plot_results(self):
        """Visualize experiment results over time"""
        df = pd.DataFrame(self.results)
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))
        
        # Conversion rates over time
        axes[0, 0].plot(df['day'], df['control_rate'], 
                       label='Control', color='blue')
        axes[0, 0].plot(df['day'], df['treatment_rate'], 
                       label='Treatment', color='green')
        axes[0, 0].set_title('Conversion Rates')
        axes[0, 0].set_xlabel('Day')
        axes[0, 0].set_ylabel('Rate')
        axes[0, 0].legend()
        
        # P-value over time
        axes[0, 1].plot(df['day'], df['p_value'], color='red')
        axes[0, 1].axhline(y=0.05, color='gray', linestyle='--',
                          label='α = 0.05')
        axes[0, 1].set_title('P-Value Evolution')
        axes[0, 1].set_xlabel('Day')
        axes[0, 1].set_ylabel('P-Value')
        axes[0, 1].legend()
        
        # Lift with confidence interval
        axes[1, 0].plot(df['day'], df['lift'] * 100, color='purple')
        axes[1, 0].fill_between(df['day'], 
                               df['confidence_interval'].apply(lambda x: x[0] * 100),
                               df['confidence_interval'].apply(lambda x: x[1] * 100),
                               alpha=0.3, color='purple')
        axes[1, 0].set_title('Lift % with 95% CI')
        axes[1, 0].set_xlabel('Day')
        axes[1, 0].set_ylabel('Lift %')
        
        # Significance indicator
        colors = ['green' if sig else 'red' 
                 for sig in df['significant']]
        axes[1, 1].bar(df['day'], df['significant'], color=colors)
        axes[1, 1].set_title('Statistical Significance')
        axes[1, 1].set_xlabel('Day')
        axes[1, 1].set_ylabel('Significant (1) or Not (0)')
        
        plt.tight_layout()
        plt.show()

# Run simulation
simulator = ABTestSimulator(
    control_rate=0.10,      # 10% baseline
    treatment_rate=0.11,    # 11% treatment (10% lift)
    daily_traffic=1000
)

results = simulator.run_experiment(days=30)
print(results.tail())
simulator.plot_results()

📈 Test Duration Calculator

How Long Should You Run Your Test?

Enter your test parameters to calculate optimal duration...

🎯 Segmentation Analysis

Segment	Control CR	Treatment CR	Lift	P-Value	Decision
New Users	3.2%	4.1%	+28%	0.002	Ship
Returning Users	8.5%	8.3%	-2%	0.451	No Effect
Mobile	2.1%	2.8%	+33%	0.012	Ship
Desktop	5.4%	5.2%	-4%	0.623	No Effect
High Value	15.2%	14.8%	-3%	0.089	Monitor

🚀 Advanced Experimentation

🎰 Multi-Armed Bandits

Thompson Sampling Implementation

import numpy as np
from scipy.stats import beta

class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.successes = np.zeros(n_arms)
        self.failures = np.zeros(n_arms)
        self.total_rewards = 0
        self.counts = np.zeros(n_arms)
        
    def select_arm(self):
        """Select arm using Thompson Sampling"""
        # Sample from Beta distribution for each arm
        samples = []
        for arm in range(self.n_arms):
            # Beta(α, β) where α = successes + 1, β = failures + 1
            sample = beta.rvs(
                self.successes[arm] + 1,
                self.failures[arm] + 1
            )
            samples.append(sample)
        
        # Select arm with highest sampled value
        return np.argmax(samples)
    
    def update(self, arm, reward):
        """Update arm statistics"""
        self.counts[arm] += 1
        if reward == 1:
            self.successes[arm] += 1
            self.total_rewards += 1
        else:
            self.failures[arm] += 1
    
    def get_arm_probabilities(self):
        """Get probability of selecting each arm"""
        n_simulations = 10000
        selections = np.zeros(self.n_arms)
        
        for _ in range(n_simulations):
            arm = self.select_arm()
            selections[arm] += 1
        
        return selections / n_simulations
    
    def run_experiment(self, true_rates, n_rounds):
        """Run bandit experiment"""
        rewards_history = []
        arm_history = []
        
        for round in range(n_rounds):
            # Select arm
            arm = self.select_arm()
            arm_history.append(arm)
            
            # Get reward (simulate conversion)
            reward = np.random.binomial(1, true_rates[arm])
            rewards_history.append(reward)
            
            # Update statistics
            self.update(arm, reward)
            
            # Log progress
            if (round + 1) % 1000 == 0:
                avg_reward = self.total_rewards / (round + 1)
                print(f"Round {round + 1}: Avg Reward = {avg_reward:.3f}")
                print(f"Arm Selection: {self.counts / self.counts.sum()}")
        
        return {
            'total_reward': self.total_rewards,
            'arm_counts': self.counts,
            'final_rates': self.successes / np.maximum(self.counts, 1),
            'regret': self.calculate_regret(true_rates, arm_history)
        }
    
    def calculate_regret(self, true_rates, arm_history):
        """Calculate cumulative regret"""
        best_arm = np.argmax(true_rates)
        best_rate = true_rates[best_arm]
        
        cumulative_regret = 0
        for arm in arm_history:
            regret = best_rate - true_rates[arm]
            cumulative_regret += regret
        
        return cumulative_regret

# Compare Thompson Sampling vs A/B Testing
true_rates = [0.10, 0.12, 0.11, 0.09]  # True conversion rates
bandit = ThompsonSamplingBandit(n_arms=4)
results = bandit.run_experiment(true_rates, n_rounds=10000)

print(f"\nFinal Results:")
print(f"Best Arm Found: {np.argmax(results['final_rates'])}")
print(f"True Best Arm: {np.argmax(true_rates)}")
print(f"Cumulative Regret: {results['regret']:.2f}")
print(f"Traffic Allocation: {results['arm_counts'] / results['arm_counts'].sum()}")

🧬 Bayesian A/B Testing

Bayesian Approach

Update beliefs with evidence

Prior beliefs → Posterior
Probability of being best
Expected loss calculation
No p-value fixation

Advantages

95%

Credible Interval

Early

Stopping OK

Implementation

Bayesian Test

# Probability B > A
prob_b_better = np.mean(
    samples_b > samples_a
)

# Expected loss
loss_choosing_a = np.maximum(
    samples_b - samples_a, 0
).mean()

# Decision
if prob_b_better > 0.95:
    decision = "Choose B"

🔀 Multivariate Testing

Variant	Button Color	Button Text	Layout	Conversion	Interactions
Control	Blue	"Buy Now"	Standard	5.0%	-
Var 1	Green	"Buy Now"	Standard	5.3%	Color effect
Var 2	Blue	"Get Started"	Standard	5.5%	Text effect
Var 3	Green	"Get Started"	Standard	6.2%	Color × Text
Var 4	Green	"Get Started"	Centered	7.1%	All factors

🎯 CUPED (Controlled-experiment Using Pre-Experiment Data)

CUPED Variance Reduction

class CUPED:
    def __init__(self, pre_period_days=30):
        self.pre_period_days = pre_period_days
        
    def reduce_variance(self, df):
        """Apply CUPED to reduce metric variance"""
        # Get pre-experiment data
        pre_data = df[df['date'] < experiment_start_date]
        exp_data = df[df['date'] >= experiment_start_date]
        
        # Calculate pre-period metric for each user
        pre_metric = pre_data.groupby('user_id')['metric'].mean()
        
        # Merge with experiment data
        exp_data = exp_data.merge(
            pre_metric.rename('pre_metric'),
            on='user_id'
        )
        
        # Calculate theta (optimal coefficient)
        cov = np.cov(exp_data['metric'], exp_data['pre_metric'])[0, 1]
        var = np.var(exp_data['pre_metric'])
        theta = cov / var if var > 0 else 0
        
        # Adjust metric using CUPED
        exp_data['adjusted_metric'] = (
            exp_data['metric'] - 
            theta * (exp_data['pre_metric'] - exp_data['pre_metric'].mean())
        )
        
        # Calculate variance reduction
        original_var = np.var(exp_data['metric'])
        adjusted_var = np.var(exp_data['adjusted_metric'])
        variance_reduction = 1 - (adjusted_var / original_var)
        
        print(f"Variance Reduction: {variance_reduction:.1%}")
        print(f"This is equivalent to {1/(1-variance_reduction):.1f}x more sample")
        
        return exp_data

# Usage
cuped = CUPED(pre_period_days=30)
adjusted_data = cuped.reduce_variance(experiment_data)

# Now run test on adjusted metric
control_adjusted = adjusted_data[
    adjusted_data['variant'] == 'control'
]['adjusted_metric']
treatment_adjusted = adjusted_data[
    adjusted_data['variant'] == 'treatment'
]['adjusted_metric']

# T-test on adjusted metrics (lower variance = higher power)
t_stat, p_value = stats.ttest_ind(control_adjusted, treatment_adjusted)

📊 Network Effects & Interference

Cluster Randomization

Randomize groups instead of individuals

Geographic clusters
Social networks
Time-based clusters

Synthetic Control

Create control from historical data

Used when randomization impossible

Ego-Network Randomization

Randomize user + connections

User + Friends

⚡ Real-Time Decision Making

Dynamic Allocation Simulator

Variant A: 25%

Variant B: 25%

Variant C: 25%

Variant D: 25%

Click variants to simulate performance and see dynamic allocation...

⚡ Quick Reference Guide

📋 Experimentation Checklist

✅ Pre-Launch

Define hypothesis
Choose primary metric
Calculate sample size
Set test duration
Check randomization
Set up tracking

✅ During Test

Monitor SRM
Check data quality
Watch guardrails
Document issues
Avoid peeking
Maintain test integrity

✅ Post-Test

Validate results
Check segments
Analyze secondary metrics
Document learnings
Make decision
Plan rollout

📊 Statistical Formulas

Essential Formulas

# Sample Size (per variant)
n = (Z_α + Z_β)² × 2σ² / δ²

# Where:
# Z_α = Z-score for significance (1.96 for 95%)
# Z_β = Z-score for power (0.84 for 80%)
# σ² = Variance
# δ = Minimum detectable effect

# Standard Error (proportion)
SE = sqrt(p × (1-p) / n)

# Confidence Interval
CI = p ± Z × SE

# Z-Score
Z = (p₁ - p₂) / sqrt(SE₁² + SE₂²)

# P-Value (two-tailed)
p_value = 2 × (1 - norm.cdf(abs(Z)))

# Relative Lift
lift = (treatment - control) / control × 100%

# Statistical Power
power = 1 - β

# Effect Size (Cohen's d)
d = (μ₁ - μ₂) / σ_pooled

# Chi-Square Test
χ² = Σ((O - E)² / E)

# Multiple Testing Correction (Bonferroni)
α_adjusted = α / m  # m = number of tests

# Bayesian Probability
P(B > A) = ∫∫ I(b > a) × P(a) × P(b) da db

🛠️ Tools Comparison

Tool	Best For	Features	Pricing
Optimizely	Enterprise	Full stack, Stats Engine	$$$
Google Optimize	Web testing	Visual editor, GA integration	Free
LaunchDarkly	Feature flags	Progressive rollouts	$$
Statsig	Product analytics	Auto-logging, Pulse	$
Split.io	Engineering teams	SDKs, Targeting	$$

💡 Common Mistakes to Avoid

❌ Statistical Errors

Stopping tests early
Ignoring multiple testing
P-hacking
Cherry-picking segments
Ignoring power analysis

❌ Design Errors

Weak hypothesis
Wrong metrics
Poor randomization
Contamination
Selection bias

❌ Implementation Errors

Broken tracking
Bot traffic
Technical bugs
Inconsistent experience
Data leakage

📊 Decision Framework

Ship Decision

Primary metric: ✓ Significant
Guardrails: ✓ No harm
Segments: ✓ Consistent
→ Decision: SHIP

Iterate Decision

Primary metric: ✗ Not sig
Secondary: ✓ Positive
Learnings: ✓ Clear
→ Decision: ITERATE

Kill Decision

Primary metric: ✗ Negative
Guardrails: ✗ Violated
Cost: High
→ Decision: KILL

🧪 AI Experimentation & A/B Testing