๐Ÿงช AI Experimentation & A/B Testing

Master statistical testing, experiment design, and data-driven decision making for AI products

๐ŸŽฏ Why Experimentation Matters in AI

๐Ÿ“Š Data-Driven Decisions

Remove guesswork from product development. Every feature change is validated with real user data.

67%
Better Decisions
2.5x
ROI Increase

๐ŸŽฏ Risk Mitigation

Test changes on small user segments before full rollout, preventing costly mistakes.

  • Catch performance regressions early
  • Validate business assumptions
  • Protect user experience
  • Ensure model improvements

๐Ÿš€ Innovation Velocity

Ship faster with confidence. Multiple experiments run in parallel accelerate learning.

โœ“ 10x faster iteration cycles
โœ“ 3x more features tested

๐Ÿ“ˆ ROI Calculator for A/B Testing

Enter your metrics to calculate the potential ROI of A/B testing...

๐Ÿ“Š Industry Impact

Company Experiment Result Business Impact
Google 41 shades of blue +$200M revenue Optimal link color
Amazon 1-Click ordering +35% conversion Billions in revenue
Netflix Personalized thumbnails +30% engagement Reduced churn
Booking.com Urgency messaging +10% bookings Market leadership
LinkedIn AI recommendations +50% connections Network growth

๐ŸŽ“ Key Principles

Principle 1: "If you can't measure it, you can't improve it" - Every AI feature should be tested before full deployment.
Principle 2: Small improvements compound - A 1% weekly improvement = 67% annual gain.
Principle 3: Not everything that can be measured matters - Focus on metrics that drive business value.

๐Ÿ“š Experimentation Fundamentals

๐Ÿ”ฌ The Scientific Method for Product Development

1. Hypothesis

"AI recommendations will increase engagement"

2. Design

Control vs Treatment groups

3. Execute

Run for statistical significance

4. Analyze

Make data-driven decision

๐Ÿ“Š Core Statistical Concepts

Hypothesis Testing

Null vs Alternative Hypothesis
# Null Hypothesis (Hโ‚€)
Hโ‚€: ฮผ_treatment = ฮผ_control
"There is no difference between groups"

# Alternative Hypothesis (Hโ‚)
Hโ‚: ฮผ_treatment โ‰  ฮผ_control
"There is a significant difference"

# Decision Rule
if p_value < ฮฑ (0.05):
    reject Hโ‚€  # Treatment has effect
else:
    fail to reject Hโ‚€  # No evidence of effect

Statistical Significance

The probability that results are not due to chance

  • p-value: Probability of observing results if Hโ‚€ is true
  • ฮฑ (alpha): Significance level (typically 0.05)
  • Confidence: 1 - ฮฑ (typically 95%)
Rule: p < 0.05 = Statistically significant

Statistical Power

Probability of detecting a true effect

Power Analysis

80% power (recommended minimum)

Factors affecting power:

  • Sample size (larger = more power)
  • Effect size (larger = easier to detect)
  • Variance (lower = more power)

๐ŸŽฏ Types of Tests

Test Type Use Case Example Pros Cons
A/B Test Compare two versions Button color Simple, clear results One variable at a time
A/B/n Test Multiple variants 3+ headlines Test many options Requires more traffic
Multivariate Multiple variables Layout + color + text Interaction effects Very large sample needed
Bandit Optimize while testing Dynamic allocation Minimize opportunity cost Complex analysis
Holdout Long-term effects Algorithm changes Measure cumulative impact Reduces test velocity

๐Ÿ“ Sample Size Calculation

Sample Size Calculator

Enter parameters to calculate required sample size...

๐ŸŽจ Key Metrics Types

Primary Metrics

Main success indicators

  • Conversion rate
  • Revenue per user
  • Engagement rate
  • Retention

Secondary Metrics

Supporting indicators

  • Click-through rate
  • Time on page
  • Bounce rate
  • Feature adoption

Guardrail Metrics

Protect from harm

  • Page load time
  • Error rates
  • User complaints
  • System stability

๐Ÿ”„ Common Experimentation Patterns

๐Ÿ—๏ธ Experiment Design Patterns

Sequential Testing

Stop early when results are clear

Sequential Analysis
import numpy as np
from scipy import stats

class SequentialTest:
    def __init__(self, alpha=0.05, beta=0.20):
        self.alpha = alpha  # Type I error
        self.beta = beta    # Type II error
        self.log_likelihood_ratio = 0
        
    def update(self, control_success, control_total, 
                treatment_success, treatment_total):
        """Update test with new data"""
        # Calculate conversion rates
        p_control = control_success / control_total
        p_treatment = treatment_success / treatment_total
        
        # Update log likelihood ratio
        if p_control > 0 and p_treatment > 0:
            self.log_likelihood_ratio += np.log(
                p_treatment / p_control
            )
        
        # Check stopping conditions
        upper_bound = np.log((1 - self.beta) / self.alpha)
        lower_bound = np.log(self.beta / (1 - self.alpha))
        
        if self.log_likelihood_ratio >= upper_bound:
            return "STOP: Treatment wins"
        elif self.log_likelihood_ratio <= lower_bound:
            return "STOP: No difference"
        else:
            return "CONTINUE: Need more data"

Stratified Randomization

Ensure balanced groups across segments

  • Balance by user demographics
  • Account for seasonality
  • Control for device types
  • Geographic distribution
Reduces variance by 20-40%

Switchback Testing

For marketplace and network effects

Hour 1: A
Hour 2: B
Hour 3: A
Hour 4: B

๐Ÿ“Š Analysis Patterns

P-Value Simulator

Enter your test results to calculate statistical significance...

๐Ÿšจ Common Pitfalls & Solutions

Pitfall Description Impact Solution
Peeking Checking results too early Inflated false positives Sequential testing or fixed horizon
Multiple Testing Testing many metrics Type I error inflation Bonferroni correction
Simpson's Paradox Aggregate vs segment results differ Wrong conclusions Segment analysis
Novelty Effect Initial excitement bias Overestimated impact Longer test duration
Sample Ratio Mismatch Unequal group sizes Invalid results SRM detection

๐ŸŽฏ Metric Selection Patterns

Metric Framework
class MetricFramework:
    def __init__(self):
        self.metrics = {
            'primary': [],
            'secondary': [],
            'guardrails': []
        }
    
    def add_primary_metric(self, metric, success_criteria):
        """Add primary success metric"""
        self.metrics['primary'].append({
            'name': metric,
            'type': 'primary',
            'success_criteria': success_criteria,
            'weight': 1.0
        })
    
    def add_guardrail(self, metric, threshold):
        """Add guardrail metric"""
        self.metrics['guardrails'].append({
            'name': metric,
            'type': 'guardrail',
            'threshold': threshold,
            'direction': 'no_harm'  # Should not decrease
        })
    
    def evaluate_experiment(self, results):
        """Evaluate if experiment is successful"""
        decision = {
            'ship': True,
            'reasons': []
        }
        
        # Check primary metrics
        for metric in self.metrics['primary']:
            if not self.meets_criteria(results[metric['name']], 
                                      metric['success_criteria']):
                decision['ship'] = False
                decision['reasons'].append(
                    f"{metric['name']} did not meet success criteria"
                )
        
        # Check guardrails
        for guardrail in self.metrics['guardrails']:
            if self.violates_guardrail(results[guardrail['name']], 
                                       guardrail['threshold']):
                decision['ship'] = False
                decision['reasons'].append(
                    f"{guardrail['name']} guardrail violated"
                )
        
        return decision
    
    def meets_criteria(self, result, criteria):
        """Check if result meets success criteria"""
        return result['lift'] >= criteria['min_lift'] and \
               result['p_value'] < criteria['significance_level']
    
    def violates_guardrail(self, result, threshold):
        """Check if guardrail is violated"""
        return result['value'] < threshold

# Usage Example
framework = MetricFramework()
framework.add_primary_metric(
    'conversion_rate',
    {'min_lift': 0.02, 'significance_level': 0.05}
)
framework.add_guardrail('page_load_time', threshold=2.0)
framework.add_guardrail('error_rate', threshold=0.01)

๐Ÿ”„ Iteration Patterns

Fast Iteration

Week 1: Test A
Week 2: Test B
Week 3: Test C
Week 4: Winner

Rapid testing for quick wins

Staged Rollout

Rollout Progress

1% โ†’ 5% โ†’ 20% โ†’ 100%

Feature Flags

Feature Flag Pattern
if feature_flag.is_enabled('new_ai_model', user_id):
    result = new_model.predict(data)
else:
    result = old_model.predict(data)

Decouple deployment from release

๐Ÿ’ป Hands-On Practice

๐Ÿงฎ Statistical Significance Calculator

Test Your Results

Control Group
Treatment Group

Enter your test data to check for statistical significance...

๐Ÿ“Š Confidence Interval Visualizer

Visualize Uncertainty

Configure parameters to visualize confidence interval...

๐ŸŽฒ Power Analysis Tool

Check Test Power

Enter parameters to calculate statistical power...

๐Ÿ”ฌ Experiment Simulator

Complete A/B Test Simulation
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

class ABTestSimulator:
    def __init__(self, control_rate, treatment_rate, daily_traffic):
        self.control_rate = control_rate
        self.treatment_rate = treatment_rate
        self.daily_traffic = daily_traffic
        self.results = []
        
    def run_day(self):
        """Simulate one day of the experiment"""
        # Split traffic 50/50
        control_n = self.daily_traffic // 2
        treatment_n = self.daily_traffic // 2
        
        # Generate conversions
        control_conversions = np.random.binomial(
            control_n, self.control_rate
        )
        treatment_conversions = np.random.binomial(
            treatment_n, self.treatment_rate
        )
        
        return {
            'control_visitors': control_n,
            'control_conversions': control_conversions,
            'treatment_visitors': treatment_n,
            'treatment_conversions': treatment_conversions
        }
    
    def run_experiment(self, days):
        """Run full experiment"""
        cumulative_control_v = 0
        cumulative_control_c = 0
        cumulative_treatment_v = 0
        cumulative_treatment_c = 0
        
        for day in range(1, days + 1):
            day_results = self.run_day()
            
            cumulative_control_v += day_results['control_visitors']
            cumulative_control_c += day_results['control_conversions']
            cumulative_treatment_v += day_results['treatment_visitors']
            cumulative_treatment_c += day_results['treatment_conversions']
            
            # Calculate current statistics
            control_rate = cumulative_control_c / cumulative_control_v
            treatment_rate = cumulative_treatment_c / cumulative_treatment_v
            
            # Perform statistical test
            test_result = self.statistical_test(
                cumulative_control_c, cumulative_control_v,
                cumulative_treatment_c, cumulative_treatment_v
            )
            
            self.results.append({
                'day': day,
                'control_rate': control_rate,
                'treatment_rate': treatment_rate,
                'lift': (treatment_rate - control_rate) / control_rate,
                'p_value': test_result['p_value'],
                'significant': test_result['p_value'] < 0.05,
                'confidence_interval': test_result['ci']
            })
        
        return pd.DataFrame(self.results)
    
    def statistical_test(self, c_conv, c_total, t_conv, t_total):
        """Perform chi-square test"""
        contingency_table = [
            [c_conv, c_total - c_conv],
            [t_conv, t_total - t_conv]
        ]
        
        chi2, p_value, dof, expected = stats.chi2_contingency(
            contingency_table
        )
        
        # Calculate confidence interval for lift
        p_c = c_conv / c_total
        p_t = t_conv / t_total
        se_c = np.sqrt(p_c * (1 - p_c) / c_total)
        se_t = np.sqrt(p_t * (1 - p_t) / t_total)
        se_diff = np.sqrt(se_c**2 + se_t**2)
        
        diff = p_t - p_c
        ci_lower = diff - 1.96 * se_diff
        ci_upper = diff + 1.96 * se_diff
        
        return {
            'p_value': p_value,
            'ci': (ci_lower, ci_upper)
        }
    
    def plot_results(self):
        """Visualize experiment results over time"""
        df = pd.DataFrame(self.results)
        
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))
        
        # Conversion rates over time
        axes[0, 0].plot(df['day'], df['control_rate'], 
                       label='Control', color='blue')
        axes[0, 0].plot(df['day'], df['treatment_rate'], 
                       label='Treatment', color='green')
        axes[0, 0].set_title('Conversion Rates')
        axes[0, 0].set_xlabel('Day')
        axes[0, 0].set_ylabel('Rate')
        axes[0, 0].legend()
        
        # P-value over time
        axes[0, 1].plot(df['day'], df['p_value'], color='red')
        axes[0, 1].axhline(y=0.05, color='gray', linestyle='--',
                          label='ฮฑ = 0.05')
        axes[0, 1].set_title('P-Value Evolution')
        axes[0, 1].set_xlabel('Day')
        axes[0, 1].set_ylabel('P-Value')
        axes[0, 1].legend()
        
        # Lift with confidence interval
        axes[1, 0].plot(df['day'], df['lift'] * 100, color='purple')
        axes[1, 0].fill_between(df['day'], 
                               df['confidence_interval'].apply(lambda x: x[0] * 100),
                               df['confidence_interval'].apply(lambda x: x[1] * 100),
                               alpha=0.3, color='purple')
        axes[1, 0].set_title('Lift % with 95% CI')
        axes[1, 0].set_xlabel('Day')
        axes[1, 0].set_ylabel('Lift %')
        
        # Significance indicator
        colors = ['green' if sig else 'red' 
                 for sig in df['significant']]
        axes[1, 1].bar(df['day'], df['significant'], color=colors)
        axes[1, 1].set_title('Statistical Significance')
        axes[1, 1].set_xlabel('Day')
        axes[1, 1].set_ylabel('Significant (1) or Not (0)')
        
        plt.tight_layout()
        plt.show()

# Run simulation
simulator = ABTestSimulator(
    control_rate=0.10,      # 10% baseline
    treatment_rate=0.11,    # 11% treatment (10% lift)
    daily_traffic=1000
)

results = simulator.run_experiment(days=30)
print(results.tail())
simulator.plot_results()

๐Ÿ“ˆ Test Duration Calculator

How Long Should You Run Your Test?

Enter your test parameters to calculate optimal duration...

๐ŸŽฏ Segmentation Analysis

Segment Control CR Treatment CR Lift P-Value Decision
New Users 3.2% 4.1% +28% 0.002 Ship
Returning Users 8.5% 8.3% -2% 0.451 No Effect
Mobile 2.1% 2.8% +33% 0.012 Ship
Desktop 5.4% 5.2% -4% 0.623 No Effect
High Value 15.2% 14.8% -3% 0.089 Monitor

๐Ÿš€ Advanced Experimentation

๐ŸŽฐ Multi-Armed Bandits

Thompson Sampling Implementation
import numpy as np
from scipy.stats import beta

class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.successes = np.zeros(n_arms)
        self.failures = np.zeros(n_arms)
        self.total_rewards = 0
        self.counts = np.zeros(n_arms)
        
    def select_arm(self):
        """Select arm using Thompson Sampling"""
        # Sample from Beta distribution for each arm
        samples = []
        for arm in range(self.n_arms):
            # Beta(ฮฑ, ฮฒ) where ฮฑ = successes + 1, ฮฒ = failures + 1
            sample = beta.rvs(
                self.successes[arm] + 1,
                self.failures[arm] + 1
            )
            samples.append(sample)
        
        # Select arm with highest sampled value
        return np.argmax(samples)
    
    def update(self, arm, reward):
        """Update arm statistics"""
        self.counts[arm] += 1
        if reward == 1:
            self.successes[arm] += 1
            self.total_rewards += 1
        else:
            self.failures[arm] += 1
    
    def get_arm_probabilities(self):
        """Get probability of selecting each arm"""
        n_simulations = 10000
        selections = np.zeros(self.n_arms)
        
        for _ in range(n_simulations):
            arm = self.select_arm()
            selections[arm] += 1
        
        return selections / n_simulations
    
    def run_experiment(self, true_rates, n_rounds):
        """Run bandit experiment"""
        rewards_history = []
        arm_history = []
        
        for round in range(n_rounds):
            # Select arm
            arm = self.select_arm()
            arm_history.append(arm)
            
            # Get reward (simulate conversion)
            reward = np.random.binomial(1, true_rates[arm])
            rewards_history.append(reward)
            
            # Update statistics
            self.update(arm, reward)
            
            # Log progress
            if (round + 1) % 1000 == 0:
                avg_reward = self.total_rewards / (round + 1)
                print(f"Round {round + 1}: Avg Reward = {avg_reward:.3f}")
                print(f"Arm Selection: {self.counts / self.counts.sum()}")
        
        return {
            'total_reward': self.total_rewards,
            'arm_counts': self.counts,
            'final_rates': self.successes / np.maximum(self.counts, 1),
            'regret': self.calculate_regret(true_rates, arm_history)
        }
    
    def calculate_regret(self, true_rates, arm_history):
        """Calculate cumulative regret"""
        best_arm = np.argmax(true_rates)
        best_rate = true_rates[best_arm]
        
        cumulative_regret = 0
        for arm in arm_history:
            regret = best_rate - true_rates[arm]
            cumulative_regret += regret
        
        return cumulative_regret

# Compare Thompson Sampling vs A/B Testing
true_rates = [0.10, 0.12, 0.11, 0.09]  # True conversion rates
bandit = ThompsonSamplingBandit(n_arms=4)
results = bandit.run_experiment(true_rates, n_rounds=10000)

print(f"\nFinal Results:")
print(f"Best Arm Found: {np.argmax(results['final_rates'])}")
print(f"True Best Arm: {np.argmax(true_rates)}")
print(f"Cumulative Regret: {results['regret']:.2f}")
print(f"Traffic Allocation: {results['arm_counts'] / results['arm_counts'].sum()}")

๐Ÿงฌ Bayesian A/B Testing

Bayesian Approach

Update beliefs with evidence

  • Prior beliefs โ†’ Posterior
  • Probability of being best
  • Expected loss calculation
  • No p-value fixation

Advantages

95%
Credible Interval
Early
Stopping OK

Implementation

Bayesian Test
# Probability B > A
prob_b_better = np.mean(
    samples_b > samples_a
)

# Expected loss
loss_choosing_a = np.maximum(
    samples_b - samples_a, 0
).mean()

# Decision
if prob_b_better > 0.95:
    decision = "Choose B"

๐Ÿ”€ Multivariate Testing

Variant Button Color Button Text Layout Conversion Interactions
Control Blue "Buy Now" Standard 5.0% -
Var 1 Green "Buy Now" Standard 5.3% Color effect
Var 2 Blue "Get Started" Standard 5.5% Text effect
Var 3 Green "Get Started" Standard 6.2% Color ร— Text
Var 4 Green "Get Started" Centered 7.1% All factors

๐ŸŽฏ CUPED (Controlled-experiment Using Pre-Experiment Data)

CUPED Variance Reduction
class CUPED:
    def __init__(self, pre_period_days=30):
        self.pre_period_days = pre_period_days
        
    def reduce_variance(self, df):
        """Apply CUPED to reduce metric variance"""
        # Get pre-experiment data
        pre_data = df[df['date'] < experiment_start_date]
        exp_data = df[df['date'] >= experiment_start_date]
        
        # Calculate pre-period metric for each user
        pre_metric = pre_data.groupby('user_id')['metric'].mean()
        
        # Merge with experiment data
        exp_data = exp_data.merge(
            pre_metric.rename('pre_metric'),
            on='user_id'
        )
        
        # Calculate theta (optimal coefficient)
        cov = np.cov(exp_data['metric'], exp_data['pre_metric'])[0, 1]
        var = np.var(exp_data['pre_metric'])
        theta = cov / var if var > 0 else 0
        
        # Adjust metric using CUPED
        exp_data['adjusted_metric'] = (
            exp_data['metric'] - 
            theta * (exp_data['pre_metric'] - exp_data['pre_metric'].mean())
        )
        
        # Calculate variance reduction
        original_var = np.var(exp_data['metric'])
        adjusted_var = np.var(exp_data['adjusted_metric'])
        variance_reduction = 1 - (adjusted_var / original_var)
        
        print(f"Variance Reduction: {variance_reduction:.1%}")
        print(f"This is equivalent to {1/(1-variance_reduction):.1f}x more sample")
        
        return exp_data

# Usage
cuped = CUPED(pre_period_days=30)
adjusted_data = cuped.reduce_variance(experiment_data)

# Now run test on adjusted metric
control_adjusted = adjusted_data[
    adjusted_data['variant'] == 'control'
]['adjusted_metric']
treatment_adjusted = adjusted_data[
    adjusted_data['variant'] == 'treatment'
]['adjusted_metric']

# T-test on adjusted metrics (lower variance = higher power)
t_stat, p_value = stats.ttest_ind(control_adjusted, treatment_adjusted)

๐Ÿ“Š Network Effects & Interference

Cluster Randomization

Randomize groups instead of individuals

  • Geographic clusters
  • Social networks
  • Time-based clusters

Synthetic Control

Create control from historical data

Used when randomization impossible

Ego-Network Randomization

Randomize user + connections

User + Friends

โšก Real-Time Decision Making

Dynamic Allocation Simulator

Variant A: 25%
Variant B: 25%
Variant C: 25%
Variant D: 25%

Click variants to simulate performance and see dynamic allocation...

โšก Quick Reference Guide

๐Ÿ“‹ Experimentation Checklist

โœ… Pre-Launch

  • Define hypothesis
  • Choose primary metric
  • Calculate sample size
  • Set test duration
  • Check randomization
  • Set up tracking

โœ… During Test

  • Monitor SRM
  • Check data quality
  • Watch guardrails
  • Document issues
  • Avoid peeking
  • Maintain test integrity

โœ… Post-Test

  • Validate results
  • Check segments
  • Analyze secondary metrics
  • Document learnings
  • Make decision
  • Plan rollout

๐Ÿ“Š Statistical Formulas

Essential Formulas
# Sample Size (per variant)
n = (Z_ฮฑ + Z_ฮฒ)ยฒ ร— 2ฯƒยฒ / ฮดยฒ

# Where:
# Z_ฮฑ = Z-score for significance (1.96 for 95%)
# Z_ฮฒ = Z-score for power (0.84 for 80%)
# ฯƒยฒ = Variance
# ฮด = Minimum detectable effect

# Standard Error (proportion)
SE = sqrt(p ร— (1-p) / n)

# Confidence Interval
CI = p ยฑ Z ร— SE

# Z-Score
Z = (pโ‚ - pโ‚‚) / sqrt(SEโ‚ยฒ + SEโ‚‚ยฒ)

# P-Value (two-tailed)
p_value = 2 ร— (1 - norm.cdf(abs(Z)))

# Relative Lift
lift = (treatment - control) / control ร— 100%

# Statistical Power
power = 1 - ฮฒ

# Effect Size (Cohen's d)
d = (ฮผโ‚ - ฮผโ‚‚) / ฯƒ_pooled

# Chi-Square Test
ฯ‡ยฒ = ฮฃ((O - E)ยฒ / E)

# Multiple Testing Correction (Bonferroni)
ฮฑ_adjusted = ฮฑ / m  # m = number of tests

# Bayesian Probability
P(B > A) = โˆซโˆซ I(b > a) ร— P(a) ร— P(b) da db

๐Ÿ› ๏ธ Tools Comparison

Tool Best For Features Pricing
Optimizely Enterprise Full stack, Stats Engine $$$
Google Optimize Web testing Visual editor, GA integration Free
LaunchDarkly Feature flags Progressive rollouts $$
Statsig Product analytics Auto-logging, Pulse $
Split.io Engineering teams SDKs, Targeting $$

๐Ÿ’ก Common Mistakes to Avoid

โŒ Statistical Errors

  • Stopping tests early
  • Ignoring multiple testing
  • P-hacking
  • Cherry-picking segments
  • Ignoring power analysis

โŒ Design Errors

  • Weak hypothesis
  • Wrong metrics
  • Poor randomization
  • Contamination
  • Selection bias

โŒ Implementation Errors

  • Broken tracking
  • Bot traffic
  • Technical bugs
  • Inconsistent experience
  • Data leakage

๐Ÿ“Š Decision Framework

Ship Decision

Primary metric: โœ“ Significant
Guardrails: โœ“ No harm
Segments: โœ“ Consistent
โ†’ Decision: SHIP

Iterate Decision

Primary metric: โœ— Not sig
Secondary: โœ“ Positive
Learnings: โœ“ Clear
โ†’ Decision: ITERATE

Kill Decision

Primary metric: โœ— Negative
Guardrails: โœ— Violated
Cost: High
โ†’ Decision: KILL