Master statistical testing, experiment design, and data-driven decision making for AI products
Remove guesswork from product development. Every feature change is validated with real user data.
Test changes on small user segments before full rollout, preventing costly mistakes.
Ship faster with confidence. Multiple experiments run in parallel accelerate learning.
Enter your metrics to calculate the potential ROI of A/B testing...
Company | Experiment | Result | Business Impact |
---|---|---|---|
41 shades of blue | +$200M revenue | Optimal link color | |
Amazon | 1-Click ordering | +35% conversion | Billions in revenue |
Netflix | Personalized thumbnails | +30% engagement | Reduced churn |
Booking.com | Urgency messaging | +10% bookings | Market leadership |
AI recommendations | +50% connections | Network growth |
"AI recommendations will increase engagement"
Control vs Treatment groups
Run for statistical significance
Make data-driven decision
# Null Hypothesis (Hโ) Hโ: ฮผ_treatment = ฮผ_control "There is no difference between groups" # Alternative Hypothesis (Hโ) Hโ: ฮผ_treatment โ ฮผ_control "There is a significant difference" # Decision Rule if p_value < ฮฑ (0.05): reject Hโ # Treatment has effect else: fail to reject Hโ # No evidence of effect
The probability that results are not due to chance
Probability of detecting a true effect
80% power (recommended minimum)
Factors affecting power:
Test Type | Use Case | Example | Pros | Cons |
---|---|---|---|---|
A/B Test | Compare two versions | Button color | Simple, clear results | One variable at a time |
A/B/n Test | Multiple variants | 3+ headlines | Test many options | Requires more traffic |
Multivariate | Multiple variables | Layout + color + text | Interaction effects | Very large sample needed |
Bandit | Optimize while testing | Dynamic allocation | Minimize opportunity cost | Complex analysis |
Holdout | Long-term effects | Algorithm changes | Measure cumulative impact | Reduces test velocity |
Enter parameters to calculate required sample size...
Main success indicators
Supporting indicators
Protect from harm
Stop early when results are clear
import numpy as np from scipy import stats class SequentialTest: def __init__(self, alpha=0.05, beta=0.20): self.alpha = alpha # Type I error self.beta = beta # Type II error self.log_likelihood_ratio = 0 def update(self, control_success, control_total, treatment_success, treatment_total): """Update test with new data""" # Calculate conversion rates p_control = control_success / control_total p_treatment = treatment_success / treatment_total # Update log likelihood ratio if p_control > 0 and p_treatment > 0: self.log_likelihood_ratio += np.log( p_treatment / p_control ) # Check stopping conditions upper_bound = np.log((1 - self.beta) / self.alpha) lower_bound = np.log(self.beta / (1 - self.alpha)) if self.log_likelihood_ratio >= upper_bound: return "STOP: Treatment wins" elif self.log_likelihood_ratio <= lower_bound: return "STOP: No difference" else: return "CONTINUE: Need more data"
Ensure balanced groups across segments
For marketplace and network effects
Enter your test results to calculate statistical significance...
Pitfall | Description | Impact | Solution |
---|---|---|---|
Peeking | Checking results too early | Inflated false positives | Sequential testing or fixed horizon |
Multiple Testing | Testing many metrics | Type I error inflation | Bonferroni correction |
Simpson's Paradox | Aggregate vs segment results differ | Wrong conclusions | Segment analysis |
Novelty Effect | Initial excitement bias | Overestimated impact | Longer test duration |
Sample Ratio Mismatch | Unequal group sizes | Invalid results | SRM detection |
class MetricFramework: def __init__(self): self.metrics = { 'primary': [], 'secondary': [], 'guardrails': [] } def add_primary_metric(self, metric, success_criteria): """Add primary success metric""" self.metrics['primary'].append({ 'name': metric, 'type': 'primary', 'success_criteria': success_criteria, 'weight': 1.0 }) def add_guardrail(self, metric, threshold): """Add guardrail metric""" self.metrics['guardrails'].append({ 'name': metric, 'type': 'guardrail', 'threshold': threshold, 'direction': 'no_harm' # Should not decrease }) def evaluate_experiment(self, results): """Evaluate if experiment is successful""" decision = { 'ship': True, 'reasons': [] } # Check primary metrics for metric in self.metrics['primary']: if not self.meets_criteria(results[metric['name']], metric['success_criteria']): decision['ship'] = False decision['reasons'].append( f"{metric['name']} did not meet success criteria" ) # Check guardrails for guardrail in self.metrics['guardrails']: if self.violates_guardrail(results[guardrail['name']], guardrail['threshold']): decision['ship'] = False decision['reasons'].append( f"{guardrail['name']} guardrail violated" ) return decision def meets_criteria(self, result, criteria): """Check if result meets success criteria""" return result['lift'] >= criteria['min_lift'] and \ result['p_value'] < criteria['significance_level'] def violates_guardrail(self, result, threshold): """Check if guardrail is violated""" return result['value'] < threshold # Usage Example framework = MetricFramework() framework.add_primary_metric( 'conversion_rate', {'min_lift': 0.02, 'significance_level': 0.05} ) framework.add_guardrail('page_load_time', threshold=2.0) framework.add_guardrail('error_rate', threshold=0.01)
Rapid testing for quick wins
1% โ 5% โ 20% โ 100%
if feature_flag.is_enabled('new_ai_model', user_id): result = new_model.predict(data) else: result = old_model.predict(data)
Decouple deployment from release
Enter your test data to check for statistical significance...
Configure parameters to visualize confidence interval...
Enter parameters to calculate statistical power...
import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt class ABTestSimulator: def __init__(self, control_rate, treatment_rate, daily_traffic): self.control_rate = control_rate self.treatment_rate = treatment_rate self.daily_traffic = daily_traffic self.results = [] def run_day(self): """Simulate one day of the experiment""" # Split traffic 50/50 control_n = self.daily_traffic // 2 treatment_n = self.daily_traffic // 2 # Generate conversions control_conversions = np.random.binomial( control_n, self.control_rate ) treatment_conversions = np.random.binomial( treatment_n, self.treatment_rate ) return { 'control_visitors': control_n, 'control_conversions': control_conversions, 'treatment_visitors': treatment_n, 'treatment_conversions': treatment_conversions } def run_experiment(self, days): """Run full experiment""" cumulative_control_v = 0 cumulative_control_c = 0 cumulative_treatment_v = 0 cumulative_treatment_c = 0 for day in range(1, days + 1): day_results = self.run_day() cumulative_control_v += day_results['control_visitors'] cumulative_control_c += day_results['control_conversions'] cumulative_treatment_v += day_results['treatment_visitors'] cumulative_treatment_c += day_results['treatment_conversions'] # Calculate current statistics control_rate = cumulative_control_c / cumulative_control_v treatment_rate = cumulative_treatment_c / cumulative_treatment_v # Perform statistical test test_result = self.statistical_test( cumulative_control_c, cumulative_control_v, cumulative_treatment_c, cumulative_treatment_v ) self.results.append({ 'day': day, 'control_rate': control_rate, 'treatment_rate': treatment_rate, 'lift': (treatment_rate - control_rate) / control_rate, 'p_value': test_result['p_value'], 'significant': test_result['p_value'] < 0.05, 'confidence_interval': test_result['ci'] }) return pd.DataFrame(self.results) def statistical_test(self, c_conv, c_total, t_conv, t_total): """Perform chi-square test""" contingency_table = [ [c_conv, c_total - c_conv], [t_conv, t_total - t_conv] ] chi2, p_value, dof, expected = stats.chi2_contingency( contingency_table ) # Calculate confidence interval for lift p_c = c_conv / c_total p_t = t_conv / t_total se_c = np.sqrt(p_c * (1 - p_c) / c_total) se_t = np.sqrt(p_t * (1 - p_t) / t_total) se_diff = np.sqrt(se_c**2 + se_t**2) diff = p_t - p_c ci_lower = diff - 1.96 * se_diff ci_upper = diff + 1.96 * se_diff return { 'p_value': p_value, 'ci': (ci_lower, ci_upper) } def plot_results(self): """Visualize experiment results over time""" df = pd.DataFrame(self.results) fig, axes = plt.subplots(2, 2, figsize=(12, 8)) # Conversion rates over time axes[0, 0].plot(df['day'], df['control_rate'], label='Control', color='blue') axes[0, 0].plot(df['day'], df['treatment_rate'], label='Treatment', color='green') axes[0, 0].set_title('Conversion Rates') axes[0, 0].set_xlabel('Day') axes[0, 0].set_ylabel('Rate') axes[0, 0].legend() # P-value over time axes[0, 1].plot(df['day'], df['p_value'], color='red') axes[0, 1].axhline(y=0.05, color='gray', linestyle='--', label='ฮฑ = 0.05') axes[0, 1].set_title('P-Value Evolution') axes[0, 1].set_xlabel('Day') axes[0, 1].set_ylabel('P-Value') axes[0, 1].legend() # Lift with confidence interval axes[1, 0].plot(df['day'], df['lift'] * 100, color='purple') axes[1, 0].fill_between(df['day'], df['confidence_interval'].apply(lambda x: x[0] * 100), df['confidence_interval'].apply(lambda x: x[1] * 100), alpha=0.3, color='purple') axes[1, 0].set_title('Lift % with 95% CI') axes[1, 0].set_xlabel('Day') axes[1, 0].set_ylabel('Lift %') # Significance indicator colors = ['green' if sig else 'red' for sig in df['significant']] axes[1, 1].bar(df['day'], df['significant'], color=colors) axes[1, 1].set_title('Statistical Significance') axes[1, 1].set_xlabel('Day') axes[1, 1].set_ylabel('Significant (1) or Not (0)') plt.tight_layout() plt.show() # Run simulation simulator = ABTestSimulator( control_rate=0.10, # 10% baseline treatment_rate=0.11, # 11% treatment (10% lift) daily_traffic=1000 ) results = simulator.run_experiment(days=30) print(results.tail()) simulator.plot_results()
Enter your test parameters to calculate optimal duration...
Segment | Control CR | Treatment CR | Lift | P-Value | Decision |
---|---|---|---|---|---|
New Users | 3.2% | 4.1% | +28% | 0.002 | Ship |
Returning Users | 8.5% | 8.3% | -2% | 0.451 | No Effect |
Mobile | 2.1% | 2.8% | +33% | 0.012 | Ship |
Desktop | 5.4% | 5.2% | -4% | 0.623 | No Effect |
High Value | 15.2% | 14.8% | -3% | 0.089 | Monitor |
import numpy as np from scipy.stats import beta class ThompsonSamplingBandit: def __init__(self, n_arms): self.n_arms = n_arms self.successes = np.zeros(n_arms) self.failures = np.zeros(n_arms) self.total_rewards = 0 self.counts = np.zeros(n_arms) def select_arm(self): """Select arm using Thompson Sampling""" # Sample from Beta distribution for each arm samples = [] for arm in range(self.n_arms): # Beta(ฮฑ, ฮฒ) where ฮฑ = successes + 1, ฮฒ = failures + 1 sample = beta.rvs( self.successes[arm] + 1, self.failures[arm] + 1 ) samples.append(sample) # Select arm with highest sampled value return np.argmax(samples) def update(self, arm, reward): """Update arm statistics""" self.counts[arm] += 1 if reward == 1: self.successes[arm] += 1 self.total_rewards += 1 else: self.failures[arm] += 1 def get_arm_probabilities(self): """Get probability of selecting each arm""" n_simulations = 10000 selections = np.zeros(self.n_arms) for _ in range(n_simulations): arm = self.select_arm() selections[arm] += 1 return selections / n_simulations def run_experiment(self, true_rates, n_rounds): """Run bandit experiment""" rewards_history = [] arm_history = [] for round in range(n_rounds): # Select arm arm = self.select_arm() arm_history.append(arm) # Get reward (simulate conversion) reward = np.random.binomial(1, true_rates[arm]) rewards_history.append(reward) # Update statistics self.update(arm, reward) # Log progress if (round + 1) % 1000 == 0: avg_reward = self.total_rewards / (round + 1) print(f"Round {round + 1}: Avg Reward = {avg_reward:.3f}") print(f"Arm Selection: {self.counts / self.counts.sum()}") return { 'total_reward': self.total_rewards, 'arm_counts': self.counts, 'final_rates': self.successes / np.maximum(self.counts, 1), 'regret': self.calculate_regret(true_rates, arm_history) } def calculate_regret(self, true_rates, arm_history): """Calculate cumulative regret""" best_arm = np.argmax(true_rates) best_rate = true_rates[best_arm] cumulative_regret = 0 for arm in arm_history: regret = best_rate - true_rates[arm] cumulative_regret += regret return cumulative_regret # Compare Thompson Sampling vs A/B Testing true_rates = [0.10, 0.12, 0.11, 0.09] # True conversion rates bandit = ThompsonSamplingBandit(n_arms=4) results = bandit.run_experiment(true_rates, n_rounds=10000) print(f"\nFinal Results:") print(f"Best Arm Found: {np.argmax(results['final_rates'])}") print(f"True Best Arm: {np.argmax(true_rates)}") print(f"Cumulative Regret: {results['regret']:.2f}") print(f"Traffic Allocation: {results['arm_counts'] / results['arm_counts'].sum()}")
Update beliefs with evidence
# Probability B > A prob_b_better = np.mean( samples_b > samples_a ) # Expected loss loss_choosing_a = np.maximum( samples_b - samples_a, 0 ).mean() # Decision if prob_b_better > 0.95: decision = "Choose B"
Variant | Button Color | Button Text | Layout | Conversion | Interactions |
---|---|---|---|---|---|
Control | Blue | "Buy Now" | Standard | 5.0% | - |
Var 1 | Green | "Buy Now" | Standard | 5.3% | Color effect |
Var 2 | Blue | "Get Started" | Standard | 5.5% | Text effect |
Var 3 | Green | "Get Started" | Standard | 6.2% | Color ร Text |
Var 4 | Green | "Get Started" | Centered | 7.1% | All factors |
class CUPED: def __init__(self, pre_period_days=30): self.pre_period_days = pre_period_days def reduce_variance(self, df): """Apply CUPED to reduce metric variance""" # Get pre-experiment data pre_data = df[df['date'] < experiment_start_date] exp_data = df[df['date'] >= experiment_start_date] # Calculate pre-period metric for each user pre_metric = pre_data.groupby('user_id')['metric'].mean() # Merge with experiment data exp_data = exp_data.merge( pre_metric.rename('pre_metric'), on='user_id' ) # Calculate theta (optimal coefficient) cov = np.cov(exp_data['metric'], exp_data['pre_metric'])[0, 1] var = np.var(exp_data['pre_metric']) theta = cov / var if var > 0 else 0 # Adjust metric using CUPED exp_data['adjusted_metric'] = ( exp_data['metric'] - theta * (exp_data['pre_metric'] - exp_data['pre_metric'].mean()) ) # Calculate variance reduction original_var = np.var(exp_data['metric']) adjusted_var = np.var(exp_data['adjusted_metric']) variance_reduction = 1 - (adjusted_var / original_var) print(f"Variance Reduction: {variance_reduction:.1%}") print(f"This is equivalent to {1/(1-variance_reduction):.1f}x more sample") return exp_data # Usage cuped = CUPED(pre_period_days=30) adjusted_data = cuped.reduce_variance(experiment_data) # Now run test on adjusted metric control_adjusted = adjusted_data[ adjusted_data['variant'] == 'control' ]['adjusted_metric'] treatment_adjusted = adjusted_data[ adjusted_data['variant'] == 'treatment' ]['adjusted_metric'] # T-test on adjusted metrics (lower variance = higher power) t_stat, p_value = stats.ttest_ind(control_adjusted, treatment_adjusted)
Randomize groups instead of individuals
Create control from historical data
Randomize user + connections
User + Friends
Click variants to simulate performance and see dynamic allocation...
# Sample Size (per variant) n = (Z_ฮฑ + Z_ฮฒ)ยฒ ร 2ฯยฒ / ฮดยฒ # Where: # Z_ฮฑ = Z-score for significance (1.96 for 95%) # Z_ฮฒ = Z-score for power (0.84 for 80%) # ฯยฒ = Variance # ฮด = Minimum detectable effect # Standard Error (proportion) SE = sqrt(p ร (1-p) / n) # Confidence Interval CI = p ยฑ Z ร SE # Z-Score Z = (pโ - pโ) / sqrt(SEโยฒ + SEโยฒ) # P-Value (two-tailed) p_value = 2 ร (1 - norm.cdf(abs(Z))) # Relative Lift lift = (treatment - control) / control ร 100% # Statistical Power power = 1 - ฮฒ # Effect Size (Cohen's d) d = (ฮผโ - ฮผโ) / ฯ_pooled # Chi-Square Test ฯยฒ = ฮฃ((O - E)ยฒ / E) # Multiple Testing Correction (Bonferroni) ฮฑ_adjusted = ฮฑ / m # m = number of tests # Bayesian Probability P(B > A) = โซโซ I(b > a) ร P(a) ร P(b) da db
Tool | Best For | Features | Pricing |
---|---|---|---|
Optimizely | Enterprise | Full stack, Stats Engine | $$$ |
Google Optimize | Web testing | Visual editor, GA integration | Free |
LaunchDarkly | Feature flags | Progressive rollouts | $$ |
Statsig | Product analytics | Auto-logging, Pulse | $ |
Split.io | Engineering teams | SDKs, Targeting | $$ |
Primary metric: โ Significant
Guardrails: โ No harm
Segments: โ Consistent
โ Decision: SHIP
Primary metric: โ Not sig
Secondary: โ Positive
Learnings: โ Clear
โ Decision: ITERATE
Primary metric: โ Negative
Guardrails: โ Violated
Cost: High
โ Decision: KILL