Master statistical testing, experiment design, and data-driven decision making for AI products
Remove guesswork from product development. Every feature change is validated with real user data.
Test changes on small user segments before full rollout, preventing costly mistakes.
Ship faster with confidence. Multiple experiments run in parallel accelerate learning.
Enter your metrics to calculate the potential ROI of A/B testing...
| Company | Experiment | Result | Business Impact |
|---|---|---|---|
| 41 shades of blue | +$200M revenue | Optimal link color | |
| Amazon | 1-Click ordering | +35% conversion | Billions in revenue |
| Netflix | Personalized thumbnails | +30% engagement | Reduced churn |
| Booking.com | Urgency messaging | +10% bookings | Market leadership |
| AI recommendations | +50% connections | Network growth |
"AI recommendations will increase engagement"
Control vs Treatment groups
Run for statistical significance
Make data-driven decision
# Null Hypothesis (Hโ)
Hโ: ฮผ_treatment = ฮผ_control
"There is no difference between groups"
# Alternative Hypothesis (Hโ)
Hโ: ฮผ_treatment โ ฮผ_control
"There is a significant difference"
# Decision Rule
if p_value < ฮฑ (0.05):
reject Hโ # Treatment has effect
else:
fail to reject Hโ # No evidence of effect
The probability that results are not due to chance
Probability of detecting a true effect
80% power (recommended minimum)
Factors affecting power:
| Test Type | Use Case | Example | Pros | Cons |
|---|---|---|---|---|
| A/B Test | Compare two versions | Button color | Simple, clear results | One variable at a time |
| A/B/n Test | Multiple variants | 3+ headlines | Test many options | Requires more traffic |
| Multivariate | Multiple variables | Layout + color + text | Interaction effects | Very large sample needed |
| Bandit | Optimize while testing | Dynamic allocation | Minimize opportunity cost | Complex analysis |
| Holdout | Long-term effects | Algorithm changes | Measure cumulative impact | Reduces test velocity |
Enter parameters to calculate required sample size...
Main success indicators
Supporting indicators
Protect from harm
Stop early when results are clear
import numpy as np
from scipy import stats
class SequentialTest:
def __init__(self, alpha=0.05, beta=0.20):
self.alpha = alpha # Type I error
self.beta = beta # Type II error
self.log_likelihood_ratio = 0
def update(self, control_success, control_total,
treatment_success, treatment_total):
"""Update test with new data"""
# Calculate conversion rates
p_control = control_success / control_total
p_treatment = treatment_success / treatment_total
# Update log likelihood ratio
if p_control > 0 and p_treatment > 0:
self.log_likelihood_ratio += np.log(
p_treatment / p_control
)
# Check stopping conditions
upper_bound = np.log((1 - self.beta) / self.alpha)
lower_bound = np.log(self.beta / (1 - self.alpha))
if self.log_likelihood_ratio >= upper_bound:
return "STOP: Treatment wins"
elif self.log_likelihood_ratio <= lower_bound:
return "STOP: No difference"
else:
return "CONTINUE: Need more data"
Ensure balanced groups across segments
For marketplace and network effects
Enter your test results to calculate statistical significance...
| Pitfall | Description | Impact | Solution |
|---|---|---|---|
| Peeking | Checking results too early | Inflated false positives | Sequential testing or fixed horizon |
| Multiple Testing | Testing many metrics | Type I error inflation | Bonferroni correction |
| Simpson's Paradox | Aggregate vs segment results differ | Wrong conclusions | Segment analysis |
| Novelty Effect | Initial excitement bias | Overestimated impact | Longer test duration |
| Sample Ratio Mismatch | Unequal group sizes | Invalid results | SRM detection |
class MetricFramework:
def __init__(self):
self.metrics = {
'primary': [],
'secondary': [],
'guardrails': []
}
def add_primary_metric(self, metric, success_criteria):
"""Add primary success metric"""
self.metrics['primary'].append({
'name': metric,
'type': 'primary',
'success_criteria': success_criteria,
'weight': 1.0
})
def add_guardrail(self, metric, threshold):
"""Add guardrail metric"""
self.metrics['guardrails'].append({
'name': metric,
'type': 'guardrail',
'threshold': threshold,
'direction': 'no_harm' # Should not decrease
})
def evaluate_experiment(self, results):
"""Evaluate if experiment is successful"""
decision = {
'ship': True,
'reasons': []
}
# Check primary metrics
for metric in self.metrics['primary']:
if not self.meets_criteria(results[metric['name']],
metric['success_criteria']):
decision['ship'] = False
decision['reasons'].append(
f"{metric['name']} did not meet success criteria"
)
# Check guardrails
for guardrail in self.metrics['guardrails']:
if self.violates_guardrail(results[guardrail['name']],
guardrail['threshold']):
decision['ship'] = False
decision['reasons'].append(
f"{guardrail['name']} guardrail violated"
)
return decision
def meets_criteria(self, result, criteria):
"""Check if result meets success criteria"""
return result['lift'] >= criteria['min_lift'] and \
result['p_value'] < criteria['significance_level']
def violates_guardrail(self, result, threshold):
"""Check if guardrail is violated"""
return result['value'] < threshold
# Usage Example
framework = MetricFramework()
framework.add_primary_metric(
'conversion_rate',
{'min_lift': 0.02, 'significance_level': 0.05}
)
framework.add_guardrail('page_load_time', threshold=2.0)
framework.add_guardrail('error_rate', threshold=0.01)
Rapid testing for quick wins
1% โ 5% โ 20% โ 100%
if feature_flag.is_enabled('new_ai_model', user_id):
result = new_model.predict(data)
else:
result = old_model.predict(data)
Decouple deployment from release
Enter your test data to check for statistical significance...
Configure parameters to visualize confidence interval...
Enter parameters to calculate statistical power...
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
class ABTestSimulator:
def __init__(self, control_rate, treatment_rate, daily_traffic):
self.control_rate = control_rate
self.treatment_rate = treatment_rate
self.daily_traffic = daily_traffic
self.results = []
def run_day(self):
"""Simulate one day of the experiment"""
# Split traffic 50/50
control_n = self.daily_traffic // 2
treatment_n = self.daily_traffic // 2
# Generate conversions
control_conversions = np.random.binomial(
control_n, self.control_rate
)
treatment_conversions = np.random.binomial(
treatment_n, self.treatment_rate
)
return {
'control_visitors': control_n,
'control_conversions': control_conversions,
'treatment_visitors': treatment_n,
'treatment_conversions': treatment_conversions
}
def run_experiment(self, days):
"""Run full experiment"""
cumulative_control_v = 0
cumulative_control_c = 0
cumulative_treatment_v = 0
cumulative_treatment_c = 0
for day in range(1, days + 1):
day_results = self.run_day()
cumulative_control_v += day_results['control_visitors']
cumulative_control_c += day_results['control_conversions']
cumulative_treatment_v += day_results['treatment_visitors']
cumulative_treatment_c += day_results['treatment_conversions']
# Calculate current statistics
control_rate = cumulative_control_c / cumulative_control_v
treatment_rate = cumulative_treatment_c / cumulative_treatment_v
# Perform statistical test
test_result = self.statistical_test(
cumulative_control_c, cumulative_control_v,
cumulative_treatment_c, cumulative_treatment_v
)
self.results.append({
'day': day,
'control_rate': control_rate,
'treatment_rate': treatment_rate,
'lift': (treatment_rate - control_rate) / control_rate,
'p_value': test_result['p_value'],
'significant': test_result['p_value'] < 0.05,
'confidence_interval': test_result['ci']
})
return pd.DataFrame(self.results)
def statistical_test(self, c_conv, c_total, t_conv, t_total):
"""Perform chi-square test"""
contingency_table = [
[c_conv, c_total - c_conv],
[t_conv, t_total - t_conv]
]
chi2, p_value, dof, expected = stats.chi2_contingency(
contingency_table
)
# Calculate confidence interval for lift
p_c = c_conv / c_total
p_t = t_conv / t_total
se_c = np.sqrt(p_c * (1 - p_c) / c_total)
se_t = np.sqrt(p_t * (1 - p_t) / t_total)
se_diff = np.sqrt(se_c**2 + se_t**2)
diff = p_t - p_c
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff
return {
'p_value': p_value,
'ci': (ci_lower, ci_upper)
}
def plot_results(self):
"""Visualize experiment results over time"""
df = pd.DataFrame(self.results)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Conversion rates over time
axes[0, 0].plot(df['day'], df['control_rate'],
label='Control', color='blue')
axes[0, 0].plot(df['day'], df['treatment_rate'],
label='Treatment', color='green')
axes[0, 0].set_title('Conversion Rates')
axes[0, 0].set_xlabel('Day')
axes[0, 0].set_ylabel('Rate')
axes[0, 0].legend()
# P-value over time
axes[0, 1].plot(df['day'], df['p_value'], color='red')
axes[0, 1].axhline(y=0.05, color='gray', linestyle='--',
label='ฮฑ = 0.05')
axes[0, 1].set_title('P-Value Evolution')
axes[0, 1].set_xlabel('Day')
axes[0, 1].set_ylabel('P-Value')
axes[0, 1].legend()
# Lift with confidence interval
axes[1, 0].plot(df['day'], df['lift'] * 100, color='purple')
axes[1, 0].fill_between(df['day'],
df['confidence_interval'].apply(lambda x: x[0] * 100),
df['confidence_interval'].apply(lambda x: x[1] * 100),
alpha=0.3, color='purple')
axes[1, 0].set_title('Lift % with 95% CI')
axes[1, 0].set_xlabel('Day')
axes[1, 0].set_ylabel('Lift %')
# Significance indicator
colors = ['green' if sig else 'red'
for sig in df['significant']]
axes[1, 1].bar(df['day'], df['significant'], color=colors)
axes[1, 1].set_title('Statistical Significance')
axes[1, 1].set_xlabel('Day')
axes[1, 1].set_ylabel('Significant (1) or Not (0)')
plt.tight_layout()
plt.show()
# Run simulation
simulator = ABTestSimulator(
control_rate=0.10, # 10% baseline
treatment_rate=0.11, # 11% treatment (10% lift)
daily_traffic=1000
)
results = simulator.run_experiment(days=30)
print(results.tail())
simulator.plot_results()
Enter your test parameters to calculate optimal duration...
| Segment | Control CR | Treatment CR | Lift | P-Value | Decision |
|---|---|---|---|---|---|
| New Users | 3.2% | 4.1% | +28% | 0.002 | Ship |
| Returning Users | 8.5% | 8.3% | -2% | 0.451 | No Effect |
| Mobile | 2.1% | 2.8% | +33% | 0.012 | Ship |
| Desktop | 5.4% | 5.2% | -4% | 0.623 | No Effect |
| High Value | 15.2% | 14.8% | -3% | 0.089 | Monitor |
import numpy as np
from scipy.stats import beta
class ThompsonSamplingBandit:
def __init__(self, n_arms):
self.n_arms = n_arms
self.successes = np.zeros(n_arms)
self.failures = np.zeros(n_arms)
self.total_rewards = 0
self.counts = np.zeros(n_arms)
def select_arm(self):
"""Select arm using Thompson Sampling"""
# Sample from Beta distribution for each arm
samples = []
for arm in range(self.n_arms):
# Beta(ฮฑ, ฮฒ) where ฮฑ = successes + 1, ฮฒ = failures + 1
sample = beta.rvs(
self.successes[arm] + 1,
self.failures[arm] + 1
)
samples.append(sample)
# Select arm with highest sampled value
return np.argmax(samples)
def update(self, arm, reward):
"""Update arm statistics"""
self.counts[arm] += 1
if reward == 1:
self.successes[arm] += 1
self.total_rewards += 1
else:
self.failures[arm] += 1
def get_arm_probabilities(self):
"""Get probability of selecting each arm"""
n_simulations = 10000
selections = np.zeros(self.n_arms)
for _ in range(n_simulations):
arm = self.select_arm()
selections[arm] += 1
return selections / n_simulations
def run_experiment(self, true_rates, n_rounds):
"""Run bandit experiment"""
rewards_history = []
arm_history = []
for round in range(n_rounds):
# Select arm
arm = self.select_arm()
arm_history.append(arm)
# Get reward (simulate conversion)
reward = np.random.binomial(1, true_rates[arm])
rewards_history.append(reward)
# Update statistics
self.update(arm, reward)
# Log progress
if (round + 1) % 1000 == 0:
avg_reward = self.total_rewards / (round + 1)
print(f"Round {round + 1}: Avg Reward = {avg_reward:.3f}")
print(f"Arm Selection: {self.counts / self.counts.sum()}")
return {
'total_reward': self.total_rewards,
'arm_counts': self.counts,
'final_rates': self.successes / np.maximum(self.counts, 1),
'regret': self.calculate_regret(true_rates, arm_history)
}
def calculate_regret(self, true_rates, arm_history):
"""Calculate cumulative regret"""
best_arm = np.argmax(true_rates)
best_rate = true_rates[best_arm]
cumulative_regret = 0
for arm in arm_history:
regret = best_rate - true_rates[arm]
cumulative_regret += regret
return cumulative_regret
# Compare Thompson Sampling vs A/B Testing
true_rates = [0.10, 0.12, 0.11, 0.09] # True conversion rates
bandit = ThompsonSamplingBandit(n_arms=4)
results = bandit.run_experiment(true_rates, n_rounds=10000)
print(f"\nFinal Results:")
print(f"Best Arm Found: {np.argmax(results['final_rates'])}")
print(f"True Best Arm: {np.argmax(true_rates)}")
print(f"Cumulative Regret: {results['regret']:.2f}")
print(f"Traffic Allocation: {results['arm_counts'] / results['arm_counts'].sum()}")
Update beliefs with evidence
# Probability B > A
prob_b_better = np.mean(
samples_b > samples_a
)
# Expected loss
loss_choosing_a = np.maximum(
samples_b - samples_a, 0
).mean()
# Decision
if prob_b_better > 0.95:
decision = "Choose B"
| Variant | Button Color | Button Text | Layout | Conversion | Interactions |
|---|---|---|---|---|---|
| Control | Blue | "Buy Now" | Standard | 5.0% | - |
| Var 1 | Green | "Buy Now" | Standard | 5.3% | Color effect |
| Var 2 | Blue | "Get Started" | Standard | 5.5% | Text effect |
| Var 3 | Green | "Get Started" | Standard | 6.2% | Color ร Text |
| Var 4 | Green | "Get Started" | Centered | 7.1% | All factors |
class CUPED:
def __init__(self, pre_period_days=30):
self.pre_period_days = pre_period_days
def reduce_variance(self, df):
"""Apply CUPED to reduce metric variance"""
# Get pre-experiment data
pre_data = df[df['date'] < experiment_start_date]
exp_data = df[df['date'] >= experiment_start_date]
# Calculate pre-period metric for each user
pre_metric = pre_data.groupby('user_id')['metric'].mean()
# Merge with experiment data
exp_data = exp_data.merge(
pre_metric.rename('pre_metric'),
on='user_id'
)
# Calculate theta (optimal coefficient)
cov = np.cov(exp_data['metric'], exp_data['pre_metric'])[0, 1]
var = np.var(exp_data['pre_metric'])
theta = cov / var if var > 0 else 0
# Adjust metric using CUPED
exp_data['adjusted_metric'] = (
exp_data['metric'] -
theta * (exp_data['pre_metric'] - exp_data['pre_metric'].mean())
)
# Calculate variance reduction
original_var = np.var(exp_data['metric'])
adjusted_var = np.var(exp_data['adjusted_metric'])
variance_reduction = 1 - (adjusted_var / original_var)
print(f"Variance Reduction: {variance_reduction:.1%}")
print(f"This is equivalent to {1/(1-variance_reduction):.1f}x more sample")
return exp_data
# Usage
cuped = CUPED(pre_period_days=30)
adjusted_data = cuped.reduce_variance(experiment_data)
# Now run test on adjusted metric
control_adjusted = adjusted_data[
adjusted_data['variant'] == 'control'
]['adjusted_metric']
treatment_adjusted = adjusted_data[
adjusted_data['variant'] == 'treatment'
]['adjusted_metric']
# T-test on adjusted metrics (lower variance = higher power)
t_stat, p_value = stats.ttest_ind(control_adjusted, treatment_adjusted)
Randomize groups instead of individuals
Create control from historical data
Randomize user + connections
User + Friends
Click variants to simulate performance and see dynamic allocation...
# Sample Size (per variant) n = (Z_ฮฑ + Z_ฮฒ)ยฒ ร 2ฯยฒ / ฮดยฒ # Where: # Z_ฮฑ = Z-score for significance (1.96 for 95%) # Z_ฮฒ = Z-score for power (0.84 for 80%) # ฯยฒ = Variance # ฮด = Minimum detectable effect # Standard Error (proportion) SE = sqrt(p ร (1-p) / n) # Confidence Interval CI = p ยฑ Z ร SE # Z-Score Z = (pโ - pโ) / sqrt(SEโยฒ + SEโยฒ) # P-Value (two-tailed) p_value = 2 ร (1 - norm.cdf(abs(Z))) # Relative Lift lift = (treatment - control) / control ร 100% # Statistical Power power = 1 - ฮฒ # Effect Size (Cohen's d) d = (ฮผโ - ฮผโ) / ฯ_pooled # Chi-Square Test ฯยฒ = ฮฃ((O - E)ยฒ / E) # Multiple Testing Correction (Bonferroni) ฮฑ_adjusted = ฮฑ / m # m = number of tests # Bayesian Probability P(B > A) = โซโซ I(b > a) ร P(a) ร P(b) da db
| Tool | Best For | Features | Pricing |
|---|---|---|---|
| Optimizely | Enterprise | Full stack, Stats Engine | $$$ |
| Google Optimize | Web testing | Visual editor, GA integration | Free |
| LaunchDarkly | Feature flags | Progressive rollouts | $$ |
| Statsig | Product analytics | Auto-logging, Pulse | $ |
| Split.io | Engineering teams | SDKs, Targeting | $$ |
Primary metric: โ SignificantGuardrails: โ No harmSegments: โ Consistentโ Decision: SHIP
Primary metric: โ Not sigSecondary: โ PositiveLearnings: โ Clearโ Decision: ITERATE
Primary metric: โ NegativeGuardrails: โ ViolatedCost: Highโ Decision: KILL