Experiment Results Analysis

Context

Use this after an A/B test has completed its run and you need to determine whether the variant won, lost, or needs more time. This skill closes the CRO experimentation loop: funnel-audit finds leaks, experiment-backlog plans tests, and this skill analyzes results and generates the next round of hypotheses. Without rigorous post-test analysis, teams either ship inconclusive variants or kill promising tests too early.

Procedure

Validate test setup. Confirm the test ran for at least 2 full business cycles (14+ days). Verify the traffic split was balanced (within 2% of target). Check whether the original sample size requirement was met. Flag any mid-test changes (variant edits, traffic reallocation, audience changes) that compromise validity.
Calculate core statistics. For each variant compute: conversion rate, absolute lift, relative lift percentage. Calculate the two-proportion z-test p-value and 95% confidence interval for the difference. Report statistical power achieved given the observed effect size. If the test tracked revenue, compute average revenue per visitor with confidence interval.
Run segment analysis. Break results by device type (mobile vs desktop), traffic source (organic, paid, direct, referral), and visitor type (new vs returning). Flag any segment where the variant effect reverses direction — this indicates the aggregate result may be misleading (Simpson's paradox risk).
Determine decision. Apply the decision framework:
- SHIP — p < 0.05, lift is positive, confidence intervals do not cross zero, test ran full duration, no segment reversals.
- KILL — p < 0.05 with negative lift, or p > 0.20 after full sample size reached (no detectable effect).
- EXTEND — 0.05 < p < 0.20 with positive trend. Calculate projected days to reach significance at the current effect size and traffic rate. If projected duration exceeds 60 additional days, recommend killing.
Model revenue impact. Annualize the observed lift: (daily traffic × lift in conversion rate × average order value × 365). Apply the lower bound of the 95% confidence interval for a conservative estimate. Note seasonal caveats if the test ran during an atypical period.
Generate follow-up hypotheses. Based on what the test revealed about user behavior, propose 2-3 next experiments. Each hypothesis must follow the format: "If we [change], then [metric] will [improve/decline] because [reason learned from this test]." Prioritize hypotheses that compound the winning insight or investigate unexpected segment differences.

Output Format

# Experiment Results: [Test Name]

## Test Summary
| Field | Value |
|-------|-------|
| Hypothesis | [Original hypothesis] |
| Duration | [Start] – [End] ([N] days) |
| Traffic split | [X/Y]% |
| Sample size target | [N] per variant |
| Sample size actual | Control: [N] / Variant: [N] |

## Statistical Results
| Variant | Visitors | Conversions | CVR | 95% CI |
|---------|----------|-------------|-----|--------|
| Control | | | | |
| Variant | | | | |

- **Relative lift:** [X]%
- **p-value:** [X]
- **Statistical power:** [X]%
- **Decision:** SHIP / KILL / EXTEND

## Decision Rationale
[2-3 sentences explaining why this decision, referencing the data above]

## Revenue Impact
- **Annualized lift (point estimate):** [currency][amount]
- **Conservative estimate (lower 95% CI):** [currency][amount]
- **Caveats:** [seasonal, traffic trend, AOV assumptions]

## Segment Analysis
| Segment | Control CVR | Variant CVR | Lift | Consistent? |
|---------|------------|-------------|------|-------------|
| Mobile | | | | Yes/No |
| Desktop | | | | Yes/No |
| New visitors | | | | Yes/No |
| Returning visitors | | | | Yes/No |

## Follow-Up Hypotheses
1. If we [change], then [metric] will [improve] because [learning from this test].
2. If we [change], then [metric] will [improve] because [learning from this test].
3. If we [change], then [metric] will [improve] because [learning from this test].

## Handoff
- Ship to: [engineer/team] with implementation spec
- Monitor: [analytics-expert] for post-ship regression check at 7/14/30 days
- Next test: [experiment-backlog] with follow-up hypotheses above

QA Rubric (scored)

Statistical rigor (0-5): significance properly calculated with confidence intervals, not just p-values.
Decision quality (0-5): recommendation is justified by data and accounts for practical significance and segment consistency.
Revenue modeling (0-5): impact projections use confidence ranges and disclose assumptions.
Learning extraction (0-5): follow-up hypotheses are derived from specific test insights, not recycled generic ideas.

Examples (good/bad)

Good: "Variant B increased checkout CVR by 14.2% (p = 0.003, 95% CI: [8.1%, 20.3%]). SHIP. Conservative annualized revenue lift: €22,400. However, mobile segment showed no lift — follow-up test should isolate mobile checkout experience."
Bad: "The variant looks better based on the numbers. We should ship it and see what happens." (no p-value, no CI, no segment check, no revenue projection)

Variants

Quick-read variant: summary table and decision only, skip segment analysis (for weekly standup updates).
High-stakes variant: add 99% confidence threshold, require segment consistency across all segments before shipping, include Bayesian credible interval alongside frequentist CI.
Multi-variant variant: extend to 3+ variants with Bonferroni correction for multiple comparisons.