Part of Forge DevKit ecosystem
◇ forge-ab
Test with rigor, not hunches
The problem
Tests launched without statistical rigor
Team runs A/B test for 3 days, declares a winner. Sample size: 47 visitors. That's noise, not signal.
No pre-committed hypothesis
Change button color, measure everything, find something significant. Classic p-hacking disguised as experimentation.
Test results don't get documented
Nobody remembers what you tested last quarter. Same experiments get repeated. Learnings evaporate.
How it works
Install
One command adds forge-ab to your environment.
Configure
3-gate wizard reads analytics context and establishes experimentation principles.
Experiment
Structured hypothesis, pre-committed sample sizes, isolated variables, documented results.
Learn
Every test produces a structured doc: hypothesis, result, confidence level, and next action. Win or lose, it's searchable.
Key capabilities
◇3 experiment modes
Hypothesis (structured if/then/because), design (sample size + duration calc), analyze (significance test + documented learning).
◇Sample size pre-commitment
Calculate required sample size before launch. No early stopping, no p-hacking.
◇4 psychology biases
Anchoring to first results, confirmation bias in analysis, novelty effect - surfaced as experiment warnings.
◇Documented learnings
Every experiment produces structured documentation. Win or lose, knowledge compounds.
Sample output
A real-world example of what this module produces.
◆ Experiment: Pricing Page CTA Color
Hypothesis: Changing CTA from blue to green increases clicks by 10%
Metric: CTA click-through rate on /pricing
Guardrail: Bounce rate must not increase by > 5%
Design:
Control: Blue CTA (#2563EB) 50% traffic
Variant: Green CTA (#16A34A) 50% traffic
Sample: 3,200 visitors (MDE 10%, power 80%)
Duration: ~14 days at current traffic
Pre-commit: Decision logged before results - no peeking Who is this for
Product Manager
Run statistically rigorous experiments with pre-committed hypotheses and sample sizes.
Growth Lead
Document every experiment result - wins and losses compound into organizational knowledge.
Data-Driven Developer
Get concrete experiment specs with sample size calculations instead of gut-feel testing.
forge-ab vs Ad-hoc A/B testing
| Dimension | Ad-hoc A/B testing | Forge DevKit |
|---|---|---|
| Statistical rigor | Run for a week, pick the winner | Pre-committed sample size, significance threshold |
| Hypothesis | Change it, measure everything, find something significant | Structured: If [change] then [metric] because [reason] |
| Knowledge retention | Results in a Slack thread, then forgotten | Documented learnings that compound across experiments |