Part of Forge DevKit ecosystem

forge-ab

Test with rigor, not hunches

The problem

Tests launched without statistical rigor

Team runs A/B test for 3 days, declares a winner. Sample size: 47 visitors. That's noise, not signal.

No pre-committed hypothesis

Change button color, measure everything, find something significant. Classic p-hacking disguised as experimentation.

Test results don't get documented

Nobody remembers what you tested last quarter. Same experiments get repeated. Learnings evaporate.

How it works

1

Install

One command adds forge-ab to your environment.

forge install forge-ab
2

Configure

3-gate wizard reads analytics context and establishes experimentation principles.

3

Experiment

Structured hypothesis, pre-committed sample sizes, isolated variables, documented results.

Mode: hypothesis / design / analyze
4

Learn

Every test produces a structured doc: hypothesis, result, confidence level, and next action. Win or lose, it's searchable.

Key capabilities

3 experiment modes

Hypothesis (structured if/then/because), design (sample size + duration calc), analyze (significance test + documented learning).

Sample size pre-commitment

Calculate required sample size before launch. No early stopping, no p-hacking.

4 psychology biases

Anchoring to first results, confirmation bias in analysis, novelty effect - surfaced as experiment warnings.

Documented learnings

Every experiment produces structured documentation. Win or lose, knowledge compounds.

Sample output

A real-world example of what this module produces.

forge:ab design - Experiment Spec
 Experiment: Pricing Page CTA Color

Hypothesis: Changing CTA from blue to green increases clicks by 10%
Metric:     CTA click-through rate on /pricing
Guardrail:  Bounce rate must not increase by > 5%

Design:
  Control:  Blue CTA (#2563EB)    50% traffic
  Variant:  Green CTA (#16A34A)   50% traffic
  Sample:   3,200 visitors        (MDE 10%, power 80%)
  Duration: ~14 days              at current traffic

Pre-commit: Decision logged before results - no peeking

Who is this for

Product Manager

Run statistically rigorous experiments with pre-committed hypotheses and sample sizes.

Growth Lead

Document every experiment result - wins and losses compound into organizational knowledge.

Data-Driven Developer

Get concrete experiment specs with sample size calculations instead of gut-feel testing.

forge-ab vs Ad-hoc A/B testing

Dimension Ad-hoc A/B testing Forge DevKit
Statistical rigor Run for a week, pick the winner Pre-committed sample size, significance threshold
Hypothesis Change it, measure everything, find something significant Structured: If [change] then [metric] because [reason]
Knowledge retention Results in a Slack thread, then forgotten Documented learnings that compound across experiments
Get Forge →