
When the target duration is reached, they converge to the standard Z-score for the selected significance level (dashed lines). These start out high, signifying the increased confidence needed for making an early decision.
Sequential testing practice problems series#
The Sequential Testing Z-Statistic time series contains the following information for a metric: Hover over a metric and click View Details to review the progression of the sequential test. If the adjusted confidence interval overlaps with zero, this means the metric delta is not stat-sig at the moment, and the experiment should continue its course as planned. The solid bar is the standard confidence interval computed without any adjustments. The dashed line represents the expanded confidence interval resulting from the adjustment. When enabled, an adjustment is automatically applied to results calculated before the target completion date of the experiment. Quick Guide: Interpreting Sequential Testing Results Ĭlick on Edit at the top of the metrics section in Pulse to toggle Sequential Testing on/off. Limit this approach to cases where only a small number of metrics are relevant to the decision. But use caution: An early stat-sig result for certain metrics doesn't guarantee sufficient power to detect regressions in other metrics. If sequential testing shows an improvement in the key metrics, an early decision could be made. Opportunity cost: This arises when a significant loss may be incurred by delaying the experiment decision, such as launching a new feature ahead of a major event or fixing a bug.Sequential testing helps identify these regressions early and distinguishes significant effects from random fluctuations. Unexpected regressions: Sometimes experiments have bugs or unintended consequences that severely impact key metrics.While peeking is typically discouraged, regular monitoring of experiments with sequential testing is particularly valuable in some cases. The goal is to enable early decision making when there's sufficient evidence while limiting the risk of false positives. In Sequential Testing, the p-values for each preliminary analysis window are adjusted to compensate for the increased false positive rate associated with peeking. This increases the false positive rate (observing an experimental effect when there is none).

Continuous monitoring introduces selection bias in the date we pick for the readout: Selectively choosing a date based on the observed results is essentially cherry-picking a stat-sig result that would never be observed if the data were to be analyzed only over the entire, pre-determined duration of the experiment. This is because p-values fluctuate and are likely to drop in and out of significance just by random chance, even when there is no real effect. the peeking problem), much higher than expected based on the significance level selected for the test. Continuous monitoring for the purpose of decision making results in inflated false positive rates (a.k.a. Traditional A/B testing best practices dictate that the readout of experiment metrics should occur only once, when the target sample size of the experiment has been reached. Sequential Testing What is Sequential Testing?
