Our thanks to Tom Siegfried for raising the issue; a simple example is needed:
For reasons known only to him, your lunch companion takes out two coins--a quarter and a nickel. He flips both coins—first the quarter, then the nickel. He repeats this five times and, in each case, if the quarter lands heads, so does the nickel; if it lands tails, the nickel also lands tails. "This," he says, "can't be coincidence; the quarter must be forcing how the nickel lands."
Being naturally skeptical, you immediately set out to discover whether this hypothesis, that the quarter forced the nickel, is tenable. You assume a null hypothesis: that there's no connection between the coins and that it happened by chance. The probability of the nickel falling the same as the quarter five times by chance turns out to be 1/32 or about 3%. If you repeated the exercise every day, you'd get a five-fold match once a month.
The usual scientific threshold for statistical significance is 5%. The USEPA uses 10% to classify a Group A carcinogen. Since smaller is more significant, a probability of 3% is safely inside those thresholds and appears to decisively reject the null hypothesis [cue the theremin]. You've found a "statistically significant result". The quarter appears to be forcing the nickel.
You might even say that the probability is 97% that the quarter is forcing the nickel. You'd be in good company.
And you'd be wrong on so many levels that it hurts my head to think about it.
Yet there has to be a 97% somewhere—and here it is: If the null hypothesis is correct—if there's no causal connection between the quarter and the nickel—the probability of at least one mismatch in five tosses is 97%.
Which didn't happen. And that's too bad. If we'd had just one mismatch, we'd have laid to rest the forcing hypothesis and that would be the end of the matter. Nonetheless, we can't say that we proved the forcing hypothesis.
As it is, all we can say is that we failed to invalidate the forcing hypothesis with a stronger argument in favor of pure chance. Without more information or more experiments, we have no idea what the probability is that the quarter forced the nickel, or not. We might say, on the strength of five matches, that for small values of smidgen, it's a smidgen more probable than it was before.