P-Value Formula and Interpretation
Reference for calculating and interpreting p-values in hypothesis testing.
Covers null rejection, one-tailed vs two-tailed, and z-test vs t-test.
The Concept
The p-value is the probability of getting results at least as extreme as the observed results, assuming the null hypothesis is true.
For a Z-Test
One-tailed (left): p = P(Z < z)
Two-tailed: p = 2 × P(Z > |z|)
Decision Rules
| P-Value | Common Interpretation | Decision (at α = 0.05) |
|---|---|---|
| p < 0.001 | Very strong evidence against H₀ | Reject H₀ |
| p < 0.01 | Strong evidence against H₀ | Reject H₀ |
| p < 0.05 | Moderate evidence against H₀ | Reject H₀ |
| p ≥ 0.05 | Weak evidence against H₀ | Fail to reject H₀ |
| p > 0.10 | Little to no evidence against H₀ | Fail to reject H₀ |
Common Misconceptions
- A p-value is NOT the probability that the null hypothesis is true
- A p-value is NOT the probability your result is due to chance
- p < 0.05 does not mean the result is practically important
- p > 0.05 does not mean there is no effect — it means you lack evidence
- A very small p-value with a tiny effect size may not be meaningful
Example
A z-test gives z = 2.15. What is the two-tailed p-value?
P(Z > 2.15) = 0.0158 (from z-table or calculator)
Two-tailed p = 2 × 0.0158 = 0.0316
Since 0.0316 < 0.05, this result is statistically significant at the 5% level.
We reject the null hypothesis.
Key Notes
- What p-value actually means: The p-value is the probability of obtaining a test result at least as extreme as the observed one, assuming the null hypothesis is true. It is NOT the probability that H₀ is true.
- The 0.05 threshold is a convention: Ronald Fisher suggested 0.05 as a rough guideline, not a law of science. Some fields use 0.01 (stricter) or 0.10 (more lenient). High-energy physics requires p < 0.0000003 (5-sigma) before claiming a discovery.
- Statistical vs practical significance: With a very large sample, even a trivially small and unimportant effect can produce p < 0.05. Always report effect size (Cohen's d, R², etc.) alongside the p-value.
- Multiple comparisons inflate false positives: If you run 20 independent tests at p < 0.05, you expect about 1 false positive by chance. Apply Bonferroni correction (divide α by the number of tests) or use FDR methods.
- One-tailed vs two-tailed tests: A two-tailed test checks for an effect in either direction (more common). A one-tailed test only checks one direction and has half the p-value of a two-tailed test for the same data — only use it with strong prior directional justification.