Imagine you run an A/B test on your website and get a “statistically significant” result (p < 0.05). You celebrate, roll out the new design, and... nothing noticeably changes. What went wrong? You found a real difference, but it was so tiny that it didn't matter in practice. This is where effect size comes in.
Effect size measures how big a difference or relationship actually is, not just whether it exists. While p-values tell you “is there an effect?”, effect size tells you “how large is the effect?”. In this tutorial, we'll explore what effect size means, the most common measures, and why every researcher should report it.
Effect size is a quantitative measure of the magnitude of a phenomenon. Unlike a p-value, which simply tells you whether an observed result is unlikely under the null hypothesis, effect size tells you how much two groups differ, or how strongly two variables are related.
Think of it this way: if a doctor tells you a new medication “works” (p < 0.05), your next question should be “how well does it work?”. Does it reduce pain by 1% or by 50%? That's the question effect size answers.
Statistical Significance vs. Practical Significance
A result can be statistically significant but practically meaningless (especially with large samples), or practically important but not statistically significant (especially with small samples).
Different research designs call for different effect size measures. Here are the most widely used ones:
Cohen's d measures the difference between two group means in terms of standard deviations. It's the most common effect size for comparing two groups (e.g., treatment vs. control).
Where and are the group means, and is the pooled standard deviation.
A Cohen's d of 0.5 means the two group means are half a standard deviation apart. The larger the d, the more the groups differ.
Pearson's r measures the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to +1.
Values close to +1 or -1 indicate a strong relationship; values close to 0 indicate a weak relationship.
Eta-squared measures the proportion of total variance in the outcome that is explained by the independent variable. It's commonly used with ANOVA and related tests.
An of 0.06 means 6% of the variance in the outcome is accounted for by the grouping variable.
The odds ratio compares the odds of an event occurring in one group to the odds in another. It's widely used in medical research and logistic regression.
An OR of 1 means no difference; OR > 1 means higher odds in the treatment group; OR < 1 means lower odds.
Jacob Cohen proposed the following benchmarks as general guidelines for interpreting common effect size measures. While these are widely used, remember that what counts as a “small” or “large” effect depends on the context of your research.
| Interpretation | Cohen's d | Pearson's r | |
|---|---|---|---|
| Small | 0.2 | 0.10 | 0.01 |
| Medium | 0.5 | 0.30 | 0.06 |
| Large | 0.8 | 0.50 | 0.14 |
Context Matters
Cohen himself cautioned that these are rough guidelines. In some fields, a “small” effect can be hugely important. For example, a medication that reduces heart attack risk by just 1% (small effect) could save thousands of lives when applied across millions of people. Always interpret effect sizes within the context of your specific domain.
The best way to understand effect size is to see it. In this visualization, two bell curves represent a control group and a treatment group. As you increase Cohen's d, the treatment group's distribution shifts further to the right, meaning the groups become more distinct. Notice how the overlap between the two distributions decreases as the effect gets larger.
Cohen's d = 0.5
(Medium effect)Cohen's d
0.5
Interpretation
Medium
Distribution Overlap
80.3%
The overlap percentage shows how much the two distributions share in common. A smaller overlap means the effect is easier to distinguish from no effect.
Try experimenting with the slider and notice:
One of the most common mistakes in statistics is equating a small p-value with a large effect. In reality, the p-value is influenced by both the effect size and the sample size:
This means that with a large enough sample, even a trivially small effect will produce a tiny p-value. Consider these two scenarios:
A meaningful effect that the test doesn't have enough power to detect.
A trivial effect that looks impressive only because of the massive sample.
This is why the American Psychological Association (APA) and many journals now require reporting effect sizes alongside p-values. Together, they give the full picture: the p-value tells you whether an effect is real, and the effect size tells you whether it matters.
Best Practice
Always report effect sizes in your research. A complete finding sounds like: “The treatment group scored significantly higher than the control group, t(98) = 2.45, p = .016, d= 0.49 (medium effect).”
Let's test your understanding of effect size with some practice problems. These will help you interpret effect sizes in real-world contexts and understand the relationship between effect size, p-values, and sample size.
Imagine you run an A/B test on your website and get a “statistically significant” result (p < 0.05). You celebrate, roll out the new design, and... nothing noticeably changes. What went wrong? You found a real difference, but it was so tiny that it didn't matter in practice. This is where effect size comes in.
Effect size measures how big a difference or relationship actually is, not just whether it exists. While p-values tell you “is there an effect?”, effect size tells you “how large is the effect?”. In this tutorial, we'll explore what effect size means, the most common measures, and why every researcher should report it.
Effect size is a quantitative measure of the magnitude of a phenomenon. Unlike a p-value, which simply tells you whether an observed result is unlikely under the null hypothesis, effect size tells you how much two groups differ, or how strongly two variables are related.
Think of it this way: if a doctor tells you a new medication “works” (p < 0.05), your next question should be “how well does it work?”. Does it reduce pain by 1% or by 50%? That's the question effect size answers.
Statistical Significance vs. Practical Significance
A result can be statistically significant but practically meaningless (especially with large samples), or practically important but not statistically significant (especially with small samples).
Different research designs call for different effect size measures. Here are the most widely used ones:
Cohen's d measures the difference between two group means in terms of standard deviations. It's the most common effect size for comparing two groups (e.g., treatment vs. control).
Where and are the group means, and is the pooled standard deviation.
A Cohen's d of 0.5 means the two group means are half a standard deviation apart. The larger the d, the more the groups differ.
Pearson's r measures the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to +1.
Values close to +1 or -1 indicate a strong relationship; values close to 0 indicate a weak relationship.
Eta-squared measures the proportion of total variance in the outcome that is explained by the independent variable. It's commonly used with ANOVA and related tests.
An of 0.06 means 6% of the variance in the outcome is accounted for by the grouping variable.
The odds ratio compares the odds of an event occurring in one group to the odds in another. It's widely used in medical research and logistic regression.
An OR of 1 means no difference; OR > 1 means higher odds in the treatment group; OR < 1 means lower odds.
Jacob Cohen proposed the following benchmarks as general guidelines for interpreting common effect size measures. While these are widely used, remember that what counts as a “small” or “large” effect depends on the context of your research.
| Interpretation | Cohen's d | Pearson's r | |
|---|---|---|---|
| Small | 0.2 | 0.10 | 0.01 |
| Medium | 0.5 | 0.30 | 0.06 |
| Large | 0.8 | 0.50 | 0.14 |
Context Matters
Cohen himself cautioned that these are rough guidelines. In some fields, a “small” effect can be hugely important. For example, a medication that reduces heart attack risk by just 1% (small effect) could save thousands of lives when applied across millions of people. Always interpret effect sizes within the context of your specific domain.
The best way to understand effect size is to see it. In this visualization, two bell curves represent a control group and a treatment group. As you increase Cohen's d, the treatment group's distribution shifts further to the right, meaning the groups become more distinct. Notice how the overlap between the two distributions decreases as the effect gets larger.
Cohen's d = 0.5
(Medium effect)Cohen's d
0.5
Interpretation
Medium
Distribution Overlap
80.3%
The overlap percentage shows how much the two distributions share in common. A smaller overlap means the effect is easier to distinguish from no effect.
Try experimenting with the slider and notice:
One of the most common mistakes in statistics is equating a small p-value with a large effect. In reality, the p-value is influenced by both the effect size and the sample size:
This means that with a large enough sample, even a trivially small effect will produce a tiny p-value. Consider these two scenarios:
A meaningful effect that the test doesn't have enough power to detect.
A trivial effect that looks impressive only because of the massive sample.
This is why the American Psychological Association (APA) and many journals now require reporting effect sizes alongside p-values. Together, they give the full picture: the p-value tells you whether an effect is real, and the effect size tells you whether it matters.
Best Practice
Always report effect sizes in your research. A complete finding sounds like: “The treatment group scored significantly higher than the control group, t(98) = 2.45, p = .016, d= 0.49 (medium effect).”
Let's test your understanding of effect size with some practice problems. These will help you interpret effect sizes in real-world contexts and understand the relationship between effect size, p-values, and sample size.