Alternatives to NHST

Klinkenberg

University of Amsterdam

2024-02-26

The problem with
P-values

There is no problem

The problem with P-values is that they are often misunderstood and misinterpreted. The P-value is the probability of observing a sample statistic as or more extreme as the one obtained, given that the null hypothesis is true. It is NOT the probability that the null hypothesis is true. The P-value is NOT a measure of the strength of the evidence against the null hypothesis.

The misinterpretation is the problem, and not adhering to the Nayman-Pearson paradigm

The dance of the P-value

H0 and HA distribution

G*Power

Determine the required sample size for a desired test power, significance level, and effect size.

G*Power is a tool to compute statistical power analyses for many different \(t\) tests, \(F\) tests, \(\chi^2\) tests, \(z\) tests and some exact tests.

gpower.hhu.de

Confidence Interval

The confidence interval is a range of values that is likely to contain the true value of an unknown population parameter. The confidence interval is calculated from a given set of sample data. The confidence interval is used to express the uncertainty associated with a sample estimate of a population parameter.

We use the standard error to calculate the lower and upper limit of the confidence interval.

Standard Error

95% confidence interval

\[SE = \frac{\text{Standard deviation}}{\text{Square root of sample size}} = \frac{s}{\sqrt{n}}\]

  • Lowerbound = \(\bar{x} - 1.96 \times SE\)
  • Upperbound = \(\bar{x} + 1.96 \times SE\)

Plot CI

5 out of 100 samples

Common Misinterpretations

Confidence intervals and levels are frequently misunderstood, and published studies have shown that even professional scientists often misinterpret them (Wikipedia, 2024)

Hoekstra, Morey, Rouder, & Wagenmakers (2014) administerred the following questionair to 120 researchers.

Important

All of the statements are false

Researcher don’t know

Table 2 from Hoekstra et al. (2014)
#True First-Year Students (n = 442) Master Students (n = 34) Researchers (n = 118)
0 2% 0% 3%
1 6% 24% 9%
2 14% 18% 14%
3 26% 15% 25%
4 30% 12% 22%
5 15% 21% 16%
6 7% 12% 11%

Conclusion

Use both NHST and confidence intervals

Example

We also studied the validity by comparing the mean ability ratings of children in different grades. We expected a positive relation between grade and ability. Figure 5 shows the average ability rating for each grade and domain. As expected, children in older age groups had a higher rating than children in younger age groups. In all four domains, there is an overall significant effect of grade: addition \(F(5,1456)=1091.4,p<.01,\omega^2=.78\), subtraction \(F(5,1363)=780.5,p<.01,\omega^2=.74\), multiplication \(F(5,1215)=409.6,p<.01,\omega^2=.62\), and \(F(5,973)=223.31,p<.01,\omega^2=.53\) for division. Levene’s tests show differences in variances for the domains multiplication and division. However, the non-parametric Kruskal-Wallis tests also show significant differences for these domains: \(\chi^2(5)=753.28,p<.01\) for multiplication and \(\chi^2(5)=505.17,p<.01\) for division. For all domains, post hoc analyses show significant differences between all grades, except for the differences between grades five and six (Klinkenberg, Straatemeier, & van der Maas, 2011).

End

Contact

CC BY-NC-SA 4.0

References

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.
Klinkenberg, S., Straatemeier, M., & van der Maas, H. L. J. (2011). Computer adaptive practice of maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education, 57(2), 1813–1824. https://doi.org/https://doi.org/10.1016/j.compedu.2011.02.003
Wikipedia. (2024). Confidence intervalWikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Confidence%20interval&oldid=1207611938.