In this paper (Rubin, 2020), I consider Fisher’s criticism that the Neyman-Pearson approach to hypothesis testing relies on the assumption of “repeated sampling from the same population.” This criticism is problematic for the Neyman-Pearson approach because it implies that test users need to know, for sure, what counts as the same or equivalent population as their current population. If they don’t know what counts as the same or equivalent population, then they can’t specify a procedure that would be able to repeatedly sample from this population, rather than from other non-equivalent populations, and without this specification the Neyman-Pearson long run error rates become meaningless.
I argue that, by definition, researchers do not know for sure what are the relevant and irrelevant features of their current populations. For example, in a psychology study, is the population “1st year undergraduate psychology students” or, more narrowly, “Australian 1st year undergraduate psychology students” or, more broadly, “psychology undergraduate students” or, even more broadly, “young people,” etc.? Researchers can make educated guesses about the relevant and irrelevant aspects of their population. However, they must concede that those guesses may be wrong. Consequently, if a researcher imagines a long run of repeated sampling, then they must imagine that they would make incorrect decisions about their null hypothesis due to not only Type I errors and Type II errors, but also Type III errors — errors caused by accidentally sampling from populations that are substantively different to their underspecified alternative and null populations.
To be clear, the Neyman-Pearson approach does consider Type III errors. However, it considers them outside of each long run of repeated sampling. It does not allow Type III errors to occur inside a long run of repeated sampling, where the sampling must always be from a correctly specified family of “admissible” populations (Neyman, 1977, p. 106; Neyman & Pearson, 1933, p. 294). In my paper, I argue that researchers are unable to imagine a long run of repeated sampling from the same or equivalent populations as their current population because they are unclear about the relevant and irrelevant characteristics of their current population. Consequently, they are unable to rule out Type III errors within their imagined long run.
Following Fisher, I contrast scientific researchers with quality controllers in industrial production settings. Unlike researchers, quality controllers have clear knowledge about the relevant and irrelevant characteristics of their populations. For example, they are given a clear and unequivocal definition of Batch 57 on a production line, and they don’t consider re-conceptualizing Batch 57 as including or excluding other features. They also know which aspects of their testing procedure are relevant and irrelevant, and they are provided with precise quality control standards that allow them to know, for sure, their smallest effect size of interest. Consequently, the Neyman-Pearson approach is suitable for quality controllers because quality controllers can imagine a testing process that repeatedly draws random samples from the same population over and over again. In contrast, the Neyman-Pearson approach is not appropriate in scientific investigations because researchers do not have a clear understanding of the relevant and irrelevant aspects of their populations, their tests, or the smallest effect size that represents their population. Indeed, they are “researchers” because they are “researching” these things. Hence, it is researchers’ self-declared ignorance and doubt about the nature of their populations that renders Neyman-Pearson long run error rates scientifically meaningless.
For further information, please see:
Rubin, M. (2020). “Repeated sampling from the same population?” A critique of Neyman and Pearson’s responses to Fisher. European Journal for Philosophy of Science, 10, Article 42, 1–15. https://doi.org/10.1007/s13194-020-00309-6 *Publisher’s open access view only version* *Author’s version*
I also discuss the differences between the Fisherian and Neyman-Pearson approaches to hypothesis testing here.