4.2 Statistical Tests
A statistical test determines whether a statement about a population is plausible given a sample that is drawn from this population. In essence, a statistical test answers the question: Is the sample that we have drawn sufficiently probable if the assumption about the population would be true?
We need several ingredients to apply a statistical test:
An assumption about a population.
A criterion to decide if the assumption is sufficiently plausible.
A sample from the population supplying information about the assumption
A probability for the sample showing how plausible the assumption is.
This section discusses these four ingredients of a statistical test. The assumption about a population is the null hypothesis of the test (Section 4.2.1). We select a significance level, usually five per cent, as criterion to decide whether the null hypothesis is sufficiently plausible or not. If it is not sufficiently plausible, we reject the null hypothesis. The values for which we reject the null hypothesis constitute the rejection region of the test (Section 4.2.2). We need a sample to test whether the assumption about the population is sufficiently plausible. Finally, we let the computer calculate a probability (p value) of drawing a sample that differs at least as much from the null hypothesis as the sample that we have drawn. If this probability is smaller than the selected significance level, the sample is in the rejection region, so we must reject the null hypothesis (Section 4.2.3). This concludes the statistical test.
4.2.1 The null hypothesis
The assumption that a researcher wants to test is called a research hypothesis. It is a statement about the empirical world that can be tested against data. Communication scientists, for instance, may hypothesize that:
a television station reaches half of all households in a country,
media literacy is below a particular standard (for instance, 5.5 on a 10-point scale) among children,
opinions about immigrants are not equally polarized among young and old voters,
the celebrity endorsing a fundraising campaign makes a difference to adult’s willingness to donate,
more exposure to brand advertisements increases brand awareness among consumers,
and so on.
These are statements about populations: all households in a country, children, voters, adults, and consumers. As these examples illustrate, research hypotheses seldom refer to statistics such as means, proportions, variances, or correlations. Still, we need a statistic to test a hypothesis. The researcher must translate the research hypothesis into a new hypothesis that refers to a statistic in the population, for example, the population mean. The new hypothesis is called a statistical hypothesis.
The most important statistical hypothesis is called the null hypothesis (H0). The null hypothesis specifies one value for a population statistic. Let us focus on the null hypothesis that average media literacy in the population of children equals 5.5 on a scale from one to ten. If 5.5 distinguishes between sufficient and insufficient media literacy on a ten-point scale, it is interesting to know whether average media literacy of children is close to this threshold, and thus just sufficient, or not.
We can test this statement about the population with a random sample of children drawn from the population in which we measure their media literacy. Once we have the measurements, we can calculate average media literacy in the sample. We can compare the sample average to the hypothesized average media literacy in the population. If they are not too far apart, we conclude that the null hypothesis is plausible. If they are too far apart, we don’t think the null hypothesis is plausible and we reject it.
4.2.2 Significance level (\(\alpha\)), significance, rejection region, and Type I error
How far apart must the sample statistic value and the hypothesized population value be to conclude that the null hypothesis is not plausible? The null hypothesis is implausible if the sample that we have drawn is among the samples that are very unlikely if the null hypothesis is true. A commonly accepted threshold value is that the sample is among the five per cent most unlikely samples. This threshold is called the significance level of the test. It is often represented by the symbol \(\alpha\) (the Greek letter alpha). If our sample is among the five per cent most unlikely samples, we reject the null hypothesis and we say that the test is statistically significant.
We can construct a sampling distribution around the hypothesized population value. Remember (Section 1.2.4) that the population value is the expected value of the sampling distribution, that is, its mean (if the estimator is unbiased). The sampling distribution, then, is centered around the population value specified in the null hypothesis. This sampling distribution tells us the probabilities of all possible sample outcomes if the null hypothesis is true. It allows us to identify the most unlikely samples, that is, the samples for which we reject the null hypothesis.
Note that we can construct a sampling distribution for the null hypothesis only if the hypothesis specifies one value for the population statistic. If we would have multiple population values in our null hypothesis, for example, average media literacy is 5.5, 5.0, or 4.5 in the population, we would have multiple sampling distributions: one for each value. This is why the null hypothesis must specify a single value.
According to our null hypothesis, the population average is 5.5. If average media literacy of children in the population would really be 5.5, which average sample media literacy scores are most unlikely? We can use a hypothetical sampling distribution with 5.5 as mean value to answer this question.
Average media literacy can be too low to maintain the null hypothesis that it is 5.5 in the population, but it can also be too high. The significance level of five per cent is divided into two halves of 2.5% per cent; one for each tail of the sampling distribution. Graphically speaking (Figure 4.1), the significance level cuts off a part of the left-hand tail and a part of the right-hand tail of the sampling distribution. Sample means in these tails are too unlikely to be found in a sample if the null hypothesis is true.
These values constitute the rejection region of the test. If the sample statistic is in the rejection region, we reject the null hypothesis. This is the rule of the game. However, rejecting the null hypothesis does not prove that it is wrong. Perhaps, average media literacy is really 5.5 in the population, but we were so unfortunate to draw a sample of children with very low media literacy scores. This error is called a Type I error: rejecting a null hypothesis that is actually true.
We don’t know whether or when we make this error. We cannot entirely avoid this error because samples can be very different from the population from which they are drawn, as we learned in Chapter 1. Thankfully, however, we know the probability that we make this error. This probability is the significance level.
You should understand the exact meaning of probabilities here. A significance level of .05 allows five per cent of all possible samples to be so different from the population that we reject the null hypothesis even if it is true.
In other words, if we draw many samples and decide on the null hypothesis for each sample, we would reject a true null hypothesis in five per cent of our decisions. So we have a five per cent chance of making a Type I error. We decide on that probability when we select the significance level of the test. We think that 5 percent (.05) is an acceptable probability for making this type of error.
4.2.3 p Value
How do we know that the sample that we have drawn is among the five percent most unlikely samples if the null hypothesis is true? In other words, how do we know that our sample statistic outcome is in the rejection region?
In the previous section, we learned that a test is statistically significant if the sample statistic is in the rejection region. Statistical software, however, usually does not report the rejection region for the sample statistic. Instead, it reports the p value of the test, which is sometimes referred to as significance or Sig. in SPSS.
A p value is the probability that a sample is drawn with a value for the sample statistic that is at least as different from the hypothesized population value as the value in the observed sample. In other words, the p value tells us the proportion of all possible samples that are less similar to the hypothesized population value than our observed sample if the null hypothesis is true. If this proportion is very small, say less than five percent, the sample that we have drawn is among the unlikely samples.
And what do we do if our sample is among the unlikely ones? We reject the null hypothesis because the test is statistically significant. The decision rule is quite simple if we know the p value of a test: If the p value is below the significance level (usually .05), we reject the null hypothesis. Otherwise, we do not reject it.
This is the golden rule of null hypothesis testing (although some argue that the gold of this rule is fool’s gold, see Chapter 6).
It is important to remember that a p value is a probability under the assumption that the null hypothesis is true. Therefore, it is a conditional probability.
Compare it to the probability that we throw sixes with a dice. This probability is one out of six under the assumption that the dice is fair. Probabilities rest on assumptions. If the assumptions are violated, we cannot calculate probabilities.
If the dice is not fair, we don’t know the probability of throwing sixes. In the same way, we have no clue whatsoever of the probability of drawing a sample like the one we have if the null hypothesis is not true in the population.