1 Sampling Distribution

How Different Could My Sample Have Been?

Key concepts: inferential statistics, generalization, population, random sample, sample statistic, sampling space, random variable, sampling distribution, probability, probability distribution, discrete probability distribution, expected value/expectation, unbiased estimator, parameter, (downward) biased, representative sample, continuous variable, continuous probability distribution, probability density, (left-hand/right-hand) probability.

Watch the micro lecture Video 1.1 on sampling distributions for an overview of the chapter.

Video 1.1: Introduction to the sampling distribution.

Summary

What does our sample tell us about the population from which it was drawn?

Statistical inference is about estimation and null hypothesis testing. We have collected data on a random sample and we want to draw conclusions (make inferences) about the population from which the sample was drawn. From the proportion of yellow candies in our sample bag, for instance, we want to estimate a plausible range of values for the proportion of yellow candies in a factory’s stock (confidence interval). Alternatively, we may want to test the null hypothesis that one fifth of the candies in a factory’s stock is yellow.

The sample does not offer a perfect miniature image of the population. If we would draw another random sample, it would have different characteristics. For instance, it would contain more or fewer yellow candies than the previous sample. To make an informed decision on the confidence interval or null hypothesis, we must compare the characteristic of the sample that we have drawn to the characteristics of the samples that we could have drawn.

The characteristics of the samples that we could have drawn constitute a sampling distribution. Sampling distributions are the central element in estimation and null hypothesis testing. In this chapter, we simulate sampling distributions to understand what they are. Here, simulation means that we let a computer draw many random samples from a population.

In Communication Science, we usually work with samples of human beings, for instance, users of social media, people looking for health information or entertainment, citizens preparing to cast a political vote, an organization’s stakeholders, or samples of media content such as tweets, tv advertisements, or newspaper articles. In the current and two subsequent chapters, however, we avoid the complexities of these samples.

We focus on a very tangible kind of sample, namely a bag of candies, which helps us understand the basic concepts of statistical inference: sampling distributions (the current chapter), probability distributions (Chapter 2), and estimation (Chapter 3). Once we thoroughly understand these concepts, we turn to Communication Science examples.

1.1 Statistical Inference: Making the Most of Your Data

Statistics is a tool for scientific research. It offers a range of techniques to check whether statements about the observable world are supported by data collected from that world. Scientific theories strive for general statements, that is, statements that apply to many situations. Checking these statements requires lots of data covering all situations addressed by theory.

Collecting data, however, is expensive, so we would like to collect as little data as possible and still be able to draw conclusions about a much larger set. The cost and time involved in collecting large sets of data are also relevant to applied research, such as market research. In this context we also like to collect as little data as necessary.

Inferential statistics offers techniques for making statements about a larger set of observations from data collected for a smaller set of observations. The large set of observations about which we want to make a statement is called the population. The smaller set is called a sample. We want to generalize a statement about the sample to a statement about the population from which the sample was drawn.

Traditionally, statistical inference is generalization from the data collected in a random sample to the population from which the sample was drawn. This approach is the focus of the present book because it is currently the most widely used type of statistical inference in the social sciences. We will, however, point out other approaches in Chapter 4.

Statistical inference is conceptually complicated and for that reason quite often used incorrectly. We will therefore spend quite some time on the principles of statistical inference. Good understanding of the principles should help you to recognize and avoid incorrect use of statistical inference. In addition, it should help you to understand the controversies surrounding statistical inference and developments in the practice of applying statistical inference that are taking place. Investing time and energy in fully understanding the principles of statistical inference really pays off later.

1.2 A Discrete Random Variable: How Many Yellow Candies in My Bag?

An obvious but key insight in statistical inference is this: If we draw random samples from the same population, we are likely to obtain different samples. No two random samples from the same population need to be identical, even though they can be identical.

1.2.1 Sample statistic

We are usually interested in a particular characteristic of the sample rather than in the exact nature of each observation within the sample. For instance, I happen to be very fond of yellow candies. If I buy a bag of candies, my first impulse is to tear the bag open and count the number of yellow candies. Am I lucky today? Does my bag contain a lot of yellow candies?

Figure 1.1: How many yellow candies will our sample bag contain?

The number of yellow candies in a bag is an example of a sample statistic: a value describing a characteristic of the sample. Each bag, that is, each sample, has one outcome score on the sample statistic. For instance, one bag contains four yellow candies, another bag contains seven, and so on. All possible outcome scores constitute the sampling space. A bag of ten candies may contain 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 yellow candies. The numbers 0 to 10 are the sampling space of the sample statistic number of yellow candies in a bag.

The sample statistic is called a random variable. It is a variable because different samples can have different scores. The value of a variable may vary from sample to sample. It is a random variable because the score depends on chance, namely the chance that a particular sample is drawn.

1.2.2 Sampling distribution

Some sample statistic outcomes occur more often than other outcomes. We can see this if we draw very many random samples from a population and collect the frequencies of all outcome scores in a table or chart. We call the distribution of the outcome scores of very many samples a sampling distribution.

Figure 1.2: What is a sampling distribution?

1.2.3 Probability and probability distribution

What is the probability of buying a bag with exactly five yellow candies? In statistical terminology, what is the probability of drawing a sample with five yellow candies as sample statistic outcome? This probability is the proportion of all possible samples that we could have drawn that happen to contain five yellow candies.

Of course, the probability of a sample bag with exactly five yellow candies depends on the share of yellow candies in the population of all candies. Figure 1.3 displays the probabilities of a sample bag with a particular number of yellow candies if twenty per cent of the candies in the population are yellow. You can adjust the population share of yellow candies to see what happens.

Figure 1.3: How does the probability of drawing a sample bag with two out of ten candies yellow depend on the proportion of yellow candies in the population?

The sampling distribution tells us all possible samples that we could have drawn. We can use the distribution of all samples to get the probability of buying a bag with exactly five yellow candies from the sampling distribution: We divide the number of samples with five yellow candies by the total number of samples we have drawn. For example, if 26 out of all 1000 samples have five yellow candies, the proportion of samples with five yellow candies is 26 / 1000 = 0.026. Then, the probability of drawing a sample with five yellow candies is 0.026 (we usually write: .026).

If we change the frequencies in the sampling distribution into proportions, we obtain the probability distribution of the sample statistic: A sampling space with a probability (between 0 and 1) for each outcome of the sample statistic. Because we are usually interested in probabilities, sampling distributions tend to have proportions, that is probabilities, instead of frequencies on the vertical axis. See Figure 1.4 for an example.

Figure 1.3 displays the probability distribution of the number of yellow candies per bag of ten candies. This is an example of a discrete probability distribution because only a limited number of outcomes are possible. It is feasible to list the probability of each outcome separately.

The sampling distribution as a probability distribution conveys very important information. It tells us which outcomes we can expect, in our example, how many yellow candies we may find in our bag of ten candies. Moreover, it tells us the probability that a particular outcome may occur. If the sample is drawn from a population in which 20% of candies are yellow, we are quite likely to find zero, one, two, three, or four yellow candies in our bag. A bag with five yellow candies would be rare, six or seven candies would be very rare, and a bag with more than seven yellow candies is extremely unlikely but not impossible. If we buy such a bag, we know that we have been extremely lucky.

We may refer to probabilities both as a proportion, that is, a number between 0 and 1, and as a percentage: a number between 0% and 100%. Proportions are commonly considered to be the correct way to express probabilities. When we talk about probabilities, however, we tend to use percentages; we may, for example, say that the probabilities are fifty-fifty.

1.2.4 Expected value or expectation

We haven’t yet thought about the value that we are most likely to encounter in the sample that we are going to draw. Intuitively, it must be related to the distribution of colours in the population of candies from which the sample was drawn. In other words, the share of yellow candies in the factory’s stock from which the bag was filled or in the machine that produces the candies, seems to be relevant to what we may expect to find in our sample.

Figure 1.4: What is the expected value of a probability distribution?

If the share of yellow candies in the population is 0.20 (or 20%), we expect one out of each five candies in a bag (sample) to be yellow. In a bag with 10 candies, we would expect two candies to be yellow: one out of each five candies or the population proportion times the total number of candies in the sample = 0.20 * 10 = 2.0. This is the expected value.

The expected value of the proportion of yellow candies in the sample is equal to the proportion of yellow candies in the population. If you carefully inspect a sampling distribution (Figure 1.4), you will see that the expected value also equals the mean of the sampling distribution. This makes sense: Excess yellow candies in some bags must be compensated for by a shortage in other bags.

Thus we arrive at the definition of the expected value of a random variable:

The expected value is the average of the sampling distribution of a random variable.

In our example, the random variable is a sample statistic, more specifically, the number of yellow candies in a sample.

The sampling distribution is an example of a probability distribution, so, more generally, the expected value is the average of a probability distribution. The expected value is also called the expectation of a probability distribution.

1.2.5 Unbiased estimator

Note that the expected value of the proportion of yellow candies in the bag (sample statistic) equals the true proportion of yellow candies in the candy factory (population statistic). For this reason, the sample proportion is an unbiased estimator of the proportion in the population. More generally, a sample statistic is called an unbiased estimator of the population statistic if the expected value (mean of the sampling distribution) is equal to the population statistic. By the way, we usually refer to the population statistic as a parameter.

Most but not all sample statistics are unbiased estimators of the population statistic. Think, for instance, of the actual number of yellow candies in the sample. This is certainly not an unbiased estimator of the number of yellow candies in the population. Because the population is so much larger than the sample, the population must contain many more yellow candies than the sample. If we were to estimate the number in the population (the parameter) from the number in the sample—for instance, we estimate that there are two yellow candies in the population of all candies because we have two in our sample of ten—we are going to vastly underestimate the number in the population. This estimate is downward biased: It is too low.

In contrast, the proportion in the sample is an unbiased estimator of the population proportion. That is why we do not use the number of yellow candies to generalize from our sample to the population. Instead, we use the proportion of yellow candies. You probably already did this intuitively.

Sometimes, we have to adjust the way in which we calculate a sample statistic to get an unbiased estimator. For instance, we must calculate the standard deviation and variance in the sample in a special way to obtain an unbiased estimate of the population standard deviation and variance. The exact calculation need not bother us, because our statistical software takes care of that. Our software only uses unbiased estimators.

1.2.6 Representative sample

Because the share of yellow candies in the population represents the probability of drawing a yellow candy, we also expect 20% of the candies in our bag to be yellow. For the same reason we expect the shares of all other colours in our sample bag to be equal to their shares in the population. As a consequence, we expect a random sample to resemble the population from which it is drawn.

A sample is representative of a population (in the strict sense) if variables in the sample are distributed in the same way as in the population. Of course, we know that a random sample is likely to differ from the population due to chance, so the actual sample that we have drawn is usually not representative of the population in the strict sense.

But we should expect it to be representative, so we say that it is in principle representative or representative in the statistical sense of the population. We can use probability theory to account for the misrepresentation in the actual sample that we draw. This is what we do when we use statistical inference to construct confidence intervals and test null hypotheses, as we will learn in later chapters.

1.3 A Continuous Random Variable: Overweight And Underweight.

Let us now look at another variable: the weight of candies in a bag. The weight of candies is perhaps more interesting to the average consumer than candy colour because candy weight is related to calories.

1.3.1 Continuous variable

Weight is a continuous variable because we can always think of a new weight between two other weights. For instance, consider two candy weights: 2.8 and 2.81 grams. It is easy to see that there can be a weight in between these two values, for instance, 2.803 grams. Between 2.8 and 2.803 we can discern an intermediate value such as 2.802. In principle, we could continue doing this endlessly, e.g., find a weight between 2.80195661 and 2.80195662 grams even if our scales may not be sufficiently precise to measure any further differences. It is the principle that counts. If we can always think of a new value in between two values, the variable is continuous.

Continuous variable: We can always think of a new value in between two values.

1.3.2 Continuous sample statistic

We are not interested in the weight of a single candy. If a relatively light candy is compensated for by a relatively heavy candy in the same bag, we still get the calories that we want. We are interested in the average weight of all candies in our sample bag, so average candy weight in our sample bag is our key sample statistic. We want to say something about the probabilities of average candy weight in the samples of candies that we can draw. Can we do that?

When we turn to the probabilities of getting samples with a particular average candy weight, we run into problems with a continuous sample statistic. If we would want to know the probability of drawing a sample bag with an average candy weight of 2.8 grams, we should exclude sample bags with an average candy weight of 2.81 grams, or 2.801 grams, or 2.8000000001 grams, and so on. In fact, we are very unlikely to draw a sample bag with an average candy weight of exactly 2.8 grams, that is, with an infinite number of zeros trailing 2.8. In other words, the probability of such a sample bag is for all practical purposes zero and negligible.

This applies to every average candy weight, so all probabilities are virtually zero. The probability distribution of the sampling space, that is, of all possible outcomes, is going to be very boring: just (nearly) zeros. And it will take forever to list all possible outcomes within the sampling space, because we have an infinite number of possible outcomes. After all, we can always find a new average candy weight between two selected weights.

1.3.3 Probability density

With a continuous sample statistic, we must look at a range of values instead of a single value. We can meaningfully talk about the probability of having a sample bag with an average candy weight of at least 2.8 grams or at most 2.8 grams. We choose a threshold, in this example 2.8 grams, and determine the probability of values above or below this threshold. We can also use two thresholds, for example the probability of an average candy weight between 2.75 and 2.85 grams. This is probably what you were thinking of when I referred to a bag with 2.8 grams as average candy weight.

If we cannot determine the probability of a single value, which we used to depict on the vertical axis in a plot of a sampling distribution, and we have to link probabilities to a range of values on the x axis, for example, average candy weight above/below 2.8 grams, how can we display probabilities? We have to display a probability as an area between the horizontal axis and a curve. This curve is called a probability density function, so if there is a label to the vertical axis of a continuous probability distribution, it is “Probability density” instead of “Probability”.

Figure 1.5 shows an example of a continuous probability distribution for the average weight of candies in a sample bag. This is the familiar normal distribution so we could say that the normal curve is the probability density function here. The total area under this curve is set to one, so the area belonging to a range of sample outcomes (average candy weight) is 1 or less, as probabilities should be.

Figure 1.5: How do we display probabilities in a continuous sampling distribution? Tip: Click on a slider handle and use your keyboard arrow keys to make small changes to the slider handle position.

A probability density function can give us the probability of values between two thresholds. It can also give us the probability of values up to (and including) a threshold value, which is known as a left-hand probability, or the probability of values above (and including) a threshold value, which is called a right-hand probability. In a null hypothesis significance test (Chapter 4), right-hand and left-hand probabilities are used to calculate p values.

Why did I put (and including) between parentheses? It does not really matter whether we add the exact boundary value (2.8 grams) to the probability on the left or on the right because the probability of getting a bag with average candy weight at exactly 2.8 grams (with a very long trail of zero decimals) is negligible.

Are you struggling with the idea of areas instead of heights (values on the vertical axis) as probabilities? Just realize that we could use the area of a bar in a histogram instead of the height as indication of the probability in discrete probability distributions, for example, Figure 1.4. The bars in a histogram are all equally wide, so (relative) differences between bar areas are equal to differences in bar height.

1.3.4 Probabilities always sum to 1

While you were playing with Figure 1.5, you may have noticed that displayed probabilities always add up to one. This is true for every probability distribution because it is part of the definition of a probability distribution.

In addition, you may have realized that we can use probability distributions in two ways. We can use them to say how likely or unlikely we are to draw a sample with the sample statistic value in a particular range. For example, what is the chance that we draw a sample bag with average candy weight over 2.9 grams? But we can also use a probability distribution to find the threshold values that separate the top ten per cent or the bottom five per cent in a distribution. If we want a sample bag with highest average candy weight, say, belonging to the ten per cent bags with highest average candy weight, what should be the minimum average candy weight in the sample bag?

1.4 Concluding Remarks

A communication scientist wants to know whether children are sufficiently aware of the dangers of media use. On a media literacy scale from one to ten, an average score of 5.5 or higher is assumed to be sufficient.

If we translate this to the simple candy bag example, we realize that the outcome in our sample does not have to be the true population value, for example twenty per cent. If twenty per cent of all candies in the population are yellow, we could very well draw a sample bag with fewer or more than twenty per cent yellow candies.

Average media literacy, then, can exceed 5.5 in our sample of children, even if average media literacy is below 5.5 in the population or the other way around. How we decide on this is discussed in later chapters.

1.4.1 Sample characteristics as observations

Perhaps the most confusing aspect of sampling distributions is the fact that samples are our cases (units of analysis) and sample characteristics are our observations. We are accustomed to think of observations as measurements on empirical things such as people or candies. We perceive each person or each candy as a case and we observe a characteristic that may change across cases (a variable), for instance the colour or weight of a candy.

In a sampling distribution, however, we observe samples (cases) and measure a sample statistic as the (random) variable. Each sample adds one observation to the sampling distribution and its sample statistic value is the value added to the sampling distribution.

1.4.2 Means at three levels

If we are dealing with the proportion of yellow candies in a sample (bag), the sample statistic is a proportion and we want to know the proportion of yellow candies in the population. The sampling distribution collects a large number of sample proportions. The mean of the proportions in the sampling distribution (expected value) equals the proportion of yellow candies in the population, because a sample proportion is an unbiased estimator of the population proportion.

Things become a little confusing if we are interested in a sample mean, such as the average weight of candies in a sample bag. Now we have means at three levels: the population, the sampling distribution, and the sample.

Figure 1.6: What is the relation between the three distributions?

The sampling distribution, here, is a distribution of sample means but the sampling distribution itself also has a mean, which is called the expected value or expectation of the sampling distribution. Don’t let this confuse you. The mean of the sampling distribution is the average of the average weight of candies in every possible sample bag. This mean of means has the same value as our first mean, namely the average weight of the candies in the population because a sample mean is an unbiased estimator of the population mean.

Remember this: The population and the sample consist of the same type of observations. In the current example, we are dealing with a sample and a population of candies. In contrast, the sampling distribution is based on a different type of observation, namely samples, for example, sample bags of candies.

The sampling distribution is the crucial link between the sample and the population. On the one hand the sampling distribution is connected to the population because the population statistic (parameter), for example, average weight of all candies, is equal to the mean of the sampling distribution. On the other hand, it is linked to the sample because it tells us which sample means we will find with what probabilities. We need the sampling distribution to make statements about the population based on our sample.

1.5 Take-Home Points

Values of a sample statistic vary across random samples from the same population. But some values are more probable than other values.
The sampling distribution of a sample statistic tells us the probability of drawing a sample with a particular value of the sample statistic or a particular minimum and/or maximum value.
If a sample statistic is an unbiased estimator of a parameter, the parameter value equals the average of the sampling distribution, which is called the expected value or expectation.
For discrete sample statistics, the sampling distribution tells us the probability of individual sample outcomes. For continuous sample statistics, it tells us the probability density, which gives us the probability of drawing a sample with an outcome that is at least or at most a particular value, or an outcome that is between two values.