Visualizing the Distribution and Key Theorems

Section 7.2 Visualizing the Distribution and Key Theorems

This is a great moment to take a deep breath. We’ve just covered a couple hundred years of statistical thinking in just a few pages. In fact, there are two big ideas, "the law of large numbers" and the "central limit theorem" that we have just partially demonstrated. These two ideas literally took mathematicians like Gerolamo Cardano (1501-1576) and Jacob Bernoulli (1654-1705) several centuries to figure out. If you look these ideas up, you may find a lot of bewildering mathematical details, but for our purposes, there are two really important take-away messages. First, if you run a statistical process a large number of times, it will converge on a stable result. For us, we knew what the average population was of the 50 states plus the District of Columbia. These 51 observations were our population, and we wanted to know how many smaller subsets, or samples, of size n=16 we would have to draw before we could get a good approximation of that true value. We learned that drawing one sample provided a poor result. Drawing 400 samples gave us a mean that was off by 1.5%. Drawing 4000 samples gave us a mean that was off by less than 1%. If we had kept going to 40,000 or 400,000 repetitions of our sampling process, we would have come extremely close to the actual average of 6,053,384.

🔗

Second, when we are looking at sample means, and we take the law of large numbers into account, we find that the distribution of sampling means starts to create a bell-shaped or normal distribution, and the center of that distribution, the mean of all of those sample means gets really close to the actual population mean. It gets closer faster for larger samples, and in contrast, for smaller samples you have to draw lots and lots of them to get really close. Just for fun, let’s illustrate this with a sample size that is larger than 16. Here’s a run that only repeats 100 times, but each time draws a sample of n=51 (equal in size to the population):

🔗

Now, we’re only off from the true value of the population mean by about one tenth of one percent. You might be scratching your head now, saying, "Wait a minute, isn’t a sample of 51 the same thing as the whole list of 51 observations?" This is confusing, but it goes back to the question of sampling with replacement that we examined a couple of pages ago (and that appears in the command above as replace=TRUE). Sampling with replacement means that as you draw out one value to include in your random sample, you immediately chuck it back into the list so that, potentially, it could get drawn again either immediately or later. As mentioned before, this practice simplifies the underlying proofs, and it does not cause any practical problems, other than head scratching. In fact, we could go even higher in our sample size with no trouble:

🔗

That command runs 100 replications using samples of size n=120. Look how close the mean of the sampling distribution is to the population mean now! Remember that this result will change a little bit every time you run the procedure, because different random samples are being drawn for each run. But the rule of thumb is that the bigger your sample size, what statisticians call n, the closer your estimate will be to the true value. Likewise, the more trials you run, the closer your population estimate will be.

🔗

So, if you’ve had a chance to catch your breath, let’s move on to making use of the sampling distribution. First, let’s save one distribution of sample means so that we have a fixed set of numbers to work with:

🔗

The bolded part is new. We’re saving a distribution of sample means to a new vector called "SampleMeans". We should have 10,000 of them:

🔗

And the mean of all of these means should be pretty close to our population mean of 6,053,384:

🔗

You might also want to run a histogram on SampleMeans and see what the frequency distribution looks like. Right now, all we need to look at is a summary of the list of sample means:

🔗

If you need a refresher on the median and quartiles, take a look back at Chapter 3 Rows and Columns.

🔗

This summary is full of useful information. First, take a look at the max and the min. The minimum sample mean in the list was 799,100. Think about that for a moment. How could a sample have a mean that small when we know that the true mean is much higher? Rhode Island must have been drawn several times in that sample! The answer comes from the randomness involved in sampling. If you run a process 10,000 times you are definitely going to end up with a few weird examples. Its almost like buying a lottery ticket. The vast majority of tickets are the usual not a winner. Once in a great while, though, there is a very unusual ticket a winner. Sampling is the same: The extreme events are unusual, but they do happen if you run the process enough times. The same goes for the maximum: at 25,030,000 the maximum sample mean is much higher than the true mean.

🔗

At 5,370,000 the median is quite close to the mean, but not exactly the same because we still have a little bit of rightward skew (the "tail" on the high side is slightly longer than it should be because of the reverse J-shape of the original distribution). The median is very useful because it divides the sample exactly in half: 50%, or exactly 5000 of the sample means are larger than 5,370,000 and the other 50% are lower. So, if we were to draw one more sample from the population it would have a fifty-fifty chance of being above the median. The quartiles help us to cut things up even more finely. The third quartile divides up the bottom 75% from the top 25%. So only 25% of the sample means are higher than 7,622,000. That means if we drew a new sample from the population that there is only a 25% chance that it will be larger than that. Likewise, in the other direction, the first quartile tells us that there is only a 25% chance that a new sample would be less than 3,853,000.

🔗

There is a slightly different way of getting the same information from R that will prove more flexible for us in the long run. The quantile() function can show us the same information as the median and the quartiles, like this:

🔗

You will notice that the values are just slightly different, by less than one tenth of one percent, than those produced by the summary() function. These are actually more precise, although the less precise ones from summary() are fine for most purposes. One reason to use quantile() is that it lets us control exactly where we make the cuts. To get quartiles, we cut at 25% (0.25 in the command just above), at 50%, and at 75%. But what if we wanted instead to cut at 2.5% and 97.5%? Easy to do with quantile():

🔗

So this result shows that, if we drew a new sample, there is only a 2.5% chance that the mean would be lower than 2,014,580. Like

🔗

wise, there is only a 2.5% chance that the new sample mean would be higher than 13,537,085 (because 97.5% of the means in the sampling distribution are lower than that value).

🔗

Now let’s put this knowledge to work. Here is a sample of the number of people in a certain area, where each of these areas is some kind of a unit associated with the U.S.:

🔗

3,706,690 159,358 106,405 55,519 53,883

🔗

We can easily get these into R and calculate the sample mean:

🔗

The mean of our mystery sample is 816,371. The question is, is this a sample of U.S. states or is it something else? Just on its own it would be hard to tell. The first observation in our sample has more people in it than Kansas, Utah, Nebraska, and several other states. We also know from looking at the distribution of raw population data from our previous example that there are many, many states that are quite small in the number of people. Thanks to the work we’ve done earlier in this chapter, however, we have an excellent basis for comparison. We have the sampling distribution of means, and it is fair to say that if we get a new mean to look at, and the new mean is way out in the extreme areas of the sample distribution, say, below the 2.5% mark or above the 97.5% mark, then it seems much less likely that our MysterySample is a sample of states.

🔗

In this case, we can see quite clearly that 816,371 is on the extreme low end of the sampling distribution. Recall that when we ran the quantile() command we found that only 2.5% of the sample means in the distribution were smaller than 2,014,580.

🔗

In fact, we could even play around with a more stringent criterion:

🔗

This quantile() command shows that only 0.5% of all the sample means are lower than 1,410,883. So our MysterySample mean of 816,371 would definitely be a very rare event, if it were truly a sample of states. From this we can infer, tentatively but based on good statistical evidence, that our MysterySample is not a sample of states. The mean of MysterySample is just too small to be very likely to be a sample of states.

🔗

And this is in fact correct: MysterySample contains the number of people in five different U.S. territories, including Puerto Rico in the Caribbean and Guam in the Pacific. These territories are land masses and groups of people associated with the U.S., but they are not states and they are different in many ways than states. For one thing they are all islands, so they are limited in land mass. Among the U.S. states, only Hawaii is an island, and it is actually bigger than 10 of the states in the continental U.S. The key thing to take away is that the mean of this sample was sufficiently different from a known distribution of means that we could make an inference that the sample was not drawn from the original population of data.

🔗

This reasoning is the basis for virtually all statistical inference. You construct a comparison distribution, you mark off a zone of extreme values, and you compare any new sample of data you get to the distribution to see if it falls in the extreme zone. If it does, you tentatively conclude that the new sample was obtained from some other source than what you used to create the comparison distribution.

🔗

If you feel a bit confused, take heart. There’s 400-500 years of mathematical developments represented in that one preceding paragraph. Also, before we had cool programs like R that could be used to create and analyze actual sample distributions, most of the material above was taught as a set of formulas and proofs. Yuck! Later in the book we will come back to specific statistical procedures that use the reasoning described above. For now, we just need to take note of three additional pieces of information.

🔗

First, we looked at the mean of the sampling distribution with mean() and we looked at its shaped with hist(), but we never quantified the spread of the distribution:

🔗

This shows us the standard deviation of the distribution of sampling means. Statisticians call this the "standard error of the mean." This chewy phrase would have been clearer, although longer, if it had been something like this: "the standard deviation of the distribution of sample means for samples drawn from a population." Unfortunately, statisticians are not known for giving things clear labels. Suffice to say that when we are looking at a distribution and each data point in that distribution is itself a representation of a sample (for example, a mean), then the standard deviation is referred to as the standard error.

🔗

Second, there is a shortcut to finding out the standard error that does not require actually constructing an empirical distribution of 10,000 (or any other number) of sampling means. It turns out that the standard deviation of the original raw data and the standard error are closely related by a simple bit of algebra:

🔗

The formula in this command takes the standard deviation of the original state data and divides it by the square root of the sample size. Remember three of four pages ago when we created the SampleMeans vector by using the replicate() and sample() commands, and that we used a sample size of n=5. That’s what you see in the formula above, inside of the sqrt() function. In R, and other software sqrt() is the abbreviation for "square root" and not for "squirt" as you might expect. So if you have a set of observations and you calculate their standard deviation, you can also calculate the standard error for a distribution of means (each of which has the same sample size), just by dividing by the square root of the sample size. You may notice that the number we got with the shortcut was slightly larger than the number that came from the distribution itself, but the difference is not meaningful (and only arises because of randomness in the distribution). Another thing you may have noticed is that the larger the sample size, the smaller the standard error. This leads to an important rule for working with samples: the bigger the better.

🔗

The last thing is another shortcut. We found out the 97.5% cut point by constructing the sampling distribution and then using quantile to tell us the actual cuts. You can also cut points just using the mean and the standard error. Two standard errors down from the mean is the 2.5% cut point and two standard errors up from the mean is the 97.5% cut point.

🔗

You will notice again that this value is different from what we calculated with the quantile() function using the empirical distribution. The differences arise because of the randomness in the distribution that we constructed. The value above is an estimate that is based on statistical proofs, whereas the empirical SampleMeans list that we constructed is just one of a nearly infinite range of such lists that we could create. We could easily reduce the discrepancy between the two methods by using a larger sample size and by having more replications included in the sampling distribution.

🔗

You have attempted of activities on this page.

🔗

Prev Top Next