Section5.3Investigation 1.13: Sampling Words (cont.)
In Section 3, we have transitioned from sampling from an infinite process to sampling from a finite population. We discussed randomly selecting the sample from a list of the entire population as a way to convince critics that the sample is likely to be representative of the larger population and therefore, we are willing to generalize results from a randomly selected sample to the larger population.
So now the βrandom chanceβ we are modelling is not in an individual outcome (like in a coin flip) but in which individuals we obtain in our sample. As suggested in Investigation 1.12, sample statistics from random samples will again follow a predictable pattern, allowing us to measure the strength of evidence against claims about the population parameter and to estimate the margin-of-error in our estimates of the parameter. We will again consider simulation-based p-values, exact p-values, and a mathematical model for estimating the p-value and confidence interval.
Exercises5.3.1Estimating the Standard Deviation of Sample Proportions
In the previous investigation, we saw that if we take random samples from a finite population, the mean of the sampling distribution of sample proportions should equal the population proportion (i.e., \(E(\hat{p}) = \pi\)). But what about the standard deviation of the distribution of sample proportions?
For the variable whether the word is short, \(\pi = 0.41\text{.}\) If we take samples of size 10, what does the above formula predict for the standard deviation of the sampling distribution?
Use the applet below or in a new tab to generate a distribution of 10,000 random samples of size \(n = 10\text{,}\) calculating the sample proportion of words that are short each time.
The theoretical formula weβve been using for the standard deviation of sample proportions assumes that the probability of success is constant and the trials are independent (properties of a binomial random variable). Now that we are sampling without replacement from a finite population, the population characteristics do change slightly as we remove observations. The probability the first word is short will be \(110/268 \approx 0.410\text{,}\) but if that word is short, the probability the next one will also be short is now \(109/267 \approx 0.408\text{.}\) And if the first word is not short, the probability the second one will be is \(110/267 \approx 0.412\text{.}\) In other words, the probability of success (short word) for the second trial depends on what we obtained in the first trial.
This means we can no longer use the binomial distribution to calculate a theoretical standard deviation. This impacts our p-value calculations and our confidence intervals. In particular, the predicted standard deviation is too large, and our confidence intervals will have higher coverage rates than we need.
However, we can use a slightly different formula that corrects for the lack of independence. (The explanation for this formula is in Investigation 1.15.)
Compute this formula for the samples of size \(n = 100\text{.}\) How does the prediction of the standard deviation change? Why? How does it compare to the standard deviation of the sample proportions simulated by the applet?
The finite population correction factor decreases the predicted value of the standard deviation (always multiplying by a value less than one) and gives a value that is closer to the simulation results.
The finite population correction factor should decrease the predicted value of the standard deviation (always multiplying by a value less than one) and give a value that is closer to the simulation results.
Generate 10,000 samples of size \(n = 100\) from this population. Report the mean and standard deviation from your simulation and comment on how they compare to Question 4.
The mean remains close to 0.41. The standard deviation is now very close to 0.049, matching the original formula without the finite population correction.
Because we are now sampling a much smaller fraction of the population, the original formula provides a reasonable prediction of the sample-to-sample variation in the sample proportions. Note that the correction factor \(\sqrt{\frac{N-n}{N-1}}\) is very close to one when \(N\) is large and so it can be ignored!
When the population size is much larger than the sample size, we can model sampling from a finite population as sampling from an infinite process as before. The population size (\(N\)) is considered much larger than the sample size (\(n\)) if \(N > 20 \times n\text{.}\)
If the population size is not large, we use the "finite population correction factor." Probability sampling methods other than simple random sampling would also require calculating a different standard deviation. You can learn more appropriate techniques in a course on sampling design and methods.
Once again you see the fundamental principle of sampling variability. Fortunately, this variability follows a predictable pattern in the long run. With random sampling, we expect the sample proportion to be "representative" of the population proportion. The second key advantage of random sampling is we now know how to estimate the size of the sample-to-sample variation. Sample proportions that are based on larger samples will tend to fall even closer to the population proportion as there is less variability among the sample proportions. So first, select randomly to avoid bias, and then if we can increase the sample size, this will improve the precision of the sample results. Once we know the precision of the sample results in repeated samples, then we can decide whether any one particular sample result can be considered statistically significant or unlikely to happen by chance β by the random sampling process β alone. A small p-value does not guarantee that the sample result did not happen just by chance, but it does allow you to measure how unlikely such a result is.
In Investigation 1.15, you will consider a method for computing an "exact" p-value, taking the population size into account. However, in many situations we donβt know the size of the population, just that it is large, and instead we will proceed directly to the binomial or normal-based method, as long as the technical conditions are met.
Checkpoint5.3.3.Finite Population Correction with \(n\) = 40.
Standardized statistics and confidence intervals can take the finite population correction into account. In the Simulating Confidence Intervals applet, use the Method pull-down menu to select Finite Population. Set \(\pi = 0.410\text{,}\)\(N = 268\text{,}\) and \(n = 40\text{.}\) Sample 2000 intervals and determine the coverage rate.