Section25.1Section 2: Comparing Several Population Means
In the previous section, you learned the chi-squared test for comparing proportions among two or more groups. An important idea was using one overall procedure that compares all groups simultaneously and controls the overall Type I error rate. In this section, you extend that same logic to comparing two or more population means.
The U.S. Vocational Rehabilitation Act of 1973 prohibited discrimination against people with disabilities. Researchers later studied whether physical disabilities affect perceptions of employment qualifications (Cesare, Tannenbaum, and Dalessio, 1990).
Seventy undergraduates were randomly assigned to view one videotape and then rated the applicant on ten questions using a 1-9 scale. The responses were averaged into one overall qualification score. The research question is whether mean ratings differ by disability condition.
(a) Identify the observational units, explanatory variable, and response variable. Classify each variable as quantitative or categorical. Is this an observational study or an experiment? Explain.
(d) What additional information is needed to decide whether these sample means are significantly different beyond what random assignment alone might produce?
(e) Suggest a standardized statistic to measure how much the five sample means differ. Outline a null simulation and indicate what values of the statistic would count as evidence against the null hypothesis.
(f) Use the Comparing Groups (Quantitative) applet: AnovaShuffle. Load DisabilityEmployment.txt, confirm explanatory/response variables, and run shuffles for the Mean Group Diff statistic. What random process is being simulated?
(h) Compare two sets of boxplots with similar group means but different within-group spread. In which set is evidence stronger that at least one population mean differs? Explain how the Mean Group Difference values compare.
Discussion: With similar group means, larger within-group variability weakens evidence for mean differences. Smaller within-group variability strengthens evidence because observed mean differences stand out more relative to natural variation.
For large samples, the null distribution of this statistic is modeled by an F distribution with numerator df \(I-1\) and denominator df \(n-I\text{.}\)
The distributions of scores across disability conditions are reasonably well behaved with similar standard deviations, so ANOVA conditions are plausible. The p-value (about 0.030) indicates moderate evidence that mean qualification ratings differ by disability condition. This was a randomized experiment, so a causal interpretation about the displayed disability condition on ratings is justified for these participants. Generalization beyond this student sample should be cautious.
Reconsider the boxplots from part (h). Explain how differences in within-group spread affect the F-statistic and p-values, and which set should have the smaller p-value.
Section25.1.2Investigation 5.5: Restaurant Spending and Music
ExercisesThe Study
A British study by North, Shilcock, and Hargreaves (2003) examined whether background music in a restaurant affected how much diners spent. The restaurant alternated classical music, popular music, and silence on successive nights over 18 days.
(b) Using the summary statistics, calculate the overall weighted mean amount spent, the between-group variability, and the pooled within-group variability. Check consistency with the reported overall mean 22.53 and overall SD 2.969.
The classical music nights had the largest sample mean spending. The reported ANOVA result (\(F=31.48\text{,}\) very small p-value) gives strong evidence of differences in group means. However, independence is questionable because treatments were assigned by evening rather than to individual diners, so evening-level effects may be confounded with music condition. Thus descriptive differences are clear, but causal interpretation and broader generalization require caution.
(g) Draw 10-20 more times and track boxplots, F, and p-values. Did you ever get p-value below 0.05? Is that possible under the null? Would it be surprising?
Discussion: Under a true null hypothesis, p-values are uniformly distributed between 0 and 1. Power increases when group means are farther apart, when within-group variability is smaller, and when sample sizes are larger.
In ANOVA, a Type I error means concluding at least one population mean differs when all are equal. A Type II error means failing to detect differences when at least one population mean actually differs.
ANOVA assesses whether differences among sample or treatment means are larger than expected from natural within-group variability. The procedure is based on the F distribution and applies both to independent random samples and randomized experiments, with conclusions depending on study design.
You also checked technical conditions for using the F model: independent observations (or random assignment), approximately normal group distributions, and similar group standard deviations.