Skip to main content

Section 25.1 Section 2: Comparing Several Population Means

In the previous section, you learned the chi-squared test for comparing proportions among two or more groups. An important idea was using one overall procedure that compares all groups simultaneously and controls the overall Type I error rate. In this section, you extend that same logic to comparing two or more population means.
The method in this section applies both to independent random samples and to randomized experiments.
List 25.1.1.

Section 25.1.1 Investigation 5.4: Disability Discrimination

Exercises The Study

The U.S. Vocational Rehabilitation Act of 1973 prohibited discrimination against people with disabilities. Researchers later studied whether physical disabilities affect perceptions of employment qualifications (Cesare, Tannenbaum, and Dalessio, 1990).
They prepared videotaped interviews using the same actors and script each time. The only difference was the applicant condition:
no disability (control), leg amputation, Canadian crutches, hearing impairment, and wheelchair confinement.
Seventy undergraduates were randomly assigned to view one videotape and then rated the applicant on ten questions using a 1-9 scale. The responses were averaged into one overall qualification score. The research question is whether mean ratings differ by disability condition.
1. Identify Study Components.
(a) Identify the observational units, explanatory variable, and response variable. Classify each variable as quantitative or categorical. Is this an observational study or an experiment? Explain.
2. State Hypotheses.
(b) State null and alternative hypotheses that reflect the researchers’ conjecture. Define any parameter symbols used.
3. Type I and Type II Errors.
(c) Explain what a Type I error and a Type II error represent in this context.
The five sample means were: amputee 4.429, crutches 5.921, hearing 4.050, none 4.900, wheelchair 5.343.
4. What Else Is Needed?
(d) What additional information is needed to decide whether these sample means are significantly different beyond what random assignment alone might produce?
5. Propose Statistic and Simulation.
(e) Suggest a standardized statistic to measure how much the five sample means differ. Outline a null simulation and indicate what values of the statistic would count as evidence against the null hypothesis.
One approach is a Mean Group Difference statistic and a shuffle/randomization procedure that reassigns response scores to the five disability groups.
6. Applet Setup and Interpretation.
(f) Use the Comparing Groups (Quantitative) applet: AnovaShuffle. Load DisabilityEmployment.txt, confirm explanatory/response variables, and run shuffles for the Mean Group Diff statistic. What random process is being simulated?
7. Empirical p-value from Mean Group Diff.
(g) Run a large number of shuffles. Describe the null distribution (shape, mean, SD) and compute the empirical p-value for Mean Group Diff.
8. Compare Two Sets of Boxplots.
(h) Compare two sets of boxplots with similar group means but different within-group spread. In which set is evidence stronger that at least one population mean differs? Explain how the Mean Group Difference values compare.
Discussion: With similar group means, larger within-group variability weakens evidence for mean differences. Smaller within-group variability strengthens evidence because observed mean differences stand out more relative to natural variation.
9. Record Descriptive Statistics.
(i) Record missing descriptive statistics:
None Amputee Crutches Hearing Wheelchair
Sample size 14 14 14 14 14
Sample mean 4.900 4.429 5.921 4.050 5.343
Sample standard deviation 1.794 1.586 1.482 1.533 1.748
Write one sentence interpreting the sample standard deviation for the none category.
Now focus on deviations in group means from the overall mean.
10. Overall Mean.
(j) What is the overall mean applicant qualification rating across all 70 students?
11. SD of Group Means.
(k) Treat the five sample means as five observations. Compute their standard deviation. What denominator appears in this calculation?
12. Equal Contribution of Means?
(l) Should each sample mean contribute equally to an overall between-group variability measure? Explain.
Define between-group variability as a weighted variance across group means, weighted by group sample sizes.
13. Between-group Variability (Equal n).
(m) Here each group has size 14. Multiply your result from part (k) by 14 to obtain the between-group variability measure.
14. Measure Natural Variation.
(n) Suggest a measure for the natural variation in these data that does not directly reflect treatment differences.
A natural choice is the pooled within-group variance.
15. Within-group Variability.
(o) Compute the average group variance across the five groups (square each group SD, then average).
For unequal sample sizes, use the pooled-variance form weighted by group sample sizes, with denominator \(n-I\text{.}\)
16. Equal Variability Assumption.
(p) Explain why it is reasonable here to assume similar within-group variability across the groups.
The standardized statistic compares between-group to within-group variability.
17. Compute Variability Ratio.
(q) Compute the ratio: part (m) divided by part (o). How many times larger is between-group variability than within-group variability?
18. Range of Ratio.
(r) What is the smallest possible value of this ratio? The largest?
19. Evidence Direction for Ratio.
(s) When the null is false (not all population means are equal), what kinds of values of this ratio provide evidence against the null?
20. F-statistic Null Distribution.
(t) In the applet, switch the statistic to the F-statistic. Describe the null distribution (shape, mean, SD) and compute the empirical p-value.
21. Compare p-values Across Statistics.
(u) Compare this p-value to the one from Mean Group Difference. Did it change much?
For large samples, the null distribution of this statistic is modeled by an F distribution with numerator df \(I-1\) and denominator df \(n-I\text{.}\)
22. Overlay F Distribution.
(v) Overlay the F distribution in the applet. Does it appear to model the simulated null distribution well? Explain.
23. Technology Confirmation (ANOVA).
(w) Confirm your calculations with ANOVA technology output.
Hint. Technology Detour: ANOVA
R: use summary(aov(response~explanatory)).
Minitab: Stat > ANOVA > One-Way, with Score as response and Disability as factor.
JMP: Analyze > Fit Y by X, then Oneway Analysis > Means/Anova.
Applet: use Comparing Groups (Quantitative), then check Show ANOVA Table.
Technical Conditions.
The ANOVA F-model requires approximately normal group distributions, equal population standard deviations, and independence.
Practical checks in this course:
  • Each group’s distribution appears reasonably well behaved (normal plot, dotplot, or histogram).
  • The ratio of largest sample SD to smallest sample SD is at most 2.
  • Independent random samples, or randomized treatment assignment in an experiment.
24. Check Conditions with Data.
(x) Is there evidence of non-normality? Compute the largest-to-smallest sample SD ratio and assess whether it is less than 2.
25. Conclusions Paragraph.
(y) Write a paragraph summarizing conclusions for this study, including significance, causation, and generalizability.
Terminology Detour.
Suppose we compare \(I\) group means, with group sample sizes \(n_i\) and total sample size \(n=\sum n_i\text{.}\)
\(H_0\text{:}\) no treatment effect (equivalently \(\mu_1 = \cdots = \mu_I\)).
\(H_a\text{:}\) there is a treatment effect (at least one \(\mu_i\) differs).
The between-group sum of squares is:
\(SS_{groups}=\sum n_i(\bar{x}_i-\bar{x})^2\text{.}\)
Mean square for groups: \(MS_{groups}=SS_{groups}/(I-1)\text{.}\)
Mean square error (pooled within-group variance): \(MSE=\sum (n_i-1)s_i^2/(n-I)\text{.}\)
ANOVA statistic: \(F = MS_{groups}/MSE\text{,}\) with df \((I-1, n-I)\text{.}\)
Large F values provide evidence against \(H_0\text{.}\)
Study Conclusions.
The distributions of scores across disability conditions are reasonably well behaved with similar standard deviations, so ANOVA conditions are plausible. The p-value (about 0.030) indicates moderate evidence that mean qualification ratings differ by disability condition. This was a randomized experiment, so a causal interpretation about the displayed disability condition on ratings is justified for these participants. Generalization beyond this student sample should be cautious.
26. Practice Problem 5.4A.
Reconsider the boxplots from part (h). Explain how differences in within-group spread affect the F-statistic and p-values, and which set should have the smaller p-value.
27. Practice Problem 5.4B.
Lifetimes of notable people in nine occupational categories were gathered from The World Almanac and Book of Facts.
(a) How many distinct pairs of occupations are there?
(b) Why is ANOVA different from running all possible two-sample t-tests? Why would doing all pairwise tests be inappropriate?
(c) If ANOVA rejects the null, can you conclude every occupation has a different population mean from every other occupation? Explain.

Section 25.1.2 Investigation 5.5: Restaurant Spending and Music

Exercises The Study

A British study by North, Shilcock, and Hargreaves (2003) examined whether background music in a restaurant affected how much diners spent. The restaurant alternated classical music, popular music, and silence on successive nights over 18 days.
Summary statistics for total bills were:
Classical music Pop music No music
Mean 24.13 21.91 21.70
SD 2.243 2.627 3.332
Sample size \(n_1=120\) \(n_2=142\) \(n_3=131\)
1. State Hypotheses.
(a) State null and alternative hypotheses corresponding to the researchers’ conjecture, in symbols and/or words, and define any symbols used.
2. Compute Key ANOVA Quantities.
(b) Using the summary statistics, calculate the overall weighted mean amount spent, the between-group variability, and the pooled within-group variability. Check consistency with the reported overall mean 22.53 and overall SD 2.969.
3. Compute F and p-value.
(c) Calculate the F-statistic by hand and use technology to determine the p-value.
Which appears larger: variation in amount spent between diners or variation between music conditions?
Hint. Technology Options
F Probability Calculator: fCalc.
R: pf(x, df1, df2, lower.tail=FALSE).
Minitab: Graph > Probability Distribution Plot, Distribution = F.
JMP: Distribution Calculator > F, with upper-tail probability.
4. Assess ANOVA Conditions.
(d) Do you consider this p-value valid? Are ANOVA technical conditions met for this design?
Study Conclusions.
The classical music nights had the largest sample mean spending. The reported ANOVA result (\(F=31.48\text{,}\) very small p-value) gives strong evidence of differences in group means. However, independence is questionable because treatments were assigned by evening rather than to individual diners, so evening-level effects may be confounded with music condition. Thus descriptive differences are clear, but causal interpretation and broader generalization require caution.
Applet Exploration: Exploring ANOVA.
Open the Simulating ANOVA Tables applet: AnovaSim.
Set each population mean to 23, sample sizes to 120, 142, and 131, and population SD to 3.
5. First Simulation Draw.
(e) Press Draw Samples once. What F-statistic and p-value do you obtain?
6. Second Simulation Draw.
(f) Press Draw Samples again. Do you get the same F and p-value? Why or why not?
7. Many Draws Under Null.
(g) Draw 10-20 more times and track boxplots, F, and p-values. Did you ever get p-value below 0.05? Is that possible under the null? Would it be surprising?
8. Increase One Mean.
(h) Change \(\mu_1\) to 24 and draw 10-20 times. Do p-values tend to be larger or smaller than in part (g)? Explain.
9. Reduce Sample Sizes.
(i) Change each sample size to 20 and draw 10-20 times. Do p-values tend to be larger or smaller than in part (h)? Explain.
10. Increase Population SD.
(j) Draw until p-value is below 0.3, then increase \(\sigma\) to 7. How does the p-value change and why?
11. Further Increase Mean Separation.
(k) Continue increasing \(\mu_1\text{.}\) How does the p-value generally change? Explain.
Discussion: Under a true null hypothesis, p-values are uniformly distributed between 0 and 1. Power increases when group means are farther apart, when within-group variability is smaller, and when sample sizes are larger.
In ANOVA, a Type I error means concluding at least one population mean differs when all are equal. A Type II error means failing to detect differences when at least one population mean actually differs.
12. Practice Problem 5.5A.
Analyze Dr. Spock venire percentages by judge (see SpockPers.txt):
(a) What do you learn from this analysis that the earlier chi-squared analysis did not show? Why is it useful?
(b) Produce numerical and graphical summaries across judges.
(c) Carry out ANOVA to test whether at least one judge has a different mean percentage.
(d) Comment on whether ANOVA technical conditions are met.
13. Practice Problem 5.5B.
Revisit the Disability Discrimination study and compare wheelchair versus amputation groups only.
(a) Carry out a two-sided pooled two-sample t-test.
(b) Carry out ANOVA for the same two groups.
(c) Compare p-values and infer how the t and F statistics are related.
(d) Describe when two-sample t procedures are preferred and when ANOVA is preferred.

Section 25.1.3 Section 5.2 Summary

This section introduced comparing several groups on a quantitative response using analysis of variance (ANOVA).
ANOVA assesses whether differences among sample or treatment means are larger than expected from natural within-group variability. The procedure is based on the F distribution and applies both to independent random samples and randomized experiments, with conclusions depending on study design.
You examined how ANOVA results depend on three key factors:
You also checked technical conditions for using the F model: independent observations (or random assignment), approximately normal group distributions, and similar group standard deviations.
Example 5.2 provides an additional ANOVA application.
You have attempted of activities on this page.