The well-known pediatrician and child development author Dr. Benjamin Spock was also an anti-Vietnam War activist. In 1968 he was put on trial and convicted on charges of conspiracy to violate the Selective Service Act (encouraging young men to avoid the draft). The case was tried by Judge Ford in Boston’s Federal courthouse. A peculiar aspect of this case was that his jury contained no women.
A lawyer writing about the case that same year in the Chicago Law Review said, "Of all defendants at such trials, Dr. Spock, who had given wise and welcome advice on child-bearing to millions of mothers, would have liked women on his jury" (see Ziesel, 1969). Opinion polls also showed that women were generally more opposed to the Vietnam War than men.
In the Boston District Court, jurors are selected in three stages. The Clerk of the Court is supposed to select 300 names at random from the City Directory. In Dr. Spock’s trial, this sample included only 102 women, even though 53% of the eligible jurors in the district were female. At the next stage, the judge selects 30 or more names from those in the box to constitute the venire. Judge Ford chose 100 potential jurors out of these 300 people, and his choices included only 9 women. Finally, 12 actual jurors are selected after interrogation by both the prosecutor and the defense counsel. Only one potential female juror came before the court and she was dismissed by the prosecution.
In filing his appeal, Spock’s lawyers argued that Judge Ford had a history of venires in which women were systematically underrepresented. They compared the gender breakdown of this judge’s venires with the venires of six other judges in the same Boston court from a recent sample of court cases. Records revealed the following data:
(a) Calculate the proportion of women on the jury list for each judge. Also create a segmented bar graph or mosaic plot to compare these distributions. How do the judges compare?
(b) Let \(p_i\) represent the long-run probability of judge \(i\) selecting a female for the jury list. State a null and an alternative hypothesis for testing whether these data provide reason to doubt that the probability of women on jury lists is the same for all seven judges.
Note: The null hypothesis states only that the probabilities are equal; it does not specify a particular value for this common probability. The alternative states that at least one probability differs.
(c) Suggest a standardized statistic (formula based on your observed sample data) for assessing the strength of evidence against the null hypothesis. Write your statistic as a formula or rule for obtaining one number that takes into account information relevant to comparing all seven groups.
(d) One possibility is to compare all sample proportions to each other by looking at all pairwise differences. How many such pairs are there? Could we simply sum these differences?
One possible statistic for measuring how much the sample proportions differ from each other is the Mean Group Difference, which finds the absolute value of each pairwise difference, sums these values, and divides by the number of differences.
Although we could consider using the Mean Group Difference statistic here, a more common statistic is the chi-squared statistic. Rather than looking at all differences among the groups, it focuses on how each cell count differs from its expected value under the null hypothesis and then sums this up across all cells.
(f) Assuming the null hypothesis is true and each judge has the same probability of a female juror in his pool, suggest an estimate for this common probability.
(g) Judge 1 had 354 jurors on the list. If the long-run proportion of women were 0.261, how many would you expect to be female? How many would you expect to be male?
(j) Are the observed counts equal to the expected counts in each cell? Is it possible that the long-run probability of a female juror is the same for each judge and that the observed differences are due to random chance alone?
(l) What types of chi-squared values (large, small, positive, negative) constitute evidence against the null hypothesis of equal long-run probabilities? Explain.
To approximate a p-value, examine how the standardized statistic varies under the null hypothesis of equal probabilities by simulating many random samples under that model.
(m) Outline the steps you would use to generate random data for each judge under the null hypothesis that the probability of a juror being female is the same for each judge.
(p) Based on your simulation, determine the proportion of simulated values that are as large or larger than your value from part (k). Does this empirical p-value provide convincing evidence that the observed discrepancy is larger than expected by chance? What do you conclude about whether the seven judges had the same long-run probability of selecting a female juror?
The chi-squared distribution is skewed right and provides a reasonable model for this statistic for large sample sizes. We typically use this model when all expected counts are at least 1 and at least 80% of expected counts are at least 5.
When comparing several population proportions, the chi-squared degrees of freedom are \(c-1\text{,}\) where \(c\) is the number of explanatory variable categories.
Discussion: If the null hypothesis is rejected, the conclusion is that at least one population proportion differs from the rest, but the test itself does not identify exactly which one(s). To learn more, inspect the component terms in the chi-squared sum.
(t) Return to the sum you calculated in part (k). Which cell comparisons provide the largest standardized discrepancies between observed and expected counts?
In this study, we modeled the jury-panel selections as independent binomial processes and compared seven sample proportions at once. One judge clearly stood out. If the judges truly had the same long-run probability of selecting women, differences as large as those observed would be extremely unlikely by chance alone. Thus the data provide strong evidence that the long-run probability of a juror being female was not the same across all seven judges. The largest contributions to the chi-squared statistic came from Judge 7, with substantially fewer women and more men than expected. This was Judge Ford, assigned to Dr. Spock’s case.
Because these data are observational and not generated by a true random mechanism, a cause-and-effect conclusion is not warranted. Still, the p-value quantifies how surprising these outcomes would be under the equal-probability model.
The chi-squared distribution approximates the sampling distribution of the chi-squared statistic when data arise from independent binomial random variables. This approximation is generally considered valid when all expected counts are at least 1 and at least 80% of expected counts are at least 5.
The data should come from independent random samples or from a randomized comparative experiment. This procedure is often called a chi-squared test of homogeneity.