Section 4: Inference for Regression

These data typically provide strong evidence that finish time and age are associated in this runner population context; large observed slope values are highly unlikely under a no-association model.

🔗

25. Estimate Sigma.

(y) Suggest how to estimate population residual SD from sample data.

🔗

26. Standardize Slope.

(z) Compute standardized slope statistic.

🔗

27. Technology Confirmation.

(aa) Confirm t-statistic and p-value with technology output.

🔗

28. Practice Problem 5.10A.

Repeat analysis without removing outlier and compare results.

🔗

29. Practice Problem 5.10B.

Compare unofficial and official race files and reassess evidence strength.

🔗

Section 27.1.2 Investigation 5.11: Running Out of Time (cont.)

Exercises The Study

This investigation uses a randomization (shuffle) test for regression slope significance, conditioning on observed data.

🔗

1. Prepare Data.

(a) Load race data, remove outlier, and confirm regression equation.

🔗

2. Single Shuffle.

(b) Shuffle y-values once and compare resulting regression line strength to observed.

🔗

3. Repeated Shuffles.

(c) Shuffle repeatedly and describe emerging line patterns.

🔗

4. Null Slope Distribution.

(d) Draw many shuffles and describe null distribution of slopes; compare to Investigation 5.10 sampling approach.

🔗

5. Empirical p-value.

(e) Estimate p-value by counting shuffled slopes as/extremer than observed.

🔗

6. Standardize Observed Slope.

(f) Standardize observed slope using shuffled SD.

🔗

7. t-statistic Comparison.

(g) Compare shuffled t-based p-value with Investigation 5.10 result.

🔗

8. Overlay t Distribution.

(h) Assess t-distribution overlay fit for shuffled t-statistics.

🔗

9. Regression Table Comparison.

(i) Compare applet regression-table SE to shuffled SD behavior.

🔗

Discussion.

Random sampling and random shuffling usually yield similar inferential conclusions, but shuffling conditions on observed x and y values and can be more flexible when population-model assumptions are uncertain.

🔗

10. Practice Problem 5.11.

Repeat this comparison using height and foot-length data from Investigation 5.8.

🔗

Section 27.1.3 Investigation 5.12: Boys’ Heights

Exercises The Study

Data in hypoHt.txt model height observations for boys at ages 2, 3, and 4. This investigation develops regression-model conditions.

🔗

1. Identify Variables.

(a) Identify explanatory and response variables.

🔗

2. Within vs Between Variability.

(b) Compare within-age variability to differences in age-group means.

🔗

3. No-Association Plausibility.

(c) Could a large sample slope/correlation arise by chance if population association were absent?

🔗

4. Simulation Strategy.

(d) Describe how to simulate under no-association to assess plausibility.

🔗

Basic Regression Model Conditions.

The mean response is linear in x, response SD is constant across x, and response is normally distributed at each x value.

🔗

5. Check by Group.

(e) Compare shape, means, and SDs by age category to assess model assumptions.

🔗

6. Condition Assessment.

(f) Based on grouped views, comment on whether conditions appear met.

🔗

When repeated x values are sparse, residual plots become primary diagnostics.

🔗

7. Residual Plots and Diagnostics.

(g) Use residual-vs-x and residual normality plots to evaluate linearity, equal variance, and normality.

🔗

8. Condition Summary.

(h) Summarize whether each model condition appears satisfied.

🔗

Section 27.1.4 Investigation 5.13: Cat Jumping (cont.)

Exercises The Study

Revisit Cat Jumping data and perform regression inference for takeoff velocity vs percent body fat.

🔗

1. Fit Regression Equation.

(a) Determine least-squares line for predicting velocity from percent body fat.

🔗

2. Normality of Residuals.

(b) Produce histogram and normal probability plot of residuals; assess normality.

🔗

3. Equal Variance and Linearity.

(c) Use residual plot vs explanatory variable (or predicted values) to assess equal variance and linearity.

🔗

4. Test for Negative Association.

(d) State hypotheses, report test statistic and p-value, and conclude.

🔗

5. Confirm t Formula.

(e) Verify t-statistic equals \(b_1/SE(b_1)\) for slope test against zero.

🔗

6. 95% CI for Slope.

(f) Compute 95% confidence interval for population slope by hand.

🔗

7. Interpret Slope Interval.

(g) Interpret interval in context.

🔗

8. Point Predictions.

(h) Predict velocity at 25% and 50% body fat.

🔗

9. Precision Comparison.

(i) Which prediction is more reliable and why?

🔗

Technology Detour: Prediction Intervals.

R: use predict(lm(...), newdata=..., interval="prediction").

🔗

JMP: Linear Fit options for individual confidence interval formulas.

🔗

10. Prediction Interval at 25%.

(j) Report and interpret 95% prediction interval at 25% body fat; identify midpoint.

🔗

11. Prediction Interval at 50%.

(k) Repeat at 50% and compare interval widths.

🔗

Definition.

Prediction intervals are for individual responses; confidence intervals at x are for mean response.

🔗

Technology Detour: Confidence Intervals.

R: use interval="confidence".

🔗

JMP: Linear Fit options for mean confidence interval formulas.

🔗

12. Confidence Interval at 25%.

(l) Report and interpret 95% confidence interval for mean velocity at 25% body fat; identify midpoint.

🔗

13. Compare PI and CI.

(m) Compare confidence and prediction intervals and explain width difference.

🔗

Study Conclusions.

With model conditions reasonably met, t-based slope inference and interval procedures provide interpretable uncertainty for both average and individual predictions.

🔗

14. Practice Problem 5.13.

Apply these interval methods to Talley 5K data for specific runner ages and evaluate validity via residual plots.

🔗

Section 27.1.5 Investigation 5.14: Housing Prices

Exercises The Study

Students collected a stratified sample of California home sales (housing.txt) to model selling price from square footage.

🔗

1. Initial Linear Fit.

(a) Determine least-squares line and interpret \(R^2\text{.}\)

🔗

2. Model Plausibility.

(b) Based on scatterplot, does basic linear model seem appropriate?

🔗

3. Residual Normality.

(c) Use residual histogram/normal plot to assess normality.

🔗

4. Residual Pattern Checks.

(d) Plot residuals vs square footage and assess curvature/equal variance.

🔗

If model conditions are violated, transformations can improve fit.

🔗

5. Log Transformation.

(e) Transform both variables with log base 10, refit, and reassess residual diagnostics.

🔗

6. Interpret Transformed Slope.

(f) Interpret transformed-model slope in context.

🔗

7. Back-Transformed Prediction.

(g) Predict price for a 3000 sq ft house and back-transform to original dollars.

🔗

Discussion.

Transformations often improve normality and variance stability, but interpretation must account for transformed scale and often requires back-transformation.

🔗

8. Practice Problem 5.14.

Analyze walmart.txt with square-root and log transforms, select better model, predict for 2003, and report a 95% prediction interval.

🔗

Section 27.1.6 Technology Exploration: The Regression Effect

Exercises The Study

Using Masters18.txt, examine first-round vs second-round golf scores to study regression to the mean.

🔗

1. Scatterplot and Correlation.

(a) Plot round 2 vs round 1 and describe association direction and strength.

🔗

2. Sort and List Extremes.

(b) Sort by round-1 score and list top ten and bottom ten with round-2 scores.

🔗

3. Improvement Counts.

(c) Count improvements in each group.

🔗

4. Compare Improvement Rates.

(d) Which group improved more and by how much?

🔗

5. Median Round-2 Scores.

(e) Compare median round-2 scores for top-ten and bottom-ten first-round groups.

🔗

6. y=x Line Interpretation.

(f) Add \(y=x\) line and identify examples of improvements and declines.

🔗

7. Compare Candidate Lines.

(g) Evaluate whether \(y=x\) is a good summary and predict regression slope relative to 1.

🔗

8. Mean Line Comparison.

(h) Add \(\bar{y}\) line and compare with expected regression line.

🔗

9. Add Regression Line.

(i) Add regression line and describe location relative to previous lines.

🔗

10. Regression Equation and Slope.

(j) Report equation and explain why slope tends to be less than 1 when SDs are similar.

🔗

11. Regression to Mean Interpretation.

(k) Explain why slope less than 1 implies very good first rounds tend to worsen and poor first rounds tend to improve.

🔗

Section 27.1.7 Section 5.4 Summary

This section extended regression to inference for slopes, coefficients, and predictions.

🔗

Summary of Inference for Regression.

To test slope hypotheses, use \(t=(b_1-\beta_{1,0})/SE(b_1)\) with \(n-2\) degrees of freedom.

🔗

Confidence intervals for slope use \(b_1 \pm t^* SE(b_1)\text{.}\)

🔗

Primary model-utility test is usually \(H_0: \beta_1=0\) versus an alternative indicating association.

🔗

Validity conditions (often checked with residual plots) are:

🔗

Linearity
🔗

🔗
Independence
🔗

🔗
Normality of residuals/conditional responses
🔗

🔗
Equal variance
🔗

🔗

When conditions are imperfect, transformations and robust interpretation strategies can improve analysis quality.

🔗

Section 27.1 Section 4: Inference for Regression

Section 27.1.1 Investigation 5.10: Running Out of Time

Exercises The Study

1. Identify Variables.

2. Conjecture Direction and Form.

3. Scatterplot and Data Features.

4. Refit Without Outlier.

5. Sampling-Variability Question.

Terminology Detour.

6. Hypotheses for Slope.

7. Compute Sample Summaries.

8. Design Null Population.

9. Second Sample and Variability.

10. Many Sample Lines.

11. Null Distribution of Slopes.

12. Expected Center.

13. Change Sigma.

14. Predict Effect of Sigma Change.

15. Check Sigma Prediction.

16. Change SD(X).

17. Check SD(X) Prediction.

18. Change Sample Size.

19. Check Sample Size Prediction.

20. Summarize Effects.

21. Intuition.

22. Formula Consistency.

23. Observed Slope Under Null.

24. Alternative Null Slope.

Study Conclusions.

25. Estimate Sigma.

26. Standardize Slope.

27. Technology Confirmation.

28. Practice Problem 5.10A.

29. Practice Problem 5.10B.

Section 27.1.2 Investigation 5.11: Running Out of Time (cont.)

Exercises The Study

1. Prepare Data.

2. Single Shuffle.

3. Repeated Shuffles.

4. Null Slope Distribution.

5. Empirical p-value.

6. Standardize Observed Slope.

7. t-statistic Comparison.

8. Overlay t Distribution.

9. Regression Table Comparison.

Discussion.

10. Practice Problem 5.11.

Section 27.1.3 Investigation 5.12: Boys’ Heights

Exercises The Study

1. Identify Variables.

2. Within vs Between Variability.

3. No-Association Plausibility.

4. Simulation Strategy.

Basic Regression Model Conditions.

5. Check by Group.

6. Condition Assessment.

7. Residual Plots and Diagnostics.

8. Condition Summary.

Section 27.1.4 Investigation 5.13: Cat Jumping (cont.)

Exercises The Study

1. Fit Regression Equation.

2. Normality of Residuals.

3. Equal Variance and Linearity.

4. Test for Negative Association.

5. Confirm t Formula.

6. 95% CI for Slope.

7. Interpret Slope Interval.

8. Point Predictions.

9. Precision Comparison.

Technology Detour: Prediction Intervals.

10. Prediction Interval at 25%.

11. Prediction Interval at 50%.

Definition.

Technology Detour: Confidence Intervals.

12. Confidence Interval at 25%.

13. Compare PI and CI.

Study Conclusions.

14. Practice Problem 5.13.

Section 27.1.5 Investigation 5.14: Housing Prices

Exercises The Study