Skip to main content
Contents Index
Search Book
close
Search Results:
No results.
Calc
Dark Mode Prev Up Next Scratch ActiveCode Profile
title here
\(
\newcommand{\lt}{<}
\newcommand{\gt}{>}
\newcommand{\amp}{&}
\definecolor{fillinmathshade}{gray}{0.9}
\newcommand{\fillinmath}[1]{\mathchoice{\colorbox{fillinmathshade}{$\displaystyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\textstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptscriptstyle\phantom{\,#1\,}$}}}
\)
Section 27.1 Section 4: Inference for Regression
In the previous section you described sample relationships between two quantitative variables. In this section you move to inference: what can be concluded about larger populations and statistical significance of associations?
List 27.1.1.
Section 27.1.1 Investigation 5.10: Running Out of Time
Exercises The Study
Data from a local 5K race include age and finish time for 248 runners. The goal is to assess whether finish time changes with age.
1. Identify Variables.
(a) Identify explanatory and response variables.
2. Conjecture Direction and Form.
(b) Predict direction, linearity, and likely significance of association.
3. Scatterplot and Data Features.
(c) Plot finish time vs age and discuss unusual features/outliers.
4. Refit Without Outlier.
(d) Remove outlier and report regression equation and slope interpretation.
5. Sampling-Variability Question.
(e) Could observed slope arise by chance if no population association exists? Outline investigation approach.
Terminology Detour.
Use
\(b_0, b_1\) for sample intercept/slope and
\(\beta_0, \beta_1\) for population parameters.
6. Hypotheses for Slope.
(f) State null and alternative hypotheses for testing whether finish time changes with age.
7. Compute Sample Summaries.
(g) Obtain means, SDs, and regression residual SD from data.
8. Design Null Population.
(h) Build a null population with
\(\beta_1=0\) and matching sample characteristics; draw one sample and compare slopes.
9. Second Sample and Variability.
(i) Draw another sample; compare regression line and discuss sampling variability.
10. Many Sample Lines.
(j) Draw many sample lines and describe variation pattern.
11. Null Distribution of Slopes.
(k) Describe shape, center, and SD of null slope distribution.
12. Expected Center.
(l) Is center as expected under null? Explain.
13. Change Sigma.
(m) Reduce sigma and describe population scatterplot change.
14. Predict Effect of Sigma Change.
(n) Predict effect on slope distribution.
15. Check Sigma Prediction.
(o) Simulate and evaluate prediction from part (n).
16. Change SD(X).
(p) Modify SD(X); predict effect on slope and intercept sampling distributions.
17. Check SD(X) Prediction.
(q) Simulate and evaluate predictions from part (p).
18. Change Sample Size.
(r) Reduce sample size and predict effect on slope distribution.
19. Check Sample Size Prediction.
(s) Simulate and evaluate prediction from part (r).
20. Summarize Effects.
(t) Summarize effects of sigma, SD(X), and n on slope sampling distribution.
21. Intuition.
(u) Explain intuitively why these effects occur.
22. Formula Consistency.
(v) Compare simulation behavior to formula for SD of slope estimator.
23. Observed Slope Under Null.
(w) Locate observed slope in null distribution and estimate p-value.
24. Alternative Null Slope.
(x) Repeat simulation for a nonzero null slope and assess plausibility.
Study Conclusions.
These data typically provide strong evidence that finish time and age are associated in this runner population context; large observed slope values are highly unlikely under a no-association model.
25. Estimate Sigma.
(y) Suggest how to estimate population residual SD from sample data.
26. Standardize Slope.
(z) Compute standardized slope statistic.
27. Technology Confirmation.
(aa) Confirm t-statistic and p-value with technology output.
28. Practice Problem 5.10A.
Repeat analysis without removing outlier and compare results.
29. Practice Problem 5.10B.
Compare unofficial and official race files and reassess evidence strength.
Section 27.1.2 Investigation 5.11: Running Out of Time (cont.)
Exercises The Study
This investigation uses a randomization (shuffle) test for regression slope significance, conditioning on observed data.
1. Prepare Data.
(a) Load race data, remove outlier, and confirm regression equation.
2. Single Shuffle.
(b) Shuffle y-values once and compare resulting regression line strength to observed.
3. Repeated Shuffles.
(c) Shuffle repeatedly and describe emerging line patterns.
4. Null Slope Distribution.
(d) Draw many shuffles and describe null distribution of slopes; compare to Investigation 5.10 sampling approach.
5. Empirical p-value.
(e) Estimate p-value by counting shuffled slopes as/extremer than observed.
6. Standardize Observed Slope.
(f) Standardize observed slope using shuffled SD.
7. t-statistic Comparison.
(g) Compare shuffled t-based p-value with Investigation 5.10 result.
8. Overlay t Distribution.
(h) Assess t-distribution overlay fit for shuffled t-statistics.
9. Regression Table Comparison.
(i) Compare applet regression-table SE to shuffled SD behavior.
Discussion.
Random sampling and random shuffling usually yield similar inferential conclusions, but shuffling conditions on observed x and y values and can be more flexible when population-model assumptions are uncertain.
10. Practice Problem 5.11.
Repeat this comparison using height and foot-length data from Investigation 5.8.
Section 27.1.3 Investigation 5.12: Boysβ Heights
Exercises The Study
Data in
hypoHt.txt model height observations for boys at ages 2, 3, and 4. This investigation develops regression-model conditions.
1. Identify Variables.
(a) Identify explanatory and response variables.
2. Within vs Between Variability.
(b) Compare within-age variability to differences in age-group means.
3. No-Association Plausibility.
(c) Could a large sample slope/correlation arise by chance if population association were absent?
4. Simulation Strategy.
(d) Describe how to simulate under no-association to assess plausibility.
Basic Regression Model Conditions.
The mean response is linear in x, response SD is constant across x, and response is normally distributed at each x value.
5. Check by Group.
(e) Compare shape, means, and SDs by age category to assess model assumptions.
6. Condition Assessment.
(f) Based on grouped views, comment on whether conditions appear met.
When repeated x values are sparse, residual plots become primary diagnostics.
7. Residual Plots and Diagnostics.
(g) Use residual-vs-x and residual normality plots to evaluate linearity, equal variance, and normality.
8. Condition Summary.
(h) Summarize whether each model condition appears satisfied.
Section 27.1.4 Investigation 5.13: Cat Jumping (cont.)
Exercises The Study
Revisit Cat Jumping data and perform regression inference for takeoff velocity vs percent body fat.
1. Fit Regression Equation.
(a) Determine least-squares line for predicting velocity from percent body fat.
2. Normality of Residuals.
(b) Produce histogram and normal probability plot of residuals; assess normality.
3. Equal Variance and Linearity.
(c) Use residual plot vs explanatory variable (or predicted values) to assess equal variance and linearity.
4. Test for Negative Association.
(d) State hypotheses, report test statistic and p-value, and conclude.
5. Confirm t Formula.
(e) Verify t-statistic equals
\(b_1/SE(b_1)\) for slope test against zero.
6. 95% CI for Slope.
(f) Compute 95% confidence interval for population slope by hand.
7. Interpret Slope Interval.
(g) Interpret interval in context.
8. Point Predictions.
(h) Predict velocity at 25% and 50% body fat.
9. Precision Comparison.
(i) Which prediction is more reliable and why?
Technology Detour: Prediction Intervals.
R: use
predict(lm(...), newdata=..., interval="prediction").
JMP: Linear Fit options for individual confidence interval formulas.
10. Prediction Interval at 25%.
(j) Report and interpret 95% prediction interval at 25% body fat; identify midpoint.
11. Prediction Interval at 50%.
(k) Repeat at 50% and compare interval widths.
Definition.
Prediction intervals are for individual responses; confidence intervals at x are for mean response.
Technology Detour: Confidence Intervals.
R: use
interval="confidence".
JMP: Linear Fit options for mean confidence interval formulas.
12. Confidence Interval at 25%.
(l) Report and interpret 95% confidence interval for mean velocity at 25% body fat; identify midpoint.
13. Compare PI and CI.
(m) Compare confidence and prediction intervals and explain width difference.
Study Conclusions.
With model conditions reasonably met, t-based slope inference and interval procedures provide interpretable uncertainty for both average and individual predictions.
14. Practice Problem 5.13.
Apply these interval methods to Talley 5K data for specific runner ages and evaluate validity via residual plots.
Section 27.1.5 Investigation 5.14: Housing Prices
Exercises The Study
Students collected a stratified sample of California home sales (
housing.txt) to model selling price from square footage.
1. Initial Linear Fit.
(a) Determine least-squares line and interpret
\(R^2\text{.}\)
2. Model Plausibility.
(b) Based on scatterplot, does basic linear model seem appropriate?
3. Residual Normality.
(c) Use residual histogram/normal plot to assess normality.
4. Residual Pattern Checks.
(d) Plot residuals vs square footage and assess curvature/equal variance.
If model conditions are violated, transformations can improve fit.
5. Log Transformation.
(e) Transform both variables with log base 10, refit, and reassess residual diagnostics.
6. Interpret Transformed Slope.
(f) Interpret transformed-model slope in context.
7. Back-Transformed Prediction.
(g) Predict price for a 3000 sq ft house and back-transform to original dollars.
Discussion.
Transformations often improve normality and variance stability, but interpretation must account for transformed scale and often requires back-transformation.
8. Practice Problem 5.14.
Analyze
walmart.txt with square-root and log transforms, select better model, predict for 2003, and report a 95% prediction interval.
Section 27.1.6 Technology Exploration: The Regression Effect
Exercises The Study
Using
Masters18.txt, examine first-round vs second-round golf scores to study regression to the mean.
1. Scatterplot and Correlation.
(a) Plot round 2 vs round 1 and describe association direction and strength.
2. Sort and List Extremes.
(b) Sort by round-1 score and list top ten and bottom ten with round-2 scores.
3. Improvement Counts.
(c) Count improvements in each group.
4. Compare Improvement Rates.
(d) Which group improved more and by how much?
5. Median Round-2 Scores.
(e) Compare median round-2 scores for top-ten and bottom-ten first-round groups.
6. y=x Line Interpretation.
(f) Add
\(y=x\) line and identify examples of improvements and declines.
7. Compare Candidate Lines.
(g) Evaluate whether
\(y=x\) is a good summary and predict regression slope relative to 1.
8. Mean Line Comparison.
(h) Add
\(\bar{y}\) line and compare with expected regression line.
9. Add Regression Line.
(i) Add regression line and describe location relative to previous lines.
10. Regression Equation and Slope.
(j) Report equation and explain why slope tends to be less than 1 when SDs are similar.
11. Regression to Mean Interpretation.
(k) Explain why slope less than 1 implies very good first rounds tend to worsen and poor first rounds tend to improve.
Section 27.1.7 Section 5.4 Summary
This section extended regression to inference for slopes, coefficients, and predictions.
Summary of Inference for Regression.
To test slope hypotheses, use
\(t=(b_1-\beta_{1,0})/SE(b_1)\) with
\(n-2\) degrees of freedom.
Confidence intervals for slope use
\(b_1 \pm t^* SE(b_1)\text{.}\)
Primary model-utility test is usually
\(H_0: \beta_1=0\) versus an alternative indicating association.
Validity conditions (often checked with residual plots) are:
Normality of residuals/conditional responses
When conditions are imperfect, transformations and robust interpretation strategies can improve analysis quality.
You have attempted
of
activities on this page.