Skip to main content
Contents Index
Search Book
close
Search Results:
No results.
Calc
Dark Mode Prev Up Next Scratch ActiveCode Profile
title here
\(
\newcommand{\lt}{<}
\newcommand{\gt}{>}
\newcommand{\amp}{&}
\definecolor{fillinmathshade}{gray}{0.9}
\newcommand{\fillinmath}[1]{\mathchoice{\colorbox{fillinmathshade}{$\displaystyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\textstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptscriptstyle\phantom{\,#1\,}$}}}
\)
Section 26.1 Section 3: Relationships Between Quantitative Variables
In this section you will analyze data sets with two quantitative variables. The goal is to describe and model relationships between variables, beginning with graphical and numerical summaries and then moving to prediction with linear models.
List 26.1.1.
Section 26.1.1 Investigation 5.6: Cat Jumping
Exercises The Study
Harris and Steudel (2002) investigated factors related to domestic catsβ maximum takeoff velocity. Variables include body mass, hind limb length, muscle mass, percent body fat, and takeoff velocity.
1. Identify Units and Response.
(a) Identify observational units and the primary response variable. Classify variable type.
2. Univariate Summary of Velocity.
(b) Produce numerical and graphical summaries of takeoff velocity and describe shape, center, spread, and unusual observations.
3. Single-Value Prediction.
(c) If selecting a random domestic cat, what is your best prediction of takeoff velocity?
4. Body Mass Conjecture.
(d) Do you expect takeoff velocity to be related to body mass? Predict direction.
Technology Detour: Scatterplots.
Applet: Place velocity in response and body mass in explanatory.
R: plot(velocity~bodymass) or
plot(bodymass, velocity).
JMP: Analyze > Fit Y by X.
5. Describe Body Mass Relationship.
(e) Describe the scatterplot association between body mass and takeoff velocity.
6. Identify Outlier Cats.
(f) Identify any outlier cats and describe how they differ in context.
Terminology Detour.
Describe scatterplots by direction, linearity, and strength. For response/explanatory settings, put response on the vertical axis and explanatory on the horizontal axis.
7. Percent Body Fat Relationship.
(g) Plot takeoff velocity vs. percent body fat. Compare strength and linearity to body mass.
8. Other Predictors.
(h) Predict and then evaluate associations of hind limb length and muscle mass with takeoff velocity.
9. Coded Scatterplot by Sex.
(i) Create a coded scatterplot by sex and describe any differences between male and female cats.
Study Conclusions.
The researchers found significant relationships of takeoff velocity with hind limb length and fat-mass ratio, but not with all measured muscle variables. Later in this chapter you will study inferential tools for assessing these relationships.
10. Practice Problem 5.6.
Using
KYDerby25.txt: (a) scatterplot winning speed vs year, (b) coded scatterplot by track condition, and contextual interpretation.
Section 26.1.2 Investigation 5.7: Drive for Show, Putt for Dough
Exercises The Study
This investigation compares how PGA scoring average relates to driving distance and putting average using
golfers18.txt.
1. Direction Conjecture: Driving.
(a) Predict direction of association between scoring average and driving distance.
2. Direction Conjecture: Putting.
(b) Predict direction of association between scoring average and putting average.
3. Compare Scatterplots.
(c) Produce both scatterplots and compare direction, form, and strength.
4. Quadrant Reasoning.
(d) With mean lines shown, identify aligned vs non-aligned quadrants for both plots.
Definition: Correlation Coefficient.
The correlation coefficient
\(r\) measures strength and direction of a linear relationship and is unitless.
5. Units of r.
(e) Use the formula to determine the units of
\(r\) for putts vs strokes.
6. Interpret r=0.
(f) If correlation equals zero, what might the scatterplot look like?
7. Correlation with Itself.
(g) What is the correlation of a variable with itself?
8. Resistance.
(h) Is correlation resistant to outliers? Explain.
9. Rank Correlations from Plots.
(i) Rank provided scatterplots from strongest negative to strongest positive correlation.
10. Compute Correlations with Technology.
(j) Compute and record correlation coefficients for each plot pair using applet/R/JMP.
11. Range of r.
(k) State smallest and largest possible values of
\(r\text{.}\)
12. Sign of r.
(l) What sign of
\(r\) corresponds to negative vs positive association?
13. Meaning of r near 0.
(m) What does
\(r\) near zero signify?
14. Meaning of r near +-1.
(n) What does
\(r\) close to 1 or -1 signify?
15. Cliche Check.
(o) Which variable has stronger correlation with scoring average: driving distance or putting average? Does this support the cliche?
Study Conclusions.
The putting relationship with scoring average is typically stronger than the driving relationship in these data. Correlation quantifies linear association and should always be interpreted alongside scatterplots.
16. Practice Problem 5.7A.
Reason about correlation when final exam scores are fixed transformations of midterm scores.
17. Practice Problem 5.7B.
Match house-price scatterplots to correlation values and justify.
Section 26.1.3 Applet Exploration: Correlation Guessing Game
Open the Guess the Correlation applet and repeatedly estimate the correlation from displayed scatterplots.
Exercises Exploration Tasks
1. First Guess.
(a) Enter your guess, check against actual value, and reflect on accuracy.
2. Second Guess.
(b) Repeat with a new sample and describe adjustments in your guessing strategy.
3. Ten Trials.
(c) Repeat for 10 scatterplots. Did your guessing improve?
4. Guess vs Actual.
(d) Use Track Performance: interpret Guess vs Actual plot and discuss whether high correlation implies good guessing.
5. Error vs Actual.
(e) Interpret Error vs Actual graph. Were some correlation values easier to guess?
6. Error vs Trial.
(f) Interpret Error vs Trial graph. Did ability improve over time?
7. Perfect Guesser Thought Experiment.
(g) If all guesses are exactly correct, what is correlation between guesses and actual values?
8. Systematic Offset Thought Experiment.
(h) If each guess is 0.2 too high, what is correlation between guesses and actual values?
9. Interpretation Caveat.
(i) Does correlation equal to 1 between guesses and actuals guarantee accurate guesses? Explain.
Section 26.1.4 Investigation 5.8: Height and Foot Size
Exercises The Study
This investigation uses student height and foot-length data to build least-squares regression ideas from prediction error.
1. Identify Variables.
(a) Identify observational units, explanatory variable, and response variable.
2. Baseline Prediction.
(b) Predict height from the sample of heights alone with one number.
3. Prediction Error.
(c) Would that prediction always be correct?
Definition: Residual.
Residual for observation
\(i\) is
\(y_i-\hat{y}_i\text{.}\)
4. Compute Residuals.
(d) Complete residual table and compare overestimation vs underestimation counts.
5. Sign of Residual.
(e) Interpret positive and negative residuals.
6. Overall Error Metric.
(f) Suggest a way to combine residuals to quantify overall error.
7. Why Sum of Residuals Fails.
(g) Compute sum of residuals and explain why it is not useful for fit quality.
8. Proof Sum Residuals from Mean.
(h) Show mathematically that residuals from the sample mean sum to zero.
9. Alternative Criteria.
(i) Suggest two alternatives that avoid cancellation.
10. Calculus Minimization.
(j) Use calculus to show minimizing sum of squared errors over a constant predictor yields the sample mean.
Predicting Height from Foot Length.
11. Describe Scatterplot.
(k) Describe association between height and foot length.
12. Movable Line.
(l) Use applet movable line and record your best-fit equation.
13. Compare Class Lines.
(m) Did everyone get the same equation?
14. Best-Line Criterion.
(n) Propose a criterion for deciding which line is best.
15. Absolute Error Criterion.
(o) Compute SAE for your line and compare across class.
16. Squared Error Criterion.
(p) Compute SSE for your line and compare.
17. Improve Line for SSE.
(q) Adjust line to minimize SSE and report equation.
18. Least Squares Line.
(r) Use applet regression line and compare with your line.
Definition: Least Squares Regression Line.
The least squares line minimizes SSE among all lines.
19. How to Determine Coefficients.
(s) Suggest a technique to determine slope and intercept from data.
20. Derivatives for SSE.
(t) Differentiate SSE with respect to intercept and slope.
21. Solve Normal Equations.
(u) Solve for least-squares coefficients.
22. Compute Coefficients from Summaries.
(v) Use summary statistics and
\(b_1=r s_y/s_x\text{,}\) \(b_0=\bar{y}-b_1\bar{x}\text{.}\)
23. Predictions and Slope Meaning.
(w) Predict heights at two nearby foot lengths and relate difference to slope.
24. Interpret Slope.
(x) Interpret slope in context.
25. Interpret Intercept.
(y) Interpret intercept and discuss meaningfulness.
26. Extrapolation.
(z) Predict at far-out x and explain why this is less reliable.
Definition: Extrapolation.
Extrapolation is prediction outside observed x-range and is usually ill-advised.
27. SSE from Mean Line.
(aa) Compute SSE when predicting with
\(\bar{y}\) only.
28. Coefficient of Determination.
(bb) Compute percentage SSE reduction from mean line to regression line and interpret as
\(R^2\text{.}\)
Definition: Coefficient of Determination.
\(R^2\) is the proportion of response variability explained by regression on the explanatory variable.
29. Residual Standard Deviation.
(cc) Compute and interpret
\(s=\sqrt{SSE/(n-2)}\text{.}\)
Study Conclusions.
Height and foot length show a moderately strong positive linear association; least-squares regression improves prediction over using the mean alone but still leaves unexplained variability.
30. Practice Problem 5.8.
For Cat Jumping data: compute and interpret
\(r\text{,}\) \(r^2\text{,}\) slope, intercept, and
\(s\text{.}\)
Section 26.1.5 Applet Exploration: Behavior of Regression Lines
Exercises Exploration Tasks
Use the Analyzing Two Quantitative Variables applet with height and foot-length data.
1. Baseline Regression Fit.
(a) Display regression line and judge whether it summarizes the linear relationship reasonably.
2. Predict Effect of High-Leverage Point.
(b) Predict effect of adding point (35, 60) on regression equation.
3. Add Point and Report New Equation.
(c) Add the point and report new equation.
4. Move New Point Vertically.
(d) Move the added point vertically and assess how strongly slope/intercept change.
5. Predict Effect of Central Point Movement.
(e) Revert and predict effect of moving point (29, 65) vertically.
6. Move Central Point.
(f) Move (29, 65) and compare influence to part (d).
7. Influence Comparison.
(g) Which point is more influential and why under least squares?
Section 26.1.6 Excel Exploration: Minimization Criteria
Exercises Exploration Tasks
This activity compares minimization criteria for choosing a single-value predictor
\(m\) for heights.
1. Compute SAE from Mean.
(a) Compute SAE using mean as predictor.
2. Compute SAE(m) in Spreadsheet.
(b) Fill formulas in
Heights.xls and verify SAE at sample mean.
3. Graph SAE(m).
(c) Graph SAE vs m, sketch shape, and describe its form.
4. Find Minimizer of SAE.
(d) Determine value(s) of m minimizing SAE and report minimum SAE.
5. Connect to Median.
(e) Relate minimizing m to ordered data and familiar location statistics.
6. Modify Middle Value.
(f) Change one middle value and reassess minimizer uniqueness.
7. Modify Extreme Value.
(g) Change largest value and compare effect on SAE minimizer.
8. General Conjecture.
(h) Conjecture rule for minimizing SAE from generic data.
9. MAE Criterion.
(i) Compare MAE(m) and SAE(m): shape and minimizer.
Discussion: SAE is piecewise linear and minimized by a median value (or median interval for even n).
10. Least-Squares Minimizer Recall.
(j) Report m that minimizes SSE for these heights.
11. Graph SSE(m).
(k) Graph SSE vs m, describe shape, and identify minimizer.
12. Outlier Sensitivity Comparison.
(l) Create extreme outlier and compare effects on SSE and SAE minimizers.
Discussion.
Least squares has computational and uniqueness advantages but is less resistant to outliers than absolute-error criteria.
Section 26.1.7 Investigation 5.9: Money-Making Movies
Exercises The Study
Using
TopTenMovies.txt, investigate whether higher audience ratings are associated with greater box-office revenue.
1. Identify Study Variables.
(a) Identify observational units, explanatory variable, and response variable with types.
2. Initial Scatterplot.
(b) Plot adjusted revenue vs IMDB rating and describe association.
3. Largest Residuals.
(c) Identify movies with largest absolute residuals and interpret what large residual means.
4. Correlation.
(d) Compute and interpret correlation.
5. Regression Equation.
(e) Determine least-squares line, interpret slope and intercept.
6. Rescale Response.
(f) Convert revenue to millions and repeat analysis; compare effects on scatterplot, correlation, slope, and intercept.
7. Interpret R2 and s.
(g) Report and interpret
\(R^2\) and
\(s\text{.}\)
8. Coded Scatterplot by MPAA.
(h) Create coded scatterplot and separate lines by MPAA rating. Describe differences in relationships by category.
9. Coded Scatterplot by Genre.
(i) Repeat coded analysis with primary genre and summarize findings.
10. Influence Check by Removing Extreme Y.
(j) Remove two movies above 2 billion and recompute scatterplot, correlation, and regression line.
11. Effect of Removing Points.
(k) Describe how removal changed results.
Definition: Influential Observation.
An observation is influential if removing it substantially changes
\(r\) and/or regression coefficients, often due to extreme x-values.
Study Conclusions.
IMDB rating appears weakly positively related to adjusted revenue. Category-based overlays can reveal systematic under/over-prediction by rating class or genre. Very large residuals in y are not always influential if x is not extreme.
12. Practice Problem 5.9A.
Compare resistance of least-squares and least-absolute-error lines to outliers, and investigate with applet.
13. Practice Problem 5.9B.
Repeat movie analysis using Rotten Tomatoes score as explanatory variable.
Section 26.1.8 Section 5.3 Summary
This section focused on association between two quantitative variables.
You used scatterplots to assess direction, strength, and linearity, and correlation to summarize linear association numerically.
You then fit least-squares regression lines for prediction, interpreted slope and intercept in context, and examined prediction limits such as extrapolation.
You learned model-evaluation tools including residuals, residual standard deviation
\(s\text{,}\) coefficient of determination
\(R^2\text{,}\) and influential observations.
These tools provide the foundation for inferential regression methods in the next section.
You have attempted
of
activities on this page.