Skip to main content

Advanced High School Statistics: Third Edition

Section 8.3 Transformations for skewed data

County population size among the counties in the US is very strongly right skewed. Can we apply a transformation to make the distribution more symmetric? How would such a transformation affect the scatterplot and residual plot when another variable is graphed against this variable? In this section, we will see the power of transformations for very skewed data.

Subsection 8.3.1 Introduction to transformations

Example 8.3.1.

Consider the histogram of county populations shown in Figure 8.3.2.(a), which shows extreme skew. What isn’t useful about this plot?
Solution.
Nearly all of the data fall into the left-most bin, and the extreme skew obscures many of the potentially interesting details in the data.
(a)
(b)
Figure 8.3.2. (a) A histogram of the populations of all US counties. (b) A histogram of log\(_{10}\)-transformed county populations. For this plot, the x-value corresponds to the power of 10, e.g. “4” on the x-axis corresponds to \(10^4 =\) 10,000.
There are some standard transformations that may be useful for strongly right skewed data where much of the data is positive but clustered near zero. A transformation is a rescaling of the data using a function. For instance, a plot of the logarithm (base 10) of county populations results in the new histogram in Figure 8.3.2.(b). This data is symmetric, and any potential outliers appear much less extreme than in the original data set. By reigning in the outliers and extreme skew, transformations like this often make it easier to build statistical models against the data.
Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of the population change from 2010 to 2017 against the population in 2010 is shown in Figure 8.3.3.(a). In this first scatterplot, it’s hard to decipher any interesting patterns because the population variable is so strongly skewed. However, if we apply a log\(_{10}\) transformation to the population variable, as shown in Figure 8.3.3.(b), a positive association between the variables is revealed. While fitting a line to predict population change (2010 to 2017) from population (in 2010) does not seem reasonable, fitting a line to predict population from log\(_{10}\)(population) does seem reasonable.
(a)
(b)
Figure 8.3.3. (a) Scatterplot of population change against the population before the change. (b) A scatterplot of the same data but where the population size has been log-transformed.
Transformations other than the logarithm can be useful, too. For instance, the square root (\(\sqrt{\text{ original observation } }\)) and inverse (\(\frac{1}{\text{ original observation } }\)) are commonly used by data scientists. Common goals in transforming data are to see the data structure differently, reduce skew, assist in modeling, or straighten a nonlinear relationship in a scatterplot.

Subsection 8.3.2 Transformations to achieve linearity

Figure 8.3.4. Variable \(y\) is plotted against \(x\text{.}\) A nonlinear relationship is evident by the \(\cup\)-pattern shown in the residual plot. The curvature is also visible in the original plot.

Example 8.3.5.

Consider the scatterplot and residual plot in Figure 8.3.4. The regression output is also provided. Is the linear model \(\hat{y} = -52.3564 + 2.7842 x\) a good model for the data?
The regression equation is

y = -52.3564 + 2.7842 x

Predictor       Coef   SE Coef         T          P
Constant    -52.3564    7.2757    -7.196      3e-08
x             2.7842    0.1768    15.752    < 2e-16

S = 13.76    R-Sq = 88.26%    R-Sq(adj) = 87.91%
Solution.
We can note the \(R^2\) value is fairly large. However, this alone does not mean that the model is good. Another model might be much better. When assessing the appropriateness of a linear model, we should look at the residual plot. The \(\cup\)-pattern in the residual plot tells us the original data is curved. If we inspect the two plots, we can see that for small and large values of \(x\) we systematically underestimate \(y\text{,}\) whereas for middle values of \(x\text{,}\) we systematically overestimate \(y\text{.}\) The curved trend can also be seen in the original scatterplot. Because of this, the linear model is not appropriate, and it would not be appropriate to perform a \(t\)-test for the slope because the conditions for inference are not met. However, we might be able to use a transformation to linearize the data.
Regression analysis is easier to perform on linear data. When data are nonlinear, we sometimes transform the data in a way that makes the resulting relationship linear. The most common transformation is log of the \(y\) values. Sometimes we also apply a transformation to the \(x\) values. We generally use the residuals as a way to evaluate whether the transformed data are more linear. If so, we can say that a better model has been found.

Example 8.3.6.

Using the regression output for the transformed data, write the new linear regression equation.
The regression equation is

log(y) = 1.722540 + 0.052985 x

Predictor         Coef     SE Coef        T          P
Constant      1.722540    0.056731    30.36    < 2e-16
x             0.052985    0.001378    38.45    < 2e-16

S = 0.1073    R-Sq = 97.82%    R-Sq(adj) = 97.75%
Solution.
The linear regression equation can be written as: \(\widehat{\text{ log } (y)} = 1.723 +0.053 x\)
Figure 8.3.7. A plot of \(\text{log}(y)\) against \(x\text{.}\) The residuals don’t show any evident patterns, which suggests the transformed data is well-fit by a linear model.

Guided Practice 8.3.8.

Which of the following statements are true? There may be more than one.
  1. There is an apparent linear relationship between \(x\) and \(y\text{.}\)
  2. There is an apparent linear relationship between \(x\) and \(\widehat{\text{ log } (y)}\text{.}\)
  3. The model provided by Regression I \((\hat{y} = -52.3564 + 2.7842 x)\) yields a better fit.
  4. The model provided by Regression II \((\widehat{\text{ log } (y)} = 1.723 +0.053 x)\) yields a better fit.
     1 
    Part (a) is false since there is a nonlinear (curved) trend in the data. Part (b) is true. Since the transformed data shows a stronger linear trend, it is a better fit, i.e. Part (c) is false, and Part (d) is true.

Subsection 8.3.3 Section summary

  • A transformation is a rescaling of the data using a function. When data are very skewed, a log transformation often results in more symmetric data.
  • Regression analysis is easier to perform on linear data. When data are nonlinear, we sometimes transform the data in a way that results in a linear relationship. The most common transformation is log of the \(y\)-values. Sometimes we also apply a transformation to the \(x\)-values.
  • To assess the model, we look at the residual plot of the transformed data. If the residual plot of the original data has a pattern, but the residual plot of the transformed data has no pattern, a linear model for the transformed data is reasonable, and the transformed model provides a better fit than the simple linear model.

Exercises 8.3.4 Exercises

1. Used trucks.

The scatterplot below shows the relationship between year and price (in thousands of $) of a random sample of 42 pickup trucks. Also shown is a residuals plot for the linear model for predicting price from year.
  1. Describe the relationship between these two variables and comment on whether a linear model is appropriate for modeling the relationship between year and price.
  2. The scatterplot below shows the relationship between logged (natural log) price and year of these trucks, as well as the residuals plot for modeling these data. Comment on which model (linear model from earlier or logged model presented here) is a better fit for these data.
  3. The output for the logged model is given below. Interpret the slope in context of the data.
    Estimate Std. Error t value Pr\((\gt |t|)\)
    (Intercept) -271.981 25.042 -10.861 0.000
    Year 0.137 0.013 10.937 0.000
Solution.
  1. The relationship is positive, non-linear, and somewhat strong. Due to the non-linear form of the relationship and the clear non-constant variance in the residuals, a linear model is not appropriate for modeling the relationship between year and price.
  2. Neither are a particularly: For the logged model, the scatterplot and residual plot show more constant variance in the residuals. However, the scatterplot with the logged model looks to have a bit of curvature.
  3. For each hour increase hours works we would expect the income to increase on average by a factor of \(e^{0.058} \approx 1.06\text{,}\) i.e. by 6%.

2. Income and hours worked.

The scatterplot below shows the relationship between income and years worked for a random sample of 787 Americans. Also shown is a residuals plot for the linear model for predicting income from hours worked. The data come from the 2012 American Community Survey.
 2 
United States Census Bureau. Summary File. 2012 American Community Survey. U.S. Census Bureau’s American Community Survey Office, 2013. Web.
  1. Describe the relationship between these two variables and comment on whether a linear model is appropriate for modeling the relationship between year and price.
  2. The scatterplot below shows the relationship between logged (natural log) income and hours worked, as well as the residuals plot for modeling these data. Comment on which model (linear model from earlier or logged model presented here) is a better fit for these data.
  3. The output for the logged model is given below. Interpret the slope in context of the data.
    Estimate Std. Error t value Pr\((>|t|)\)
    (Intercept) 1.017 0.113 9.000 0.000
    hrs_work 0.058 0.003 21.086 0.000
You have attempted of activities on this page.