Suppose we collect some data when performing an experiment and plot it as shown on the left of Figure 6.5.1. Notice that there is no line on which all the points lie; in fact, it would be surprising if there were since we can expect some uncertainty in the measurements recorded. There does, however, appear to be a line, as shown on the right, on which the points almost lie.
In this section, weโll explore how the techniques developed in this chapter enable us to find the line that best approximates the data. More specifically, weโll see how the search for a line passing through the data points leads to an inconsistent system . Since we are unable to find a solution, we instead seek the vector where is as close as possible to . Orthogonal projection gives us just the right tool for doing this.
When weโve encountered inconsistent systems in the past, weโve simply said there is no solution and moved on. The preview activity, however, shows how we can find approximate solutions to an inconsistent system: if there are no solutions to , we instead solve the consistent system , the orthogonal projection of onto Col. As weโll see, this solution is, in a specific sense, the best possible.
Plot these three points in Figure 6.5.2. Are you able to draw a line that passes through all three points?
Figure6.5.2.Plot the three data points here.
Remember that the equation of a line can be written as where is the slope and is the -intercept. We will try to find and so that the three points lie on the line.
The first data point gives an equation for and . In particular, we know that when , then so we have or . Use the other two data points to create a linear system describing and .
We have obtained a linear system having three equations, one from each data point, for the two unknowns and . Identify a matrix and vector so that the system has the form , where .
Notice that the unknown vector describes the line that we seek.
Is there a solution to this linear system? How does this question relate to your attempt to draw a line through the three points above?
xxxxxxxxxx
1
โ
Messages
Since this system is inconsistent, we know that is not in the column space Col. Find an orthogonal basis for Col and use it to find the orthogonal projection of onto Col.
Since is in Col, the equation is consistent. Find its solution and sketch the line in Figure 6.5.2. We say that this is the line of best fit.
This activity illustrates the idea behind a technique known as orthogonal least squares, which we have been working toward throughout this chapter. If the data points are denoted as , we construct the matrix and vector as
With the vector representing the line , we see that the equation describes a line passing through all the data points. In our activity, it is visually apparent that there is no such line, which agrees with the fact that the equation is inconsistent.
Remember that , the orthogonal projection of onto Col, is the closest vector in Col to . Therefore, when we solve the equation , we are finding the vector so that is as close to as possible. Letโs think about what this means within the context of this problem.
Drawing the line defined by the vector , the quantity reflects the vertical distance between the line and the data point , as shown in Figure 6.5.5. Seen in this way, the square of the distance is a measure of how much the line defined by the vector misses the data points. The solution to the least-squares problem is the line that misses the data points by the smallest amount possible.
Given an inconsistent system , we seek the vector that minimizes the distance from to . In other words, satisfies , where is the orthogonal projection of onto the column space Col. We know the equation is consistent since is in Col, and we know there is only one solution if we assume that the columns of are linearly independent.
We will usually denote the solution of by and call this vector the least-squares approximate solution of to distinguish it from a (possibly non-existent) solution of .
There is an alternative method for finding that does not involve first finding the orthogonal projection . Remember that is defined by the fact that is orthogonal to Col. In other words, is in the orthogonal complement Col, which Proposition 6.2.10 tells us is the same as Nul. Since is in Nul, it follows that
with matrix and vector . Since this equation is inconsistent, we will find the least-squares approximate solution by solving the normal equation , which has the form
The rate at which a cricket chirps is related to the outdoor temperature, as reflected in some experimental data that weโll study in this activity. The chirp rate is expressed in chirps per second while the temperature is in degrees Fahrenheit. Evaluate the following cell to load the data:
Use the first data point to write an equation involving and .
Suppose that we represent the unknowns using a vector . Use the 15 data points to create the matrix and vector so that the linear system describes the unknown vector .
xxxxxxxxxx
1
โ
Messages
Write the normal equations ; that is, find the matrix and the vector .
Solve the normal equations to find , the least-squares approximate solution to the equation . Call your solution xhat since x has another meaning in Sage.
xxxxxxxxxx
1
โ
Messages
What are the values of and that you found?
If the chirp rate is 22 chirps per second, what is your prediction for the temperature?
You can plot the data and your line, assuming you called the solution xhat, using the cell below.
This example demonstrates an approach, called linear regression, in which a collection of data is modeled using a linear function found by solving a least-squares problem. Once we have the linear function that best fits the data, we can make predictions about situations that we havenโt encountered in the data.
If weโre going to use our function to make predictions, itโs natural to ask how much confidence we have in these predictions. This is a statistical question that leads to a rich and well-developed theoryโ1โ
For example, see Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer, 2013.
, which we wonโt explore in much detail here. However, there is one simple measure of how well our linear function fits the data that is known as the coefficient of determination and denoted by .
We have seen that the square of the distance measures the amount by which the line fails to pass through the data points. When the line is close to the data points, we expect this number to be small. However, the size of this measure depends on the scale of the data. For instance, the two lines shown in Figure 6.5.8 seem to fit the data equally well, but is 100 times larger on the right.
The coefficient of determination is defined by normalizing so that it is independent of the scale. Recall that we described how to demean a vector in Section 6.1: given a vector , we obtain by subtracting the average of the components from each component.
A more complete explanation of this definition relies on the concept of variance, which we explore in Exercise 6.5.6.12 and the next chapter. For the time being, itโs enough to know that and that the closer is to 1, the better the line fits the data. In our original example, illustrated in Figure 6.5.8, we find that , and in our study of cricket chirp rates, we have . However, assessing the confidence we have in predictions made by solving a least-squares problem can require considerable thought, and it would be naive to rely only on the value of .
As weโve seen, the least-squares approximate solution to may be found by solving the normal equation , and this can be a practical strategy for some problems. However, this approach can be problematic as small rounding errors can accumulate and lead to inaccurate final results.
As the next activity demonstrates, there is an alternate method for finding the least-squares approximate solution using a factorization of the matrix , and this method is preferable as it is numerically more reliable.
Suppose we are interested in finding the least-squares approximate solution to the equation and that we have the factorization . Explain why the least-squares approximate solution is given by solving
Multiply both sides of the second expression by and explain why
Since is upper triangular, this is a relatively simple equation to solve using back substitution, as we saw in Section 5.1. We will therefore write the least-squares approximate solution as
and put this to use in the following context.
Brozakโs formula, which is used to calculate a personโs body fat index , is
where denotes a personโs body density in grams per cubic centimeter. Obtaining an accurate measure of is difficult, however, because it requires submerging the person in water and measuring the volume of water displaced. Instead, we will gather several other body measurements, which are more easily obtained, and use it to predict .
For instance, suppose we take 10 patients and measure their weight in pounds, height in inches, abdomen in centimeters, wrist circumference in centimeters, neck circumference in centimeters, and . Evaluating the following cell loads and displays the data.
vectors weight, height, abdomen, wrist, neck, and BFI formed from the columns of the dataset.
the command onesvec(n), which returns an -dimensional vector whose entries are all one.
the command QR(A) that returns the factorization of as Q, R = QR(A).
the command demean(v), which returns the demeaned vector .
We would like to find the linear function
that best fits the data.
Use the first data point to write an equation for the parameters .
Describe the linear system for these parameters. More specifically, describe how the matrix and the vector are formed.
Construct the matrix and find its factorization in the cell below.
xxxxxxxxxx
1
โ
Messages
Find the least-squares approximate solution by solving the equation . You may want to use N(xhat) to display a decimal approximation of the vector. What are the parameters that best fit the data?
Find the coefficient of determination for your parameters. What does this imply about the quality of the fit?
xxxxxxxxxx
1
โ
Messages
Suppose a personโs measurements are: weight 190, height 70, abdomen 90, wrist 18, and neck 35. Estimate this personโs .
In the examples weโve seen so far, we have fit a linear function to a dataset. Sometimes, however, a polynomial, such as a quadratic function, may be more appropriate. It turns out that the techniques weโve developed in this section are still useful as the next activity demonstrates.
In addition to loading and plotting the data, evaluating that cell provides the following commands:
Q, R = QR(A) returns the factorization of .
demean(v) returns the demeaned vector .
Letโs fit a quadratic function of the form
to this dataset.
Write four equations, one for each data point, that describe the coefficients ,, and .
Express these four equations as a linear system where .
Find the factorization of and use it to find the least-squares approximate solution .
xxxxxxxxxx
1
โ
Messages
Use the parameters ,, and that you found to write the quadratic function that fits the data. You can plot this function, along with the data, by entering your function in the place indicated below.
The matrices that you created in the last activity when fitting a quadratic and cubic function to a dataset have a special form. In particular, if the data points are labeled and we seek a degree polynomial, then
Find the vector , the least-squares approximate solution to the linear system that results from fitting a degree 5 polynomial to the data.
xxxxxxxxxx
1
โ
Messages
If your result is stored in the variable xhat, you may plot the polynomial and the data together using the following cell.
xxxxxxxxxx
1
plot_model(xhat, data)
Messages
Find the coefficient of determination for this polynomial fit.
Repeat these steps to fit a degree 8 polynomial to the data, plot the polynomial with the data, and find .
xxxxxxxxxx
1
โ
Messages
Repeat one more time by fitting a degree 11 polynomial to the data, creating a plot, and finding .
xxxxxxxxxx
1
โ
Messages
Itโs certainly true that higher degree polynomials fit the data better, as seen by the increasing values of , but thatโs not always a good thing. For instance, when , you may notice that the graph of the polynomial wiggles a little more than we would expect. In this case, the polynomial is trying too hard to fit the data, which usually contains some uncertainty, especially if itโs obtained from measurements. The error built in to the data is called noise, and its presence means that we shouldnโt expect our polynomial to fit the data perfectly. When we choose a polynomial whose degree is too high, we give the noise too much weight in the model, which leads to some undesirable behavior, like the wiggles in the graph.
Fitting the data with a polynomial whose degree is too high is called overfitting, a phenomenon that can appear in many machine learning applications. Generally speaking, we would like to choose large enough to capture the essential features of the data but not so large that we overfit and build the noise into the model. There are ways to determine the optimal value of , but we wonโt pursue that here.
Choosing a reasonable value of , estimate the extent of Arctic sea ice at month 6.5, roughly at the Summer Solstice.
Given an inconsistent system , we find , the least-squares approximate solution, by requiring that be as close to as possible. In other words, where is the orthogonal projection of onto Col.
One way to find is by solving the normal equations This is not our preferred method since numerical problems can arise.
A second way to find uses a factorization of . If , then and finding is computationally feasible since is upper triangular.
This technique may be applied widely and is useful for modeling data. We saw examples in this section where linear functions of several input variables and polynomials provided effective models for different datasets.
A simple measure of the quality of the fit is the coefficient of determination though some additional thought should be given in real applications.
The following cell loads in some data showing the number of people in Bangladesh living without electricity over 27 years. It also defines vectors year, which records the years in the dataset, and people, which records the number of people.
This problem concerns a dataset describing planets in our Solar system. For each planet, we have the length of the semi-major axis, essentially the distance from the planet to the Sun in AU (astronomical units), and the period , the length of time in years required to complete one orbit around the Sun.
We would like to model this data using the function where and are parameters we need to determine. Since this isnโt a linear function, we will transform this relationship by taking the natural logarithm of both sides to obtain
Evaluating the following cell loads a dataset describing the temperature in the Earthโs atmosphere at various altitudes. There are also two vectors altitude, expressed in kilometers, and temperature, in degrees Celsius.
Describe how to form the matrix and vector so that the linear system describes a degree polynomial fitting the data.
After choosing a value of , construct the matrix and vector , and find the least-squares approximate solution .
Plot the polynomial and data using plot_model(xhat, data).
Now examine what happens as you vary the degree of the polynomial . Choose an appropriate value of that seems to capture the most important features of the data while avoiding overfitting, and explain your choice.
Use your value of to estimate the temperature at an altitude of 55 kilometers.
The following cell loads some data describing 1057 houses in a particular real estate market. For each house, we record the living area in square feet, the lot size in acres, the age in years, and the price in dollars. The cell also defines variables area, size, age, and price.
We observed that if the columns of are linearly independent, then there is a unique least-squares approximate solution to the equation because the equation has a unique solution. We also said that is the unique solution to the normal equation without explaining why this equation has a unique solution. This exercise offers an explanation.
This problem is about the meaning of the coefficient of determination and its connection to variance, a topic that appears in the next section. Throughout this problem, we consider the linear system and the approximate least-squares solution , where . We suppose that is an matrix, and we will denote the -dimensional vector .
Explain why , the mean of the components of , can be found as the dot product
In the examples we have seen in this section, explain why is in Col.
If we write , explain why
and hence why the mean of the components of is zero.
The variance of an -dimensional vector is Var, where is the vector obtained by demeaning .
Explain why
VarVarVar
Explain why
VarVar
and hence
VarVarVarVar
These expressions indicate why it is sometimes said that measures the โfraction of variance explainedโ by the function we are using to fit the data. As seen in the previous exercise, there may be other features that are not recorded in the dataset that influence the quantity we wish to predict.