We’ve now seen what singular value decompositions are, how to construct them, and how they provide important information about a matrix such as orthonormal bases for the four fundamental subspaces. This puts us in a good position to begin using singular value decompositions to solve a wide variety of problems.
Given the fact that singular value decompositions so immediately convey fundamental data about a matrix, it seems natural that some of our previous work can be reinterpreted in terms of singular value decompositions. Therefore, we’ll take some time in this section to revisit some familiar issues, such as least-squares problems and principal component analysis, while also looking at some new applications.
Preview Activity 7.5.1.
Suppose that \(A = U\Sigma V^T\) where
\begin{equation*}
\Sigma = \begin{bmatrix}
13 \amp 0 \amp 0 \amp 0 \\
0 \amp 8 \amp 0 \amp 0 \\
0 \amp 0 \amp 2 \amp 0 \\
0 \amp0 \amp 0 \amp 0 \\
0 \amp0 \amp 0 \amp 0
\end{bmatrix},
\end{equation*}
vectors \(\uvec_j\) form the columns of \(U\text{,}\) and vectors \(\vvec_j\) form the columns of \(V\text{.}\)
What are the shapes of the matrices \(A\text{,}\) \(U\text{,}\) and \(V\text{?}\)
What is the rank of \(A\text{?}\)
Describe how to find an orthonormal basis for \(\col(A)\text{.}\)
Describe how to find an orthonormal basis for \(\nul(A)\text{.}\)
If the columns of \(Q\) form an orthonormal basis for \(\col(A)\text{,}\) what is \(Q^TQ\text{?}\)
How would you form a matrix that projects vectors orthogonally onto \(\col(A)\text{?}\)
Subsection 7.5.1 Least-squares problems
Least-squares problems, which we explored in
Section 6.5, arise when we are confronted with an inconsistent linear system
\(A\xvec=\bvec\text{.}\) Since there is no solution to the system, we instead find the vector
\(\xvec\) minimizing the distance between
\(\bvec\) and
\(A\xvec\text{.}\) That is, we find the vector
\(\xhat\text{,}\) the least-squares approximate solution, by solving
\(A\xhat=\bhat\) where
\(\bhat\) is the orthogonal projection of
\(\bvec\) onto the column space of
\(A\text{.}\)
If we have a singular value decomposition \(A=U\Sigma V^T\text{,}\) then the number of nonzero singular values \(r\) tells us the rank of \(A\text{,}\) and the first \(r\) columns of \(U\) form an orthonormal basis for \(\col(A)\text{.}\) This basis may be used to project vectors onto \(\col(A)\) and hence to solve least-squares problems.
Before exploring this connection further, we will introduce Sage as a tool for automating the construction of singular value decompositions. One new feature is that we need to declare our matrix to consist of floating point entries. We do this by including
RDF
inside the matrix definition, as illustrated in the following cell.
Activity 7.5.2.
Consider the equation \(A\xvec=\bvec\) where
\begin{equation*}
\begin{bmatrix}
1 \amp 0 \\
1 \amp 1 \\
1 \amp 2
\end{bmatrix}
\xvec = \threevec{-1}36
\end{equation*}
Find a singular value decomposition for
\(A\) using the Sage cell below. What are singular values of
\(A\text{?}\)
What is \(r\text{,}\) the rank of \(A\text{?}\) How can we identify an orthonormal basis for \(\col(A)\text{?}\)
-
Form the reduced singular value decomposition \(U_r\Sigma_rV_r^T\) by constructing: the matrix \(U_r\text{,}\) consisting of the first \(r\) columns of \(U\text{;}\) the matrix \(V_r\text{,}\) consisting of the first \(r\) columns of \(V\text{;}\) and \(\Sigma_r\text{,}\) a square \(r\times r\) diagonal matrix. Verify that \(A=U_r\Sigma_r V_r^T\text{.}\)
You may find it convenient to remember that if
B
is a matrix defined in Sage, then
B.matrix_from_columns( list )
and
B.matrix_from_rows( list )
can be used to extract columns or rows from
B
. For instance,
B.matrix_from_rows([0,1,2])
provides a matrix formed from the first three rows of
B
.
How does the reduced singular value decomposition provide a matrix whose columns are an orthonormal basis for \(\col(A)\text{?}\)
Explain why a least-squares approximate solution \(\xhat\) satisfies
\begin{equation*}
A\xhat = U_rU_r^T\bvec.
\end{equation*}
What is the product \(V_r^TV_r\) and why does it have this form?
Explain why
\begin{equation*}
\xhat = V_r\Sigma_r^{-1}U_r^T\bvec
\end{equation*}
is the least-squares approximate solution, and use this expression to find
\(\xhat\text{.}\)
This activity demonstrates the power of a singular value decomposition to find a least-squares approximate solution for an equation \(A\xvec = \bvec\text{.}\) Because it immediately provides an orthonormal basis for \(\col(A)\text{,}\) something that we’ve had to construct using the Gram-Schmidt process in the past, we can easily project \(\bvec\) onto \(\col(A)\text{,}\) which results in a simple expression for \(\xhat\text{.}\)
Proposition 7.5.1.
If \(A=U_r\Sigma_r V_r^T\) is a reduced singular value decomposition of \(A\text{,}\) then a least-squares approximate solution to \(A\xvec=\bvec\) is given by
\begin{equation*}
\xhat = V_r\Sigma_r^{-1}U_r^T\bvec.
\end{equation*}
If the columns of
\(A\) are linearly independent, then the equation
\(A\xhat = \bhat\) has only one solution so there is a unique least-squares approximate solution
\(\xhat\text{.}\) Otherwise, the expression in
Proposition 7.5.1 produces the solution to
\(A\xhat=\bhat\) having the shortest length.
The matrix \(A^+ = V_r\Sigma_r^{-1}U_r^T\) is known as the Moore-Penrose psuedoinverse of \(A\text{.}\) When \(A\) is invertible, \(A^{-1} = A^+\text{.}\)
Subsection 7.5.2 Rank \(k\) approximations
If we have a singular value decomposition for a matrix \(A\text{,}\) we can form a sequence of matrices \(A_k\) that approximate \(A\) with increasing accuracy. This may feel familiar to calculus students who have seen the way in which a function \(f(x)\) can be approximated by a linear function, a quadratic function, and so forth with increasing accuracy.
We’ll begin with a singular value decomposition of a rank \(r\) matrix \(A\) so that \(A=U\Sigma V^T\text{.}\) To create the approximating matrix \(A_k\text{,}\) we keep the first \(k\) singular values and set the others to zero. For instance, if \(\Sigma = \begin{bmatrix}
22 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 14 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 3 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
\end{bmatrix}
\text{,}\) we can form matrices
\begin{equation*}
\Sigma^{(1)} = \begin{bmatrix}
22 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
\end{bmatrix}, \hspace{24pt}
\Sigma^{(2)} = \begin{bmatrix}
22 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 14 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
0 \amp 0 \amp 0 \amp 0 \amp 0 \\
\end{bmatrix}
\end{equation*}
and define \(A_1 = U\Sigma^{(1)}V^T\) and \(A_2 =
U\Sigma^{(2)}V^T\text{.}\) Because \(A_k\) has \(k\) nonzero singular values, we know that \(\rank(A_k) = k\text{.}\) In fact, there is a sense in which \(A_k\) is the closest matrix to \(A\) among all rank \(k\) matrices.
Activity 7.5.3.
Let’s consider a matrix \(A=U\Sigma V^T\) where
\begin{align*}
\amp U = \begin{bmatrix}
\frac{1}{2} \amp \frac{1}{2} \amp \frac{1}{2} \amp \frac{1}{2} \\
\frac{1}{2} \amp \frac{1}{2} \amp -\frac{1}{2} \amp -\frac{1}{2} \\
\frac{1}{2} \amp -\frac{1}{2} \amp \frac{1}{2} \amp -\frac{1}{2} \\
\frac{1}{2} \amp -\frac{1}{2} \amp -\frac{1}{2} \amp \frac{1}{2} \\
\end{bmatrix},\hspace{10pt}
\Sigma = \begin{bmatrix}
500 \amp 0 \amp 0 \amp 0 \\
0 \amp 100 \amp 0 \amp 0 \\
0 \amp 0 \amp 20 \amp 0 \\
0 \amp 0 \amp 0 \amp 4
\end{bmatrix}\\
\amp V = \begin{bmatrix}
\frac{1}{2} \amp \frac{1}{2} \amp \frac{1}{2} \amp \frac{1}{2} \\
\frac{1}{2} \amp -\frac{1}{2} \amp -\frac{1}{2} \amp \frac{1}{2} \\
-\frac{1}{2} \amp -\frac{1}{2} \amp \frac{1}{2} \amp \frac{1}{2} \\
-\frac{1}{2} \amp \frac{1}{2} \amp -\frac{1}{2} \amp \frac{1}{2}
\end{bmatrix}
\end{align*}
Evaluating the following cell will create the matrices
U
,
V
, and
Sigma
. Notice how the
diagonal_matrix
command provides a convenient way to form the diagonal matrix
\(\Sigma\text{.}\)
Form the matrix
\(A=U\Sigma V^T\text{.}\) What is
\(\rank(A)\text{?}\)
Now form the approximating matrix
\(A_1=U\Sigma^{(1)}
V^T\text{.}\) What is
\(\rank(A_1)\text{?}\)
Find the error in the approximation \(A\approx
A_1\) by finding \(A-A_1\text{.}\)
Now find
\(A_2 = U\Sigma^{(2)} V^T\) and the error
\(A-A_2\text{.}\) What is
\(\rank(A_2)\text{?}\)
Find \(A_3 = U\Sigma^{(3)} V^T\) and the error \(A-A_3\text{.}\) What is \(\rank(A_3)\text{?}\)
What would happen if we were to compute \(A_4\text{?}\)
What do you notice about the error \(A-A_k\) as \(k\) increases?
In this activity, the approximating matrix \(A_k\) has rank \(k\) because its singular value decomposition has \(k\) nonzero singular values. We then saw how the difference between \(A\) and the approximations \(A_k\) decreases as \(k\) increases, which means that the sequence \(A_k\) forms better approximations as \(k\) increases.
Another way to represent \(A_k\) is with a reduced singular value decomposition so that \(A_k = U_k\Sigma_kV_k^T\) where
\begin{equation*}
U_k = \begin{bmatrix}
\uvec_1 \amp \ldots \amp \uvec_k
\end{bmatrix},\hspace{10pt}
\Sigma_k = \begin{bmatrix}
\sigma_1 \amp 0 \amp \ldots \amp 0 \\
0 \amp \sigma_2 \amp \ldots \amp 0 \\
\vdots \amp \vdots \amp \ddots \amp \vdots \\
0 \amp 0 \amp \ldots \amp \sigma_k
\end{bmatrix},\hspace{10pt}
V_k = \begin{bmatrix}
\vvec_1 \amp \ldots \amp \vvec_k
\end{bmatrix}\text{.}
\end{equation*}
Notice that the rank \(1\) matrix \(A_1\) then has the form \(A_1 = \uvec_1\begin{bmatrix}\sigma_1\end{bmatrix}
\vvec_1^T = \sigma_1\uvec_1\vvec_1^T\) and that we can similarly write:
\begin{align*}
A \approx A_1 \amp = \sigma_1\uvec_1\vvec_1^T\\
A \approx A_2 \amp = \sigma_1\uvec_1\vvec_1^T +
\sigma_2\uvec_2\vvec_2^T\\
A \approx A_3 \amp = \sigma_1\uvec_1\vvec_1^T +
\sigma_2\uvec_2\vvec_2^T +
\sigma_3\uvec_3\vvec_3^T\\
\vdots \amp\\
A = A_r \amp = \sigma_1\uvec_1\vvec_1^T +
\sigma_2\uvec_2\vvec_2^T +
\sigma_3\uvec_3\vvec_3^T +
\ldots +
\sigma_r\uvec_r\vvec_r^T\text{.}
\end{align*}
Given two vectors \(\uvec\) and \(\vvec\text{,}\) the matrix \(\uvec~\vvec^T\) is called the outer product of \(\uvec\) and \(\vvec\text{.}\) (The dot product \(\uvec\cdot\vvec=\uvec^T\vvec\) is sometimes called the inner product.) An outer product will always be a rank \(1\) matrix so we see above how \(A_k\) is obtained by adding together \(k\) rank \(1\) matrices, each of which gets us one step closer to the original matrix \(A\text{.}\)
Subsection 7.5.3 Principal component analysis
In
Section 7.3, we explored principal component analysis as a technique to reduce the dimension of a dataset. In particular, we constructed the covariance matrix
\(C\) from a demeaned data matrix and saw that the eigenvalues and eigenvectors of
\(C\) tell us about the variance of the dataset in different directions. We referred to the eigenvectors of
\(C\) as
principal components and found that projecting the data onto a subspace defined by the first few principal components frequently gave us a way to visualize the dataset. As we added more principal components, we retained more information about the original dataset. This feels similar to the rank
\(k\) approximations we have just seen so let’s explore the connection.
Suppose that we have a dataset with \(N\) points, that \(A\) represents the demeaned data matrix, that \(A = U\Sigma V^T\) is a singular value decomposition, and that the singular values are \(A\) are denoted as \(\sigma_i\text{.}\) It follows that the covariance matrix
\begin{equation*}
C = \frac1N AA^T = \frac1N (U\Sigma V^T) (U\Sigma V^T)^T =
U\left(\frac1N \Sigma \Sigma^T\right) U^T.
\end{equation*}
Notice that \(\frac1N \Sigma\Sigma^T\) is a diagonal matrix whose diagonal entries are \(\frac1N\sigma_i^2\text{.}\) Therefore, it follows that
\begin{equation*}
C =
U\left(\frac1N \Sigma \Sigma^T\right) U^T
\end{equation*}
is an orthogonal diagonalization of \(C\) showing that
the principal components of the dataset, which are the eigenvectors of \(C\text{,}\) are given by the columns of \(U\text{.}\) In other words, the left singular vectors of \(A\) are the principal components of the dataset.
the variance in the direction of a principal component is the associated eigenvalue of \(C\) and therefore
\begin{equation*}
V_{\uvec_i} = \frac1N\sigma_i^2.
\end{equation*}
Activity 7.5.4.
Let’s revisit the iris dataset that we studied in
Section 7.3. Remember that there are four measurements given for each of 150 irises and that each iris belongs to one of three species.
Evaluating the following cell will load the dataset and define the demeaned data matrix
\(A\) whose shape is
\(4\times150\text{.}\)
Find the singular values of
\(A\) using the command
A.singular_values()
and use them to determine the variance
\(V_{\uvec_j}\) in the direction of each of the four principal components. What is the fraction of variance retained by the first two principal components?
We will now write the matrix \(\Gamma = \Sigma
V^T\) so that \(A = U\Gamma\text{.}\) Suppose that a demeaned data point, say, the 100th column of \(A\text{,}\) is written as a linear combination of principal components:
\begin{equation*}
\xvec = c_1\uvec_1+c_2\uvec_2+c_3\uvec_3+c_4\uvec_4.
\end{equation*}
Explain why \(\fourvec{c_1}{c_2}{c_3}{c_4}\text{,}\) the vector of coordinates of \(\xvec\) in the basis of principal components, appears as 100th column of \(\Gamma\text{.}\)
Suppose that we now project this demeaned data point \(\xvec\) orthogonally onto the subspace spanned by the first two principal components \(\uvec_1\) and \(\uvec_2\text{.}\) What are the coordinates of the projected point in this basis and how can we find them in the matrix \(\Gamma\text{?}\)
Alternatively, consider the approximation \(A_2=U_2\Sigma_2V_2^T\) of the demeaned data matrix \(A\text{.}\) Explain why the 100th column of \(A_2\) represents the projection of \(\xvec\) onto the two-dimensional subspace spanned by the first two principal components, \(\uvec_1\) and \(\uvec_2\text{.}\) Then explain why the coefficients in that projection, \(c_1\uvec_1 + c_2\uvec_2\text{,}\) form the two-dimensional vector \(\twovec{c_1}{c_2}\) that is the 100th column of \(\Gamma_2=\Sigma_2
V_2^T\text{.}\)
Now we’ve seen that the columns of
\(\Gamma_2 =
\Sigma_2 V_2^T\) form the coordinates of the demeaned data points projected on to the two-dimensional subspace spanned by
\(\uvec_1\) and
\(\uvec_2\text{.}\) In the cell below, find a singular value decomposition of
\(A\) and use it to form the matrix
Gamma2
. When you evaluate this cell, you will see a plot of the projected demeaned data plots, similar to the one we created in
Section 7.3.
In our first encounter with principal component analysis, we began with a demeaned data matrix \(A\text{,}\) formed the covariance matrix \(C\text{,}\) and used the eigenvalues and eigenvectors of \(C\) to project the demeaned data onto a smaller dimensional subspace. In this section, we have seen that a singular value decomposition of \(A\) provides a more direct route: the left singular vectors of \(A\) form the principal components and the approximating matrix \(A_k\) represents the data points projected onto the subspace spanned by the first \(k\) principal components. The coordinates of a projected demeaned data point are given by the columns of \(\Gamma_k =
\Sigma_kV_k^T\text{.}\)