PCA

machine learning
unsupervised learning
pca
linear algebra
matrix factorization
eigendecomposition
projection
subspace
covariance matrix
optimization
mlf
mlt
explainer
review

Given a dataset of \(\displaystyle n\) data-points in \(\displaystyle \mathbb{R}^{d}\), the subspace spanned by the top \(\displaystyle k\) eigenvectors of the covariance matrix of the dataset is the closest \(\displaystyle k\)-dimensional subspace to it.

Setup

For convenience, we will assume that the dataset is centered. Let \(X\) be a centered dataset of shape \(d\times n\):

\[ X=\begin{bmatrix}\vert & & \vert\\ x_{1} & \cdots & x_{n}\\ \vert & & \vert \end{bmatrix} \]

The \(i^{th}\) column of \(X\) is the \(i^{th}\) data-point. Since \(X\) is centered, we have:

\[ \sum_{i=1}^{n}x_{i}=0 \]

We are going to look for a subspace of dimension \(k\) that is closest to the dataset. More precisely, we want to minimize the distance between the points and their projections onto the subspace. We can always find an orthonormal basis for a subspace. Let us call the subspace we are hunting for as \(W\) and an orthonormal basis for it as \(\{w_{1},\cdots,w_{k}\}\).

Optimization Problem

The projection of the \(i^{th}\) data-point onto this subspace is:

\[ \sum\limits_{j=1}^{k}(x_{i}^{T}w_{j})w_{j} \]

Terming the squared distance between the point \(x_i\) and its projection as the error \(||e_{i}||^2\), we first compute \(e_i\):

\[ e_{i}=x_{i}-\left[\sum\limits_{j=1}^{k}(x_{i}^{T}w_{j})w_{j}\right] \]

Let us now expand this error term using the fact that \(||e_{i}||^{2}=e_{i}^{T}e_{i}\).

\[ \begin{aligned} e_{i}^{T}e_{i} & =x_{i}^{T}x_{i}-\left[\sum\limits_{j=1}^{k}(x_{i}^{T}w_{j})^{2}\right] \end{aligned} \]

Since we will be minimizing the error, we can drop \(x_{i}^{T}x_{i}\), which is a constant in the context of the minimization problem. We are therefore left with the following quantity for a single data-point:

\[ -\left[\sum\limits_{j=1}^{k}(x_{i}^{T}w_{j})^{2}\right] \]

For the entire dataset, we sum this error and divide by \(n\):

\[ -\frac{1}{n}\sum\limits_{i=1}^{n}\left[\sum\limits_{j=1}^{k}(x_{i}^{T}w_{j})^{2}\right] \]

Setting \(C=\frac{1}{n}\sum\limits_{i=1}^{n}x_{i}x_{i}^{T}\), interchanging the order of the summation and the order of multiplication, we get:

\[ -\sum\limits_{j=1}^{k}w_{j}^{T}\left[\frac{1}{n}\sum\limits_{i=1}^{n}x_{i}x_{i}^{T}\right]w_{j}=-\sum_{j=1}^{k}w_{j}^{T}Cw_{j} \]

Minimizing the above quantity is the same as maximizing its negation. Therefore, we have the following optimization problem:

\[ \max\limits_{W}\,\,\,\sum\limits_{i=1}^{k}w_{i}^{T}Cw_{i} \]

where \(\{w_{1},\cdots,w_{k}\}\) is an orthonormal list of vectors and \(C\) is the covariance matrix of the dataset.

Solution

Since \(C\) is a real symmetric matrix, it is diagonalizable. There is an orthonormal basis for \(\mathbb{R}^{d}\) made up of the eigenvectors of \(C\). Let us denote it by \(\{u,\cdots,u_{d}\}\). We also note that \(C\) is positive semi-definite, so it has \(d\) non-negative eigenvalues. Let us call them \(\lambda_{1}\geqslant\cdots\geqslant\lambda_{d}\geqslant 0\). We can represent \(C\) as:

\[ C=\sum\limits_{i=1}^{d}\lambda_{i}u_{i}u_{i}^{T} \]

Plugging this back into the objective function, we have:

\[ \begin{aligned} \sum\limits_{i=1}^{k}w_{i}^{T}Cw_{i} & =\sum\limits_{i=1}^{k}\sum\limits_{j=1}^{d}\lambda_{j}(w_{i}^{T}u_{j})^{2} \end{aligned} \]

Changing the order of the summation inside, we have:

\[ \sum_{j=1}^{d}\lambda_{j}\sum_{i=1}^{k}(u_{j}^{T}w_{i})^{2} \]

The term \(\sum\limits_{i=1}^{k}(u_{j}^{T}w_{i})^{2}\) is the squared norm of the projection of \(u_{j}\) onto the subspace \(W\). Since the norm of a vector is always greater than or equal to the norm of its projection, we have:

\[ \sum_{i=1}^{k}(u_{j}^{T}w_{i})^{2}\leqslant||u_{j}||^{2}=1 \]

Applying this inequality across all values of \(j\), we end up with:

\[ \sum\limits_{i=1}^{k}w_{i}^{T}Cw_{i}\leqslant\sum_{j=1}^{d}\lambda_{j} \]

We have thus arrived at an upper bound for the objective function. This upper bound is achieved when \(w_{i}=u_{i}\) for \(1\leqslant i\leqslant k\) and that completes the proof.

Summary

In summary, the subspace that is closest to the dataset is the one spanned by the top \(k\) eigenvectors of the covariance matrix of the dataset.