First let's load in the digits as before

Now let's put the digits into a data matrix, where each row is a digit and each column is a pixel index. To do this, we're going to have to "flatten" the 2D arrays for each digit into a 1D array. We'll go row by row in "scanline order," so the first row goes into the first 28 dimensions, the second row goes into dimensions 29-56, etc. Here's an example of using numpy's flatten method to do this on an example 0 digit.

The original digit image is $28 \times 28$ pixels, so when we flatten it, we get a 784 dimensional vector. It's a little strange to look at, but we can kind of see the pattern. Noting that pure black is 0 and pure white is 1, the top few rows are all white pixels, so we see it stays at 1 for a while, then it jumps down on each line when it passes through the 0 slice. The animation below shows the flattening process in more detail

Now that we have a way to turn a digit into a vector, we can set up our data matrix x. We'll also keep track of the actual labels of the digits in a parallel array y, though we won't actually use those labels in what we're going to do next

Let's compute PCA on the data matrix to find the directions of greatest variance. First, we compute the mean of all of the data points, which corresponds to computing the mean over all rows

Let's have a look at the mean. We have to turn it back from a flattened vector into an image by using numpy's reshape method

Now we'll subtract off the mean from each row and compute the directions of maximum variance in what remains

To get an idea of how much variance each direction explains, we can create a scree plot, which is simply plotting the variance of each component. When we do this, we see that the variance is captured well by many fewer than 784 dimensions

Let's look at a cumulative sum of the ratio to the total variance to see how many components it takes us to explain 90% of the variance. We find that this takes about 86 dimensions

Let's pull out the first 10 directions of greatest variance and plot them. Overall, the first 10 components account for about 50% of the variance in the dataset, and the first 20 components account for about 64% of the variance in the dataset (note the diminishing returns)