Principal Component Regression


The final feature selection technique that we will introduce may come in handy in particular, given the high degree of multicollinearity that we observed in our gene expression features matrix.

In this technique rather than leaving out carefully selected explanatory variables from our features matrix $X$, we can use what we call principal component analysis (PCA) to take our features matrix $X$ and represent it in a different way $X_{pc}$, such that $X_{pc}$ still preserves some aspect of the original features matrix $X$.

Linear Algebra Basics

We don’t assume that you have taken any linear algebra courses in order to understand this module. However, we need to cover some basics to understand why principal component analysis is cool and useful.

Vectors

First, we define a vector $\mathbf{x}$ as an ordered tuple of real numbers $\mathbf{x}=(x_1,x_2,…,x_n)$.

For instance, $\mathbf{x}=(2,-1,3)$ is a vector with three dimensions. When summarizing the tuple (ie. the $\mathbf{x}$), we usually use a bold font. You can think of a vector as being a row of values in a dataframe, for instance.

Properties of Vectors

Adding and Subtracting Vectors

We can add (and subtract) two vectors $\mathbf{x}=(x_1,x_2,…,x_n)$ and $\mathbf{y}=(y_1,y_2,…,y_n)$ by adding the corresponding components as such:
$$\mathbf{x+y}=(x_1,x_2,…,x_n )+(y_1,y_2,…,y_n )=(x_1+y_1,x_2+y_2,…,x_n+y_n )$$

For instance, if $\mathbf{x}=(2,-1,3)$ and $\mathbf{y}=(4,0,1)$, then

$$\mathbf{x}+\mathbf{y} = (2,-1,3)+(4,0,1)=(2+4, -1+0, 3+1) = (6, -1, 4)$$

Multiplying a Vector by a Number

We can multiple (and divide) a vector $\mathbf{x}=(x_1,x_2,…,x_n )$ by a single number $\alpha$ as follows:
$$\alpha\mathbf{x}=\alpha(x_1,x_2,…,x_n )=(\alpha x_1,\alpha x_2,…,\alpha x_n )$$

For instance, if $\mathbf{x}=(2,-1,3)$, then

$$5\cdot\mathbf{x} = 5\cdot (2,-1,3) = (10,-5, 15)$$

Geometric Representation of Vectors

Radius Vectors: Geometrically, we visualize a vector $\mathbf{x}$ with an arrow emanating from one point and ending at another point. The subtraction of these two points (end point – starting point) should equal your vector $\mathbf{x}$.

For instance, the vector $\mathbf{x}=(2,-1)$ may be visualized with the following representations.

  • The blue arrow below starting at the point $(0,0)$ and ending at $(2,-1)$.

    (because $\mathbf{x}=(2,-1)=(2,-1)-(0,0)$)

  • The orange arrow below starting at the point $(1,3)$ and ending at $(3,2)$.

    (because $\mathbf{x}=(2,-1)=(3,2)-(1,3)$)

Image Description

Thus, there are actually infinitely many ways to represent the same vector.

Positional Vectors: In addition, we can also think of any n-dimensional point as a vector $\mathbf{x}$ that happens to be emanating out of the origin point $(0,0,…,0)$ in a cartesian plane.

For instance, the blue arrow above is the positional vector representation of the point $(2,-1)$.

Parallel Vectors

For any vector $\mathbf{x}$ that is not all zeros, any vector $\alpha \mathbf{x}$ is parallel to $\mathbf{x}$ for $\alpha\neq0$.

  • When $\alpha>0$, $\mathbf{x}$ and $\alpha x$ are pointing in the same direction.
  • When $\alpha<0$, $\mathbf{x}$ and $\alpha x$ are pointing in the oppposite direction.
Image Description

For instance,

  • one representation of the vector $\mathbf{x}=(2,-1)$ is sketched in blue above,
  • one representation of the vector $2\cdot\mathbf{x}=2\cdot(2,-1)=(4,-2)$ is sketched in purple above, and
  • one representation of the vector $-1/2\cdot\mathbf{x}=-1/2\cdot(2,-1)=(-1,1/2)$ is sketched in yellow above.

Notice how each of these vectors are parellel to each other. Also notice how the yellow vector $-1/2\cdot\mathbf{x}$ is moving in the opposite direction of the the blue vector $\mathbf{x}$.

Dot Product

Dot Product of Two Vectors

We define the dot product of two vectors $\mathbf{x}=(x_1,x_2,…,x_n )$ and $\mathbf{y}=(y_1,y_2,…,y_n )$ as follows:
$$\mathbf{x}\cdot \mathbf{y}=x_1 y_1+x_2 y_2+⋯+x_n y_n$$

Example
For instance, the dot product of the vectors $\mathbf{x}_1=(-2,0)$ and $\mathbf{u}=(2,1)$ is the following.

\begin{align*}
\mathbf{x}_1 \cdot \mathbf{u} &= (-2,0) \cdot (2,1) \
&= -2 \cdot 2 + 0 \cdot 1 \
&= -4
\end

Furthermore, the dot product of the vectors $\mathbf{x}_2=(0,2)$ and $\mathbf{u}=(2,1)$ is the following.

\begin{align*}
\mathbf{x}_2 \cdot \mathbf{u} &= (0,2) \cdot (2,1) \
&= 0 \cdot 2 + 2 \cdot 1 \
&= 2
\end

Geometric Representation of the Dot Product of Two Vectors

The dot product $\mathbf{x}\cdot\mathbf{u}$ of a vector $\mathbf{x}$ and $\mathbf{u}$ represents a measure of how far $\mathbf{x}$ "advances" in the direction of $\mathbf{u}$.

We follow the instructions below to visually represent what it means for a given vector $\mathbf{x}$ to "advance" in the direction of another given vector $\mathbf{u}$

Example:

The positional vector representation of the vector $\mathbf{u}=(2,1)$ is drawn below with the green arrow. The point $\mathbf{x}_2=(0,2)$ is also drawn below in orange.

  1. Find the point on the green arrow $\mathbf{u}= (2,1)$ that is closest to the point $\mathbf{x}=(0,2)$. The line between these two points should be perpendicular to the green arrow representation of $\mathbf{u}$.

  2. Draw an orange arrow starting from the origin (0,0) and ending at this point from (1). This new vector that you drew is parallel to $\mathbf{u}=(2,1)$ and we call it the projection of the point $\mathbf{x}_2$ onto the vector $\mathbf{u}$.

Image Description

Also remember how the dot product of the vectors $\mathbf{x}_2=(0,2)$ and $\mathbf{u}=(2,1)$ is positive.

\begin{align*} \mathbf{x}_2 \cdot \mathbf{u} &= (0,2) \cdot (2,1) \\ &= 0 \cdot 2 + 2 \cdot 1 \\ &= 2 \end{align*}

This indicates that this point advances in the same direction as $u$.

Example:

Now let's try visualizing the projection of the yellow point $\mathbf{x}_1=(-2,0)$ onto this same vector $\mathbf{u}=(2,1)$. We run into an extra complication when representing this projection because the point $\mathbf{x}_1=(-2,0)$ does not perpendicularly intersect with the positional green vector $\mathbf{u}=(2,1)$.

In this case you can sketch and use another vector $\mathbf{w}=(-4,-2)$ (like the light green one below) that is parallel to $\mathbf{u}=(2,1)$ in the steps below.

  1. Find the point on the light green arrow $\mathbf{w}=(-4,-2)$ that is closest to the point $\mathbf{x}_1=(-2,0)$.

  2. Draw an yellow arrow starting from the origin (0,0) and ending at this point from (1). This new vector that you drew is parallel to $\mathbf{u}=(2,1)$ and we call it the projection of the point $\mathbf{x}_1$ onto the vector $\mathbf{u}$.

Image Description

Also remember how the dot product of the vectors $\mathbf{x}_1=(-2,0)$ and $\mathbf{u}=(2,1)$ is negative.

\begin{align*} \mathbf{x}_1 \cdot \mathbf{u} &= (-2,0) \cdot (2,1) \\ &= -2 \cdot 2 + 0 \cdot 1 \\ &= -4 \end{align*}

This indicates that this point advances in a different direction as $u$.

Interpreting Dot Products as Coordinates on a New Number Line:

We can think of these dot product $\mathbf{x}\cdot \mathbf{u}$ as a coordinate on a new number line that is defined by how much the projection of the point $\mathbf{x}$ advances along the vector $\mathbf{u}$.

Image Description

Dimensionality Reduction

Dimensionality Reduction of an Observation

Projecting onto a Single Vector

By projecting a given point $\mathbf{x}=(x_1,x_2,...,x_n)$ onto a single given vector $\mathbf{u}=(u_1,u_2,...,,u_n)$, we are performing a type of what we call dimensionality reduction of the point $\mathbf{x}=(x_1,x_2,...,x_n)$.

Ex: For instance, suppse we choose to project the 4-dimensional point $\mathbf{x}_1=(6,7,8,9)$ onto the vector $\mathbf{u}=(1,0,0,0)$ by calculating their dot product.

$$\mathbf{x}_1\cdot\mathbf{u}=(6,7,8,9)\cdot(1,0,0,0) = 6\cdot1 + 7\cdot0 + 8\cdot0 + 9\cdot0 = 6$$

Then we are now representing our 4-dimensional point $\mathbf{x}_1=(6,7,8,9)$ with the 1-dimensional point (6).

Image Description

Projecting onto Multiple Vectors

Furthermore, by projecting a given point $\mathbf{x}=(x_1,x_2,...,x_n)$ onto a set of $p$ vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_p$, we are also performing a type of what we call dimensionality reduction of the point $\mathbf{x}=(x_1,x_2,...,x_n)$.

Ex: For instance, suppose we choose to project the 4-dimensional point $\mathbf{x}=(6,7,8,9)$ first onto the vector $\mathbf{u}_1=(1,0,0,0)$ and then onto the vector $\mathbf{u}_2=(0,1,0,0)$ with the following two dot products.

$$\mathbf{x}\cdot\mathbf{u}_1=(6,7,8,9)\cdot(1,0,0,0) = 6\cdot1 + 7\cdot0 + 8\cdot0 + 9\cdot0 = 6$$
$$\mathbf{x}\cdot\mathbf{u}_2=(6,7,8,9)\cdot(0,1,0,0) = 6\cdot0 + 7\cdot1 + 8\cdot0 + 9\cdot0 = 7$$

Now we are representing our 4-dimensional point $\mathbf{x}=(6,7,8,9)$ with the 2-dimensional point (6,7).

Image Description

Dimensionality Reduction of a Dataset

Similarly, we can choose to project multiple n-dimensional points (ie. rows) in a given dataset $X$ (with $n$ columns) onto a set of $p$ vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_p$. By doing so, we are performing a dimensionality reduction of the dataset $X$ and representing it with another dataset $\hat{X}$ which now only has $p$ columns.

Ex: For instance, we can take the following dataset $X_{old}$ below which has 4 columns and two observations.

X_old = pd.DataFrame(np.array([[6,7,8,9], [10,11,12,13]]))
X_old

0 1 2 3
0 6 7 8 9
1 10 11 12 13

And then we can project the two rows in this dataset first onto the vector $\mathbf{u}_1=(1,0,0,0)$ and then onto the vector $\mathbf{u}_2=(0,1,0,0)$.

vectors = pd.DataFrame(np.array([[1, 0, 0, 0], [0,1,0,0]]))
vectors

0 1 2 3
0 1 0 0 0
1 0 1 0 0

First Row Dot Products
$$\mathbf{x}_1\cdot\mathbf{u}_1=(6,7,8,9)\cdot(1,0,0,0) = 6\cdot1 + 7\cdot0 + 8\cdot0 + 9\cdot0 = 6$$
$$\mathbf{x}_1\cdot\mathbf{u}_2=(6,7,8,9)\cdot(0,1,0,0) = 6\cdot0 + 7\cdot1 + 8\cdot0 + 9\cdot0 = 7$$

Second Row Dot Products
$$\mathbf{x}_2\cdot\mathbf{u}_1=(10,11,12,13)\cdot(1,0,0,0) = 10\cdot1 + 11\cdot0 + 12\cdot0 + 13\cdot0 = 10$$
$$\mathbf{x}_2\cdot\mathbf{u}_2=(10,11,12,13)\cdot(0,1,0,0) = 10\cdot0 + 11\cdot1 + 12\cdot0 + 13\cdot0 = 11$$

By doing so, we now achieve a dimensionality reduction of $X_{old}$, which we call $X_{new}$ that represents each of our 2 original row observations, now with just 2 dimensions.

X_new = pd.DataFrame(np.dot(X_old, vectors.T), columns=['new_coord_1', 'new_coord_2'])
X_new

new_coord_1 new_coord_2
0 6 7
1 10 11
Image Description

Goals of a Dimensionality Reduction Algorithm

There are many methods and algorithms that will reduce the dimension of a given point/observation or set of points/observations (ie. a dataset). Usually the goals of a dimensionality reduction algorithm are the following.

  1. Represent a $n$ dimensional dataset as a $p$ dimensional dataset, where $ p< n $.

  2. This lower-dimensional representation should strive to preserve some “property” of the original observation (or dataset).

For instance, by projecting a given dataset onto the vectors $\mathbf{u}_1=(1,0,0,0)$ and $\mathbf{u}_2=(0,1,0,0)$ from above, what we end up preserving is the first two columns of the original dataset.

Full Projection with Principal Component Analysis

One of the most common types of dimensionality reduction methods uses what's called principal component analysis (PCA). In a principal component analysis, we take an original dataset $X$ with $n$ columns and project it onto a carefully selected set of $p$ vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_p$ which we call loading vectors. By doing so, we achieve a new dataset $X_{pc}$ which is a representation of $X$, that now only has $p$ columns.



A "full projection" of the original dataset $X$ with $n$ columns using PCA involves projecting $X$ onto a carefully selected set of $p=n$ vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_n$. Thus in this case, we achieve a new dataset $X_{pc}$ which is a *representation* of $X$, that also happens to have $n$ columns. This is technically not a "dimensionality reduction" of $X$ because we have not actually reduced the dimension of $X$, but just represented it in a different form.



However, in order to understand the properties of an actual dimensionality reduction of $X$ using PCA, which we'll discuss in 10.4, we need to understand the properties of the "full projection" of $X$ using PCA fist. We'll discuss and demonstrate these properties by creating a "full projection" of our features matrix $X$ that is comprised of $n=50$ gene explanatory variable columns corresponding to 150 tumor rows.

X

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
0 -0.845493 1.526262 -2.218126 1.641149 1.210038 -1.801174 1.991829 -1.266751 1.803601 -1.871297 ... 1.864684 -1.849315 -2.016888 1.556149 -1.735853 -2.134520 -2.417362 1.599595 -2.097080 -1.964144
1 0.878424 1.692429 -2.046898 2.694920 2.512264 -1.506456 2.183692 -1.074956 2.121859 -1.969695 ... 1.638766 -1.804299 -0.930737 1.168764 -1.642269 -0.908221 -1.878902 2.048030 -1.720365 -2.294808
2 1.117994 2.625543 -2.273475 1.683788 2.095848 -2.155872 2.114443 -0.896636 1.831405 -2.276697 ... 2.129761 -2.133432 -2.173389 2.081198 -2.225724 -2.126557 -2.456196 2.128701 -1.512812 -1.910912
3 -0.536355 2.015416 -2.368843 2.558947 2.094437 -2.519153 2.217922 -1.444650 2.199390 -2.202884 ... 2.321338 -1.834924 -2.107330 1.617957 -1.959183 -2.088288 -2.346309 2.178543 -1.822892 -2.050343
4 1.375436 3.037213 -2.164606 2.821925 1.825272 -1.849208 2.575300 -0.942689 1.996785 -2.427971 ... 2.666453 -2.198057 -2.223759 2.240747 -2.315530 -2.106164 -2.127631 2.191884 -2.355830 -2.459397
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145 -0.450093 -0.712083 0.062016 0.016073 -0.911054 0.286857 -0.612718 -0.141252 -0.305011 0.261221 ... -0.635939 0.311086 0.338626 -0.404976 0.409222 0.313488 0.191822 0.596065 1.289382 0.474246
146 -0.322474 -0.371127 0.362781 -0.755361 -0.730451 1.272614 -0.211888 1.198752 -0.345393 0.220046 ... -0.217614 0.566488 0.550564 -0.899421 0.523832 0.578086 0.294984 0.095218 0.990912 -0.431382
147 0.314081 -0.797145 0.664606 -0.661115 -1.025616 0.269226 -0.531681 -0.104954 -0.102499 -0.050642 ... -0.630174 0.984316 0.978791 0.736561 0.704657 0.997862 0.051828 0.678716 1.102883 0.424926
148 -0.227688 -0.953516 0.402115 -0.741837 -1.194118 -0.449345 -1.135600 -0.725638 -0.248574 -0.063934 ... -0.820660 0.209566 0.215154 -1.318596 0.365007 0.189423 -0.408029 -0.104882 0.257973 0.178702
149 -0.842105 -0.535221 0.296167 -0.580676 -0.848944 -0.045335 -0.533834 1.625074 -0.332672 0.289700 ... -0.876239 2.026926 1.962872 -1.269113 0.582182 2.036181 0.223438 -0.533956 0.989518 0.017268

150 rows × 50 columns

PCA() Function

First, we can use the PCA() function to instantiate a pca object that will contain all the information relevant to this particular principal componenent analysis that we are about to conduct.

from sklearn.decomposition import PCA
pca = PCA()

Principal Components

We then use the .fit_transform() function that corresponds to this pca object we created, along with our $X$ dataframe which is comprised of our $n=50$ gene explanatory variables.

This .fit_transform() function will:

  1. create our new projected dataset $X_{pc}$, which is a representation of $X$
  2. update our pca object to now hold relevant information about this particular principal component analysis which we can extract later.
X_pc = pca.fit_transform(X)
X_pc=pd.DataFrame(X_pc, columns = ['princ_comp_'+str(i) for i in range(1,51)])
X_pc

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
0 11.811550 -0.426965 0.493211 -0.215049 -0.878618 -0.206519 -0.519120 -0.359708 1.233738 0.110285 ... -0.145820 -0.006228 0.167487 0.011840 0.018605 0.004295 0.019660 0.044441 -0.022706 0.026468
1 13.408741 0.179091 -1.894700 0.760387 1.559272 -0.266017 0.808696 0.054908 -0.199177 1.005478 ... 0.115522 0.216464 -0.092097 0.093863 -0.078108 -0.109997 0.030465 0.023067 0.301279 0.030154
2 14.787795 0.175545 0.466971 0.567701 0.393446 0.312356 -0.569608 -0.072699 -0.221028 -0.064616 ... 0.040838 -0.119659 -0.058398 -0.020088 0.079735 -0.010872 0.071032 -0.063136 0.000997 0.017111
3 13.986303 -0.360751 0.165300 -0.224779 -0.262125 -0.041952 0.126688 -0.507498 0.405516 -0.145314 ... -0.204435 0.115458 0.015716 -0.187804 0.036921 -0.016803 -0.030482 0.091201 -0.148672 0.010797
4 16.180329 0.879694 -0.119710 0.722362 0.187244 -0.958821 -0.392817 0.736819 -0.962863 -0.592854 ... 0.005439 0.427350 -0.093517 -0.045853 -0.134727 -0.075841 0.063655 -0.091192 -0.091140 -0.032713
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145 -2.912419 0.287644 -0.130006 -0.010599 0.045865 0.228536 0.230506 -1.267780 0.401567 -0.212682 ... 0.096869 -0.104359 -0.163481 -0.090162 0.288743 0.076196 0.063715 0.010581 0.003595 0.006235
146 -2.850561 -0.103263 -0.920080 0.333949 -0.920864 0.823344 0.172882 -0.068164 -1.087548 0.612276 ... 0.035487 0.134741 0.074192 0.096764 0.007018 -0.128810 0.113191 -0.003900 -0.013923 -0.006210
147 -3.663353 1.085507 -0.223835 1.232618 -0.004888 0.350161 0.286931 -0.539557 0.440470 0.096053 ... 0.233901 -0.041811 0.141582 -0.036514 0.000147 -0.150253 0.069814 0.068094 0.036242 0.014631
148 -3.140596 0.065543 1.327583 -0.130014 -0.310971 0.797273 0.083089 -1.209402 1.129151 0.537873 ... 0.171780 -0.040860 -0.107404 0.200736 -0.179975 0.008448 0.004026 -0.021564 -0.098246 0.012250
149 -4.010312 -0.542476 -1.810049 1.734627 -1.017348 1.445337 0.048484 0.950447 -0.949820 0.112723 ... 0.062236 0.047296 -0.199606 0.090891 0.065233 0.112811 0.027649 0.050115 0.005592 -0.002153

150 rows × 50 columns

Notice how our new projected $X_{pc}$ dataset representation of $X$ also happens to have $p=50$ columns and 150 rows, just like our $X$ matrix. We call each of the columns in this new projected matrix $X_{pc}$ principal components. So we'll call $X_{pc}$ our "principal component matrix".

PCA Loading Vectors

A "full projection" PCA involves projecting $X$ onto a carefully selected set of $p=n$ vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_n$.

For instance, the first principal component column in $X_{pc}$ is found by calculating the dot product of every row (ie. point) in $X$ with the first loading vector $\mathbf{u}_1$.

Similarly, the second principal component column in $X_{pc}$ is found by calculating the dot product of every row (ie. point) in $X$ with the second loading vector $\mathbf{u}_2$.

So on and so forth, in our case the 50th principal component column in $X_{pc}$ is found by calculating the dot product of every row (ie. point) in $X$ with the 50th loading vector $\mathbf{u}_{50}$.

We can extract these $p=50$ loading vectors used in our PCA by using the .components_ attribute that corresponds to our pca object.

pca.components_

array([[ 0.01781239, 0.15537908, -0.15654806, ..., 0.14085223,
-0.13114171, -0.14418143],
[ 0.41826166, -0.05396324, 0.02206133, ..., 0.16290096,
0.15927019, 0.06690061],
[ 0.36795989, -0.05588702, -0.04953479, ..., -0.13784997,
-0.2006779 , -0.11401138],
...,
[ 0.00847199, -0.06123676, -0.04837649, ..., 0.03208006,
-0.00696163, 0.00138566],
[-0.00808032, 0.0770841 , -0.12344374, ..., -0.02149068,
-0.00327142, 0.00225006],
[-0.00583861, -0.01124934, 0.03080762, ..., 0.01811343,
0.0161778 , -0.00200254]])

Let's put this output in a dataframe. Each column in this output represents one of our $p=50$ loading vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_n$. Each row in this output corresponds to one of our $n=50$ genes.

In any given PCA, each of the $n$ rows (ie. values) in a loading vector $\mathbf{u}$ will always correspond to each of the coresponding $n$ columns in $X$.

U = pd.DataFrame(pca.components_, columns = ['loading_vector_'+str(i) for i in range(1,51)], index=X.columns)
U

loading_vector_1 loading_vector_2 loading_vector_3 loading_vector_4 loading_vector_5 loading_vector_6 loading_vector_7 loading_vector_8 loading_vector_9 loading_vector_10 ... loading_vector_41 loading_vector_42 loading_vector_43 loading_vector_44 loading_vector_45 loading_vector_46 loading_vector_47 loading_vector_48 loading_vector_49 loading_vector_50
X159 0.017812 0.155379 -0.156548 0.148403 0.134461 -0.124165 0.152606 -0.088624 0.150115 -0.151563 ... 0.150893 -0.146285 -0.146682 0.128307 -0.142528 -0.145048 -0.142858 0.140852 -0.131142 -0.144181
X960 0.418262 -0.053963 0.022061 -0.066759 -0.113198 -0.097637 -0.108869 -0.176995 -0.032802 0.010736 ... 0.091364 0.029216 0.031576 0.198690 0.048338 0.035256 0.014046 0.162901 0.159270 0.066901
X980 0.367960 -0.055887 -0.049535 -0.066882 -0.035670 -0.270309 -0.060840 -0.239211 -0.116554 -0.031616 ... -0.099543 -0.143113 -0.146161 -0.082369 -0.180748 -0.156990 -0.043061 -0.137850 -0.200678 -0.114011
X986 0.321827 0.032125 -0.015338 0.016729 -0.107646 0.075492 0.036442 0.379043 0.080649 -0.157446 ... -0.004458 0.302420 0.298987 0.149837 -0.041730 0.307181 -0.151724 0.041838 -0.094807 -0.158632
X1023 0.449596 0.007566 0.019430 0.028314 0.192433 -0.144174 -0.128311 -0.013929 0.101880 0.065064 ... 0.082733 -0.057208 -0.048007 0.332554 -0.035332 -0.044184 0.081733 -0.102344 0.038500 0.126927
X1028 0.012552 -0.137091 -0.194819 -0.279245 -0.165952 -0.197014 -0.158794 0.268200 0.000715 -0.225895 ... -0.054965 0.050405 0.040003 -0.066271 -0.121526 0.054273 0.070472 0.182095 0.175955 -0.157420
X1064 -0.006985 -0.074024 -0.025743 -0.012787 -0.235718 0.138390 0.076197 -0.615434 0.132836 0.027474 ... 0.048702 0.181687 0.185034 0.013396 -0.066054 0.191415 0.052137 0.014243 -0.046542 -0.053076
X1092 0.208612 0.015872 0.008116 -0.004398 -0.084464 -0.029236 0.002819 -0.004669 0.048315 0.108768 ... 0.091081 0.121988 0.119800 -0.014212 -0.062139 0.128846 0.351188 -0.021250 -0.327030 -0.123426
X1103 -0.199892 -0.137891 -0.025444 -0.024975 -0.301526 -0.192015 -0.017592 -0.055913 0.024744 -0.029922 ... 0.002084 -0.038875 -0.042675 0.278923 0.013883 -0.048093 -0.220070 0.101277 -0.219096 0.026283
X1109 0.220427 0.044067 -0.039397 -0.030270 0.071652 0.167517 0.038671 -0.264120 -0.200584 -0.012419 ... 0.054619 -0.015448 0.003418 -0.218856 0.169437 0.010289 -0.157870 0.247343 0.213033 -0.211683
X1124 0.226738 0.143314 0.019148 -0.066467 -0.001412 0.600210 0.052907 -0.023057 -0.012798 -0.000445 ... -0.071434 -0.099429 -0.110289 0.031458 -0.147915 -0.122546 0.073969 0.089819 0.019560 0.102216
X1136 -0.212652 0.000355 -0.046220 -0.204276 0.170182 -0.191195 -0.071165 -0.407776 0.127220 -0.073530 ... -0.007895 0.143734 0.115333 -0.015003 -0.122194 0.116786 -0.087645 0.108481 -0.040969 0.168734
X1141 0.031034 0.104602 0.083692 0.039691 0.229527 -0.166660 0.130406 -0.031224 -0.164275 0.000995 ... 0.287098 0.001881 -0.012239 0.270146 0.057396 0.001384 0.120654 0.343491 0.192130 0.101682
X1144 -0.019776 -0.070769 -0.088275 -0.113019 -0.368380 0.272053 0.041454 -0.059239 0.136801 -0.139080 ... 0.040382 -0.094908 -0.114377 0.218182 -0.019291 -0.120970 -0.000655 -0.092255 0.102593 -0.135703
X1169 0.122464 0.063279 0.085838 -0.005954 0.162777 0.120927 -0.025478 -0.063266 -0.169328 -0.033148 ... -0.121805 0.006134 0.015527 -0.206195 -0.031959 0.014036 -0.334315 -0.082831 -0.335507 -0.088748
X1173 -0.087877 0.126174 -0.167632 0.174898 0.392938 0.175711 0.050924 -0.043101 0.092097 -0.074625 ... 0.059981 0.078199 0.087285 0.098619 0.044807 0.091944 0.179118 -0.111550 -0.076816 -0.078286
X1179 0.124069 -0.018389 0.143086 0.036132 -0.152709 -0.087590 0.124492 -0.092206 0.094599 -0.120705 ... -0.204930 -0.032491 -0.015126 0.082339 0.344550 -0.027210 -0.070796 -0.178541 0.062485 -0.135851
X1193 -0.025286 -0.000506 0.031127 0.195155 -0.085712 -0.020155 0.059100 -0.023022 -0.053097 0.148340 ... -0.068399 0.004706 0.024460 -0.055410 -0.715891 0.023644 -0.113824 -0.054651 0.205696 -0.092311
X1203 0.146288 -0.000112 -0.031844 0.046176 -0.083271 -0.043014 0.080281 0.035944 0.113109 -0.068915 ... -0.075761 -0.025225 0.007921 -0.350700 -0.164963 0.012427 0.013668 0.174766 0.354395 0.189140
X1206 -0.103683 0.118282 0.092487 -0.019634 -0.239873 0.041272 0.044019 -0.013709 -0.131936 0.104485 ... -0.005504 0.015551 -0.013875 0.307346 0.037699 -0.022873 -0.099672 -0.075424 0.207749 -0.055867
X1208 -0.176899 0.050418 0.124804 0.056912 -0.050710 0.104602 -0.088192 -0.017534 -0.012420 0.030285 ... 0.021271 0.041982 0.053182 0.074382 -0.134492 0.084958 0.222618 -0.042078 0.073065 0.162275
X1219 -0.129378 -0.026354 -0.005932 -0.023590 0.184126 0.174404 -0.301034 -0.114992 -0.128453 -0.173306 ... -0.155266 0.017481 0.010794 0.166693 0.017666 0.051842 0.054568 0.128611 0.195751 -0.435202
X1232 0.022979 -0.080550 -0.026058 0.212002 -0.260184 0.070996 0.017039 -0.048644 -0.387971 0.085944 ... -0.083874 -0.037853 -0.002749 -0.022923 0.099221 0.007838 0.340935 0.075350 -0.149667 0.146910
X1264 0.048176 0.042611 0.047620 0.026092 0.073310 0.009335 0.076244 0.012739 -0.072920 0.065724 ... 0.232342 0.062241 0.027941 0.114471 -0.142282 0.014263 -0.188469 -0.428522 0.104502 -0.119752
X1272 -0.044456 0.017061 0.001810 0.571795 -0.153363 -0.126445 -0.142731 0.016108 -0.031726 0.051336 ... 0.176511 -0.017774 -0.014901 -0.138304 0.038297 -0.021671 -0.000403 0.046601 0.005657 -0.342960
X1292 0.065217 0.051259 0.048864 0.003276 -0.116805 0.124506 -0.114355 0.075897 0.375585 0.185275 ... 0.304785 -0.006561 -0.045690 -0.199783 0.067782 -0.043893 -0.002045 0.208273 -0.087984 -0.088076
X1297 -0.015462 0.041132 -0.062837 0.011807 -0.101603 0.211110 -0.080156 0.048335 -0.151611 -0.195408 ... 0.309848 0.016645 -0.021580 0.029653 0.011578 -0.022906 -0.361167 -0.082992 -0.076335 0.319307
X1329 -0.055715 -0.225197 0.163480 -0.323673 0.093945 0.140383 0.049372 0.036200 0.176198 0.136077 ... 0.144929 -0.077438 -0.041501 0.057085 -0.134709 -0.076264 0.180068 -0.083678 -0.075652 -0.251042
X1351 -0.034191 -0.019839 0.035499 -0.111579 -0.010350 -0.099824 -0.054332 -0.058706 -0.071143 0.282306 ... 0.278765 0.006697 -0.005509 -0.136270 0.208608 0.018433 -0.039524 -0.276192 0.288203 -0.176453
X1362 -0.039014 0.143290 -0.029425 0.115273 0.005437 -0.034730 -0.077483 0.012132 0.168286 0.196789 ... -0.185811 0.029291 -0.029093 0.269789 -0.120746 -0.022267 0.013550 0.115664 -0.038650 -0.045677
X1416 -0.077942 0.103474 -0.010782 -0.176538 -0.132968 -0.048242 0.038189 0.013779 -0.183229 -0.026409 ... 0.419022 -0.011417 -0.006745 -0.070683 -0.112683 -0.012535 0.091876 0.165991 -0.264339 -0.096445
X1417 -0.062266 -0.098479 -0.012124 -0.128917 0.110340 0.025609 0.042983 0.088347 0.001696 0.208110 ... -0.102891 -0.045851 -0.010878 0.034724 0.021485 -0.002683 -0.218073 0.302536 -0.070303 -0.187969
X1418 -0.006303 0.077226 -0.210539 0.066228 -0.062355 -0.029866 0.061995 -0.009788 0.201299 -0.127190 ... 0.212587 0.077293 0.000490 -0.042193 0.014618 -0.005061 0.041836 -0.087824 0.143552 0.097948
X1430 0.038380 -0.240830 -0.125565 -0.153055 0.100998 0.014254 -0.316553 0.022687 -0.093939 0.011461 ... 0.118002 0.058036 0.003768 -0.081724 -0.068105 0.016340 0.106612 -0.190753 0.037359 0.043977
X1444 -0.042209 0.518339 -0.113513 -0.186104 -0.106077 -0.094141 -0.260994 -0.018930 -0.070040 -0.202743 ... -0.061493 0.008912 -0.036933 -0.040251 -0.040975 -0.054904 0.134400 -0.144913 -0.030252 -0.060203
X1470 0.008846 -0.180399 0.019525 0.047119 0.026535 0.056868 -0.134981 0.038870 -0.190909 0.243379 ... 0.148297 0.077341 -0.008781 0.052572 -0.062891 -0.008664 -0.112598 0.111133 0.009755 0.088454
X1506 0.030345 0.175019 -0.214361 -0.000623 -0.009565 -0.084153 0.322153 -0.005773 -0.103015 0.013937 ... -0.056529 0.058241 0.016853 -0.086628 0.014799 -0.026429 0.168029 -0.099648 0.057114 -0.093141
X1514 -0.009418 0.024194 -0.183627 -0.259792 0.001257 -0.034902 0.391719 0.069026 -0.346416 0.071422 ... -0.004394 0.058937 0.024577 0.068888 -0.055119 0.030318 0.014202 0.029137 -0.005748 -0.070891
X1529 0.011946 0.226646 0.179226 -0.172660 -0.018406 -0.057620 0.230953 -0.001386 0.211190 0.016250 ... 0.069104 0.050053 -0.029073 -0.103728 0.030500 -0.011109 -0.077085 0.014407 0.056399 0.014344
X1553 0.003682 -0.285660 -0.210854 0.110515 0.029275 0.059826 0.315452 -0.001545 0.079583 -0.081392 ... 0.032816 0.060829 -0.050904 0.004010 0.048115 -0.077282 0.041003 -0.004929 -0.037913 -0.081544
X1563 0.053598 -0.086980 -0.195750 -0.031111 -0.040204 -0.013779 0.011093 -0.001053 0.171042 0.175140 ... -0.114530 -0.049130 0.033065 -0.112991 0.095346 0.014630 -0.021139 -0.021744 0.000519 0.087946
X1574 -0.005888 -0.401671 -0.134626 0.070906 0.110501 -0.006910 0.145766 -0.007757 -0.014523 -0.222011 ... 0.070701 -0.079197 0.048543 0.036022 -0.043104 0.061582 0.025979 -0.101712 0.036233 0.035401
X1595 -0.006642 -0.040717 0.312115 -0.005151 0.006293 0.001495 -0.066913 -0.008409 -0.055011 -0.327570 ... 0.128328 0.039176 -0.021649 -0.098670 0.017110 0.002082 0.044883 0.012209 -0.025054 -0.050363
X1597 -0.026063 0.081706 0.294589 -0.080806 -0.018562 -0.077256 0.216944 0.048691 -0.028898 0.048844 ... -0.006106 -0.124579 -0.020719 -0.021708 -0.036671 -0.008816 0.115317 -0.016079 0.013025 -0.068175
X1609 -0.023896 0.069817 -0.463192 -0.002227 -0.008592 0.065144 -0.080882 -0.008289 -0.000274 -0.008496 ... 0.095558 0.001356 -0.101700 0.001603 0.034910 -0.086209 0.008694 0.002314 0.017110 -0.054807
X1616 0.009050 0.114797 -0.291884 -0.088528 -0.015213 0.014439 -0.093520 0.024977 -0.002981 0.415652 ... -0.083823 0.139192 0.021364 0.029387 0.099757 -0.034024 -0.100652 -0.025852 -0.024180 0.034125
X1637 -0.019676 0.099173 -0.068873 0.069614 -0.028774 0.015506 0.044541 -0.011246 -0.035131 -0.069189 ... 0.034675 0.069013 0.082642 0.050530 -0.009205 0.033023 -0.056398 0.049371 0.019651 -0.028709
X1656 0.008472 -0.061237 -0.048376 -0.008102 0.017021 0.006985 -0.049196 0.022115 0.033606 -0.045214 ... 0.013024 -0.157708 -0.266469 0.019690 -0.037679 -0.324203 0.090757 0.032080 -0.006962 0.001386
X1657 -0.008080 0.077084 -0.123444 -0.030478 -0.024103 0.016439 -0.028242 0.006639 0.017509 0.035757 ... 0.064443 -0.801126 0.376263 0.028089 0.010205 0.345781 -0.029638 -0.021491 -0.003271 0.002250
X1683 -0.005839 -0.011249 0.030808 0.002976 0.011965 -0.001808 -0.019630 -0.003248 -0.010698 -0.019683 ... 0.005727 0.022715 0.717744 -0.000839 0.000008 -0.689050 0.007632 0.018113 0.016178 -0.002003

50 rows × 50 columns

Principal Component Properties in a "Full PCA Projection"

Let's discuss and show four important properties that a "principal component matrix" $X_{pc}$ will always follow, specifically when $X_{pc}$ is a "full projection" (ie. has $p=n$ columns).

Principal Components Property 1: Correlations of 0

First, let's calculate the correlation of all pairs of principal component columns in our principal component matrix $X_{pc}$. Notice how ALL of new column correlations are 0! This property will hold for any principal component matrix $X_{pc}$ that we might create.

This is actually a great property for $X_{pc}$ to have! If we can trust that $X_{pc}$ is representing our original features matrix $X$ well, that is preserving the aspects of $X$ that we desire it to, then we can instead use $X_{pc}$ as our features matrix to predict tumor size $y$. Because the new explanatory varaible columns in $X_{pc}$ have correlations of 0, we no longer need to worry about multicollinearity issues!

X_pc.corr().round(3)

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
princ_comp_1 1.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 ... 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0
princ_comp_2 0.0 1.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 0.0 ... -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0
princ_comp_3 0.0 0.0 1.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 ... -0.0 0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0
princ_comp_4 0.0 0.0 -0.0 1.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 ... 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 0.0 0.0 -0.0
princ_comp_5 -0.0 0.0 0.0 -0.0 1.0 0.0 0.0 0.0 -0.0 -0.0 ... -0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0
princ_comp_6 -0.0 -0.0 -0.0 0.0 0.0 1.0 0.0 -0.0 -0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0
princ_comp_7 -0.0 0.0 0.0 -0.0 0.0 0.0 1.0 -0.0 -0.0 -0.0 ... -0.0 0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0
princ_comp_8 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 1.0 -0.0 0.0 ... 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0
princ_comp_9 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0
princ_comp_10 -0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 1.0 ... -0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0
princ_comp_11 0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 ... 0.0 -0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 0.0
princ_comp_12 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 ... 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0
princ_comp_13 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 ... -0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0
princ_comp_14 0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 ... -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0
princ_comp_15 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 ... 0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0
princ_comp_16 -0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 ... -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0
princ_comp_17 -0.0 0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 ... 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0
princ_comp_18 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0
princ_comp_19 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 ... 0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0
princ_comp_20 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 ... -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 0.0
princ_comp_21 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 ... 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0
princ_comp_22 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 ... -0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 0.0
princ_comp_23 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 ... -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 0.0
princ_comp_24 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 ... 0.0 -0.0 0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 0.0
princ_comp_25 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 ... 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0
princ_comp_26 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 ... -0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0
princ_comp_27 0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 ... 0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0
princ_comp_28 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 ... 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0
princ_comp_29 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 ... 0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0
princ_comp_30 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 ... -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0
princ_comp_31 -0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 ... -0.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 0.0
princ_comp_32 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 ... -0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 0.0
princ_comp_33 -0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 ... 0.0 -0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0
princ_comp_34 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 ... -0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0
princ_comp_35 0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 ... -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0
princ_comp_36 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 0.0 0.0
princ_comp_37 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 -0.0 ... 0.0 -0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0
princ_comp_38 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 ... -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0
princ_comp_39 0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 ... 0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0
princ_comp_40 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 ... 0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0
princ_comp_41 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0 0.0 0.0 -0.0 ... 1.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0 0.0 0.0
princ_comp_42 -0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 ... -0.0 1.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 0.0 -0.0
princ_comp_43 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 ... -0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0
princ_comp_44 -0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 ... -0.0 0.0 0.0 1.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0
princ_comp_45 -0.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 ... 0.0 -0.0 0.0 0.0 1.0 -0.0 0.0 0.0 -0.0 0.0
princ_comp_46 0.0 -0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 ... 0.0 -0.0 -0.0 0.0 -0.0 1.0 -0.0 -0.0 0.0 -0.0
princ_comp_47 -0.0 -0.0 0.0 0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 ... 0.0 0.0 -0.0 -0.0 0.0 -0.0 1.0 -0.0 0.0 0.0
princ_comp_48 0.0 0.0 -0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 ... -0.0 -0.0 0.0 -0.0 0.0 -0.0 -0.0 1.0 0.0 0.0
princ_comp_49 -0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 -0.0 ... 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 1.0 0.0
princ_comp_50 -0.0 0.0 0.0 -0.0 -0.0 0.0 -0.0 -0.0 0.0 -0.0 ... 0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 1.0

50 rows × 50 columns

Principal Components Property 2: Principal Component Variances in Descending Order

Next, let's calculate each of the variances of each of our $p=50$ principal component columns. The principal component columns returned by this PCA() function are always going to be automatically sorted in order descending variance. This property will hold for any principal component matrix $X_{pc}$ that we might create.

X_pc.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
princ_comp_4 1.139962
princ_comp_5 0.909499
princ_comp_6 0.814228
princ_comp_7 0.640248
princ_comp_8 0.537941
princ_comp_9 0.512215
princ_comp_10 0.352804
princ_comp_11 0.335422
princ_comp_12 0.315845
princ_comp_13 0.271892
princ_comp_14 0.248252
princ_comp_15 0.231304
princ_comp_16 0.213070
princ_comp_17 0.196217
princ_comp_18 0.178154
princ_comp_19 0.172938
princ_comp_20 0.151599
princ_comp_21 0.127468
princ_comp_22 0.116333
princ_comp_23 0.110676
princ_comp_24 0.099914
princ_comp_25 0.093986
princ_comp_26 0.087873
princ_comp_27 0.082091
princ_comp_28 0.076610
princ_comp_29 0.071974
princ_comp_30 0.059705
princ_comp_31 0.056508
princ_comp_32 0.051601
princ_comp_33 0.037665
princ_comp_34 0.037302
princ_comp_35 0.036107
princ_comp_36 0.032421
princ_comp_37 0.024838
princ_comp_38 0.024100
princ_comp_39 0.023174
princ_comp_40 0.019491
princ_comp_41 0.017513
princ_comp_42 0.016241
princ_comp_43 0.012114
princ_comp_44 0.010989
princ_comp_45 0.009619
princ_comp_46 0.008930
princ_comp_47 0.004783
princ_comp_48 0.003170
princ_comp_49 0.002186
princ_comp_50 0.000366
dtype: float64

Principal Components Property 3: $X$ Column Variances Sum = $X_{pc}$ Column Variances Sum

Next, let's calculate the sum of these p=50 principal component $X_{pc}$ column variances.

X_pc_var_sum = X_pc.var().sum()
X_pc_var_sum

50.33557046979871

And let's compare this to the sum of the n=50 original $X$ column variances.

X_var_sum=X.var().sum()
X_var_sum

50.33557046979867

These two column variance sums of $X$ and $X_{pc}$ are the same! This property will hold for any principal component matrix $X_{pc}$ that we might create, when $X_{pc}$ is a "full projection" of $X$.

Principal Component Property 4 Effecient "Variance Capture"

Let's take a look at each of our $p=50$ principal component variances again. The variance of the first principal component is quite high in comparison to the variances of the remaining 49 principal components.

X_pc.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
princ_comp_4 1.139962
princ_comp_5 0.909499
princ_comp_6 0.814228
princ_comp_7 0.640248
princ_comp_8 0.537941
princ_comp_9 0.512215
princ_comp_10 0.352804
princ_comp_11 0.335422
princ_comp_12 0.315845
princ_comp_13 0.271892
princ_comp_14 0.248252
princ_comp_15 0.231304
princ_comp_16 0.213070
princ_comp_17 0.196217
princ_comp_18 0.178154
princ_comp_19 0.172938
princ_comp_20 0.151599
princ_comp_21 0.127468
princ_comp_22 0.116333
princ_comp_23 0.110676
princ_comp_24 0.099914
princ_comp_25 0.093986
princ_comp_26 0.087873
princ_comp_27 0.082091
princ_comp_28 0.076610
princ_comp_29 0.071974
princ_comp_30 0.059705
princ_comp_31 0.056508
princ_comp_32 0.051601
princ_comp_33 0.037665
princ_comp_34 0.037302
princ_comp_35 0.036107
princ_comp_36 0.032421
princ_comp_37 0.024838
princ_comp_38 0.024100
princ_comp_39 0.023174
princ_comp_40 0.019491
princ_comp_41 0.017513
princ_comp_42 0.016241
princ_comp_43 0.012114
princ_comp_44 0.010989
princ_comp_45 0.009619
princ_comp_46 0.008930
princ_comp_47 0.004783
princ_comp_48 0.003170
princ_comp_49 0.002186
princ_comp_50 0.000366
dtype: float64

This first principal component variance 38.098 is also pretty close to the sum of ALL the column variances of the original matrix $X$ 50.336.

X_pc.var()['princ_comp_1']

38.097594271680194

X_var_sum

50.33557046979867

By calculating the ratio of these two numbers, we say that principal component 1 "captures" 75.7% of the original column variance sum.

X_pc.var()['princ_comp_1']/X_var_sum

0.7568722061973797

Similarly, we might also calculate that principal component 2 "captures" 4.14% of the original column variance sum.

X_pc.var()['princ_comp_2']/X_var_sum

0.04142093996418921

We similarly can calculate this "variance capture" for each of our $p=50$ principal components. Notice how most of our principal components beyond the first few columns "capture" comparatively very little of the original column variance sum.

X_pc.var()/X_var_sum

princ_comp_1 0.756872
princ_comp_2 0.041421
princ_comp_3 0.031304
princ_comp_4 0.022647
princ_comp_5 0.018069
princ_comp_6 0.016176
princ_comp_7 0.012720
princ_comp_8 0.010687
princ_comp_9 0.010176
princ_comp_10 0.007009
princ_comp_11 0.006664
princ_comp_12 0.006275
princ_comp_13 0.005402
princ_comp_14 0.004932
princ_comp_15 0.004595
princ_comp_16 0.004233
princ_comp_17 0.003898
princ_comp_18 0.003539
princ_comp_19 0.003436
princ_comp_20 0.003012
princ_comp_21 0.002532
princ_comp_22 0.002311
princ_comp_23 0.002199
princ_comp_24 0.001985
princ_comp_25 0.001867
princ_comp_26 0.001746
princ_comp_27 0.001631
princ_comp_28 0.001522
princ_comp_29 0.001430
princ_comp_30 0.001186
princ_comp_31 0.001123
princ_comp_32 0.001025
princ_comp_33 0.000748
princ_comp_34 0.000741
princ_comp_35 0.000717
princ_comp_36 0.000644
princ_comp_37 0.000493
princ_comp_38 0.000479
princ_comp_39 0.000460
princ_comp_40 0.000387
princ_comp_41 0.000348
princ_comp_42 0.000323
princ_comp_43 0.000241
princ_comp_44 0.000218
princ_comp_45 0.000191
princ_comp_46 0.000177
princ_comp_47 0.000095
princ_comp_48 0.000063
princ_comp_49 0.000043
princ_comp_50 0.000007
dtype: float64

It is not uncommon to see principal component variances that look like this, where the variances of the first few principal components are much larger than the variances of later principal components. This is because the $p$ loading vectors $\mathbf{u}_1, \mathbf{u}_2,...,\mathbf{u}_p$ that are carefully selected and used in a principal component analysis are designed to achieve this effect.

"Maximal Variance Vector" $u_1$

That is, the first loading vector $\mathbf{u}_1$ found in PCA is designed so that the 1st principal component column in $X_{pc}$ that it creates will have a variance that is as high as possible.

"Second Maximal Variance Vector" $u_2$

Similarly, the second loading vector $\mathbf{u}_2$ found in PCA is designed so that the 2nd principal component column in $X_{pc}$ that it creates will have a variance that is as high as possible. However, this "second maximal variance vector" also has an additional constraint: it must be perpendicular to $\mathbf{u}_1$!

"Third Maximal Variance Vector" $u_3$

Keeping this process going, the third loading vector $\mathbf{u}_3$ found in PCA is designed so that the 3rd principal component column in $X_{pc}$ that it creates will have a variance that is as high as possible. However, this "third maximal variance vector" also has an additional constraint: it must be perpendicular to $\mathbf{u}_1$ and $\mathbf{u}_2$!

"Efficient Variance Capture"

This general process of finding the "kth maximal variance" vector in PCA keeps going until we have carefully selected all $p=50$ loading vectors and calculated their $p=50$ corresponding principal component columns in $X_{pc}$.

Dimensionality Reduction with Principal Component Analysis

In 10.3, we used PCA to create $p=50$ principal component columns in $X_{pc}$. This "full projection" $X_{pc}$ is not technically a dimensionality reduction of $X$ as it has the same number of columns as $X$.

If we'd like to actually reduce the number of dimensions (ie. columns) in $X_{pc}$ that we are using to represent the original matrix $X$, then all we need to do is select a subset of $p<n$ columns from $X_{pc}$.

But which of the $n$ possible principal component columns are best to select from $X_{pc}$? Furthermore, what is the best number of $p$ columns to select?

Goal of PCA Dimensionality Reduction

As with any dimensionality reduction algorithm, there is always something about the original dataset $X$ that we are trying to preserve. With a PCA dimensionality reduction algorithm, our goals are to:

  1. keep the number of $p$ selected principal component columns in $X_{pc}$ low, while
  2. keeping sum of the variances of these $p$ selected principal components as high possible.

Because the principal components that are returned by PCA() are already sorted in order or descending variance, this means that we will always want to select the first $p$ principal components that are returned.

X_pc.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
princ_comp_4 1.139962
princ_comp_5 0.909499
princ_comp_6 0.814228
princ_comp_7 0.640248
princ_comp_8 0.537941
princ_comp_9 0.512215
princ_comp_10 0.352804
princ_comp_11 0.335422
princ_comp_12 0.315845
princ_comp_13 0.271892
princ_comp_14 0.248252
princ_comp_15 0.231304
princ_comp_16 0.213070
princ_comp_17 0.196217
princ_comp_18 0.178154
princ_comp_19 0.172938
princ_comp_20 0.151599
princ_comp_21 0.127468
princ_comp_22 0.116333
princ_comp_23 0.110676
princ_comp_24 0.099914
princ_comp_25 0.093986
princ_comp_26 0.087873
princ_comp_27 0.082091
princ_comp_28 0.076610
princ_comp_29 0.071974
princ_comp_30 0.059705
princ_comp_31 0.056508
princ_comp_32 0.051601
princ_comp_33 0.037665
princ_comp_34 0.037302
princ_comp_35 0.036107
princ_comp_36 0.032421
princ_comp_37 0.024838
princ_comp_38 0.024100
princ_comp_39 0.023174
princ_comp_40 0.019491
princ_comp_41 0.017513
princ_comp_42 0.016241
princ_comp_43 0.012114
princ_comp_44 0.010989
princ_comp_45 0.009619
princ_comp_46 0.008930
princ_comp_47 0.004783
princ_comp_48 0.003170
princ_comp_49 0.002186
princ_comp_50 0.000366
dtype: float64

Property of a Subsets of Principal Components

For instance, let's take a subset of our $n=50$ principal components. We'll select the first 3.

X_pc_subset = X_pc.iloc[:,0:3]
X_pc_subset

princ_comp_1 princ_comp_2 princ_comp_3
0 11.811550 -0.426965 0.493211
1 13.408741 0.179091 -1.894700
2 14.787795 0.175545 0.466971
3 13.986303 -0.360751 0.165300
4 16.180329 0.879694 -0.119710
... ... ... ...
145 -2.912419 0.287644 -0.130006
146 -2.850561 -0.103263 -0.920080
147 -3.663353 1.085507 -0.223835
148 -3.140596 0.065543 1.327583
149 -4.010312 -0.542476 -1.810049

150 rows × 3 columns

Notice how the sum of these three principal component variances in $X_{pc}$ 41.76 is smaller than the sum of our $n=50$ column variances in $X$ 50.34.

X_pc_subset.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
dtype: float64

X_pc_subset.var().sum()

41.75823345208681

X.var().sum()

50.33557046979867

This property will hold for any principal component analysis we might conduct.

To compare and recap...
  • sum($X$ column variances) $=$ sum(ALL $X_{pc}$ column variances)
  • sum($X$ column variances) $\geq$ sum(SUBSET OF $X_{pc}$ column variances)

Which principal components to use?

Recall that the principal components columns that are returned by the PCA() function are sorted in order of descending column variance. Thus if we would like to represent our $X$ matrix with, say $p=3$ principal component columns from $X_{pc}$, we should select the first three listed principal components, as these three will preserve the most amount of the sum of column variances in $X$.

X_pc.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
princ_comp_4 1.139962
princ_comp_5 0.909499
princ_comp_6 0.814228
princ_comp_7 0.640248
princ_comp_8 0.537941
princ_comp_9 0.512215
princ_comp_10 0.352804
princ_comp_11 0.335422
princ_comp_12 0.315845
princ_comp_13 0.271892
princ_comp_14 0.248252
princ_comp_15 0.231304
princ_comp_16 0.213070
princ_comp_17 0.196217
princ_comp_18 0.178154
princ_comp_19 0.172938
princ_comp_20 0.151599
princ_comp_21 0.127468
princ_comp_22 0.116333
princ_comp_23 0.110676
princ_comp_24 0.099914
princ_comp_25 0.093986
princ_comp_26 0.087873
princ_comp_27 0.082091
princ_comp_28 0.076610
princ_comp_29 0.071974
princ_comp_30 0.059705
princ_comp_31 0.056508
princ_comp_32 0.051601
princ_comp_33 0.037665
princ_comp_34 0.037302
princ_comp_35 0.036107
princ_comp_36 0.032421
princ_comp_37 0.024838
princ_comp_38 0.024100
princ_comp_39 0.023174
princ_comp_40 0.019491
princ_comp_41 0.017513
princ_comp_42 0.016241
princ_comp_43 0.012114
princ_comp_44 0.010989
princ_comp_45 0.009619
princ_comp_46 0.008930
princ_comp_47 0.004783
princ_comp_48 0.003170
princ_comp_49 0.002186
princ_comp_50 0.000366
dtype: float64

For instance, by calculating the ratio of the sum of the three highest principal component variances (41.75) and the sum of the original column variances in $X$ (50.34), we say that 82.96% of sum of the original column variances in $X$ was preserved/captured by using just the first three principal components.

X_pc_subset.var()

princ_comp_1 38.097594
princ_comp_2 2.084947
princ_comp_3 1.575693
dtype: float64

X_pc_subset.var().sum()

41.75823345208681

X_var_sum

50.33557046979867

X_pc_subset.var().sum()/X_var_sum

0.8295969045814579

How many principal components to use?

Cumulative Principal Component Variance Capture

When deciding how many principal components are sufficient to use for our research goal, we can calculate the cumulative sum of each of the $n=50$ possibel principal component variances using the .cumsum(). This function cumulatively adds each corresponding value in a given series to the sum of the values that have come before it.

X_pc.var().cumsum()

princ_comp_1 38.097594
princ_comp_2 40.182541
princ_comp_3 41.758233
princ_comp_4 42.898196
princ_comp_5 43.807695
princ_comp_6 44.621923
princ_comp_7 45.262171
princ_comp_8 45.800112
princ_comp_9 46.312327
princ_comp_10 46.665131
princ_comp_11 47.000554
princ_comp_12 47.316399
princ_comp_13 47.588291
princ_comp_14 47.836543
princ_comp_15 48.067847
princ_comp_16 48.280917
princ_comp_17 48.477133
princ_comp_18 48.655288
princ_comp_19 48.828225
princ_comp_20 48.979824
princ_comp_21 49.107291
princ_comp_22 49.223624
princ_comp_23 49.334300
princ_comp_24 49.434214
princ_comp_25 49.528200
princ_comp_26 49.616074
princ_comp_27 49.698165
princ_comp_28 49.774774
princ_comp_29 49.846749
princ_comp_30 49.906454
princ_comp_31 49.962962
princ_comp_32 50.014563
princ_comp_33 50.052228
princ_comp_34 50.089530
princ_comp_35 50.125637
princ_comp_36 50.158058
princ_comp_37 50.182896
princ_comp_38 50.206996
princ_comp_39 50.230170
princ_comp_40 50.249661
princ_comp_41 50.267174
princ_comp_42 50.283414
princ_comp_43 50.295528
princ_comp_44 50.306517
princ_comp_45 50.316136
princ_comp_46 50.325065
princ_comp_47 50.329848
princ_comp_48 50.333019
princ_comp_49 50.335205
princ_comp_50 50.335570
dtype: float64

By dividing each of these cumulative variance sums by the 50.34 (ie. the sum of the $n=50$ column variances in $X$), we can track what percent of the sum of the original column variances in $X$ is preserved by using the corresponding amount of principal components.

  • For instance, by using just the first principal component, 75.7% would be preserved.
  • By using just the first and second principal components, 79.8% would be preserved.
  • So on and so forth, we can verify that by using all 50 principal components, 100% would be preserved.
percent_preserved = X_pc.var().cumsum()/X_var_sum
percent_preserved

princ_comp_1 0.756872
princ_comp_2 0.798293
princ_comp_3 0.829597
princ_comp_4 0.852244
princ_comp_5 0.870313
princ_comp_6 0.886489
princ_comp_7 0.899208
princ_comp_8 0.909896
princ_comp_9 0.920072
princ_comp_10 0.927081
princ_comp_11 0.933744
princ_comp_12 0.940019
princ_comp_13 0.945421
princ_comp_14 0.950353
princ_comp_15 0.954948
princ_comp_16 0.959181
princ_comp_17 0.963079
princ_comp_18 0.966618
princ_comp_19 0.970054
princ_comp_20 0.973066
princ_comp_21 0.975598
princ_comp_22 0.977909
princ_comp_23 0.980108
princ_comp_24 0.982093
princ_comp_25 0.983960
princ_comp_26 0.985706
princ_comp_27 0.987337
princ_comp_28 0.988859
princ_comp_29 0.990289
princ_comp_30 0.991475
princ_comp_31 0.992598
princ_comp_32 0.993623
princ_comp_33 0.994371
princ_comp_34 0.995112
princ_comp_35 0.995829
princ_comp_36 0.996473
princ_comp_37 0.996967
princ_comp_38 0.997446
princ_comp_39 0.997906
princ_comp_40 0.998293
princ_comp_41 0.998641
princ_comp_42 0.998964
princ_comp_43 0.999204
princ_comp_44 0.999423
princ_comp_45 0.999614
princ_comp_46 0.999791
princ_comp_47 0.999886
princ_comp_48 0.999949
princ_comp_49 0.999993
princ_comp_50 1.000000
dtype: float64

One idea: Find a Point of Diminishing Returns

By plotting these values, we might decide to look for a point of diminishing returns. For instance, after the addition of the ninth principal component, give or take, it does not look like we achieve very many gains to the percent preserved with the addition of any subsequent principal components.

plt.figure(figsize=(15,5))
plt.plot(percent_preserved)
plt.xticks(rotation=90)
plt.show()
Image Description

Thus, we might decide to use just 9 principal components, which preserves about 92% of the sum of the column variances in $X$.

Another idea: Minimum Variance Capture Threshold

Alternatively, for instance, subject matter experts of a particular application might suggest that principal component analyses that achieve a minimum of 80% original variance sum capture tend to yield sufficient results. Thus in this case we may decide to use just $p=3$ principal components, which is the lowest principal component number that still achieves this 80% minimum variance capture threhold.

X_pc.var().cumsum()/X_var_sum

princ_comp_1 0.756872
princ_comp_2 0.798293
princ_comp_3 0.829597
princ_comp_4 0.852244
princ_comp_5 0.870313
princ_comp_6 0.886489
princ_comp_7 0.899208
princ_comp_8 0.909896
princ_comp_9 0.920072
princ_comp_10 0.927081
princ_comp_11 0.933744
princ_comp_12 0.940019
princ_comp_13 0.945421
princ_comp_14 0.950353
princ_comp_15 0.954948
princ_comp_16 0.959181
princ_comp_17 0.963079
princ_comp_18 0.966618
princ_comp_19 0.970054
princ_comp_20 0.973066
princ_comp_21 0.975598
princ_comp_22 0.977909
princ_comp_23 0.980108
princ_comp_24 0.982093
princ_comp_25 0.983960
princ_comp_26 0.985706
princ_comp_27 0.987337
princ_comp_28 0.988859
princ_comp_29 0.990289
princ_comp_30 0.991475
princ_comp_31 0.992598
princ_comp_32 0.993623
princ_comp_33 0.994371
princ_comp_34 0.995112
princ_comp_35 0.995829
princ_comp_36 0.996473
princ_comp_37 0.996967
princ_comp_38 0.997446
princ_comp_39 0.997906
princ_comp_40 0.998293
princ_comp_41 0.998641
princ_comp_42 0.998964
princ_comp_43 0.999204
princ_comp_44 0.999423
princ_comp_45 0.999614
princ_comp_46 0.999791
princ_comp_47 0.999886
princ_comp_48 0.999949
princ_comp_49 0.999993
princ_comp_50 1.000000
dtype: float64

Another idea: Select the $p$ that Yields the Best Predictive Model Results

Recall that our particular plan for this PCA of $X$ is to use a subset of columns in $X_{pc}$ as our features matrix to predict tumor size $y$. Given this, we might also choose to use the number of $p$ columns of $X_{pc}$ that gives us the best regression model prediction results. We will use this strategy in #10.4.

Principal Component Regression

We now have an understanding of how we can use PCA to project our original features matrix $X$ onto a new representation of this features matrix $X_{pc}$. Thus, we can then use a carefully selected subset of columns from $X_{pc}$ as our new features matrix to predict our breast tumor size response variable in a basic linear regression model. We call this technique of using the principal components as explanatory variables in a regression model principal component regression (PCR).

Benefits of using a Principal Component Regression

No Multicollinearity

Remember that the correlation between any pair of principal component columns in $X_{pc}$ is 0. Thus, none of our new explanatory variables from $X_{pc}$ that we will use in the regression model will be collinear.

Trust the Model Slope Interpretations

Without multicollinearity issues, we can have more trust in our interpretation of our new resulting linear regression slopes.

Less Likely to Overfit

Furthermore, without having any collinear explanatory variables, this may decrease the chance of our model overfitting to the training dataset. Thus, this new model may be able to yield better test dataset performance.

Full Dataset Principal Component Regression

Now, if our goal was to predict just the full dataset (ie. all the rows) using a set of principal components, then we could fit just a single linear regression model with this full dataset. Let's suppose that we have decided ahead of time that we wanted to use the first three principal components.

#Full principal component features matrix
X_pc.iloc[:,0:3].head()

princ_comp_1 princ_comp_2 princ_comp_3
0 11.811550 -0.426965 0.493211
1 13.408741 0.179091 -1.894700
2 14.787795 0.175545 0.466971
3 13.986303 -0.360751 0.165300
4 16.180329 0.879694 -0.119710
#Full target array
y.head()

0 13.0
1 15.0
2 15.0
3 20.0
4 10.0
Name: size, dtype: float64

Fitting the PCR Model

Then, to perform principal component regression on the "full dataset", we can then simply just instantiate a new LinearRegression() object and .fit() it with our first three principal components as the features matrix and our breast tumor size target array $y$.

lin_reg_pcr = LinearRegression() 
lin_reg_pcr.fit(X_pc.iloc[:,0:3], y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Extracting this principal component regression model intercept and slopes, we have fit the following model.

$$\hat{size} = 22.96 -0.39pc_1-0.73pc_2+0.28pc_3$$

print('Intercept:', lin_reg_pcr.intercept_)
df_slopes = pd.DataFrame(lin_reg_pcr.coef_.T, columns = ['slopes'], index=['pc1', 'pc2', 'pc3'])
df_slopes

Intercept: 22.96

slopes
pc1 -0.391468
pc2 -0.726656
pc3 0.275390

PCR Model Predictions

If we wanted to extract the tumor size predictions of the 150 genes in our full dataset, we would want to make sure that our .predict() function input is the first three principal compoments of the full dataset $X$, rather than the dataset $X$ itself.

y_pred = lin_reg_pcr.predict(X_pc.iloc[:,0:3])
y_pred

array([18.78223536, 17.05898452, 17.17208611, 17.79247049, 15.95371266,
16.92216003, 18.28549809, 17.42602432, 17.25829098, 19.56258496,
18.87368979, 17.5406931 , 16.69711724, 17.5370352 , 16.59402649,
18.080839 , 18.8681992 , 17.74428173, 17.3165964 , 17.8102937 ,
19.09251529, 17.49145873, 16.84044545, 22.75663624, 23.35646358,
22.53406711, 21.89349112, 22.34098344, 23.73493596, 23.26829746,
23.79740106, 22.25244063, 22.72381252, 22.66298912, 22.04713142,
23.30997459, 23.51692273, 23.4614089 , 20.81238838, 23.33675538,
23.7775952 , 22.31739973, 22.26779547, 23.23129963, 23.21622935,
23.87628578, 20.80809791, 24.06720198, 25.12756053, 24.54637757,
25.03023919, 20.82344775, 24.51886577, 20.81280865, 23.53400027,
23.95787424, 24.41863092, 24.23575736, 23.70329892, 24.80429707,
25.09099107, 24.10131249, 25.19659412, 23.52513061, 25.12856823,
25.32914828, 23.0105964 , 25.92130806, 24.04406905, 25.05416293,
22.67613225, 22.68416055, 22.97011836, 23.73409273, 24.15341063,
23.87912757, 16.98039267, 24.20918844, 24.55144161, 24.59814211,
24.77736544, 24.99726953, 25.6272155 , 23.37908305, 25.18345741,
17.78085024, 25.04379738, 25.8089857 , 22.30382519, 24.91829514,
21.20770507, 24.37274658, 23.01224947, 24.05804845, 25.0388183 ,
24.48331429, 24.52291178, 23.91242497, 24.80045229, 24.86095048,
24.54159674, 24.50110256, 25.22391314, 25.95776742, 26.11947206,
24.10773636, 24.5605316 , 25.4344583 , 26.12746542, 25.08741727,
24.36252458, 25.05653278, 24.82378462, 24.16451572, 24.80178323,
25.57057798, 24.97564244, 25.94237941, 22.55138943, 24.61369877,
24.27727434, 24.8725372 , 24.24639502, 24.67863461, 20.96094766,
22.7522888 , 24.71510595, 22.1013685 , 25.53608709, 24.44036447,
25.06923728, 24.14398794, 24.11888633, 25.01942252, 25.10714894,
23.51393539, 22.78532681, 23.03104612, 27.40491195, 24.7436145 ,
23.92580002, 24.87809634, 24.6085368 , 22.81619218, 24.98874072,
23.85529911, 23.89755984, 23.54365388, 24.50741925, 24.4256344 ])

Full Dataset R^2

Thus, if we wanted to calculate the R^2 of the full dataset we can do so with the actual and predicted tumor sizes of the full dataset.

r2_full = r2_score(y, y_pred)
r2_full

0.09278220321022701

However, given that our overaching goal is to build a linear regression model that will perform well with new datasets rather than the dataset that trained the model, this full dataset R^2 value will not be that informative of this PCR model's performance on new datasets.

Train-Test-Split Principal Component Regression - No Parameter Tuning

Rather than fitting a principal component regression model to the full dataset, we can:

  1. train a single principal component regression model with the principal components of a single training dataset and
  2. test this principal component regression model with the principal components of a single test dataset.

Let's further also suppose we know ahead of time that we'd like to use just the first 3 principal components.

1. Create Training Data Principal Components

We want to be careful about the way that we go about creating our training and test dataset principal components, given that our goal of this analysis is to determine how a trained principal component regression model might perform on a new dataset.

What to do?

Given this goal, we specifically want create our training dataset principal component matrix $X_{pctrain}$ from just $X_{train}$ itself.

pca_train = PCA()
X_pc_train = pca_train.fit_transform(X_train)
X_pc_train=pd.DataFrame(X_pc_train, columns = ['princ_comp_'+str(i) for i in range(1,51)], index=X_train.index)
X_pc_train

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
0 15.154212 0.378553 -0.320491 -0.674058 0.680039 0.040436 0.196779 0.152997 0.079250 0.254449 ... 0.121378 0.186821 -0.079747 -0.003711 0.147100 0.053695 -0.100830 0.032327 0.013352 -0.000843
1 15.108695 0.152287 -0.481896 0.681803 -0.078833 0.333664 -0.614771 -0.323650 0.145248 0.432646 ... 0.277414 -0.130298 -0.032602 0.020076 0.053498 -0.145137 0.003475 0.028177 -0.006085 0.009547
2 -2.907332 -2.117009 0.629097 -1.916650 -1.164595 -0.589500 -0.122971 0.064672 0.287769 0.173999 ... -0.100008 0.075933 0.038485 -0.044895 -0.062338 -0.054181 -0.081461 0.047390 -0.017325 -0.035101
3 -1.049041 -0.058965 -2.136942 -1.164711 0.057740 -0.973921 0.353416 0.792569 0.959555 0.173803 ... 0.014045 0.026185 -0.106634 0.002292 -0.117111 -0.038452 -0.006669 0.028538 0.024617 -0.024165
4 -3.096553 -0.699373 0.253451 0.669848 0.415624 0.320489 0.740290 -0.327692 -0.008696 0.308461 ... -0.143137 -0.003342 0.221769 -0.069819 0.066691 -0.084811 -0.037378 -0.005318 0.010782 -0.020877
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130 -1.300137 0.637101 -0.677342 -0.671490 0.884207 0.022748 -1.015565 -0.175094 -0.743117 -0.012716 ... -0.145279 -0.056100 -0.029594 0.201205 -0.073815 -0.043745 0.026978 0.001158 -0.001572 0.000592
131 -3.365041 -1.758297 1.787233 -0.666252 0.475666 -1.392063 -0.414421 -0.524241 0.563024 -0.194544 ... 0.109103 0.080474 0.216132 0.050841 0.010901 -0.175149 -0.086981 -0.000202 0.044988 0.028282
132 -3.492955 -0.798863 0.091437 0.532173 1.027900 0.952705 0.971850 -0.713013 -0.860726 0.123826 ... -0.352325 0.096231 -0.022505 0.135272 -0.132143 -0.017553 0.046214 0.028089 -0.008525 0.001676
133 -3.797796 0.116999 -0.708484 2.017865 0.113257 -0.590945 1.129595 1.146102 -0.441518 0.054279 ... -0.059823 -0.028992 0.182056 -0.032827 0.071474 0.079713 0.017983 0.087996 -0.022466 -0.023059
134 12.100270 -0.566505 -0.652507 -0.422493 0.694564 -0.095119 -0.566419 1.208140 -0.176023 -0.333702 ... -0.093814 -0.027542 -0.263283 0.017713 -0.107719 -0.064030 -0.002155 0.002189 -0.056152 0.013871

135 rows × 50 columns

What not to do?

Let's contrast this with what we do not want to to, which is:

  1. first create the principal component matrix $X_{pc}$ of the full features matrix $X$
  2. and then split the rows up in the full principal compoent matrix $X_{pc}$ into a training and test dataset.

Notice how the principal component values for the 25th observation from the original features matrix are different for $X_{pc}$ vs. $X_{pctrain}$. This is because when a set of principal components are created from a given dataset, every row observation in the original dataset will have an impact on all the rows in the resulting principal component matrix.

X_pc.iloc[[25],:]

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
25 -1.876304 1.729524 0.349771 -0.844223 -0.080117 -0.484791 -0.757216 0.315798 -0.353992 0.212031 ... 0.13748 -0.103587 0.045167 0.099001 -0.04951 0.114287 0.021333 -0.00387 0.007436 -0.040517

1 rows × 50 columns

X_pc_train.loc[[25],:]

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
25 -3.178595 0.131896 -0.763482 0.284198 -1.328848 0.295244 -0.119101 -0.037461 -0.770827 -0.088768 ... 0.079618 -0.219801 -0.064785 0.031257 -0.031685 -0.037161 -0.063803 0.008031 0.027785 -0.006889

1 rows × 50 columns

Why?

If your goal is to build a fixed principal component regression model, and then use it to predict the breast tumor size of any new dataset of breast tumor tissue samples that comes along perhaps years later, then you will not have all of these new datasets "in your hands" when you try to create the $X_{pc}$ principal components. Thus, in order to design a model that best mimics what will happen in real life, you should create the training dataset principal components directly via just $X_{train}$.

2. Fit a Linear Regression Model with the Training Principal Components

Next, let's use the first three training principal components to train a new linear regression model.

$$\hat{size} = 22.73 -0.19pc_1+0.35pc_2+0.66pc_3$$

lin_reg_pcr_train = LinearRegression() 
lin_reg_pcr_train.fit(X_pc_train.iloc[:,0:3], y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print('Intercept:',lin_reg_pcr_train.intercept_)
df_slopes = pd.DataFrame(lin_reg_pcr_train.coef_.T, columns = ['slopes'], index=['pc1', 'pc2', 'pc3'])
df_slopes

Intercept: 22.881481481481483

slopes
pc1 -0.362056
pc2 -0.741294
pc3 -0.466778

3. Create Test Data Principal Components

Next, we'd like to test our principal component regression model with the first three test dataset principal components. We also want to be careful about how we create the test dataset principal components.

Importance of Matching Explanatory Variable Units

When making predictions with a regression model, it is important that the units of your inputs match the units of the corresponding explanatory variables that were used to train the model. Otherwise, the predictions will become misleading and nonsensical.

For instance, suppose a simple linear regression model was trained to predict a person's height (in) given their weight (kg).

$$\hat{height} = 100 + 1.5(weight_{kg})$$

If instead we were to use a person's weight in (lbs) as input (which has a larger scale than kg), then this model would give a wildly inflated prediction for this person's height!

What are the "units" of a training principal component?

Each of our training principal components that we created can be thought of as a particular type of "measurement" that comes with a particular type of "unit." For instance, the "unit" that corresponds to training principal component 1 "measures" how far along the given point "advances" along the training loading vector $\mathbf{u}_1$.

How do we make the test principal component "units" match the training principal component "units"?

So in order for our test principal component "units" to match the training principal component "units", we need for our test principal components to specifically measure how far along each of the training loading vectors $\mathbf{u}_{train1}, \mathbf{u}_{train2},...,\mathbf{u}_{trainp}$ the observations in the test dataset "advance".
Thus, to do this we need to calculate the dot product of each observation in the *test dataset* with each *training* loading vector $\mathbf{u}_{train1}, \mathbf{u}_{train2},...,\mathbf{u}_{trainp}$.

Slightly Different PCA Code

The code to create the test principal components in this particular manner is shown below.

X_pc_test = pca_train.transform(X_test)
X_pc_test=pd.DataFrame(X_pc_test, columns = ['princ_comp_'+str(i) for i in range(1,51)], index=X_test.index)
X_pc_test.head()

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
0 12.273654 1.258480 -0.535613 -0.468472 1.260134 -0.988088 0.222236 -0.024651 -1.087082 -0.079052 ... 0.472153 -0.431586 -0.330107 -0.032902 -0.243829 -0.258314 -0.078710 0.096953 -0.216197 -0.066499
1 -4.642753 -1.055885 0.864788 -0.056576 0.944469 -1.448501 -0.020915 0.460650 -0.387290 0.429761 ... 0.065958 -0.114075 0.096415 -0.083839 0.001999 -0.090375 -0.145598 -0.083447 -0.042976 0.036695
2 -3.550102 0.162141 -1.735819 2.703701 0.370880 -0.760811 0.683817 1.010507 -0.233910 -0.142261 ... 0.379743 -0.333020 0.031706 0.082612 0.198936 0.036783 -0.145777 0.073758 0.001752 -0.024326
3 -4.704339 -0.559763 1.226415 0.804478 -0.024969 0.344427 0.345989 0.055987 -0.340693 0.009730 ... -0.230930 0.023548 0.062292 0.147554 -0.241371 -0.133440 -0.021751 -0.043838 -0.008331 -0.041117
4 -4.523153 -1.402877 1.882620 -0.084833 1.044504 -0.745928 0.628752 0.099121 0.112402 -0.677695 ... -0.074426 0.188613 0.022592 0.135899 -0.104108 -0.056394 0.098333 -0.039768 -0.083790 -0.016413

5 rows × 50 columns

This code is slightly different than what we used to create the training principal components. Notice how we DID NOT use this code below below to create the test principal components $X_{pctest}$.

#Training Principal Components
pca_train = PCA()
X_pc_train = pca_train.fit_transform(X_train)
#Test Principal Components
X_pc_test = pca_train.transform(X_test)

Rather than redefining a new PCA() container pca_test, specifically one that would be trained with the $X_{test}$ dataset, we instead used the same PCA() container pca_train that was trained with the $X_{train}$ dataset.

By using the .transform() function corresponding to pca_train, rather than the .fit_transform() function, we indicate that we would like to use training loading vectors $\mathbf{u}_{train1}, \mathbf{u}_{train2},...,\mathbf{u}_{trainp}$ contained in the corresponding pca_train to create our $X_{pctest}$.

This ensures that the units of the columns in $X_{pctest}$ match the units of the columns in $X_{pctrain}$ that were used to train the model.

4. Test the Model with the Test Principal Components

Finally, let's use our carefully created test principal components (just the first 3) to test our model.

y_pred_test = lin_reg_pcr_train.predict(X_pc_test.iloc[:,0:3])
y_pred_test

array([17.75483684, 24.94147632, 24.85686621, 24.42720243, 24.68029682,
21.85478643, 17.62796346, 18.84334036, 23.90686002, 24.05411386,
23.13374648, 24.20332363, 23.09336472, 26.15524415, 23.68880049])

r2_test = r2_score(y_test, y_pred_test)
r2_test

0.14582777723204277

Conclusion

Notice how the test R^2=0.146 of this training pcr model is still much lower than the best test R^2=0.285 that we have found with a model so far in this module (8.4.8).

However, let's consider that our decision to use the first $p=3$ principal components was made in a naive fashion. Could it be the case that using a different number of $p$ principal components would have yielded a higher test R^2? We'll consider this question in 10.4.4 below.

Train-Test-Split Principal Component Regression - With Parameter Tuning

Given that the number of $p$ principal components to use in our pcr model is left up to us, we can employ what we call parameter tuning to help us select the value of $p$ that best helps us meet our goals. So in our case, we'd like to select the best number of $p$ principal components that will yield the highest test dataset R^2.

Using a for loop, for each value of p=1,2,...,50 we

  • train a pcr model using the first $p$ training principal components, and
  • test the corresponding pcr model using the first $p$ test principal components.

We can see that by using actually just $p=38$ principal component, we achieve a test dataset R^2=0.349 that is highest amongst all of our pcr models created with our training dataset.

However, we should note that we observe quite a bit of fluctation in the test dataset R^2 as we increase the number of principal components. It tends to decrease, then increase, then decrease again, etc. This may indicate that the variance preserving mechanism of PCA is not necessarily capturing the most salient elements of $X$ that contribute the most to the predictive power of the model.

for p in range(1,50):
#1. Train a model with p training principal components
lin_reg_pcr_train = LinearRegression()
lin_reg_pcr_train.fit(X_pc_train.iloc[:,0:p], y_train)

#2. Test a model with p test principal components
y_pred_test = lin_reg_pcr_train.predict(X_pc_test.iloc[:,0:p])
r2_test = r2_score(y_test, y_pred_test)
print('Test R^2 with', p, 'Principal Components:', r2_test)

Test R^2 with 1 Principal Components: 0.1771482395318289
Test R^2 with 2 Principal Components: 0.16510593739330526
Test R^2 with 3 Principal Components: 0.14582777723204277
Test R^2 with 4 Principal Components: 0.12793685271961208
Test R^2 with 5 Principal Components: 0.13720764268741847
Test R^2 with 6 Principal Components: 0.022999674094031475
Test R^2 with 7 Principal Components: 0.009630215645979767
Test R^2 with 8 Principal Components: 0.006491275865039947
Test R^2 with 9 Principal Components: 0.012162541341550703
Test R^2 with 10 Principal Components: 0.01027589690976638
Test R^2 with 11 Principal Components: 0.020146157428211797
Test R^2 with 12 Principal Components: -0.04453329858806643
Test R^2 with 13 Principal Components: 0.013482123323740991
Test R^2 with 14 Principal Components: 0.03508855928927401
Test R^2 with 15 Principal Components: 0.025561195205017317
Test R^2 with 16 Principal Components: 0.06279152824723644
Test R^2 with 17 Principal Components: 0.091883501847151
Test R^2 with 18 Principal Components: 0.10037082276858333
Test R^2 with 19 Principal Components: 0.1604071440345306
Test R^2 with 20 Principal Components: 0.16464811400871326
Test R^2 with 21 Principal Components: 0.16115922136218397
Test R^2 with 22 Principal Components: 0.22183257703635617
Test R^2 with 23 Principal Components: 0.23964571536585721
Test R^2 with 24 Principal Components: 0.24307373407536015
Test R^2 with 25 Principal Components: 0.23644120461610119
Test R^2 with 26 Principal Components: 0.22939746916611248
Test R^2 with 27 Principal Components: 0.2340352435540347
Test R^2 with 28 Principal Components: 0.27455015963684115
Test R^2 with 29 Principal Components: 0.2718721776176467
Test R^2 with 30 Principal Components: 0.26353705148694073
Test R^2 with 31 Principal Components: 0.2278966775045821
Test R^2 with 32 Principal Components: 0.06393682638126907
Test R^2 with 33 Principal Components: -0.048347910361851376
Test R^2 with 34 Principal Components: -0.05233288908195455
Test R^2 with 35 Principal Components: -0.0912873354966679
Test R^2 with 36 Principal Components: 0.29990253193343286
Test R^2 with 37 Principal Components: 0.31956204947552447
Test R^2 with 38 Principal Components: 0.3490294220670397
Test R^2 with 39 Principal Components: 0.30021254840725364
Test R^2 with 40 Principal Components: 0.1762991003672567
Test R^2 with 41 Principal Components: 0.16918194085811789
Test R^2 with 42 Principal Components: 0.1783015707585105
Test R^2 with 43 Principal Components: 0.16264811179998573
Test R^2 with 44 Principal Components: 0.2291240366657168
Test R^2 with 45 Principal Components: 0.14656182139889862
Test R^2 with 46 Principal Components: -0.25577780753575596
Test R^2 with 47 Principal Components: -0.36465711656342203
Test R^2 with 48 Principal Components: -0.4141472010172742
Test R^2 with 49 Principal Components: -0.7550178960526708

Conclusion

Great! The test R^2=0.349 of our best pcr model was better than the best test R^2=0.285 that we have found so far with an elastic net model module (8.4.8). So if we only planned to use just one training and test dataset to infer how well our model might do with new datasets, then this best PCR model would be the one that we would want to select.

Cross-Validation, Parameter Tuning, and Principal Component Regression

Finally, let's not forget the shortcomings that we discussed in 9.1 when it comes to using just a single training and test dataset to infer how well a given model might do when it comes to making predictions with new datasets.

So let's wrap this module up by using cross-validation and principal component regression. We'll also use parameter tuning as well to select the best number of training principal components to use that will achieve the highest average test fold R^2.

Same k=5 Test Folds

For an "apples-to-apples" comparision, let's use the same random $k=5$ test folds that we used to cross-validate our linear regression and elastic net models that we explored in 9.5.3.

cross_val = KFold(n_splits=5, shuffle=True, random_state=207)

Full Features Matrix Principal Components $X_{pc}$

Rather than creating distinct principal components ($X_{pctrain}$ and $X_{pctest}$) for a given training features matrix $X_{train}$ and $X_{test}$, we will supply the principal components $X_{pc}$ that were calculated via the full features matrix $X$ (containing all rows).

X_pc.head()

princ_comp_1 princ_comp_2 princ_comp_3 princ_comp_4 princ_comp_5 princ_comp_6 princ_comp_7 princ_comp_8 princ_comp_9 princ_comp_10 ... princ_comp_41 princ_comp_42 princ_comp_43 princ_comp_44 princ_comp_45 princ_comp_46 princ_comp_47 princ_comp_48 princ_comp_49 princ_comp_50
0 11.811550 -0.426965 0.493211 -0.215049 -0.878618 -0.206519 -0.519120 -0.359708 1.233738 0.110285 ... -0.145820 -0.006228 0.167487 0.011840 0.018605 0.004295 0.019660 0.044441 -0.022706 0.026468
1 13.408741 0.179091 -1.894700 0.760387 1.559272 -0.266017 0.808696 0.054908 -0.199177 1.005478 ... 0.115522 0.216464 -0.092097 0.093863 -0.078108 -0.109997 0.030465 0.023067 0.301279 0.030154
2 14.787795 0.175545 0.466971 0.567701 0.393446 0.312356 -0.569608 -0.072699 -0.221028 -0.064616 ... 0.040838 -0.119659 -0.058398 -0.020088 0.079735 -0.010872 0.071032 -0.063136 0.000997 0.017111
3 13.986303 -0.360751 0.165300 -0.224779 -0.262125 -0.041952 0.126688 -0.507498 0.405516 -0.145314 ... -0.204435 0.115458 0.015716 -0.187804 0.036921 -0.016803 -0.030482 0.091201 -0.148672 0.010797
4 16.180329 0.879694 -0.119710 0.722362 0.187244 -0.958821 -0.392817 0.736819 -0.962863 -0.592854 ... 0.005439 0.427350 -0.093517 -0.045853 -0.134727 -0.075841 0.063655 -0.091192 -0.091140 -0.032713

5 rows × 50 columns

Parameter Tuning

Similarly, using a for loop, for each value of p=1,2,...,50 our cross_val_score() function will do the following.

  • Train 5 training pcr models, with each model using $p$ training principal components.
  • Then test the 5 training pcr models with 5 corresponding test datasets, also using the first $p$ test principal components.

Best Model

From the output below, we can see that the pcr model that achieves the highest average test fold R^2=0.0583 is one that uses $p=2$ princpal components.

for p in range(1,50):
#1. Create a new linear regression object
pcr = LinearRegression()

#2. Perform cross-validation on this pcr model, using p principal components of the full features matrix X
test_fold_r2=cross_val_score(pcr, X_pc.iloc[:,0:p], y, cv=cross_val, scoring="r2")

#3. Results
print('p=',p)
print('Mean Test Fold R^2', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())
print('---------------')

p= 1
Mean Test Fold R^2 0.05310765287126964
Std Test Fold R^2: 0.07273512278352448
---------------
p= 2
Mean Test Fold R^2 0.05830284045014955
Std Test Fold R^2: 0.08812894801403544
---------------
p= 3
Mean Test Fold R^2 0.03912345098634906
Std Test Fold R^2: 0.08297872707696613
---------------
p= 4
Mean Test Fold R^2 0.0472817877237212
Std Test Fold R^2: 0.07430376610069425
---------------
p= 5
Mean Test Fold R^2 0.049473310593190246
Std Test Fold R^2: 0.05513493789616346
---------------
p= 6
Mean Test Fold R^2 0.016554709650679088
Std Test Fold R^2: 0.06809987285976846
---------------
p= 7
Mean Test Fold R^2 -0.000982724843883953
Std Test Fold R^2: 0.08086871091659864
---------------
p= 8
Mean Test Fold R^2 -0.011064790395554724
Std Test Fold R^2: 0.08140491828900553
---------------
p= 9
Mean Test Fold R^2 -0.02251704776606862
Std Test Fold R^2: 0.08847370206160866
---------------
p= 10
Mean Test Fold R^2 -0.028704981008047394
Std Test Fold R^2: 0.08959677040016283
---------------
p= 11
Mean Test Fold R^2 -0.06274270560448458
Std Test Fold R^2: 0.09676662175160836
---------------
p= 12
Mean Test Fold R^2 -0.0902277639299566
Std Test Fold R^2: 0.08802751205731629
---------------
p= 13
Mean Test Fold R^2 -0.09991441050026692
Std Test Fold R^2: 0.09026949746309668
---------------
p= 14
Mean Test Fold R^2 -0.07250036075608199
Std Test Fold R^2: 0.07870318177492441
---------------
p= 15
Mean Test Fold R^2 -0.10109944490395328
Std Test Fold R^2: 0.09248631329344988
---------------
p= 16
Mean Test Fold R^2 -0.27006021937749847
Std Test Fold R^2: 0.19866282628248066
---------------
p= 17
Mean Test Fold R^2 -0.25078983993041104
Std Test Fold R^2: 0.192636233681652
---------------
p= 18
Mean Test Fold R^2 -0.26564322444570687
Std Test Fold R^2: 0.20344738388557507
---------------
p= 19
Mean Test Fold R^2 -0.3749819577350667
Std Test Fold R^2: 0.34801699765198957
---------------
p= 20
Mean Test Fold R^2 -0.30237328094464455
Std Test Fold R^2: 0.21006125160587524
---------------
p= 21
Mean Test Fold R^2 -0.3082019700433073
Std Test Fold R^2: 0.20530899734714947
---------------
p= 22
Mean Test Fold R^2 -0.34652974609712694
Std Test Fold R^2: 0.2392278133283507
---------------
p= 23
Mean Test Fold R^2 -0.3328883997777091
Std Test Fold R^2: 0.23597768835366584
---------------
p= 24
Mean Test Fold R^2 -0.3611770930688762
Std Test Fold R^2: 0.21581089072635354
---------------
p= 25
Mean Test Fold R^2 -0.3710784083989951
Std Test Fold R^2: 0.237813257448039
---------------
p= 26
Mean Test Fold R^2 -0.3702338787966342
Std Test Fold R^2: 0.2692002232703087
---------------
p= 27
Mean Test Fold R^2 -0.3406059152410257
Std Test Fold R^2: 0.185906926582966
---------------
p= 28
Mean Test Fold R^2 -0.3692598397870996
Std Test Fold R^2: 0.22654759000799246
---------------
p= 29
Mean Test Fold R^2 -0.4211444307259775
Std Test Fold R^2: 0.28504924304405005
---------------
p= 30
Mean Test Fold R^2 -0.48690553733766995
Std Test Fold R^2: 0.24274786194035097
---------------
p= 31
Mean Test Fold R^2 -0.510241479627088
Std Test Fold R^2: 0.24684044810102407
---------------
p= 32
Mean Test Fold R^2 -0.5163731413218677
Std Test Fold R^2: 0.2506752703018624
---------------
p= 33
Mean Test Fold R^2 -0.5107750429641869
Std Test Fold R^2: 0.2244761338866777
---------------
p= 34
Mean Test Fold R^2 -0.445010464831512
Std Test Fold R^2: 0.19134555739034284
---------------
p= 35
Mean Test Fold R^2 -0.45256586833627654
Std Test Fold R^2: 0.12986146376824173
---------------
p= 36
Mean Test Fold R^2 -0.43132736567673147
Std Test Fold R^2: 0.18505640828553216
---------------
p= 37
Mean Test Fold R^2 -0.4907449792565076
Std Test Fold R^2: 0.1656744425534545
---------------
p= 38
Mean Test Fold R^2 -0.5171595988703066
Std Test Fold R^2: 0.207966911585072
---------------
p= 39
Mean Test Fold R^2 -0.4809096142840904
Std Test Fold R^2: 0.1954058299506181
---------------
p= 40
Mean Test Fold R^2 -0.5376703952955421
Std Test Fold R^2: 0.19454069689939923
---------------
p= 41
Mean Test Fold R^2 -0.6341104853075585
Std Test Fold R^2: 0.23812594852998725
---------------
p= 42
Mean Test Fold R^2 -0.6342207124467839
Std Test Fold R^2: 0.24878119872078844
---------------
p= 43
Mean Test Fold R^2 -0.6920152089227156
Std Test Fold R^2: 0.5271161695076473
---------------
p= 44
Mean Test Fold R^2 -0.6854927870883453
Std Test Fold R^2: 0.5333256664530581
---------------
p= 45
Mean Test Fold R^2 -0.6925657018243203
Std Test Fold R^2: 0.516075803376406
---------------
p= 46
Mean Test Fold R^2 -0.7985937028778137
Std Test Fold R^2: 0.5437815475390569
---------------
p= 47
Mean Test Fold R^2 -0.816041062730563
Std Test Fold R^2: 0.5401928127071236
---------------
p= 48
Mean Test Fold R^2 -0.7756565927748866
Std Test Fold R^2: 0.4232556430813156
---------------
p= 49
Mean Test Fold R^2 -0.8516585701940915
Std Test Fold R^2: 0.4077510018936373
---------------

Analysis Shortcomings

Were you able to spot the "hand waving" that we had to employ here in our pcr cross-validation analysis in 10.4.5? Unfortunately, the cross-validation analysis that we just conducted here in was not as rigorous as the train-test-split analysis that we conducted for pcr in 10.4.3.

In our cross_val_score() function, we simply just supplied a subset of principal component columns from the full $X_{pc}$. Unfortunately, what the cross_val_score() function will do on the back-end, is simply just randomly split the rows in this $X_{pc}$ features matrix, creating 5 sets of training and test dataset pairs.

So unfortunately, for a given training and test dataset pair considered by the cross_val_score() function, we are:

  1. NOT creating a distinct set of training principal components based on just the training features matrix $X_{train}$ as we do in 10.4.3, and
  2. NOT creating a distinct set of test principal components based on loading vectors that correspond to the training dataset as we do in 10.4.3.

So in other words, we can no longer guarantee that the units of the explanatory variables being used to train the model exactly match the units of the explanatory variables that are being used to test the model.

However, the larger a full dataset is, the more likely it is that these two sets of "units" will be close enough.

Conclusion

Great! When performing cross-validation on our PCR model, we found that the average test fold R^2 of our best model (0.0583) was better than the best average test fold R^2 of that we have seen so far out of all our other cross-validation model analyses in 9.5.4.(0.0418).

Note, however, that the best model found with PCR and cross-validation yielded different conclusions than the best model found with PCR and just a single training and test dataset. Our single training and test dataset showed that a PCR model using 38 principal components yielded the best results. Whereas our cross-validation analysis with PCR models showed that using just 2 principal components yielded the best results.

However, we want to keep in mind that our pcr cross-validation involved some slight "hand waving" discussed in 10.4.6. So we may want to take these result with a "grain of salt".