Breast Cancer Research Introduction


To fully demonstrate how beneficial the remaining feature selection techqniques that we'll discuss in this module are, let's introduce a new dataset and a new research goal.

Breast Tumor Tissue Sample Dataset

Chanrion et al. (2008) [1] reported results of a study of 155 patients treated for breast cancer with tamoxofen. The patients were followed for a period of time and diagnosed as having a recurrence of breast cancer (R), or being recurrence free (RF). Various clinical measurements were made including tumor size at the time of treatment. Gene expression was measured for a large number of gene sequences. Here we focus on a sample of 50 gene sequences.

[1] Chanrion, M. et al. "A Gene Expression Signature that Can Predict the Recurrence of Tamoxifen-Treated Primary Breast Cancer." Clin. Cancer. Res. 2008;14(6)March15, 2008

We can see that this dataset is comprised of what we call the expression level of 50 gene sequences for 150 breast tumor tissue samples that were collected. This dataset also includes the size of the tumor.

df = pd.read_csv('breast_tumor.csv')
df.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683 size
0 0.739512 1.971036 -1.660842 2.777183 3.299062 -1.954834 2.784970 0.411848 3.974931 -0.979751 ... -1.735267 -2.036134 3.114202 -1.567229 -2.100264 -2.067843 2.116261 -2.057195 -0.868534 13.0
1 2.037903 2.197854 -1.263034 4.082346 5.426886 -1.732520 3.085890 0.688056 4.503384 -1.185032 ... -1.646363 0.127756 2.772590 -1.451107 0.267480 -1.526069 2.643856 -1.625604 -1.415037 15.0
2 2.218338 3.471559 -1.789433 2.829994 4.746466 -2.222392 2.977280 0.944858 4.021099 -1.825502 ... -2.296393 -2.347923 3.577213 -2.175087 -2.084889 -2.106915 2.738768 -1.387816 -0.780555 15.0
3 0.972344 2.638734 -2.010999 3.913935 4.744161 -2.496426 3.139577 0.155651 4.632121 -1.671513 ... -1.706846 -2.216318 3.168707 -1.844349 -2.010999 -1.996352 2.797407 -1.743066 -1.010999 20.0
4 2.412235 4.033491 -1.536501 4.239650 4.304348 -1.991067 3.700095 0.878536 4.295705 -2.141092 ... -2.424026 -2.448274 3.717911 -2.286523 -2.045515 -1.776328 2.813104 -2.353637 -1.687061 10.0

5 rows × 51 columns

df.shape

(150, 51)

df.columns

Index(['X159', 'X960', 'X980', 'X986', 'X1023', 'X1028', 'X1064', 'X1092',
'X1103', 'X1109', 'X1124', 'X1136', 'X1141', 'X1144', 'X1169', 'X1173',
'X1179', 'X1193', 'X1203', 'X1206', 'X1208', 'X1219', 'X1232', 'X1264',
'X1272', 'X1292', 'X1297', 'X1329', 'X1351', 'X1362', 'X1416', 'X1417',
'X1418', 'X1430', 'X1444', 'X1470', 'X1506', 'X1514', 'X1529', 'X1553',
'X1563', 'X1574', 'X1595', 'X1597', 'X1609', 'X1616', 'X1637', 'X1656',
'X1657', 'X1683', 'size'],
dtype='object')

If we calculate some basic summary statistics for each of these 50 gene sequences, we can see that a gene sequence's expression level can take on positive and negative values. Generally speaking, the higher a gene expression value is the more that this gene is "expressed" in the tissue sample. Similarly, the lower this gene expression value is, the less that this gene is "expressed" in the tissue sample.

summary_stats = df.describe()
summary_stats

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683 size
count 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 ... 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000
mean 1.376306 -0.112320 3.492451 0.744515 1.321874 -0.596157 -0.339048 2.236122 0.980135 2.924173 ... 1.917094 1.982022 1.741923 0.586703 2.021079 0.364399 0.234298 0.345367 2.377693 22.960000
std 0.755686 1.369578 2.331048 1.242713 1.639463 0.756856 1.573671 1.444945 1.666017 2.093202 ... 1.981597 1.998929 0.884797 1.245007 1.937274 1.009526 1.180466 1.149509 1.658281 8.722323
min -0.382146 -1.942775 -2.028855 -1.172946 -0.831602 -2.496426 -2.309684 -1.584963 -0.775162 -2.302563 ... -2.496426 -2.548893 0.182386 -2.376563 -2.180813 -2.195256 -1.317740 -2.548893 -2.071866 9.000000
25% 0.949494 -0.849483 3.886548 0.031332 0.110678 -0.951850 -1.256071 0.950987 -0.013481 2.764891 ... 1.326367 1.434842 1.150739 0.207832 1.405858 0.299315 -0.560667 -0.191817 2.228602 18.000000
50% 1.469655 -0.596223 4.345319 0.407620 0.737692 -0.534296 -0.875781 2.450144 0.354814 3.561006 ... 2.249980 2.335982 1.561707 0.893613 2.291399 0.617323 -0.184270 0.648666 2.818252 20.000000
75% 1.894190 -0.239456 4.797746 0.783101 1.840496 -0.162224 -0.300901 3.388225 0.888427 4.251084 ... 3.168609 3.268658 2.149453 1.362798 3.304314 0.911426 0.426736 1.080240 3.337945 27.250000
max 2.971671 4.033491 6.454067 4.508867 5.426886 2.170873 4.285718 5.354386 6.022652 5.828356 ... 5.920234 5.892563 4.814187 2.889297 5.952548 2.965542 3.415801 3.558226 5.239447 58.000000

8 rows × 51 columns

Also notice that the standard deviation of each of these 50 genes ranges widely from 0.71 to 3.27. We'll revisit this insight in the next section.

summary_stats.loc['std'].sort_values()

X1232 0.707890
X159 0.755686
X1028 0.756856
X1416 0.816352
X1206 0.835004
X1264 0.845427
X1219 0.867225
X1597 0.884797
X1179 0.892574
X1418 0.954523
X1637 1.009526
X1430 1.024816
X1173 1.055141
X1514 1.063305
X1141 1.112103
X1657 1.149509
X1656 1.180466
X1444 1.237122
X986 1.242713
X1136 1.242932
X1609 1.245007
X1351 1.278610
X1563 1.305220
X1193 1.309822
X1208 1.327855
X960 1.369578
X1529 1.443172
X1092 1.444945
X1064 1.573671
X1023 1.639463
X1203 1.649872
X1417 1.652708
X1683 1.658281
X1103 1.666017
X1169 1.734047
X1124 1.760771
X1329 1.905598
X1272 1.928260
X1616 1.937274
X1574 1.981597
X1595 1.998929
X1553 2.014539
X1144 2.034201
X1109 2.093202
X980 2.331048
X1362 2.535929
X1470 2.580006
X1506 3.047798
X1292 3.247265
X1297 3.271446
size 8.722323
Name: std, dtype: float64

Research Goal: Predict Tumor Size with New Tissue Samples

Ideally, we'd like to pursue two research goals with this dataset and resulting analysis. First, we'd like to build an linear regression model that effectively predicts the breast tumor size of new tissue samples based on the gene expression level of the 50 gene sequences that we are exploring.

Thus, it's possible that the inclusion of all 50 available gene sequences as explanatory variables in our model may lead to overfitting. Thus, we would like to employ feature selection techniques that can help us select the best combination of these 50 gene explanatory variables to put in our linear regression model.

Another Research Goal: Discover which Genes are Most Important when it Comes to Predicting Breast Tumor Size

As we discussed in Module 8, if we want to interpret the magnitude of a slope as representing how important that explanatory variable is towards predicting the response variable, then these explanatory variables should be of the same magnitude. Because we can see, for instance, that the standard deviations of our gene explanatory variables vary quite a bit, we should scale our explanatory variables first.

Dataset Preprocessing and Exploration

Training and Test Datasets

Because one of research goals is to infer how well our linear regression model might perform with new breast tumor size predictions (in which we do not know the actual tumor size), we can again randomly split our full dataset into a training dataset to train our model and a test dataset to test the model's predictions.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.1, random_state=102)
df_train.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683 size
5 1.244203 3.485278 -1.838879 3.354405 3.596351 -2.372703 2.537772 -0.042187 4.409049 -1.824082 ... -2.081240 -2.229338 3.370172 -2.063752 -1.995848 -1.946939 2.648671 -1.697325 -1.107875 13.0
2 2.218338 3.471559 -1.789433 2.829994 4.746466 -2.222392 2.977280 0.944858 4.021099 -1.825502 ... -2.296393 -2.347923 3.577213 -2.175087 -2.084889 -2.106915 2.738768 -1.387816 -0.780555 15.0
102 0.316945 -0.390375 4.743326 0.551410 1.817976 0.048910 -0.698110 1.448081 0.164387 4.441513 ... 1.324413 1.353907 1.057898 0.512706 1.436640 0.893563 -0.536053 0.893563 4.305040 18.0
40 1.835268 -0.603578 4.350014 0.617475 -0.478047 -0.584963 -0.163006 0.784271 0.123989 3.938599 ... 1.115477 1.284453 1.063326 0.358454 1.326852 0.653197 -0.315776 -0.135655 2.279382 15.0
142 1.209177 -0.891624 4.499789 0.286713 -0.084269 0.136573 -1.162271 2.839159 0.360295 3.737733 ... 3.126064 3.204745 1.625320 0.866073 3.235488 0.360295 -0.949727 1.257053 1.820451 18.0

5 rows × 51 columns

df_test.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683 size
8 1.047177 2.971751 -1.541659 3.551610 4.324723 -1.866498 3.437098 0.301806 5.121727 -1.408392 ... -2.138303 -2.138303 2.833941 -1.941266 -1.988060 -1.895942 2.615933 -2.120816 -1.576992 18.0
103 -0.053242 -0.804744 5.557256 0.158262 -0.263034 0.897100 -0.201634 3.580600 0.031027 5.412831 ... 4.022689 4.092374 1.046294 0.902703 3.847852 1.457530 -0.682260 0.222392 3.564785 25.0
65 2.263034 -0.902389 4.597829 0.673556 -0.260152 -0.304006 -1.089267 3.166715 1.594549 3.993221 ... 5.340384 5.431289 1.604071 -0.160040 5.412104 0.130931 -0.463947 -0.888969 2.613532 20.0
82 0.703225 -1.351222 6.454067 -0.791135 0.772766 0.413224 -1.097466 3.007232 0.471900 4.177958 ... 4.822038 4.838095 1.770643 1.456133 4.972323 1.104842 -1.193681 1.550727 2.623119 25.0
117 -0.361188 -1.219169 5.836283 -0.163886 -0.185746 0.669575 -0.952382 3.676407 0.410188 4.192342 ... 4.860650 4.873258 0.481271 2.715997 4.811154 0.857647 -0.743435 0.155871 2.732116 18.0

5 rows × 51 columns

Features Matrix and Target Array

Unfortunately, the regularization feature selection techniques that we introduced in section 8, are not possible using the .ols() function. Instead, we will revert back to using the LinearRegression() function which we used in Data Science Discovery.

Recall that the LinearRegression() function required us to split up the variables our dataframe into two dataframes.

  1. The features matrix dataframe X comprised of the explanatory variables.
  2. The target array dataframe/series y comprised of the response variable.

Let's create our features matrix X and target array y for both the training and test dataset for this linear regression model below.

#Training Features matrix
X_train = df_train.drop(['size'], axis=1)
X_train.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
5 1.244203 3.485278 -1.838879 3.354405 3.596351 -2.372703 2.537772 -0.042187 4.409049 -1.824082 ... 3.870400 -2.081240 -2.229338 3.370172 -2.063752 -1.995848 -1.946939 2.648671 -1.697325 -1.107875
2 2.218338 3.471559 -1.789433 2.829994 4.746466 -2.222392 2.977280 0.944858 4.021099 -1.825502 ... 3.565945 -2.296393 -2.347923 3.577213 -2.175087 -2.084889 -2.106915 2.738768 -1.387816 -0.780555
102 0.316945 -0.390375 4.743326 0.551410 1.817976 0.048910 -0.698110 1.448081 0.164387 4.441513 ... 0.715666 1.324413 1.353907 1.057898 0.512706 1.436640 0.893563 -0.536053 0.893563 4.305040
40 1.835268 -0.603578 4.350014 0.617475 -0.478047 -0.584963 -0.163006 0.784271 0.123989 3.938599 ... 0.504675 1.115477 1.284453 1.063326 0.358454 1.326852 0.653197 -0.315776 -0.135655 2.279382
142 1.209177 -0.891624 4.499789 0.286713 -0.084269 0.136573 -1.162271 2.839159 0.360295 3.737733 ... -0.425306 3.126064 3.204745 1.625320 0.866073 3.235488 0.360295 -0.949727 1.257053 1.820451

5 rows × 50 columns

#Training target array
y_train=df_train['size']
y_train.head()

5 13.0
2 15.0
102 18.0
40 15.0
142 18.0
Name: size, dtype: float64

#Test Features matrix
X_test = df_test.drop(['size'], axis=1)
X_test.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
8 1.047177 2.971751 -1.541659 3.551610 4.324723 -1.866498 3.437098 0.301806 5.121727 -1.408392 ... 3.552632 -2.138303 -2.138303 2.833941 -1.941266 -1.988060 -1.895942 2.615933 -2.120816 -1.576992
103 -0.053242 -0.804744 5.557256 0.158262 -0.263034 0.897100 -0.201634 3.580600 0.031027 5.412831 ... -0.436099 4.022689 4.092374 1.046294 0.902703 3.847852 1.457530 -0.682260 0.222392 3.564785
65 2.263034 -0.902389 4.597829 0.673556 -0.260152 -0.304006 -1.089267 3.166715 1.594549 3.993221 ... 0.077243 5.340384 5.431289 1.604071 -0.160040 5.412104 0.130931 -0.463947 -0.888969 2.613532
82 0.703225 -1.351222 6.454067 -0.791135 0.772766 0.413224 -1.097466 3.007232 0.471900 4.177958 ... 0.282239 4.822038 4.838095 1.770643 1.456133 4.972323 1.104842 -1.193681 1.550727 2.623119
117 -0.361188 -1.219169 5.836283 -0.163886 -0.185746 0.669575 -0.952382 3.676407 0.410188 4.192342 ... -0.560957 4.860650 4.873258 0.481271 2.715997 4.811154 0.857647 -0.743435 0.155871 2.732116

5 rows × 50 columns

#Test target array
y_test=df_test['size']
y_test.head()

8 18.0
103 25.0
65 20.0
82 25.0
117 18.0
Name: size, dtype: float64

Features Matrices Scaling

Also, given that one of our goals is to be able to use the slope magnitudes to interpret each of the gene explanatory variables relative importance when it comes to predicting tumor size, we should scale each of our features matrices. We do this below.

from sklearn.preprocessing import StandardScaler

#Create a StandardScaler() object and use it to
scaler_training = StandardScaler()
scaled_expl_vars = scaler_training.fit_transform(X_train)

#Put this numpy array scaled output back into a dataframe with the same columns
X_train = pd.DataFrame(scaled_expl_vars, columns=X_train.columns)
X_train.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
0 -0.188911 2.674820 -2.364522 2.152603 1.412846 -2.420321 1.899571 -1.568374 2.145277 -2.324344 ... 2.387077 -2.088785 -2.183883 1.848449 -2.228972 -2.151242 -2.361514 2.040161 -1.846183 -2.168571
1 1.134933 2.664634 -2.342705 1.719746 2.129373 -2.214271 2.186127 -0.884551 1.906277 -2.325037 ... 2.151231 -2.201519 -2.245515 2.083075 -2.321449 -2.199038 -2.526149 2.116996 -1.572131 -1.967851
2 -1.449049 -0.202849 0.539825 -0.161034 0.304911 0.899298 -0.210205 -0.535919 -0.469680 0.733466 ... -0.056732 -0.304314 -0.321553 -0.771911 -0.088888 -0.308724 0.561709 -0.675766 0.447905 1.150740
3 0.614342 -0.361152 0.366279 -0.106503 -1.125521 0.030367 0.138680 -0.995805 -0.494568 0.488028 ... -0.220176 -0.413791 -0.357650 -0.765760 -0.217015 -0.367657 0.314344 -0.487914 -0.463411 -0.091434
4 -0.236511 -0.575025 0.432366 -0.379519 -0.880195 1.019470 -0.512836 0.427818 -0.348990 0.389999 ... -0.940583 0.639702 0.640387 -0.128886 0.204629 0.656876 0.012911 -1.028546 0.769755 -0.372861

5 rows × 50 columns

#Use the existing a StandardScaler() object and use it to transform the test datasets
scaled_expl_vars = scaler_training.transform(X_test)

#Put this numpy array scaled output back into a dataframe with the same columns
X_test = pd.DataFrame(scaled_expl_vars, columns=X_test.columns)
X_test.head()

Code correction made on 10/9. You should use the .transform() function (as opposed to the .fit_transform()) to scale the `X_test` dataframe using column means and standard deviations from the TRAINING dataset stored in the fitted `scaler_training` object.

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
0 -0.299415 2.002417 -1.727194 1.935547 1.635520 -1.451124 1.895163 -1.471236 1.929423 -1.757712 ... 1.947210 -1.670136 -1.666857 1.216891 -1.485737 -1.665706 -1.905008 2.162811 -1.711147 -1.970706
1 -1.533480 -0.484835 0.834148 -0.389406 -0.814173 1.478331 -0.105865 0.865151 -0.753172 1.095184 ... -0.930971 0.801870 0.802887 -0.823446 0.408357 0.709364 0.745571 -0.601882 0.211822 0.926456
2 1.064109 -0.549146 0.487980 -0.036352 -0.812634 0.205140 -0.593996 0.570227 0.070741 0.501449 ... -0.560555 1.330575 1.333612 -0.186824 -0.299434 1.345975 -0.302971 -0.418882 -0.700224 0.390467
3 -0.685140 -0.844753 1.157724 -1.039885 -0.261094 0.965415 -0.598505 0.456583 -0.520849 0.578713 ... -0.412634 1.122597 1.098479 0.003293 0.776943 1.166995 0.466807 -1.030578 1.301929 0.395869
4 -1.878826 -0.757781 0.934823 -0.610126 -0.772904 1.237150 -0.518719 0.933421 -0.553369 0.584729 ... -1.021066 1.138089 1.112417 -1.468336 1.616017 1.101404 0.271424 -0.653162 0.157231 0.457284

5 rows × 50 columns

Multicollinearity Checking

Next, because one of our research goals aims to be able to correctly interpret our slopes (for the purpose of gene importance in the model), we should first check our explanatory variables for collinearity.

X_train.corr()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
X159 1.000000 0.060265 -0.065454 0.022752 0.002266 -0.226498 -0.035437 -0.160238 0.051143 -0.110140 ... 0.146743 -0.020795 -0.010273 0.333225 -0.141372 -0.008198 -0.072578 0.134411 -0.073557 -0.151282
X960 0.060265 1.000000 -0.916792 0.919474 0.864897 -0.656544 0.934777 -0.444383 0.892529 -0.882282 ... 0.888604 -0.845295 -0.851930 0.726146 -0.841161 -0.841833 -0.825226 0.793374 -0.791090 -0.838848
X980 -0.065454 -0.916792 1.000000 -0.863944 -0.775288 0.762141 -0.913493 0.513205 -0.893101 0.948234 ... -0.883796 0.868768 0.874294 -0.731147 0.873856 0.863753 0.895619 -0.860707 0.770817 0.893986
X986 0.022752 0.919474 -0.863944 1.000000 0.810531 -0.650212 0.909173 -0.449640 0.855915 -0.805998 ... 0.855282 -0.829077 -0.826968 0.704965 -0.794439 -0.817660 -0.786007 0.716178 -0.784597 -0.804646
X1023 0.002266 0.864897 -0.775288 0.810531 1.000000 -0.618030 0.789146 -0.379213 0.751222 -0.712516 ... 0.762159 -0.789073 -0.787609 0.588211 -0.708756 -0.779140 -0.682828 0.638509 -0.674408 -0.662923
X1028 -0.226498 -0.656544 0.762141 -0.650212 -0.618030 1.000000 -0.645414 0.484050 -0.665692 0.722887 ... -0.706024 0.760260 0.762429 -0.646901 0.725366 0.757094 0.729037 -0.659633 0.653814 0.699566
X1064 -0.035437 0.934777 -0.913493 0.909173 0.789146 -0.645414 1.000000 -0.485517 0.886098 -0.874262 ... 0.860687 -0.840807 -0.843057 0.668920 -0.823614 -0.836642 -0.845297 0.776042 -0.805988 -0.844329
X1092 -0.160238 -0.444383 0.513205 -0.449640 -0.379213 0.484050 -0.485517 1.000000 -0.473404 0.400143 ... -0.516239 0.564969 0.562968 -0.395651 0.522040 0.562572 0.410890 -0.441094 0.481274 0.445237
X1103 0.051143 0.892529 -0.893101 0.855915 0.751222 -0.665692 0.886098 -0.473404 1.000000 -0.860258 ... 0.876620 -0.782109 -0.785695 0.783122 -0.796972 -0.776115 -0.813230 0.791853 -0.728272 -0.782653
X1109 -0.110140 -0.882282 0.948234 -0.805998 -0.712516 0.722887 -0.874262 0.400143 -0.860258 1.000000 ... -0.822570 0.794170 0.798974 -0.728567 0.822577 0.787503 0.910524 -0.848353 0.729511 0.877479
X1124 0.031407 0.865214 -0.837546 0.838944 0.663751 -0.622546 0.910569 -0.432644 0.805809 -0.835547 ... 0.782090 -0.767345 -0.772630 0.651323 -0.774667 -0.766530 -0.826978 0.746686 -0.810543 -0.815883
X1136 0.161022 0.799013 -0.837336 0.779458 0.681390 -0.638374 0.762544 -0.446870 0.881181 -0.807831 ... 0.849828 -0.729269 -0.723239 0.821403 -0.735579 -0.709176 -0.731699 0.797020 -0.571091 -0.692232
X1141 0.011729 0.871462 -0.864145 0.829365 0.756040 -0.651875 0.861079 -0.469561 0.855482 -0.812283 ... 0.843947 -0.797747 -0.799305 0.647370 -0.782430 -0.788425 -0.712398 0.799659 -0.712964 -0.801396
X1144 0.040838 0.942138 -0.957587 0.904227 0.782889 -0.709512 0.955349 -0.470001 0.915338 -0.929812 ... 0.903989 -0.845723 -0.851139 0.755900 -0.848812 -0.840404 -0.874500 0.851006 -0.774372 -0.879659
X1169 -0.004202 0.925896 -0.926571 0.864454 0.789092 -0.699165 0.909802 -0.447707 0.894222 -0.883932 ... 0.867721 -0.854186 -0.863700 0.717545 -0.846288 -0.851842 -0.823758 0.820538 -0.782410 -0.840337
X1173 0.058787 0.903325 -0.884817 0.870304 0.776981 -0.693137 0.877916 -0.554142 0.865982 -0.794434 ... 0.863425 -0.846434 -0.850215 0.675853 -0.833425 -0.838393 -0.736912 0.749456 -0.773375 -0.804784
X1179 0.062648 0.737455 -0.736313 0.654872 0.641327 -0.605075 0.656605 -0.343754 0.726666 -0.711082 ... 0.740930 -0.683865 -0.685372 0.658865 -0.635948 -0.672497 -0.620235 0.763833 -0.520312 -0.608884
X1193 0.091274 0.933136 -0.930017 0.873882 0.803894 -0.673898 0.908863 -0.465221 0.909069 -0.881124 ... 0.887649 -0.828929 -0.828686 0.727626 -0.830956 -0.816732 -0.804668 0.809600 -0.731388 -0.834385
X1203 0.079657 0.916031 -0.904998 0.899109 0.779479 -0.687560 0.899677 -0.485905 0.944242 -0.861692 ... 0.910898 -0.808672 -0.810954 0.808131 -0.803194 -0.800176 -0.828396 0.828596 -0.758954 -0.794665
X1206 0.174463 0.760769 -0.742724 0.796077 0.670605 -0.639950 0.746090 -0.500799 0.779156 -0.687160 ... 0.802564 -0.703009 -0.695499 0.697781 -0.656429 -0.683848 -0.651239 0.666202 -0.608689 -0.673853
X1208 0.044885 0.908029 -0.926673 0.884039 0.765225 -0.671927 0.926455 -0.450485 0.903138 -0.905937 ... 0.879817 -0.817365 -0.815426 0.719438 -0.795445 -0.803429 -0.824524 0.834101 -0.719321 -0.844577
X1219 0.180275 0.769047 -0.807458 0.769531 0.633759 -0.633706 0.733670 -0.411896 0.811018 -0.761748 ... 0.830423 -0.715472 -0.711176 0.802107 -0.705878 -0.694509 -0.733160 0.800655 -0.663075 -0.708285
X1232 0.426183 0.303486 -0.324989 0.249726 0.166148 -0.392972 0.270865 -0.269866 0.236039 -0.369432 ... 0.327017 -0.301708 -0.310809 0.328062 -0.333782 -0.311180 -0.329475 0.355530 -0.322522 -0.347581
X1264 -0.051055 -0.781417 0.802760 -0.782776 -0.703112 0.658323 -0.804180 0.397005 -0.747521 0.814639 ... -0.681377 0.758666 0.758941 -0.589019 0.786272 0.752985 0.803112 -0.603683 0.722360 0.774020
X1272 -0.040265 -0.911553 0.947765 -0.873484 -0.767387 0.751029 -0.915850 0.494310 -0.854102 0.891946 ... -0.865252 0.878443 0.881443 -0.701314 0.847869 0.871232 0.859305 -0.817937 0.837731 0.893305
X1292 -0.076443 -0.925540 0.983793 -0.872776 -0.778493 0.751053 -0.919327 0.495884 -0.894957 0.959648 ... -0.886213 0.860019 0.864635 -0.736384 0.873662 0.853065 0.904217 -0.861037 0.783029 0.901897
X1297 0.017390 0.946077 -0.961785 0.906564 0.814310 -0.722749 0.945475 -0.473534 0.934736 -0.918783 ... 0.908553 -0.864263 -0.867555 0.749340 -0.859787 -0.856443 -0.859907 0.847827 -0.791277 -0.869537
X1329 0.009951 0.929842 -0.932724 0.886701 0.789396 -0.695484 0.909363 -0.441399 0.896489 -0.895399 ... 0.904664 -0.837141 -0.839982 0.738162 -0.830809 -0.826397 -0.815080 0.867191 -0.759899 -0.863806
X1351 0.074195 0.811867 -0.847300 0.750453 0.680455 -0.660163 0.795969 -0.450868 0.805231 -0.825490 ... 0.861870 -0.747672 -0.751341 0.683120 -0.743854 -0.735933 -0.730542 0.850602 -0.607218 -0.750289
X1362 0.020027 0.918716 -0.923671 0.875123 0.802120 -0.676309 0.902487 -0.419660 0.946665 -0.897589 ... 0.887244 -0.805565 -0.806308 0.745834 -0.808091 -0.794275 -0.843427 0.835359 -0.721160 -0.804713
X1416 0.180052 0.383643 -0.502941 0.356281 0.214178 -0.423820 0.400211 -0.473305 0.409376 -0.477921 ... 0.502524 -0.425794 -0.425664 0.424760 -0.416625 -0.417702 -0.375468 0.593593 -0.258241 -0.429165
X1417 -0.074155 -0.891227 0.949137 -0.835970 -0.775966 0.776325 -0.879697 0.439637 -0.848448 0.910556 ... -0.830148 0.842794 0.847958 -0.672883 0.872998 0.836976 0.868016 -0.804726 0.787150 0.887542
X1418 -0.071250 -0.792134 0.772190 -0.770712 -0.690774 0.623884 -0.792832 0.407504 -0.738287 0.752375 ... -0.670585 0.738928 0.741526 -0.566998 0.778472 0.738323 0.752189 -0.564174 0.801533 0.735440
X1430 -0.077804 -0.891555 0.900095 -0.830997 -0.774981 0.675972 -0.858329 0.479091 -0.853732 0.877924 ... -0.788063 0.824912 0.832096 -0.646183 0.837266 0.822258 0.829237 -0.717934 0.757981 0.837409
X1444 -0.151062 -0.822917 0.910982 -0.774846 -0.698487 0.763655 -0.821939 0.516456 -0.795250 0.868480 ... -0.779184 0.827906 0.826656 -0.663202 0.814365 0.818499 0.812694 -0.760813 0.722381 0.851153
X1470 0.043124 0.927974 -0.941770 0.906335 0.785812 -0.707760 0.926588 -0.455920 0.899787 -0.900745 ... 0.923467 -0.830640 -0.832105 0.794707 -0.836461 -0.818802 -0.856976 0.863743 -0.766625 -0.854665
X1506 -0.031407 -0.927604 0.977642 -0.886268 -0.788302 0.743441 -0.928034 0.496425 -0.898061 0.935573 ... -0.892210 0.873149 0.876923 -0.727647 0.846958 0.866361 0.901064 -0.843021 0.805861 0.897225
X1514 -0.086193 -0.820769 0.820674 -0.755153 -0.685370 0.683551 -0.837623 0.470159 -0.692624 0.798052 ... -0.724666 0.790684 0.795717 -0.489634 0.783708 0.790052 0.771904 -0.647816 0.767054 0.851241
X1529 0.262708 0.807609 -0.866144 0.736371 0.682320 -0.696159 0.760780 -0.429428 0.801496 -0.857029 ... 0.808604 -0.717575 -0.721919 0.749539 -0.781370 -0.711594 -0.784322 0.794576 -0.618542 -0.767723
X1553 -0.015018 -0.843533 0.865949 -0.819131 -0.777474 0.756747 -0.834719 0.571974 -0.775977 0.789738 ... -0.787036 0.996108 0.998524 -0.638189 0.824152 0.997639 0.781007 -0.697473 0.759585 0.772824
X1563 0.146743 0.888604 -0.883796 0.855282 0.762159 -0.706024 0.860687 -0.516239 0.876620 -0.822570 ... 1.000000 -0.786572 -0.793890 0.813031 -0.774063 -0.780421 -0.767569 0.851589 -0.699145 -0.779229
X1574 -0.020795 -0.845295 0.868768 -0.829077 -0.789073 0.760260 -0.840807 0.564969 -0.782109 0.794170 ... -0.786572 1.000000 0.995950 -0.639595 0.825637 0.995033 0.786121 -0.699452 0.761645 0.777340
X1595 -0.010273 -0.851930 0.874294 -0.826968 -0.787609 0.762429 -0.843057 0.562968 -0.785695 0.798974 ... -0.793890 0.995950 1.000000 -0.646385 0.829484 0.998837 0.793389 -0.702696 0.765187 0.779285
X1597 0.333225 0.726146 -0.731147 0.704965 0.588211 -0.646901 0.668920 -0.395651 0.783122 -0.728567 ... 0.813031 -0.639595 -0.646385 1.000000 -0.640804 -0.634429 -0.676188 0.727538 -0.574218 -0.628768
X1609 -0.141372 -0.841161 0.873856 -0.794439 -0.708756 0.725366 -0.823614 0.522040 -0.796972 0.822577 ... -0.774063 0.825637 0.829484 -0.640804 1.000000 0.823166 0.828017 -0.722580 0.750383 0.811844
X1616 -0.008198 -0.841833 0.863753 -0.817660 -0.779140 0.757094 -0.836642 0.562572 -0.776115 0.787503 ... -0.780421 0.995033 0.998837 -0.634429 0.823166 1.000000 0.788440 -0.683023 0.765562 0.770101
X1637 -0.072578 -0.825226 0.895619 -0.786007 -0.682828 0.729037 -0.845297 0.410890 -0.813230 0.910524 ... -0.767569 0.786121 0.793389 -0.676188 0.828017 0.788440 1.000000 -0.739133 0.769479 0.851905
X1656 0.134411 0.793374 -0.860707 0.716178 0.638509 -0.659633 0.776042 -0.441094 0.791853 -0.848353 ... 0.851589 -0.699452 -0.702696 0.727538 -0.722580 -0.683023 -0.739133 1.000000 -0.557172 -0.752078
X1657 -0.073557 -0.791090 0.770817 -0.784597 -0.674408 0.653814 -0.805988 0.481274 -0.728272 0.729511 ... -0.699145 0.761645 0.765187 -0.574218 0.750383 0.765562 0.769479 -0.557172 1.000000 0.771097
X1683 -0.151282 -0.838848 0.893986 -0.804646 -0.662923 0.699566 -0.844329 0.445237 -0.782653 0.877479 ... -0.779229 0.777340 0.779285 -0.628768 0.811844 0.770101 0.851905 -0.752078 0.771097 1.000000

50 rows × 50 columns

sns.heatmap(X_train.corr(),vmin=-1, vmax=1, cmap='RdBu')
plt.show()
Image Description

As we can see in the correlation matrix above, there exist many pairs of explanatory variables that have a high correlation magnitude. For instance, we can see that the relationship between genes X1023 and X960 is linear and strong. Thus, any linear regression model that we would build that includes all of these genes is likely to suffer from multicollinearity, and thus we would not be able to trust our resulting slope interpretations as much.

Therefore, in addition to building a model that cuts out gene explanatory variables that would lead to overfitting, and thus yield worse performance in new/test dataset predictions, we'd also like to build a model that will hopefully cut out gene explanatory variables that are collinear with another gene explanatory variable. Doing so would help us meet our other goal, of being able to effectively interpret our resulting model slopes.

sns.lmplot(x='X1023', y='X960', data=X_train)
plt.show()
Image Description