Breast Cancer Research Introduction
To fully demonstrate how beneficial the remaining feature selection techqniques that we'll discuss in this module are, let's introduce a new dataset and a new research goal.
Breast Tumor Tissue Sample Dataset
Chanrion et al. (2008) [1] reported results of a study of 155 patients treated for breast cancer with tamoxofen. The patients were followed for a period of time and diagnosed as having a recurrence of breast cancer (R), or being recurrence free (RF). Various clinical measurements were made including tumor size at the time of treatment. Gene expression was measured for a large number of gene sequences. Here we focus on a sample of 50 gene sequences.
[1] Chanrion, M. et al. "A Gene Expression Signature that Can Predict the Recurrence of Tamoxifen-Treated Primary Breast Cancer." Clin. Cancer. Res. 2008;14(6)March15, 2008
We can see that this dataset is comprised of what we call the expression level of 50 gene sequences for 150 breast tumor tissue samples that were collected. This dataset also includes the size of the tumor.
df = pd.read_csv('breast_tumor.csv')
df.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.739512 | 1.971036 | -1.660842 | 2.777183 | 3.299062 | -1.954834 | 2.784970 | 0.411848 | 3.974931 | -0.979751 | ... | -1.735267 | -2.036134 | 3.114202 | -1.567229 | -2.100264 | -2.067843 | 2.116261 | -2.057195 | -0.868534 | 13.0 |
1 | 2.037903 | 2.197854 | -1.263034 | 4.082346 | 5.426886 | -1.732520 | 3.085890 | 0.688056 | 4.503384 | -1.185032 | ... | -1.646363 | 0.127756 | 2.772590 | -1.451107 | 0.267480 | -1.526069 | 2.643856 | -1.625604 | -1.415037 | 15.0 |
2 | 2.218338 | 3.471559 | -1.789433 | 2.829994 | 4.746466 | -2.222392 | 2.977280 | 0.944858 | 4.021099 | -1.825502 | ... | -2.296393 | -2.347923 | 3.577213 | -2.175087 | -2.084889 | -2.106915 | 2.738768 | -1.387816 | -0.780555 | 15.0 |
3 | 0.972344 | 2.638734 | -2.010999 | 3.913935 | 4.744161 | -2.496426 | 3.139577 | 0.155651 | 4.632121 | -1.671513 | ... | -1.706846 | -2.216318 | 3.168707 | -1.844349 | -2.010999 | -1.996352 | 2.797407 | -1.743066 | -1.010999 | 20.0 |
4 | 2.412235 | 4.033491 | -1.536501 | 4.239650 | 4.304348 | -1.991067 | 3.700095 | 0.878536 | 4.295705 | -2.141092 | ... | -2.424026 | -2.448274 | 3.717911 | -2.286523 | -2.045515 | -1.776328 | 2.813104 | -2.353637 | -1.687061 | 10.0 |
5 rows × 51 columns
df.shape
(150, 51)
df.columns
Index(['X159', 'X960', 'X980', 'X986', 'X1023', 'X1028', 'X1064', 'X1092',
'X1103', 'X1109', 'X1124', 'X1136', 'X1141', 'X1144', 'X1169', 'X1173',
'X1179', 'X1193', 'X1203', 'X1206', 'X1208', 'X1219', 'X1232', 'X1264',
'X1272', 'X1292', 'X1297', 'X1329', 'X1351', 'X1362', 'X1416', 'X1417',
'X1418', 'X1430', 'X1444', 'X1470', 'X1506', 'X1514', 'X1529', 'X1553',
'X1563', 'X1574', 'X1595', 'X1597', 'X1609', 'X1616', 'X1637', 'X1656',
'X1657', 'X1683', 'size'],
dtype='object')
If we calculate some basic summary statistics for each of these 50 gene sequences, we can see that a gene sequence's expression level can take on positive and negative values. Generally speaking, the higher a gene expression value is the more that this gene is "expressed" in the tissue sample. Similarly, the lower this gene expression value is, the less that this gene is "expressed" in the tissue sample.
summary_stats = df.describe()
summary_stats
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | ... | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 1.376306 | -0.112320 | 3.492451 | 0.744515 | 1.321874 | -0.596157 | -0.339048 | 2.236122 | 0.980135 | 2.924173 | ... | 1.917094 | 1.982022 | 1.741923 | 0.586703 | 2.021079 | 0.364399 | 0.234298 | 0.345367 | 2.377693 | 22.960000 |
std | 0.755686 | 1.369578 | 2.331048 | 1.242713 | 1.639463 | 0.756856 | 1.573671 | 1.444945 | 1.666017 | 2.093202 | ... | 1.981597 | 1.998929 | 0.884797 | 1.245007 | 1.937274 | 1.009526 | 1.180466 | 1.149509 | 1.658281 | 8.722323 |
min | -0.382146 | -1.942775 | -2.028855 | -1.172946 | -0.831602 | -2.496426 | -2.309684 | -1.584963 | -0.775162 | -2.302563 | ... | -2.496426 | -2.548893 | 0.182386 | -2.376563 | -2.180813 | -2.195256 | -1.317740 | -2.548893 | -2.071866 | 9.000000 |
25% | 0.949494 | -0.849483 | 3.886548 | 0.031332 | 0.110678 | -0.951850 | -1.256071 | 0.950987 | -0.013481 | 2.764891 | ... | 1.326367 | 1.434842 | 1.150739 | 0.207832 | 1.405858 | 0.299315 | -0.560667 | -0.191817 | 2.228602 | 18.000000 |
50% | 1.469655 | -0.596223 | 4.345319 | 0.407620 | 0.737692 | -0.534296 | -0.875781 | 2.450144 | 0.354814 | 3.561006 | ... | 2.249980 | 2.335982 | 1.561707 | 0.893613 | 2.291399 | 0.617323 | -0.184270 | 0.648666 | 2.818252 | 20.000000 |
75% | 1.894190 | -0.239456 | 4.797746 | 0.783101 | 1.840496 | -0.162224 | -0.300901 | 3.388225 | 0.888427 | 4.251084 | ... | 3.168609 | 3.268658 | 2.149453 | 1.362798 | 3.304314 | 0.911426 | 0.426736 | 1.080240 | 3.337945 | 27.250000 |
max | 2.971671 | 4.033491 | 6.454067 | 4.508867 | 5.426886 | 2.170873 | 4.285718 | 5.354386 | 6.022652 | 5.828356 | ... | 5.920234 | 5.892563 | 4.814187 | 2.889297 | 5.952548 | 2.965542 | 3.415801 | 3.558226 | 5.239447 | 58.000000 |
8 rows × 51 columns
Also notice that the standard deviation of each of these 50 genes ranges widely from 0.71 to 3.27. We'll revisit this insight in the next section.
summary_stats.loc['std'].sort_values()
X1232 0.707890
X159 0.755686
X1028 0.756856
X1416 0.816352
X1206 0.835004
X1264 0.845427
X1219 0.867225
X1597 0.884797
X1179 0.892574
X1418 0.954523
X1637 1.009526
X1430 1.024816
X1173 1.055141
X1514 1.063305
X1141 1.112103
X1657 1.149509
X1656 1.180466
X1444 1.237122
X986 1.242713
X1136 1.242932
X1609 1.245007
X1351 1.278610
X1563 1.305220
X1193 1.309822
X1208 1.327855
X960 1.369578
X1529 1.443172
X1092 1.444945
X1064 1.573671
X1023 1.639463
X1203 1.649872
X1417 1.652708
X1683 1.658281
X1103 1.666017
X1169 1.734047
X1124 1.760771
X1329 1.905598
X1272 1.928260
X1616 1.937274
X1574 1.981597
X1595 1.998929
X1553 2.014539
X1144 2.034201
X1109 2.093202
X980 2.331048
X1362 2.535929
X1470 2.580006
X1506 3.047798
X1292 3.247265
X1297 3.271446
size 8.722323
Name: std, dtype: float64
Research Goal: Predict Tumor Size with New Tissue Samples
Ideally, we'd like to pursue two research goals with this dataset and resulting analysis. First, we'd like to build an linear regression model that effectively predicts the breast tumor size
of new tissue samples based on the gene expression level of the 50 gene sequences
that we are exploring.
Thus, it's possible that the inclusion of all 50 available gene sequences as explanatory variables in our model may lead to overfitting. Thus, we would like to employ feature selection techniques that can help us select the best combination of these 50 gene explanatory variables to put in our linear regression model.
Another Research Goal: Discover which Genes are Most Important when it Comes to Predicting Breast Tumor Size
As we discussed in Module 8, if we want to interpret the magnitude of a slope as representing how important that explanatory variable is towards predicting the response variable, then these explanatory variables should be of the same magnitude. Because we can see, for instance, that the standard deviations of our gene explanatory variables vary quite a bit, we should scale our explanatory variables first.
Dataset Preprocessing and Exploration
Training and Test Datasets
Because one of research goals is to infer how well our linear regression model might perform with new breast tumor size predictions (in which we do not know the actual tumor size), we can again randomly split our full dataset into a training dataset to train our model and a test dataset to test the model's predictions.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.1, random_state=102)
df_train.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 1.244203 | 3.485278 | -1.838879 | 3.354405 | 3.596351 | -2.372703 | 2.537772 | -0.042187 | 4.409049 | -1.824082 | ... | -2.081240 | -2.229338 | 3.370172 | -2.063752 | -1.995848 | -1.946939 | 2.648671 | -1.697325 | -1.107875 | 13.0 |
2 | 2.218338 | 3.471559 | -1.789433 | 2.829994 | 4.746466 | -2.222392 | 2.977280 | 0.944858 | 4.021099 | -1.825502 | ... | -2.296393 | -2.347923 | 3.577213 | -2.175087 | -2.084889 | -2.106915 | 2.738768 | -1.387816 | -0.780555 | 15.0 |
102 | 0.316945 | -0.390375 | 4.743326 | 0.551410 | 1.817976 | 0.048910 | -0.698110 | 1.448081 | 0.164387 | 4.441513 | ... | 1.324413 | 1.353907 | 1.057898 | 0.512706 | 1.436640 | 0.893563 | -0.536053 | 0.893563 | 4.305040 | 18.0 |
40 | 1.835268 | -0.603578 | 4.350014 | 0.617475 | -0.478047 | -0.584963 | -0.163006 | 0.784271 | 0.123989 | 3.938599 | ... | 1.115477 | 1.284453 | 1.063326 | 0.358454 | 1.326852 | 0.653197 | -0.315776 | -0.135655 | 2.279382 | 15.0 |
142 | 1.209177 | -0.891624 | 4.499789 | 0.286713 | -0.084269 | 0.136573 | -1.162271 | 2.839159 | 0.360295 | 3.737733 | ... | 3.126064 | 3.204745 | 1.625320 | 0.866073 | 3.235488 | 0.360295 | -0.949727 | 1.257053 | 1.820451 | 18.0 |
5 rows × 51 columns
df_test.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 1.047177 | 2.971751 | -1.541659 | 3.551610 | 4.324723 | -1.866498 | 3.437098 | 0.301806 | 5.121727 | -1.408392 | ... | -2.138303 | -2.138303 | 2.833941 | -1.941266 | -1.988060 | -1.895942 | 2.615933 | -2.120816 | -1.576992 | 18.0 |
103 | -0.053242 | -0.804744 | 5.557256 | 0.158262 | -0.263034 | 0.897100 | -0.201634 | 3.580600 | 0.031027 | 5.412831 | ... | 4.022689 | 4.092374 | 1.046294 | 0.902703 | 3.847852 | 1.457530 | -0.682260 | 0.222392 | 3.564785 | 25.0 |
65 | 2.263034 | -0.902389 | 4.597829 | 0.673556 | -0.260152 | -0.304006 | -1.089267 | 3.166715 | 1.594549 | 3.993221 | ... | 5.340384 | 5.431289 | 1.604071 | -0.160040 | 5.412104 | 0.130931 | -0.463947 | -0.888969 | 2.613532 | 20.0 |
82 | 0.703225 | -1.351222 | 6.454067 | -0.791135 | 0.772766 | 0.413224 | -1.097466 | 3.007232 | 0.471900 | 4.177958 | ... | 4.822038 | 4.838095 | 1.770643 | 1.456133 | 4.972323 | 1.104842 | -1.193681 | 1.550727 | 2.623119 | 25.0 |
117 | -0.361188 | -1.219169 | 5.836283 | -0.163886 | -0.185746 | 0.669575 | -0.952382 | 3.676407 | 0.410188 | 4.192342 | ... | 4.860650 | 4.873258 | 0.481271 | 2.715997 | 4.811154 | 0.857647 | -0.743435 | 0.155871 | 2.732116 | 18.0 |
5 rows × 51 columns
Features Matrix and Target Array
Unfortunately, the regularization feature selection techniques that we introduced in section 8, are not possible using the .ols() function. Instead, we will revert back to using the LinearRegression() function which we used in Data Science Discovery.
Recall that the LinearRegression() function required us to split up the variables our dataframe into two dataframes.
- The features matrix dataframe X comprised of the explanatory variables.
- The target array dataframe/series y comprised of the response variable.
Let's create our features matrix X and target array y for both the training and test dataset for this linear regression model below.
#Training Features matrix
X_train = df_train.drop(['size'], axis=1)
X_train.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1563 | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 1.244203 | 3.485278 | -1.838879 | 3.354405 | 3.596351 | -2.372703 | 2.537772 | -0.042187 | 4.409049 | -1.824082 | ... | 3.870400 | -2.081240 | -2.229338 | 3.370172 | -2.063752 | -1.995848 | -1.946939 | 2.648671 | -1.697325 | -1.107875 |
2 | 2.218338 | 3.471559 | -1.789433 | 2.829994 | 4.746466 | -2.222392 | 2.977280 | 0.944858 | 4.021099 | -1.825502 | ... | 3.565945 | -2.296393 | -2.347923 | 3.577213 | -2.175087 | -2.084889 | -2.106915 | 2.738768 | -1.387816 | -0.780555 |
102 | 0.316945 | -0.390375 | 4.743326 | 0.551410 | 1.817976 | 0.048910 | -0.698110 | 1.448081 | 0.164387 | 4.441513 | ... | 0.715666 | 1.324413 | 1.353907 | 1.057898 | 0.512706 | 1.436640 | 0.893563 | -0.536053 | 0.893563 | 4.305040 |
40 | 1.835268 | -0.603578 | 4.350014 | 0.617475 | -0.478047 | -0.584963 | -0.163006 | 0.784271 | 0.123989 | 3.938599 | ... | 0.504675 | 1.115477 | 1.284453 | 1.063326 | 0.358454 | 1.326852 | 0.653197 | -0.315776 | -0.135655 | 2.279382 |
142 | 1.209177 | -0.891624 | 4.499789 | 0.286713 | -0.084269 | 0.136573 | -1.162271 | 2.839159 | 0.360295 | 3.737733 | ... | -0.425306 | 3.126064 | 3.204745 | 1.625320 | 0.866073 | 3.235488 | 0.360295 | -0.949727 | 1.257053 | 1.820451 |
5 rows × 50 columns
#Training target array
y_train=df_train['size']
y_train.head()
5 13.0
2 15.0
102 18.0
40 15.0
142 18.0
Name: size, dtype: float64
#Test Features matrix
X_test = df_test.drop(['size'], axis=1)
X_test.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1563 | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 1.047177 | 2.971751 | -1.541659 | 3.551610 | 4.324723 | -1.866498 | 3.437098 | 0.301806 | 5.121727 | -1.408392 | ... | 3.552632 | -2.138303 | -2.138303 | 2.833941 | -1.941266 | -1.988060 | -1.895942 | 2.615933 | -2.120816 | -1.576992 |
103 | -0.053242 | -0.804744 | 5.557256 | 0.158262 | -0.263034 | 0.897100 | -0.201634 | 3.580600 | 0.031027 | 5.412831 | ... | -0.436099 | 4.022689 | 4.092374 | 1.046294 | 0.902703 | 3.847852 | 1.457530 | -0.682260 | 0.222392 | 3.564785 |
65 | 2.263034 | -0.902389 | 4.597829 | 0.673556 | -0.260152 | -0.304006 | -1.089267 | 3.166715 | 1.594549 | 3.993221 | ... | 0.077243 | 5.340384 | 5.431289 | 1.604071 | -0.160040 | 5.412104 | 0.130931 | -0.463947 | -0.888969 | 2.613532 |
82 | 0.703225 | -1.351222 | 6.454067 | -0.791135 | 0.772766 | 0.413224 | -1.097466 | 3.007232 | 0.471900 | 4.177958 | ... | 0.282239 | 4.822038 | 4.838095 | 1.770643 | 1.456133 | 4.972323 | 1.104842 | -1.193681 | 1.550727 | 2.623119 |
117 | -0.361188 | -1.219169 | 5.836283 | -0.163886 | -0.185746 | 0.669575 | -0.952382 | 3.676407 | 0.410188 | 4.192342 | ... | -0.560957 | 4.860650 | 4.873258 | 0.481271 | 2.715997 | 4.811154 | 0.857647 | -0.743435 | 0.155871 | 2.732116 |
5 rows × 50 columns
#Test target array
y_test=df_test['size']
y_test.head()
8 18.0
103 25.0
65 20.0
82 25.0
117 18.0
Name: size, dtype: float64
Features Matrices Scaling
Also, given that one of our goals is to be able to use the slope magnitudes to interpret each of the gene explanatory variables relative importance when it comes to predicting tumor size, we should scale each of our features matrices. We do this below.
from sklearn.preprocessing import StandardScaler
#Create a StandardScaler() object and use it to
scaler_training = StandardScaler()
scaled_expl_vars = scaler_training.fit_transform(X_train)
#Put this numpy array scaled output back into a dataframe with the same columns
X_train = pd.DataFrame(scaled_expl_vars, columns=X_train.columns)
X_train.head()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1563 | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.188911 | 2.674820 | -2.364522 | 2.152603 | 1.412846 | -2.420321 | 1.899571 | -1.568374 | 2.145277 | -2.324344 | ... | 2.387077 | -2.088785 | -2.183883 | 1.848449 | -2.228972 | -2.151242 | -2.361514 | 2.040161 | -1.846183 | -2.168571 |
1 | 1.134933 | 2.664634 | -2.342705 | 1.719746 | 2.129373 | -2.214271 | 2.186127 | -0.884551 | 1.906277 | -2.325037 | ... | 2.151231 | -2.201519 | -2.245515 | 2.083075 | -2.321449 | -2.199038 | -2.526149 | 2.116996 | -1.572131 | -1.967851 |
2 | -1.449049 | -0.202849 | 0.539825 | -0.161034 | 0.304911 | 0.899298 | -0.210205 | -0.535919 | -0.469680 | 0.733466 | ... | -0.056732 | -0.304314 | -0.321553 | -0.771911 | -0.088888 | -0.308724 | 0.561709 | -0.675766 | 0.447905 | 1.150740 |
3 | 0.614342 | -0.361152 | 0.366279 | -0.106503 | -1.125521 | 0.030367 | 0.138680 | -0.995805 | -0.494568 | 0.488028 | ... | -0.220176 | -0.413791 | -0.357650 | -0.765760 | -0.217015 | -0.367657 | 0.314344 | -0.487914 | -0.463411 | -0.091434 |
4 | -0.236511 | -0.575025 | 0.432366 | -0.379519 | -0.880195 | 1.019470 | -0.512836 | 0.427818 | -0.348990 | 0.389999 | ... | -0.940583 | 0.639702 | 0.640387 | -0.128886 | 0.204629 | 0.656876 | 0.012911 | -1.028546 | 0.769755 | -0.372861 |
5 rows × 50 columns
#Use the existing a StandardScaler() object and use it to transform the test datasets
scaled_expl_vars = scaler_training.transform(X_test)
#Put this numpy array scaled output back into a dataframe with the same columns
X_test = pd.DataFrame(scaled_expl_vars, columns=X_test.columns)
X_test.head()
Code correction made on 10/9. You should use the .transform() function (as opposed to the .fit_transform()) to scale the `X_test` dataframe using column means and standard deviations from the TRAINING dataset stored in the fitted `scaler_training` object.
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1563 | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.299415 | 2.002417 | -1.727194 | 1.935547 | 1.635520 | -1.451124 | 1.895163 | -1.471236 | 1.929423 | -1.757712 | ... | 1.947210 | -1.670136 | -1.666857 | 1.216891 | -1.485737 | -1.665706 | -1.905008 | 2.162811 | -1.711147 | -1.970706 |
1 | -1.533480 | -0.484835 | 0.834148 | -0.389406 | -0.814173 | 1.478331 | -0.105865 | 0.865151 | -0.753172 | 1.095184 | ... | -0.930971 | 0.801870 | 0.802887 | -0.823446 | 0.408357 | 0.709364 | 0.745571 | -0.601882 | 0.211822 | 0.926456 |
2 | 1.064109 | -0.549146 | 0.487980 | -0.036352 | -0.812634 | 0.205140 | -0.593996 | 0.570227 | 0.070741 | 0.501449 | ... | -0.560555 | 1.330575 | 1.333612 | -0.186824 | -0.299434 | 1.345975 | -0.302971 | -0.418882 | -0.700224 | 0.390467 |
3 | -0.685140 | -0.844753 | 1.157724 | -1.039885 | -0.261094 | 0.965415 | -0.598505 | 0.456583 | -0.520849 | 0.578713 | ... | -0.412634 | 1.122597 | 1.098479 | 0.003293 | 0.776943 | 1.166995 | 0.466807 | -1.030578 | 1.301929 | 0.395869 |
4 | -1.878826 | -0.757781 | 0.934823 | -0.610126 | -0.772904 | 1.237150 | -0.518719 | 0.933421 | -0.553369 | 0.584729 | ... | -1.021066 | 1.138089 | 1.112417 | -1.468336 | 1.616017 | 1.101404 | 0.271424 | -0.653162 | 0.157231 | 0.457284 |
5 rows × 50 columns
Multicollinearity Checking
Next, because one of our research goals aims to be able to correctly interpret our slopes (for the purpose of gene importance in the model), we should first check our explanatory variables for collinearity.
X_train.corr()
X159 | X960 | X980 | X986 | X1023 | X1028 | X1064 | X1092 | X1103 | X1109 | ... | X1563 | X1574 | X1595 | X1597 | X1609 | X1616 | X1637 | X1656 | X1657 | X1683 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X159 | 1.000000 | 0.060265 | -0.065454 | 0.022752 | 0.002266 | -0.226498 | -0.035437 | -0.160238 | 0.051143 | -0.110140 | ... | 0.146743 | -0.020795 | -0.010273 | 0.333225 | -0.141372 | -0.008198 | -0.072578 | 0.134411 | -0.073557 | -0.151282 |
X960 | 0.060265 | 1.000000 | -0.916792 | 0.919474 | 0.864897 | -0.656544 | 0.934777 | -0.444383 | 0.892529 | -0.882282 | ... | 0.888604 | -0.845295 | -0.851930 | 0.726146 | -0.841161 | -0.841833 | -0.825226 | 0.793374 | -0.791090 | -0.838848 |
X980 | -0.065454 | -0.916792 | 1.000000 | -0.863944 | -0.775288 | 0.762141 | -0.913493 | 0.513205 | -0.893101 | 0.948234 | ... | -0.883796 | 0.868768 | 0.874294 | -0.731147 | 0.873856 | 0.863753 | 0.895619 | -0.860707 | 0.770817 | 0.893986 |
X986 | 0.022752 | 0.919474 | -0.863944 | 1.000000 | 0.810531 | -0.650212 | 0.909173 | -0.449640 | 0.855915 | -0.805998 | ... | 0.855282 | -0.829077 | -0.826968 | 0.704965 | -0.794439 | -0.817660 | -0.786007 | 0.716178 | -0.784597 | -0.804646 |
X1023 | 0.002266 | 0.864897 | -0.775288 | 0.810531 | 1.000000 | -0.618030 | 0.789146 | -0.379213 | 0.751222 | -0.712516 | ... | 0.762159 | -0.789073 | -0.787609 | 0.588211 | -0.708756 | -0.779140 | -0.682828 | 0.638509 | -0.674408 | -0.662923 |
X1028 | -0.226498 | -0.656544 | 0.762141 | -0.650212 | -0.618030 | 1.000000 | -0.645414 | 0.484050 | -0.665692 | 0.722887 | ... | -0.706024 | 0.760260 | 0.762429 | -0.646901 | 0.725366 | 0.757094 | 0.729037 | -0.659633 | 0.653814 | 0.699566 |
X1064 | -0.035437 | 0.934777 | -0.913493 | 0.909173 | 0.789146 | -0.645414 | 1.000000 | -0.485517 | 0.886098 | -0.874262 | ... | 0.860687 | -0.840807 | -0.843057 | 0.668920 | -0.823614 | -0.836642 | -0.845297 | 0.776042 | -0.805988 | -0.844329 |
X1092 | -0.160238 | -0.444383 | 0.513205 | -0.449640 | -0.379213 | 0.484050 | -0.485517 | 1.000000 | -0.473404 | 0.400143 | ... | -0.516239 | 0.564969 | 0.562968 | -0.395651 | 0.522040 | 0.562572 | 0.410890 | -0.441094 | 0.481274 | 0.445237 |
X1103 | 0.051143 | 0.892529 | -0.893101 | 0.855915 | 0.751222 | -0.665692 | 0.886098 | -0.473404 | 1.000000 | -0.860258 | ... | 0.876620 | -0.782109 | -0.785695 | 0.783122 | -0.796972 | -0.776115 | -0.813230 | 0.791853 | -0.728272 | -0.782653 |
X1109 | -0.110140 | -0.882282 | 0.948234 | -0.805998 | -0.712516 | 0.722887 | -0.874262 | 0.400143 | -0.860258 | 1.000000 | ... | -0.822570 | 0.794170 | 0.798974 | -0.728567 | 0.822577 | 0.787503 | 0.910524 | -0.848353 | 0.729511 | 0.877479 |
X1124 | 0.031407 | 0.865214 | -0.837546 | 0.838944 | 0.663751 | -0.622546 | 0.910569 | -0.432644 | 0.805809 | -0.835547 | ... | 0.782090 | -0.767345 | -0.772630 | 0.651323 | -0.774667 | -0.766530 | -0.826978 | 0.746686 | -0.810543 | -0.815883 |
X1136 | 0.161022 | 0.799013 | -0.837336 | 0.779458 | 0.681390 | -0.638374 | 0.762544 | -0.446870 | 0.881181 | -0.807831 | ... | 0.849828 | -0.729269 | -0.723239 | 0.821403 | -0.735579 | -0.709176 | -0.731699 | 0.797020 | -0.571091 | -0.692232 |
X1141 | 0.011729 | 0.871462 | -0.864145 | 0.829365 | 0.756040 | -0.651875 | 0.861079 | -0.469561 | 0.855482 | -0.812283 | ... | 0.843947 | -0.797747 | -0.799305 | 0.647370 | -0.782430 | -0.788425 | -0.712398 | 0.799659 | -0.712964 | -0.801396 |
X1144 | 0.040838 | 0.942138 | -0.957587 | 0.904227 | 0.782889 | -0.709512 | 0.955349 | -0.470001 | 0.915338 | -0.929812 | ... | 0.903989 | -0.845723 | -0.851139 | 0.755900 | -0.848812 | -0.840404 | -0.874500 | 0.851006 | -0.774372 | -0.879659 |
X1169 | -0.004202 | 0.925896 | -0.926571 | 0.864454 | 0.789092 | -0.699165 | 0.909802 | -0.447707 | 0.894222 | -0.883932 | ... | 0.867721 | -0.854186 | -0.863700 | 0.717545 | -0.846288 | -0.851842 | -0.823758 | 0.820538 | -0.782410 | -0.840337 |
X1173 | 0.058787 | 0.903325 | -0.884817 | 0.870304 | 0.776981 | -0.693137 | 0.877916 | -0.554142 | 0.865982 | -0.794434 | ... | 0.863425 | -0.846434 | -0.850215 | 0.675853 | -0.833425 | -0.838393 | -0.736912 | 0.749456 | -0.773375 | -0.804784 |
X1179 | 0.062648 | 0.737455 | -0.736313 | 0.654872 | 0.641327 | -0.605075 | 0.656605 | -0.343754 | 0.726666 | -0.711082 | ... | 0.740930 | -0.683865 | -0.685372 | 0.658865 | -0.635948 | -0.672497 | -0.620235 | 0.763833 | -0.520312 | -0.608884 |
X1193 | 0.091274 | 0.933136 | -0.930017 | 0.873882 | 0.803894 | -0.673898 | 0.908863 | -0.465221 | 0.909069 | -0.881124 | ... | 0.887649 | -0.828929 | -0.828686 | 0.727626 | -0.830956 | -0.816732 | -0.804668 | 0.809600 | -0.731388 | -0.834385 |
X1203 | 0.079657 | 0.916031 | -0.904998 | 0.899109 | 0.779479 | -0.687560 | 0.899677 | -0.485905 | 0.944242 | -0.861692 | ... | 0.910898 | -0.808672 | -0.810954 | 0.808131 | -0.803194 | -0.800176 | -0.828396 | 0.828596 | -0.758954 | -0.794665 |
X1206 | 0.174463 | 0.760769 | -0.742724 | 0.796077 | 0.670605 | -0.639950 | 0.746090 | -0.500799 | 0.779156 | -0.687160 | ... | 0.802564 | -0.703009 | -0.695499 | 0.697781 | -0.656429 | -0.683848 | -0.651239 | 0.666202 | -0.608689 | -0.673853 |
X1208 | 0.044885 | 0.908029 | -0.926673 | 0.884039 | 0.765225 | -0.671927 | 0.926455 | -0.450485 | 0.903138 | -0.905937 | ... | 0.879817 | -0.817365 | -0.815426 | 0.719438 | -0.795445 | -0.803429 | -0.824524 | 0.834101 | -0.719321 | -0.844577 |
X1219 | 0.180275 | 0.769047 | -0.807458 | 0.769531 | 0.633759 | -0.633706 | 0.733670 | -0.411896 | 0.811018 | -0.761748 | ... | 0.830423 | -0.715472 | -0.711176 | 0.802107 | -0.705878 | -0.694509 | -0.733160 | 0.800655 | -0.663075 | -0.708285 |
X1232 | 0.426183 | 0.303486 | -0.324989 | 0.249726 | 0.166148 | -0.392972 | 0.270865 | -0.269866 | 0.236039 | -0.369432 | ... | 0.327017 | -0.301708 | -0.310809 | 0.328062 | -0.333782 | -0.311180 | -0.329475 | 0.355530 | -0.322522 | -0.347581 |
X1264 | -0.051055 | -0.781417 | 0.802760 | -0.782776 | -0.703112 | 0.658323 | -0.804180 | 0.397005 | -0.747521 | 0.814639 | ... | -0.681377 | 0.758666 | 0.758941 | -0.589019 | 0.786272 | 0.752985 | 0.803112 | -0.603683 | 0.722360 | 0.774020 |
X1272 | -0.040265 | -0.911553 | 0.947765 | -0.873484 | -0.767387 | 0.751029 | -0.915850 | 0.494310 | -0.854102 | 0.891946 | ... | -0.865252 | 0.878443 | 0.881443 | -0.701314 | 0.847869 | 0.871232 | 0.859305 | -0.817937 | 0.837731 | 0.893305 |
X1292 | -0.076443 | -0.925540 | 0.983793 | -0.872776 | -0.778493 | 0.751053 | -0.919327 | 0.495884 | -0.894957 | 0.959648 | ... | -0.886213 | 0.860019 | 0.864635 | -0.736384 | 0.873662 | 0.853065 | 0.904217 | -0.861037 | 0.783029 | 0.901897 |
X1297 | 0.017390 | 0.946077 | -0.961785 | 0.906564 | 0.814310 | -0.722749 | 0.945475 | -0.473534 | 0.934736 | -0.918783 | ... | 0.908553 | -0.864263 | -0.867555 | 0.749340 | -0.859787 | -0.856443 | -0.859907 | 0.847827 | -0.791277 | -0.869537 |
X1329 | 0.009951 | 0.929842 | -0.932724 | 0.886701 | 0.789396 | -0.695484 | 0.909363 | -0.441399 | 0.896489 | -0.895399 | ... | 0.904664 | -0.837141 | -0.839982 | 0.738162 | -0.830809 | -0.826397 | -0.815080 | 0.867191 | -0.759899 | -0.863806 |
X1351 | 0.074195 | 0.811867 | -0.847300 | 0.750453 | 0.680455 | -0.660163 | 0.795969 | -0.450868 | 0.805231 | -0.825490 | ... | 0.861870 | -0.747672 | -0.751341 | 0.683120 | -0.743854 | -0.735933 | -0.730542 | 0.850602 | -0.607218 | -0.750289 |
X1362 | 0.020027 | 0.918716 | -0.923671 | 0.875123 | 0.802120 | -0.676309 | 0.902487 | -0.419660 | 0.946665 | -0.897589 | ... | 0.887244 | -0.805565 | -0.806308 | 0.745834 | -0.808091 | -0.794275 | -0.843427 | 0.835359 | -0.721160 | -0.804713 |
X1416 | 0.180052 | 0.383643 | -0.502941 | 0.356281 | 0.214178 | -0.423820 | 0.400211 | -0.473305 | 0.409376 | -0.477921 | ... | 0.502524 | -0.425794 | -0.425664 | 0.424760 | -0.416625 | -0.417702 | -0.375468 | 0.593593 | -0.258241 | -0.429165 |
X1417 | -0.074155 | -0.891227 | 0.949137 | -0.835970 | -0.775966 | 0.776325 | -0.879697 | 0.439637 | -0.848448 | 0.910556 | ... | -0.830148 | 0.842794 | 0.847958 | -0.672883 | 0.872998 | 0.836976 | 0.868016 | -0.804726 | 0.787150 | 0.887542 |
X1418 | -0.071250 | -0.792134 | 0.772190 | -0.770712 | -0.690774 | 0.623884 | -0.792832 | 0.407504 | -0.738287 | 0.752375 | ... | -0.670585 | 0.738928 | 0.741526 | -0.566998 | 0.778472 | 0.738323 | 0.752189 | -0.564174 | 0.801533 | 0.735440 |
X1430 | -0.077804 | -0.891555 | 0.900095 | -0.830997 | -0.774981 | 0.675972 | -0.858329 | 0.479091 | -0.853732 | 0.877924 | ... | -0.788063 | 0.824912 | 0.832096 | -0.646183 | 0.837266 | 0.822258 | 0.829237 | -0.717934 | 0.757981 | 0.837409 |
X1444 | -0.151062 | -0.822917 | 0.910982 | -0.774846 | -0.698487 | 0.763655 | -0.821939 | 0.516456 | -0.795250 | 0.868480 | ... | -0.779184 | 0.827906 | 0.826656 | -0.663202 | 0.814365 | 0.818499 | 0.812694 | -0.760813 | 0.722381 | 0.851153 |
X1470 | 0.043124 | 0.927974 | -0.941770 | 0.906335 | 0.785812 | -0.707760 | 0.926588 | -0.455920 | 0.899787 | -0.900745 | ... | 0.923467 | -0.830640 | -0.832105 | 0.794707 | -0.836461 | -0.818802 | -0.856976 | 0.863743 | -0.766625 | -0.854665 |
X1506 | -0.031407 | -0.927604 | 0.977642 | -0.886268 | -0.788302 | 0.743441 | -0.928034 | 0.496425 | -0.898061 | 0.935573 | ... | -0.892210 | 0.873149 | 0.876923 | -0.727647 | 0.846958 | 0.866361 | 0.901064 | -0.843021 | 0.805861 | 0.897225 |
X1514 | -0.086193 | -0.820769 | 0.820674 | -0.755153 | -0.685370 | 0.683551 | -0.837623 | 0.470159 | -0.692624 | 0.798052 | ... | -0.724666 | 0.790684 | 0.795717 | -0.489634 | 0.783708 | 0.790052 | 0.771904 | -0.647816 | 0.767054 | 0.851241 |
X1529 | 0.262708 | 0.807609 | -0.866144 | 0.736371 | 0.682320 | -0.696159 | 0.760780 | -0.429428 | 0.801496 | -0.857029 | ... | 0.808604 | -0.717575 | -0.721919 | 0.749539 | -0.781370 | -0.711594 | -0.784322 | 0.794576 | -0.618542 | -0.767723 |
X1553 | -0.015018 | -0.843533 | 0.865949 | -0.819131 | -0.777474 | 0.756747 | -0.834719 | 0.571974 | -0.775977 | 0.789738 | ... | -0.787036 | 0.996108 | 0.998524 | -0.638189 | 0.824152 | 0.997639 | 0.781007 | -0.697473 | 0.759585 | 0.772824 |
X1563 | 0.146743 | 0.888604 | -0.883796 | 0.855282 | 0.762159 | -0.706024 | 0.860687 | -0.516239 | 0.876620 | -0.822570 | ... | 1.000000 | -0.786572 | -0.793890 | 0.813031 | -0.774063 | -0.780421 | -0.767569 | 0.851589 | -0.699145 | -0.779229 |
X1574 | -0.020795 | -0.845295 | 0.868768 | -0.829077 | -0.789073 | 0.760260 | -0.840807 | 0.564969 | -0.782109 | 0.794170 | ... | -0.786572 | 1.000000 | 0.995950 | -0.639595 | 0.825637 | 0.995033 | 0.786121 | -0.699452 | 0.761645 | 0.777340 |
X1595 | -0.010273 | -0.851930 | 0.874294 | -0.826968 | -0.787609 | 0.762429 | -0.843057 | 0.562968 | -0.785695 | 0.798974 | ... | -0.793890 | 0.995950 | 1.000000 | -0.646385 | 0.829484 | 0.998837 | 0.793389 | -0.702696 | 0.765187 | 0.779285 |
X1597 | 0.333225 | 0.726146 | -0.731147 | 0.704965 | 0.588211 | -0.646901 | 0.668920 | -0.395651 | 0.783122 | -0.728567 | ... | 0.813031 | -0.639595 | -0.646385 | 1.000000 | -0.640804 | -0.634429 | -0.676188 | 0.727538 | -0.574218 | -0.628768 |
X1609 | -0.141372 | -0.841161 | 0.873856 | -0.794439 | -0.708756 | 0.725366 | -0.823614 | 0.522040 | -0.796972 | 0.822577 | ... | -0.774063 | 0.825637 | 0.829484 | -0.640804 | 1.000000 | 0.823166 | 0.828017 | -0.722580 | 0.750383 | 0.811844 |
X1616 | -0.008198 | -0.841833 | 0.863753 | -0.817660 | -0.779140 | 0.757094 | -0.836642 | 0.562572 | -0.776115 | 0.787503 | ... | -0.780421 | 0.995033 | 0.998837 | -0.634429 | 0.823166 | 1.000000 | 0.788440 | -0.683023 | 0.765562 | 0.770101 |
X1637 | -0.072578 | -0.825226 | 0.895619 | -0.786007 | -0.682828 | 0.729037 | -0.845297 | 0.410890 | -0.813230 | 0.910524 | ... | -0.767569 | 0.786121 | 0.793389 | -0.676188 | 0.828017 | 0.788440 | 1.000000 | -0.739133 | 0.769479 | 0.851905 |
X1656 | 0.134411 | 0.793374 | -0.860707 | 0.716178 | 0.638509 | -0.659633 | 0.776042 | -0.441094 | 0.791853 | -0.848353 | ... | 0.851589 | -0.699452 | -0.702696 | 0.727538 | -0.722580 | -0.683023 | -0.739133 | 1.000000 | -0.557172 | -0.752078 |
X1657 | -0.073557 | -0.791090 | 0.770817 | -0.784597 | -0.674408 | 0.653814 | -0.805988 | 0.481274 | -0.728272 | 0.729511 | ... | -0.699145 | 0.761645 | 0.765187 | -0.574218 | 0.750383 | 0.765562 | 0.769479 | -0.557172 | 1.000000 | 0.771097 |
X1683 | -0.151282 | -0.838848 | 0.893986 | -0.804646 | -0.662923 | 0.699566 | -0.844329 | 0.445237 | -0.782653 | 0.877479 | ... | -0.779229 | 0.777340 | 0.779285 | -0.628768 | 0.811844 | 0.770101 | 0.851905 | -0.752078 | 0.771097 | 1.000000 |
50 rows × 50 columns
sns.heatmap(X_train.corr(),vmin=-1, vmax=1, cmap='RdBu')
plt.show()

As we can see in the correlation matrix above, there exist many pairs of explanatory variables that have a high correlation magnitude. For instance, we can see that the relationship between genes X1023 and X960 is linear and strong. Thus, any linear regression model that we would build that includes all of these genes is likely to suffer from multicollinearity, and thus we would not be able to trust our resulting slope interpretations as much.
Therefore, in addition to building a model that cuts out gene explanatory variables that would lead to overfitting, and thus yield worse performance in new/test dataset predictions, we'd also like to build a model that will hopefully cut out gene explanatory variables that are collinear with another gene explanatory variable. Doing so would help us meet our other goal, of being able to effectively interpret our resulting model slopes.
sns.lmplot(x='X1023', y='X960', data=X_train)
plt.show()
