Comparing Machine Learning Models


In the previous few pages, we have demonstrated three different techniques used to generate a classifier for data to separate Instagram accounts into fake and real accounts.

On this page, we'll turn to our research question for this section: can we distinguish between Instagram accounts that are real or fake? On this page, we'll compare between each of our models to determine which model performs best, particularly on new data.

Creating a Testing and Training Set

To start, we want to be able to evaluate how well our model will perform on new data. To do this, we'll prepare and separate our data into a testing and training set. We do so with the test_train_split function from the model_selection module within sklearn. By using the random_state argument, we can ensure that we can replicate the same testing and training split if we needed to rerun the analysis.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=207)

Fitting the Three Models

We will now compare the performance of three classifiers: a logistic regression model, a random forest, and a neural network. First, we need to fit each of the three models to our training data.

import statsmodels.formula.api as smf
df_train = X_train.copy()
df_train['y'] = y_train
glm = smf.logit(formula = 'y ~ has_a_profile_pic_yes + number_of_words_in_name + num_characters_in_bio + number_of_posts + number_of_followers + number_of_follows', data = df_train).fit()
Warning: Maximum number of iterations has been exceeded. Current function value: 0.131286 Iterations: 35
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train, y_train)

Note that the logistic regression model returns a warning message. Further investigation into this message indicates that the data may be perfectly separable, at least using some of the predictor variables. This message might lead us to refine our model more. For the purposes of this page, we will use the model as returned, so that our resulting models are as comparable as possible.

glm.summary()
Logit Regression Results
Dep. Variable: y No. Observations: 96
Model: Logit Df Residuals: 89
Method: MLE Df Model: 6
Date: Mon, 07 Aug 2023 Pseudo R-squ.: 0.8105
Time: 14:27:35 Log-Likelihood: -12.603
converged: False LL-Null: -66.521
Covariance Type: nonrobust LLR p-value: 5.785e-21
coef std err z P>|z| [0.025 0.975]
Intercept -57.0171 3.2e+05 -0.000 1.000 -6.27e+05 6.27e+05
has_a_profile_pic_yes 52.8556 3.2e+05 0.000 1.000 -6.27e+05 6.27e+05
number_of_words_in_name 1.0425 0.566 1.841 0.066 -0.067 2.152
num_characters_in_bio 0.0939 0.046 2.032 0.042 0.003 0.185
number_of_posts 0.0025 0.012 0.200 0.841 -0.022 0.027
number_of_followers 0.0252 0.009 2.960 0.003 0.009 0.042
number_of_follows -0.0081 0.003 -3.094 0.002 -0.013 -0.003


Possibly complete quasi-separation: A fraction 0.52 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Summary output for the logistic regression model.

Generating Predictions for the Testing Data

Now that we have our models fit to our training data, we can apply them to the testing data. This allows us to evaluate the model's performance on data that it has not previously seen. In this case, we will use the predicted probabilities of an Instagram account being real for each of the models.

y_prob_glm = glm.predict(X_test)
y_prob_glm = pd.DataFrame(y_prob_glm)
y_prob_glm.columns = ['prob_real']
y_prob_rf = rf.predict_proba(X_test)
y_prob_rf = pd.DataFrame(y_prob_rf)
y_prob_rf.columns = ['prob_fake', 'prob_real']
y_prob_nn = mlp.predict_proba(X_test)
y_prob_nn = pd.DataFrame(y_prob_nn)
y_prob_nn.columns = ['prob_fake', 'prob_real']

Comparing the Three Models

Now that we have the predicted probabilities, we can use them as classifiers. With the ROC curve and the corresponding AUC metric, we can determine which of the three models has the best performance on the testing data over all thresholds.

from sklearn.metrics import roc_curve, roc_auc_score

def plot_roc(fpr, tpr, auc, lw=2):
plt.plot(fpr, tpr, color='darkorange', lw=lw,
label='ROC curve (area = '+str(round(auc,3))+')')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
fprs_glm, tprs_glm, thresholds_glm = roc_curve(y_true=y_test, y_score=y_prob_glm)
auc_glm = roc_auc_score(y_true=y_test, y_score=y_prob_glm)
plot_roc(fprs_glm, tprs_glm, auc_glm)

The ROC curve and AUC for the logistic regression model evaluated on the testing data.

fprs_rf, tprs_rf, thresholds_rf = roc_curve(y_true=y_test, y_score=y_prob_rf['prob_real'])
auc_rf = roc_auc_score(y_true=y_test, y_score=y_prob_rf['prob_real'])
plot_roc(fprs_rf, tprs_rf, auc_rf)

The ROC curve and AUC for the random forest evaluated on the testing data.

fprs_nn, tprs_nn, thresholds_nn = roc_curve(y_true=y_test, y_score=y_prob_nn['prob_real'])
auc_nn = roc_auc_score(y_true=y_test, y_score=y_prob_nn['prob_real'])
plot_roc(fprs_nn, tprs_nn, auc_nn)

The ROC curve and AUC for the neural network evaluated on the testing data.

From this output, we can compare the performance of each of these three machine learning techniques on our testing data. The AUC (printed in the lower right corner of this plot) is a measure for how well each model performs over all possible thresholds used to separate the two classes in the testing data. The random forest appears to perform the best in general, as it has the largest AUC of 0.958, compared to the logistic regression AUC of 0.937 and the neural network AUC of 0.86. Each of these AUCs is quite large, indicating that all models perform well on the available testing data.

We can also see the trade-off between the top two models in terms of the frequency of false positives or false negatives. For example, the random forest is able to maintain a false positive rate of 0 up to a true positive rate of about 0.95, while the logistic regression model obtains a true positive rate of 1 once the false positive rate is at least 0.1. If a low false positive rate is more important, then the random forest might be prioritized. If a higher true positive rate is preferred, then the logistic regression model might be selected.

We have not selected a threshold to use for the particular model, although you could follow the steps as outlined in Module 10 to do so.

Conclusion

One limitation of this analysis is that our Instagram data only contains 120 observations. Our testing data, which is 20% of the observations, consists of only 24 accounts. Because our data is smaller, it is possible that the performance of each of these models is inflated. Ideally, we would attempt to gather more data in order to fully evaluate the performance of each model.

We've seen that we are able to apply some of the same analyses to random forests and neural networks as we have to logistic regression models. This is because each of these machine learning techniques provides predictive probabilities as part of their output. Once we have the predictive probabilities, we can fit and analyze a classifier in a similar process to what we have seen before.

These additional machine learning techniques provide a different way to predict the output compared to our regression-based techniques from earlier. These are only two other techniques that we could use, with the number of machine learning techniques available being quite large.

Overall, we've seen that while the process of fitting the model is distinct for each of the techniques, downstream analyses are comparable once the predictions have been obtained.