Random Forests

← Decision Trees Next: Neural Networks →

While one decision tree can provide a classifier, random forests combine multiple decision trees to result in a more robust classifier. Techniques that combine the results of multiple models are called ensemble methods.

Specifically, random forests attempt to improve on the performance of decision trees by reducing the overfitting and instability that is common amongst decision trees. In essence, the algorithm will take a random sample with replacement from the available data. It will then fit a decision tree to that data. It then returns to the original data, takes another random sample with replacement, and fits another decision tree. It can repeat this process until multiple decision trees have been created. Together, these trees serve as the forest. Additionally, only a random set of the features are considered as the variable for optimal data separation at each separation.

For a new observation, the output or prediction for the response variable would be an average of the results returned from each of the trees in the forest.

Fitting a Random Forest

We'll fit a random forest. Before we fit the model, note that we will bypass some of the typical steps that might be needed, including selecting how many decision trees will be fit and using cross validation.

y = df['account_type'].map({'fake': 0, 'real': 1})
X = pd.get_dummies(df[['has_a_profile_pic', 'number_of_words_in_name', 'num_characters_in_bio',
                      'number_of_posts', 'number_of_followers', 'number_of_follows']], drop_first = True)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)

After fitting the random forest with the default number of trees (100), we can use the random forest to predict values for each of our original observations. The predictions that are returned are the most commonly returned classification for the observation from each of the decision trees in the random forest. Note that we might opt to do this with our testing data if we had originally set some aside.

y_pred = rf.predict(X)

Let's check out how our predicted values compare to the actual values.

pd.crosstab(y_pred, y)

account_type	0	1
row_0
0	60	0
1	0	60

We can also calculate the proportion of decision trees that classify the observation as either a fake or a real Instagram account. This could serve as our estimated probability that a given account is fake (or real). For example, we can see that our first observation had characteristics consist with a real account, while the second observation had some characteristics that resulted in it being misclassified by 22 of the 100 decision trees that were fit to the data.

Note that we did make the resulting predicted probabilities into a data frame in this example.

y_prob = rf.predict_proba(X)
y_prob = pd.DataFrame(y_prob)
y_prob.columns = ['prob_fake', 'prob_real']
y_prob.head()

	prob_fake	prob_real
0	0.02	0.98
1	0.22	0.78
2	0.01	0.99
3	0.11	0.89
4	0.04	0.96

Here, we see that we have perfect prediction (no accounts were misclassified) after averaging across the 100 decision trees that were fit. This might also be affected by the fact that we haven't placed any constraint on our trees (they can continue to subdivide the data until there is perfect separation) and that we also are evaluating the performance of the original data on a model that has been trained with all of the available data. Typically, we would evaluate the performance of the random forest on testing data. We'll demonstrate how to do so in a few pages.

However, even though the classifications appeared to be perfect, we can see variability in the predicted probabilities.

y_prob['adjusted_prob'] = abs(y_prob['prob_fake'] - 0.5)
y_prob['adjusted_prob'].hist()
plt.xlabel('Difference of Maximum Probability and 0.5')
plt.ylabel('Frequency')
plt.title('Histogram of How Separable the Instagram Accounts Are')
plt.show()

Histogram of how far above 50% each of the observations was correctly classified.

Here, we see a graph that demonstrates the distribution of how close an observation's predicted probability is to 0.5. For example, we can see that we have 3 observations with predicted probabilities that are around 0.15 away from 0.5, indicating that these observations were misclassified fairly often (by about 35 of the 100 decision trees). However, we have a vast majority (approximately 65) that were correctly classified by at least 96 of the 100 decision trees. In general, most of our decision trees perform well on most of our data.

Advantages and Drawbacks of Random Forests

A random forest typically performs better than a single decision tree, since it averages multiple possible decision trees. It also has the advantage of reducing the propensity to overfit the tree to the data that is common with a single decision tree. Random forests are analogous to a k-fold cross validation procedure, which can similarly serve as a way to improve and/or evaluate the performance of various machine learning methods. Random forests can also accommodate many different variable types as features for the problem.

Beyond the performance of the random forest as a predictor, it can also be used to provide additional information or to capture some of the same information as other machine learning techniques. In other words, the random forests collect information that can be helpful to understanding different components of the data. One example is that a random forest can indicate how important an individual feature might be to the data, as these features should be among the first few variables used to separate the data. An index of how highly a feature is selected to separate the data can serve as a measure of how important that variable is to understanding the data.

Unfortunately, random forests are not as easy to understand as a decision tree. With a decision tree, you can specify the conditions needed in order for an observation to be predicted to be within a specific group. When multiple decision trees are combined, it is challenging to summarize all of the decision tree results into one interpretable statement about why that output is the result. An algorithm with this type of behavior is sometimes referred to as a black box algorithm and does limit some of its applicability for researchers and users, as the results cannot be independently understood or verified with simple language. In other words, it is more challenging to explain exactly why the results are provided.

← Decision Trees Next: Neural Networks →