Decision Trees


Decision trees use information from the available predictors to make a prediction about the output. Specifically, a decision tree first attempts to identify the variable that can be used to separate the two conditions the best, along with a cutoff that performs well. This is designed to encourage homogeneity within each of the two resulting groups; that is, we want each of the groups to be as similar as possible after each step. Each condition is called a decision rule. Then, it attempts to identify a second variable along with a cutoff to help separate that portion of the data into the two classes. It continues to identify separating variables and cutoffs until perfect separation, a certain threshold of correctly classified observations is met, or a certain number of decision rules have been generated. In some cases, the probability of an observation being in a certain class can also be returned as the final output.

In other words, the results of a decision tree are a flow chart with either a probability or a class returned based on the conditions of the observation.

Fitting a Decision Tree

For the purposes of this page, we'll show one example of fitting a decision tree to predict whether an Instagram account is real or fake.

y = df['account_type']

We will first prepare our features (predictor variables). We can use the get_dummies function from pandas first to update our categorical variable into a quantitative one.

X = pd.get_dummies(df[['has_a_profile_pic', 'number_of_words_in_name', 'num_characters_in_bio', 'number_of_posts', 
'number_of_followers', 'number_of_follows']], drop_first = True)

We can now generate the decision tree by supplying the classifier from sklearn with the features and the target.

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X,y)

Once the classifier has been fit, we can generate our predictions for y for our data of interest. Below, we use the original data.

y_pred = clf.predict(X)

We can see that the classifier performs perfectly for the data that trained it. We can generate the confusion matrix either manually (with crosstab from pandas or using the confusion_matrix function from sklearn). Note that we ordinarily would use testing data to evaluate the performance of the classifier. For illustrative purposes here, we have bypassed this step and focused only on the implementation of a decision tree.

pd.crosstab(y_pred, y)
account_type fake real
row_0
fake 60 0
real 0 60
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y, y_pred)
cm
    array([[60,  0],
           [ 0, 60]])

To learn more about the actual separation path between the two classes, we can generate a figure of the decision tree below. The first few lines of code adjust the plot to increase its size and help improve its legibility. We see that the first node involved the column with the index 1 (number of characters in the bio) and a cutoff of 2. If this first condition is met, then the observations go to the left side. If the condition is not met, then the observations continue to the left. Then, we move on to our second separating condition, either number of followers or whether the account has a profile picture.

from sklearn import tree
plt.figure(figsize = (12,12))
tree.plot_tree(clf, fontsize = 10)
plt.show()

A visualization of the decision tree.

Advantages and Drawbacks of Decision Trees

From this output, we can observe some of the advantages of decision trees. The structure of a decision tree is simple, being a composite of easy to understand condition. This algorithm can then also provide a simple method for estimating the target of an observation by hand. The process for making decisions is also clearly defined, which is a benefit compared to some algorithms where the underlying reasons for the predicted output is unclear or more complex to understand.

A decision tree can be built with very little data. Even with little data to support the separation between different groups, a decision tree can still be informative. Because it is based on simple decision rules, the rules can be easily interpreted and provide some intuition as to the underlying phenomenon in the data.

Decision trees are also flexible, as they can provide classifiers for many classes and can provide different types of output (probabilities or exact classification levels), depending on the format of the data.

On the other hand, decision trees as an algorithm do have some drawbacks. Without any modifications to the fitting process, a decision tree will often be overfit to the training data. Some of the specific decision rules may not generalize to additional data. They can also be unstable, as a small change in the data might result in a different set of optimal decision rules. In conjunction with being overfit, decision trees have a tendency to have less than optimal performance compared to other classifiers.

Finally, the treatment of categorical variables in decision trees is not necessarily comparable to the treatment of other categorical variables and to quantitative variables. Specifically, categorical variables with more levels are more likely to be selected as a variable for a decision rule compared to categorical variables with fewer levels, as the additional levels may promote better separation between classes.

Note that there are additional parameters and features of decision trees that can be adjusted to optimize the performance of decision trees. For our initial introduction to decision trees, we will not be able to cover each of these parameters in detail.