Instagram Classifier Introduction

← Introduction Next: Introducing Logistic Regression →

Building a Classifier

Let's return to our research goal that we explored in Module 7. That is, we'd like to build a classifier model that predicts whether or not any given Instagram account is real or fake. Using our fake_insta.csv dataset that we explored in Module 7, we now have the following potential explanatory variables that we could use to predict our response variable (account_type):

has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows

df = pd.read_csv('fake_insta.csv')
print(df.shape)
df.head()

(120, 7)

	has_a_profile_pic	number_of_words_in_name	num_characters_in_bio	number_of_posts	number_of_followers	number_of_follows	account_type
0	yes	1	30	35	488	604	real
1	yes	5	64	3	35	6	real
2	yes	2	82	319	328	668	real
3	yes	1	143	273	14890	7369	real
4	yes	1	76	6	225	356	real

For coding ease, let's create a list of our numerical explanatory variable column names.

num_cols = list(df.columns)
num_cols.remove('has_a_profile_pic')
num_cols.remove('account_type')
num_cols

['number_of_words_in_name',
'num_characters_in_bio',
'number_of_posts',
'number_of_followers',
'number_of_follows']

Basic Descriptive Analytics and Data Cleaning

But let's not forget to first perform some basic descriptive analytics on this dataset. With the goal of using this dataset to build an effective classifier in mind, this preliminary analysis may provide insights about the following types of data decisions and insights, for instance, that we might make in order to better meet our research goal.

How to clean the dataset?
Which model to use?
Potential pitfalls of our chosen models.

Outliers

Unfortunately, the presence of strong outliers in a dataset can sometimes negatively influence the predictive power of many given types of models. For instance, a model may work hard to yield better predictions for a few outliers, perhaps at the expense of good predictions for the bulk of observations that are non-outliers. So let's first inspect our dataset for the presence of some types of outliers.

There can actually be many ways in which an observation might be considered an outlier.

Single Variable Outliers

As we've seen in Unit 7, there are some types of observations that can be labeled as "outliers" based on how far away from the bulk of observations they for just a single variable. To look for observations that are outliers due to just a single variable, we can look at a boxplot for each numerical explanatory variable in the dataset.

for col in num_cols:
    sns.boxplot(df[col])
    plt.title(col)
    plt.show()

Technically, we are able to observe observations that are single variable outliers for each of our 5 numerical variables in this dataset. That is, each numerical variable has at least one observation in which it is either:

$>Q3+1.5IQR$ (actually just this type in this case)
$<Q1+1.5IQR$

Let's see what would theoretically happen if we got rid of all observations that were deemed a single variable outlier using these inequalities.

df_temp = df.copy()
print('Initial Number of Rows:', df_temp.shape[0])

for col in num_cols:
    Q1 = df_temp[col].quantile(.25)
    Q3 = df_temp[col].quantile(.75)
    IQR = Q3-Q1
    df_temp = df_temp[df_temp[col]<Q3*1.5+1.5*IQR]
    print('Remaining Number of Rows:', df_temp.shape[0])

Initial Number of Rows: 120
Remaining Number of Rows: 116
Remaining Number of Rows: 109
Remaining Number of Rows: 99
Remaining Number of Rows: 91
Remaining Number of Rows: 86

Data Cleaning Issue

Unfortunately, getting rid of all observations that are either above or below these technical outlier thresholds seems to decrease the dataset size dramatically. Our dataset was quite small (120 rows) to begin with, and now it only has 86 rows.

Given that we also eventually plan on splitting this dataset into a training and test dataset, this will make our test dataset extremely small. Having small datasets can increase the variability of our analysis results. For instance, our potential results, insights, and decisions may vary wildly based on the particular random seed that we used to randomly split the data into a training and test dataset.

Potential Analysis Compromise

It seems like only the number_of_followers variable has a dramatically high set of outliers. So let's just focus on filtering out these extremely high single variable outliers.

sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()

We can see that by just filtering the single variable outliers for the number_of_followers column, that we only needed to delete 15 observations.

col = 'number_of_followers'
df=df.copy()
Q1 = df[col].quantile(.25)
Q3 = df[col].quantile(.75)
IQR = Q3-Q1
df = df[df[col]<Q3*1.5+1.5*IQR]
print('Remaining Number of Rows:', df.shape[0])

Remaining Number of Rows: 105

And now the remaining 105 number_of_followers observations in the updated dataset only have a few new outliers that are not quite as high.

sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()

Two-Variable Outliers

Furthermore, we can have outliers that only appear when looking at the relationship between two or more variables in a scatterplot. For instance, if we were not trying to conserve as many observations as possible in this dataset we might also consider deleting the account which distinguishes itself with both a high number of posts (330) AND a high number of people that they follow (3504). We can see that this point is an outlier in the number_of_posts vs. number_of_follows scatterplot below.

sns.pairplot(df)
plt.show()

3+ Variable Outliers

Looking for outliers that only appear in 3 or more dimensional scatterplots become even more challenging to visually see. Thus, we usually need rely on more sophisticated techniques to search for all possible outliers. Furthermore, not all outliers are necessarily going to negatively influence our model results that we build. Thus, techniques that help us pick out only the influential outliers can help us conserve the amount of original data to use in our model building.

2.2.2 Explanatory Variable Relationships with the Response Variable

Given that our goal is to predict whether or not an account is fake or real, our response variable in this analysis will be account_type.

Let's first see if we can detect any individual associations between our account_type response variable and our six potential explanatory variables.

Numerical Explanatory Variables

To visualize the relationship between one of numerical explanatory variables and our categorical response variable, we can use side-by-side boxplots like we discussed in module 8.

for col in num_cols:
    sns.boxplot(x=col, y='account_type', data=df)
    plt.show()

Associations

For each of the 5 numerical explanatory variables (except for number_of_follows), we can see at the very least decent association with the account_type variable. For each of the first four side-by-side boxplot visualizations, there does not appear to be too much overlap between the two IQR boxes, indicating at least somewhat of an association.

However, in the side-by-side boxplots visualization that visualizes number_of_follows and account_type we can see taht there is a strong overlap between the two IQR boxes. This indicates that there is not a strong association between number_of_follows and account_type.

Potential Feature Selection "Hack"

In section 11, we'll introduce more sophisticated feature selection methods that can help us "weed out" potential explanatory variables that may not bring enough predictive power to the model and thus may lead to overfitting. However, in some modeling scenarios we may have millions of potential explanatory variables to use! With a large amount of potential explanatory variables, these sophisticated feature selection methods may take too long to run.

Thus, in situations like this, we might first employ a basic "hack" feature selection method in which we weed out any explanatory variables (like number_of_follows) that do not display a strong association with the response variable. This preliminary "weed out" of explanatory variables may reduce the pool of potential explanatory variables enough to then employ more sophisticated feature selection methods that more directly approach our research goal of not overfitting.

Because we only have 6 potential explanatory variables, we won't automatically weed out the number_of_follows variable. We'll leave it to our more sophisticated feature selection methods to determine if it should be left out or not.

Categorical Explanatory Variables

To visualize the relationship between our categorical explanatory variable has_a_profile_pic our categorical response variable account_type, we can create a cross-tabulation table of the two categorical variables, and then plot this table in a barplot like we discussed in module 8.

temp = pd.crosstab(df['has_a_profile_pic'], df['account_type'], normalize='index')
temp

account_type	fake	real
has_a_profile_pic
no	1.000000	0.000000
yes	0.367089	0.632911

temp.plot.bar()
plt.ylabel('Relative Frequency')
plt.show()

We can see, for instance, in the cross-tabulation table and/or the barplot above (looking at the blue box heights) that accounts with a profile picture are much less likely to be fake. Thus we can say that there is an association between has_a_profile_pic and account_type.

Alternatively we could have noted in the cross-tabulation table and/or the barplot above (looking at the orange box heights) that accounts with a profile picture are much more likely to be real. Thus we can say that there is an association between has_a_profile_pic and account_type.

Remeber, if you see at least one color in these types of plots in which the box heights are not all the same, then there is an association between the two categorical variables.

Numerical Explanatory Variable Collinearity

Next, we'll discuss in section 9 that we also need to watch out for collinear numerical explanatory variables when fitting a logistic regression model, just like we experienced when fitting a linear regression model. So let's first also inspect the relationships between each of our numerical explanatory variables.

sns.pairplot(df)
plt.show()

By looking at the relationship between each of our pairs of numerical explanatory variables, we can see that most of these relationships are luckily quite weak. The only pair of explanatory variables that has a moderate relationship is number_of_follows and number_of_followers with a correlation of $R=0.607$. However, a common correlation threshold for deeming whether two explanatory variables are collinear is $R=0.70$. So according to this threshold at least, it does not look like our logistic regression model will have an issue with multicollinear explanatory variables.

df[num_cols].corr()

	number_of_words_in_name	num_characters_in_bio	number_of_posts	number_of_followers	number_of_follows
number_of_words_in_name	1.000000	0.198617	0.434214	0.183240	-0.053608
num_characters_in_bio	0.198617	1.000000	0.442632	0.410773	0.080228
number_of_posts	0.434214	0.442632	1.000000	0.398960	0.154180
number_of_followers	0.183240	0.410773	0.398960	1.000000	0.606749
number_of_follows	-0.053608	0.080228	0.154180	0.606749	1.000000

Training and Test Datasets

Let's narrow in on the language of our stated research goal: we'd like to build a model that predicts whether or not any given Instagram account is real or fake. So in other words, we'd like to build a classifier model that will yield good predictions of new Instagram accounts.

Thus, in order for us to get a better sense as to how well our chosen classifier will perform with new Instagram account datasets, we should either employ cross-validation techniques or at the very least a train-test-split technique to assess this. We'll introduce cross-validation techniques for classifier models in section 12, but for now let's just randomly split our dataset into a single training and test dataset.

As we've done for linear regression models, we'll use:

our training dataset to train our classifier model, and
our test dataset to test our classifier model.

df_train, df_test = train_test_split(df, test_size=0.2, random_state=207)
df_train.head()

	has_a_profile_pic	number_of_words_in_name	num_characters_in_bio	number_of_posts	number_of_followers	number_of_follows	account_type
47	yes	2	0	0	87	40	real
51	yes	2	81	25	341	274	real
75	no	1	0	1	24	2	fake
93	yes	0	0	15	772	3239	fake
76	no	2	0	0	13	22	fake

df_test.head()

	has_a_profile_pic	number_of_words_in_name	num_characters_in_bio	number_of_posts	number_of_followers	number_of_follows	account_type
71	no	1	0	0	16	2	fake
6	yes	1	132	9	213	254	real
16	yes	2	86	25	96	499	real
84	no	1	0	0	21	31	fake
55	yes	7	24	465	654	535	real

← Introduction Next: Introducing Logistic Regression →