Instagram Classifier Introduction


Building a Classifier

Let's return to our research goal that we explored in Module 7. That is, we'd like to build a classifier model that predicts whether or not any given Instagram account is real or fake. Using our fake_insta.csv dataset that we explored in Module 7, we now have the following potential explanatory variables that we could use to predict our response variable (account_type):

  • has_a_profile_pic
  • number_of_words_in_name
  • num_characters_in_bio
  • number_of_posts
  • number_of_followers
  • number_of_follows
df = pd.read_csv('fake_insta.csv')
print(df.shape)
df.head()

(120, 7)

has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type
0 yes 1 30 35 488 604 real
1 yes 5 64 3 35 6 real
2 yes 2 82 319 328 668 real
3 yes 1 143 273 14890 7369 real
4 yes 1 76 6 225 356 real

For coding ease, let's create a list of our numerical explanatory variable column names.

num_cols = list(df.columns)
num_cols.remove('has_a_profile_pic')
num_cols.remove('account_type')
num_cols

['number_of_words_in_name',
'num_characters_in_bio',
'number_of_posts',
'number_of_followers',
'number_of_follows']

Basic Descriptive Analytics and Data Cleaning

But let's not forget to first perform some basic descriptive analytics on this dataset. With the goal of using this dataset to build an effective classifier in mind, this preliminary analysis may provide insights about the following types of data decisions and insights, for instance, that we might make in order to better meet our research goal.

  1. How to clean the dataset?
  2. Which model to use?
  3. Potential pitfalls of our chosen models.

Outliers

Unfortunately, the presence of strong outliers in a dataset can sometimes negatively influence the predictive power of many given types of models. For instance, a model may work hard to yield better predictions for a few outliers, perhaps at the expense of good predictions for the bulk of observations that are non-outliers. So let's first inspect our dataset for the presence of some types of outliers.

There can actually be many ways in which an observation might be considered an outlier.

Single Variable Outliers

As we've seen in Unit 7, there are some types of observations that can be labeled as "outliers" based on how far away from the bulk of observations they for just a single variable. To look for observations that are outliers due to just a single variable, we can look at a boxplot for each numerical explanatory variable in the dataset.

for col in num_cols:
sns.boxplot(df[col])
plt.title(col)
plt.show()
Image Description
Image Description
Image Description
Image Description
Image Description

Technically, we are able to observe observations that are single variable outliers for each of our 5 numerical variables in this dataset. That is, each numerical variable has at least one observation in which it is either:

  • $>Q3+1.5IQR$ (actually just this type in this case)
  • $<Q1+1.5IQR$

Let's see what would theoretically happen if we got rid of all observations that were deemed a single variable outlier using these inequalities.

df_temp = df.copy()
print('Initial Number of Rows:', df_temp.shape[0])

for col in num_cols:
Q1 = df_temp[col].quantile(.25)
Q3 = df_temp[col].quantile(.75)
IQR = Q3-Q1
df_temp = df_temp[df_temp[col]<Q3*1.5+1.5*IQR]
print('Remaining Number of Rows:', df_temp.shape[0])

Initial Number of Rows: 120
Remaining Number of Rows: 116
Remaining Number of Rows: 109
Remaining Number of Rows: 99
Remaining Number of Rows: 91
Remaining Number of Rows: 86

Data Cleaning Issue

Unfortunately, getting rid of all observations that are either above or below these technical outlier thresholds seems to decrease the dataset size dramatically. Our dataset was quite small (120 rows) to begin with, and now it only has 86 rows.

Given that we also eventually plan on splitting this dataset into a training and test dataset, this will make our test dataset extremely small. Having small datasets can increase the variability of our analysis results. For instance, our potential results, insights, and decisions may vary wildly based on the particular random seed that we used to randomly split the data into a training and test dataset.

Potential Analysis Compromise

It seems like only the number_of_followers variable has a dramatically high set of outliers. So let's just focus on filtering out these extremely high single variable outliers.

sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()
Image Description

We can see that by just filtering the single variable outliers for the number_of_followers column, that we only needed to delete 15 observations.

col = 'number_of_followers'
df=df.copy()
Q1 = df[col].quantile(.25)
Q3 = df[col].quantile(.75)
IQR = Q3-Q1
df = df[df[col]<Q3*1.5+1.5*IQR]
print('Remaining Number of Rows:', df.shape[0])

Remaining Number of Rows: 105

And now the remaining 105 number_of_followers observations in the updated dataset only have a few new outliers that are not quite as high.

sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()
Image Description

Two-Variable Outliers

Furthermore, we can have outliers that only appear when looking at the relationship between two or more variables in a scatterplot. For instance, if we were not trying to conserve as many observations as possible in this dataset we might also consider deleting the account which distinguishes itself with both a high number of posts (330) AND a high number of people that they follow (3504). We can see that this point is an outlier in the number_of_posts vs. number_of_follows scatterplot below.

sns.pairplot(df)
plt.show()
Image Description

3+ Variable Outliers

Looking for outliers that only appear in 3 or more dimensional scatterplots become even more challenging to visually see. Thus, we usually need rely on more sophisticated techniques to search for all possible outliers. Furthermore, not all outliers are necessarily going to negatively influence our model results that we build. Thus, techniques that help us pick out only the influential outliers can help us conserve the amount of original data to use in our model building.

2.2.2 Explanatory Variable Relationships with the Response Variable

Given that our goal is to predict whether or not an account is fake or real, our response variable in this analysis will be account_type.

Let's first see if we can detect any individual associations between our account_type response variable and our six potential explanatory variables.

Numerical Explanatory Variables

To visualize the relationship between one of numerical explanatory variables and our categorical response variable, we can use side-by-side boxplots like we discussed in module 8.

for col in num_cols:
sns.boxplot(x=col, y='account_type', data=df)
plt.show()
Image Description
Image Description
Image Description
Image Description

!


Image Description

Associations

For each of the 5 numerical explanatory variables (except for number_of_follows), we can see at the very least decent association with the account_type variable. For each of the first four side-by-side boxplot visualizations, there does not appear to be too much overlap between the two IQR boxes, indicating at least somewhat of an association.

However, in the side-by-side boxplots visualization that visualizes number_of_follows and account_type we can see taht there is a strong overlap between the two IQR boxes. This indicates that there is not a strong association between number_of_follows and account_type.

Potential Feature Selection "Hack"

In section 11, we'll introduce more sophisticated feature selection methods that can help us "weed out" potential explanatory variables that may not bring enough predictive power to the model and thus may lead to overfitting. However, in some modeling scenarios we may have millions of potential explanatory variables to use! With a large amount of potential explanatory variables, these sophisticated feature selection methods may take too long to run.

Thus, in situations like this, we might first employ a basic "hack" feature selection method in which we weed out any explanatory variables (like number_of_follows) that do not display a strong association with the response variable. This preliminary "weed out" of explanatory variables may reduce the pool of potential explanatory variables enough to then employ more sophisticated feature selection methods that more directly approach our research goal of not overfitting.

Because we only have 6 potential explanatory variables, we won't automatically weed out the number_of_follows variable. We'll leave it to our more sophisticated feature selection methods to determine if it should be left out or not.

Categorical Explanatory Variables

To visualize the relationship between our categorical explanatory variable has_a_profile_pic our categorical response variable account_type, we can create a cross-tabulation table of the two categorical variables, and then plot this table in a barplot like we discussed in module 8.

temp = pd.crosstab(df['has_a_profile_pic'], df['account_type'], normalize='index')
temp

account_type fake real
has_a_profile_pic
no 1.000000 0.000000
yes 0.367089 0.632911
temp.plot.bar()
plt.ylabel('Relative Frequency')
plt.show()
Image Description

We can see, for instance, in the cross-tabulation table and/or the barplot above (looking at the blue box heights) that accounts with a profile picture are much less likely to be fake. Thus we can say that there is an association between has_a_profile_pic and account_type.

Alternatively we could have noted in the cross-tabulation table and/or the barplot above (looking at the orange box heights) that accounts with a profile picture are much more likely to be real. Thus we can say that there is an association between has_a_profile_pic and account_type.

Remeber, if you see at least one color in these types of plots in which the box heights are not all the same, then there is an association between the two categorical variables.

Numerical Explanatory Variable Collinearity

Next, we'll discuss in section 9 that we also need to watch out for collinear numerical explanatory variables when fitting a logistic regression model, just like we experienced when fitting a linear regression model. So let's first also inspect the relationships between each of our numerical explanatory variables.

sns.pairplot(df)
plt.show()
Image Description

By looking at the relationship between each of our pairs of numerical explanatory variables, we can see that most of these relationships are luckily quite weak. The only pair of explanatory variables that has a moderate relationship is number_of_follows and number_of_followers with a correlation of $R=0.607$. However, a common correlation threshold for deeming whether two explanatory variables are collinear is $R=0.70$. So according to this threshold at least, it does not look like our logistic regression model will have an issue with multicollinear explanatory variables.

df[num_cols].corr()

number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows
number_of_words_in_name 1.000000 0.198617 0.434214 0.183240 -0.053608
num_characters_in_bio 0.198617 1.000000 0.442632 0.410773 0.080228
number_of_posts 0.434214 0.442632 1.000000 0.398960 0.154180
number_of_followers 0.183240 0.410773 0.398960 1.000000 0.606749
number_of_follows -0.053608 0.080228 0.154180 0.606749 1.000000

Training and Test Datasets

Let's narrow in on the language of our stated research goal: we'd like to build a model that predicts whether or not any given Instagram account is real or fake. So in other words, we'd like to build a classifier model that will yield good predictions of new Instagram accounts.

Thus, in order for us to get a better sense as to how well our chosen classifier will perform with new Instagram account datasets, we should either employ cross-validation techniques or at the very least a train-test-split technique to assess this. We'll introduce cross-validation techniques for classifier models in section 12, but for now let's just randomly split our dataset into a single training and test dataset.

As we've done for linear regression models, we'll use:

  • our training dataset to train our classifier model, and
  • our test dataset to test our classifier model.
df_train, df_test = train_test_split(df, test_size=0.2, random_state=207)
df_train.head()

has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type
47 yes 2 0 0 87 40 real
51 yes 2 81 25 341 274 real
75 no 1 0 1 24 2 fake
93 yes 0 0 15 772 3239 fake
76 no 2 0 0 13 22 fake
df_test.head()

has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type
71 no 1 0 0 16 2 fake
6 yes 1 132 9 213 254 real
16 yes 2 86 25 96 499 real
84 no 1 0 0 21 31 fake
55 yes 7 24 465 654 535 real