Instagram Classifier Introduction
Building a Classifier
Let's return to our research goal that we explored in Module 7. That is, we'd like to build a classifier model that predicts whether or not any given Instagram account is real or fake. Using our fake_insta.csv dataset that we explored in Module 7, we now have the following potential explanatory variables that we could use to predict our response variable (account_type):
- has_a_profile_pic
- number_of_words_in_name
- num_characters_in_bio
- number_of_posts
- number_of_followers
- number_of_follows
df = pd.read_csv('fake_insta.csv')
print(df.shape)
df.head()
(120, 7)
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | |
---|---|---|---|---|---|---|---|
0 | yes | 1 | 30 | 35 | 488 | 604 | real |
1 | yes | 5 | 64 | 3 | 35 | 6 | real |
2 | yes | 2 | 82 | 319 | 328 | 668 | real |
3 | yes | 1 | 143 | 273 | 14890 | 7369 | real |
4 | yes | 1 | 76 | 6 | 225 | 356 | real |
For coding ease, let's create a list of our numerical explanatory variable column names.
num_cols = list(df.columns)
num_cols.remove('has_a_profile_pic')
num_cols.remove('account_type')
num_cols
['number_of_words_in_name',
'num_characters_in_bio',
'number_of_posts',
'number_of_followers',
'number_of_follows']
Basic Descriptive Analytics and Data Cleaning
But let's not forget to first perform some basic descriptive analytics on this dataset. With the goal of using this dataset to build an effective classifier in mind, this preliminary analysis may provide insights about the following types of data decisions and insights, for instance, that we might make in order to better meet our research goal.
- How to clean the dataset?
- Which model to use?
- Potential pitfalls of our chosen models.
Outliers
Unfortunately, the presence of strong outliers in a dataset can sometimes negatively influence the predictive power of many given types of models. For instance, a model may work hard to yield better predictions for a few outliers, perhaps at the expense of good predictions for the bulk of observations that are non-outliers. So let's first inspect our dataset for the presence of some types of outliers.
There can actually be many ways in which an observation might be considered an outlier.
Single Variable Outliers
As we've seen in Unit 7, there are some types of observations that can be labeled as "outliers" based on how far away from the bulk of observations they for just a single variable. To look for observations that are outliers due to just a single variable, we can look at a boxplot for each numerical explanatory variable in the dataset.
for col in num_cols:
sns.boxplot(df[col])
plt.title(col)
plt.show()





Technically, we are able to observe observations that are single variable outliers for each of our 5 numerical variables in this dataset. That is, each numerical variable has at least one observation in which it is either:
- $>Q3+1.5IQR$ (actually just this type in this case)
- $<Q1+1.5IQR$
Let's see what would theoretically happen if we got rid of all observations that were deemed a single variable outlier using these inequalities.
df_temp = df.copy()
print('Initial Number of Rows:', df_temp.shape[0])
for col in num_cols:
Q1 = df_temp[col].quantile(.25)
Q3 = df_temp[col].quantile(.75)
IQR = Q3-Q1
df_temp = df_temp[df_temp[col]<Q3*1.5+1.5*IQR]
print('Remaining Number of Rows:', df_temp.shape[0])
Initial Number of Rows: 120
Remaining Number of Rows: 116
Remaining Number of Rows: 109
Remaining Number of Rows: 99
Remaining Number of Rows: 91
Remaining Number of Rows: 86
Data Cleaning Issue
Unfortunately, getting rid of all observations that are either above or below these technical outlier thresholds seems to decrease the dataset size dramatically. Our dataset was quite small (120 rows) to begin with, and now it only has 86 rows.
Given that we also eventually plan on splitting this dataset into a training and test dataset, this will make our test dataset extremely small. Having small datasets can increase the variability of our analysis results. For instance, our potential results, insights, and decisions may vary wildly based on the particular random seed that we used to randomly split the data into a training and test dataset.
Potential Analysis Compromise
It seems like only the number_of_followers variable has a dramatically high set of outliers. So let's just focus on filtering out these extremely high single variable outliers.
sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()

We can see that by just filtering the single variable outliers for the number_of_followers column, that we only needed to delete 15 observations.
col = 'number_of_followers'
df=df.copy()
Q1 = df[col].quantile(.25)
Q3 = df[col].quantile(.75)
IQR = Q3-Q1
df = df[df[col]<Q3*1.5+1.5*IQR]
print('Remaining Number of Rows:', df.shape[0])
Remaining Number of Rows: 105
And now the remaining 105 number_of_followers observations in the updated dataset only have a few new outliers that are not quite as high.
sns.boxplot(df['number_of_followers'])
plt.title('Number of Followers')
plt.show()

Two-Variable Outliers
Furthermore, we can have outliers that only appear when looking at the relationship between two or more variables in a scatterplot. For instance, if we were not trying to conserve as many observations as possible in this dataset we might also consider deleting the account which distinguishes itself with both a high number of posts (330) AND a high number of people that they follow (3504). We can see that this point is an outlier in the number_of_posts
vs. number_of_follows
scatterplot below.
sns.pairplot(df)
plt.show()

3+ Variable Outliers
Looking for outliers that only appear in 3 or more dimensional scatterplots become even more challenging to visually see. Thus, we usually need rely on more sophisticated techniques to search for all possible outliers. Furthermore, not all outliers are necessarily going to negatively influence our model results that we build. Thus, techniques that help us pick out only the influential outliers can help us conserve the amount of original data to use in our model building.
2.2.2 Explanatory Variable Relationships with the Response Variable
Given that our goal is to predict whether or not an account is fake or real, our response variable in this analysis will be account_type.
Let's first see if we can detect any individual associations between our account_type response variable and our six potential explanatory variables.
Numerical Explanatory Variables
To visualize the relationship between one of numerical explanatory variables and our categorical response variable, we can use side-by-side boxplots like we discussed in module 8.
for col in num_cols:
sns.boxplot(x=col, y='account_type', data=df)
plt.show()




!

Associations
For each of the 5 numerical explanatory variables (except for number_of_follows), we can see at the very least decent association with the account_type variable. For each of the first four side-by-side boxplot visualizations, there does not appear to be too much overlap between the two IQR boxes, indicating at least somewhat of an association.
However, in the side-by-side boxplots visualization that visualizes number_of_follows and account_type we can see taht there is a strong overlap between the two IQR boxes. This indicates that there is not a strong association between number_of_follows and account_type.
Potential Feature Selection "Hack"
In section 11, we'll introduce more sophisticated feature selection methods that can help us "weed out" potential explanatory variables that may not bring enough predictive power to the model and thus may lead to overfitting. However, in some modeling scenarios we may have millions of potential explanatory variables to use! With a large amount of potential explanatory variables, these sophisticated feature selection methods may take too long to run.
Thus, in situations like this, we might first employ a basic "hack" feature selection method in which we weed out any explanatory variables (like number_of_follows) that do not display a strong association with the response variable. This preliminary "weed out" of explanatory variables may reduce the pool of potential explanatory variables enough to then employ more sophisticated feature selection methods that more directly approach our research goal of not overfitting.
Because we only have 6 potential explanatory variables, we won't automatically weed out the number_of_follows variable. We'll leave it to our more sophisticated feature selection methods to determine if it should be left out or not.
Categorical Explanatory Variables
To visualize the relationship between our categorical explanatory variable has_a_profile_pic our categorical response variable account_type, we can create a cross-tabulation table of the two categorical variables, and then plot this table in a barplot like we discussed in module 8.
temp = pd.crosstab(df['has_a_profile_pic'], df['account_type'], normalize='index')
temp
account_type | fake | real |
---|---|---|
has_a_profile_pic | ||
no | 1.000000 | 0.000000 |
yes | 0.367089 | 0.632911 |
temp.plot.bar()
plt.ylabel('Relative Frequency')
plt.show()

We can see, for instance, in the cross-tabulation table and/or the barplot above (looking at the blue box heights) that accounts with a profile picture are much less likely to be fake. Thus we can say that there is an association between has_a_profile_pic and account_type.
Alternatively we could have noted in the cross-tabulation table and/or the barplot above (looking at the orange box heights) that accounts with a profile picture are much more likely to be real. Thus we can say that there is an association between has_a_profile_pic and account_type.
Remeber, if you see at least one color in these types of plots in which the box heights are not all the same, then there is an association between the two categorical variables.
Numerical Explanatory Variable Collinearity
Next, we'll discuss in section 9 that we also need to watch out for collinear numerical explanatory variables when fitting a logistic regression model, just like we experienced when fitting a linear regression model. So let's first also inspect the relationships between each of our numerical explanatory variables.
sns.pairplot(df)
plt.show()

By looking at the relationship between each of our pairs of numerical explanatory variables, we can see that most of these relationships are luckily quite weak. The only pair of explanatory variables that has a moderate relationship is number_of_follows and number_of_followers with a correlation of $R=0.607$. However, a common correlation threshold for deeming whether two explanatory variables are collinear is $R=0.70$. So according to this threshold at least, it does not look like our logistic regression model will have an issue with multicollinear explanatory variables.
df[num_cols].corr()
number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | |
---|---|---|---|---|---|
number_of_words_in_name | 1.000000 | 0.198617 | 0.434214 | 0.183240 | -0.053608 |
num_characters_in_bio | 0.198617 | 1.000000 | 0.442632 | 0.410773 | 0.080228 |
number_of_posts | 0.434214 | 0.442632 | 1.000000 | 0.398960 | 0.154180 |
number_of_followers | 0.183240 | 0.410773 | 0.398960 | 1.000000 | 0.606749 |
number_of_follows | -0.053608 | 0.080228 | 0.154180 | 0.606749 | 1.000000 |
Training and Test Datasets
Let's narrow in on the language of our stated research goal: we'd like to build a model that predicts whether or not any given Instagram account is real or fake. So in other words, we'd like to build a classifier model that will yield good predictions of new Instagram accounts.
Thus, in order for us to get a better sense as to how well our chosen classifier will perform with new Instagram account datasets, we should either employ cross-validation techniques or at the very least a train-test-split technique to assess this. We'll introduce cross-validation techniques for classifier models in section 12, but for now let's just randomly split our dataset into a single training and test dataset.
As we've done for linear regression models, we'll use:
- our training dataset to train our classifier model, and
- our test dataset to test our classifier model.
df_train, df_test = train_test_split(df, test_size=0.2, random_state=207)
df_train.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | |
---|---|---|---|---|---|---|---|
47 | yes | 2 | 0 | 0 | 87 | 40 | real |
51 | yes | 2 | 81 | 25 | 341 | 274 | real |
75 | no | 1 | 0 | 1 | 24 | 2 | fake |
93 | yes | 0 | 0 | 15 | 772 | 3239 | fake |
76 | no | 2 | 0 | 0 | 13 | 22 | fake |
df_test.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | |
---|---|---|---|---|---|---|---|
71 | no | 1 | 0 | 0 | 16 | 2 | fake |
6 | yes | 1 | 132 | 9 | 213 | 254 | real |
16 | yes | 2 | 86 | 25 | 96 | 499 | real |
84 | no | 1 | 0 | 0 | 21 | 31 | fake |
55 | yes | 7 | 24 | 465 | 654 | 535 | real |