Making Predictions


As of 7/19/2023, the DataScienceDuo Instagram account:

  • has a profile picture
  • has 3 words in their name (ie. "Data Science Duo")
  • has 131 characters in their bio
  • has 18 posts
  • has 292 followers
  • and follows 14 accounts.

Probability Prediction

Predicted Probability it's Real

Let's first predict the probability that this account is real. Recall that because we designated our response variable y=1 to represent real accounts, then $\hat{p}$ will indeed be the predicive probability that that a given account is real.

We could manually plug in these values into our logistic regression equation (in the format below).

\begin{align*} \hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned} &-37.47 \\ &+ 30.73(1) \\ &+ 2.60(3) \\ &+ 0.087(131) \\ &- 0.0060(18) \\ &+ 0.025(292) \\ &- 0.0046(14) \end{aligned}\right)} =0.9999999968774005 \end{align*}
prob_real = 1/(1+np.exp(-(-37.47 + 30.73*1 + 
2.60*3 + 0.087*131 - 0.0060*18 +
0.025*292 - 0.0046*14)))
prob_real

0.9999999968774005

Or we could use our .predict() function with the given explanatory variable values.

Note: The .predict() function specifically returns the predictive probability.

probs = log_mod_full.predict(exog=dict(has_a_profile_pic='yes',
number_of_words_in_name=3,
num_characters_in_bio=131,
number_of_posts=18,
number_of_followers=292,
number_of_follows=14))

probs

0 1.0
dtype: float64

Wow! The model predicts a very high probability that this Instagram account is real. Technically this probability is 0.999999997178352, but it was rounded to in the output.

prob_real = probs.values[0]
prob_real

0.999999997178352

Predicted Probability it's Fake

Similarly, the model predicts that the probability that it is fake to be very low at 0.000000000282.

prob_fake = 1-prob_real
prob_fake

2.821648026340995e-09

Prediction Evaluation

Given that DataScienceDuo Instagram account is a real account, we can see that the model performed extremely well at predicting the account status of this particular account.

Odds Prediction

Predicted Odds it's Real

Next, let's predict the odds that this account is real.

Numerical Odds

Manual Calcuation

We can manually calculate these odds by hand by using the equation below.

Recall that because we designated our response variable y=1 to represent real accounts, then the resulting odds $\frac{\hat{p}}{1-\hat{p}}$ will indeed be the odds that that a given account is real.

\begin{align*} \hat{odds} = \frac{\hat{p}}{1-\hat{p}} =\exp\left(\begin{aligned} &-37.47 \\ &+ 30.73(1) \\ &+ 2.60(3) \\ &+ 0.087(131) \\ &- 0.0060(18) \\ &+ 0.025(292) \\ &- 0.0046(14) \end{aligned}\right) =320245997 \end{align*}
odds_real = np.exp(-37.47 + 30.73*1 + 
2.60*3 + 0.087*131 - 0.0060*18 +
0.025*292 - 0.0046*14)
odds_real

320245997.303172

Converting from a Probability

Or, given that we know that the predictive probability that the account is real is $\hat{p}=0.9999999968774005$, then we could calculate the odds using our probability to odds conversion equation below:

$$\hat{odds}_{real} = \frac{\hat{p}}{1-\hat{p}}\approx 320245997$$

odds_real=prob_real/(1-prob_real)
odds_real

354402812.76864773

Prose Odds

Note that this 320245997 number is the numerical odds that the account is real. To convert these odds into the prose format, we could simply note that this numerical real odds is simply the ratio of the number of real chances to the number of fake chances.

$$320245997=\frac{320245997\ real\ chances}{1\ fake\ chance}$$

So the most straightforward way to communicate these odds using the prose definition would be to say "the predicted odds that this account is real are 320245997 to 1".

Although, technically we could have multiplied both of these numbers by any other number (say 2) and gotten an equivalent odds statement. For instance, "the predicted odds that this account is real are $2\cdot320245997$ to $2\cdot1$".

Predicted Odds it's Fake

Prose Odds

Given that we identified that the model predicts that there are "320245997 real chances" for every "1 fake chance", we could say "the odds that the account is fake is 1 to 320245997", using the prose odds definition.

Numerical Odds

Converting from Prose Odds

To calculate the predicted numerical odds that it is fake, we could have noted that this odds is simply just the ratio of the number of fake chances to the number of real chances.

$\hat{odds}_{fake}=\frac{number\ of\ fake\ chances}{number\ of\ real\ chances}=\frac{1}{320245997}=0.000000000312$

Converting from Probabilities

Or given that we know $\hat{p}$ (the predictive probability that it's real), then we can use the conversion equation to find the odds that it's fake.

$\hat{odds}_{fake} = \frac{1-\hat{p}}{\hat{p}} = 0.000000000312$

odds_fake = (1-prob_real)/prob_real
odds_fake

2.8216480343026926e-09

Predicted Odds Evaluation

Given that the DataScienceDuo account is real and the predicted (numerical) odds that it is real 320245997 is really large (ie. much larger than 1), then this indicates the model predicted the account status of this account really well.

Lod-Odds Prediction

Predicted Log-Odds it's Real

Finally, let's calculate the log-odds that this account is real. Let's just take the natural log of the predicted numerical odds that it's real.

$log(\hat{odds}) = log(\frac{\hat{p}}{1-\hat{p}}) = log(320245997)=19.58$

log_odds_real = np.log(odds_real)
log_odds_real

19.68594471336982

Predicted Log-Odds it's Fake

And let's calculate the log-odds that this account is fake. Let's just take the natural log of the predicted numerical odds that it's fake.

$log(\hat{odds}) = log(\frac{1-\hat{p}}{\hat{p}}) = log(1/320245997)=-19.58$

log_odds_real = np.log(odds_fake)
log_odds_real

-19.68594471336982

log_odds_real = np.log(1/odds_real)
log_odds_real

-19.68594471336982

Watch out for extrapolation!

Just like with linear regression, we need to watch out for extrapolation when it comes to making predictions with our model.

DataScienceDuo

  • has 3 words in their name (ie. "Data Science Duo")
  • has 131 characters in their bio
  • has 18 posts
  • has 292 followers
  • and follows 14 accounts.

Because each of the DataScienceDuo account's numerical information falls within the range of values that trained our model, this prediction is not an extrapolation. Thus, we can have more trust that our model has been exposed to accounts with explanatory variable values like the DataScienceDuo's, and therefore we can have more trust that this will be a good, data-informed prediction.

df_train[num_cols].describe().loc[['min','max']]

number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows
min 0.0 0.0 0.0 0.0 1.0
max 5.0 138.0 590.0 1572.0 4239.0