Missing Data


Imagine that you are a participant in a survey. The survey includes a personal question, an unclear question, or a question that you either don't want to or don't know how to answer. What happens in that instance? You may respond with an answer like "I don't know."

How does this get recorded in the data? Often, it will be recorded as a missing value. Encountering missing data when analyzing data is common, so knowing available options can help select an appropriate method.

Issues Caused by Missing Data

Let's start looking at the Chicago Airbnb data. We will start by using the original version of the data as downloaded from Inside Airbnb.

We'll look at the host_listings_count as provided by the data.

df['host_listings_count'].describe()
        count    7747.000000
        mean      427.291597
        std      1339.708447
        min         0.000000
        25%         1.000000
        50%         4.000000
        75%        17.000000
        max      4807.000000
        Name: host_listings_count, dtype: float64
    

We can observe some summary characteristics for the number of listings that a host has.

(df['host_listings_count'] == 0).sum()
19

We see that we have 19 listings where the host is recorded to have 0 listings. This value does not seem to make sense, as each host must include at least one listing to be included in our data. Let's adjust this value to be reported as "missing" instead.

Since we are directly adjusting our data, it can be helpful to do so on a copy of the data. That way, if we make a mistake or otherwise need to return to the original data, we can do so easily. We do so using the copy function.

df_copy = df.copy()
df_copy['host_listings_count'][df_copy['host_listings_count'] == 0] = 'missing'

Python returns a warning message.

Note that we did receive a Warning message (but not an Error message). Python is making sure that we are aware that we have only adjusted those observations with a host_listings_count of 0 and not the entire data frame. However, it did successfully adjust this variable for those observations.

We've now replaced the values of 0 with "missing", since we don't think those values are reasonable. How does this affect our later analyses?

df_copy['host_listings_count'].describe()
    count     7747
    unique      61
    top          1
    freq      2331
    Name: host_listings_count, dtype: int64
    
df_copy['host_listings_count'].mean()

Now we get a lengthy error message, with the last few lines shown above

Once we replace the 0 values in the data with the word missing, the mean can no longer be calculated. This is because it is unclear how to incorporate the string value "missing" into the mean calculation.

There is another approach that we can use; instead of representing the values with 0 using the string "missing", we can instead as Python to recognize that it is not a valid number. One option is with None. None will be recorded as NaN, which stands for "Not a Number", communicates that the given value is missing.

Once we have updated the data to use NaN for the missing values, the mean can again be calculated appropriately.

df_copy['host_listings_count'][df_copy['host_listings_count'] == "missing"] = None
df_copy['host_listings_count'].mean()
428.342132505176

We can see that how missing data is encoded affects the data cleaning and data analysis steps that must be followed later. Therefore, it can be very helpful to have Python identify and represent any missing data appropriately, especially for future analyses.

We'll demonstrate one approach to identifying missing data along with a discussion of why this matters on this page.

Using Python to Identify Missing Data

Since missing values generally prevent proper analysis of quantitative variables, we'll focus on identifying missing values encoded in quantitative variables. We'll also use a manufactured Instagram dataset that has more missing values for the rest of this page and some of the following pages.

Note that once we read in the Instagram data as df below, we can no longer access the Chicago Airbnb data using the object df. For this reason, it is important to be careful when naming variables.

df = pd.read_csv('fake_insta_updated.csv')
df.head()
has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type
0 yes 1 30 35 488 604 real
1 yes 5 64 3 35 6 real
2 yes 2 82 319 328 668 real
3 yes 1 143 273 14890 7369 real
4 yes 1 76 6 225 356 real

The first five observations for the Instagram account.

df.dtypes
    has_a_profile_pic          object
    number_of_words_in_name     int64
    num_characters_in_bio       int64
    number_of_posts             int64
    number_of_followers         int64
    number_of_follows          object
    account_type               object
    dtype: object
   

We see that we have many variables that we have what appears to be many quantitative variables. Notice that the "number_of_follows" variable is recorded as an object variable type (indicating categorical variable) but seems like it should be recorded as a quantitative one instead. The "number_of_follows" variable is the only quantitative variable that seems to be like this.

As a start, let's look at the values that the "number_of_follows" variable has.

df['number_of_follows'].unique()
    array(['604', '6', '668', '7369', '356', '424', '254', '521', '143',
           '358', '492', '436', '437', '622', '141', '337', '499', '605',
           '199', '694', '276', "Don't know/Refused (VOL.)", '367', '157',
           '545', '138', '1395', '490', '347', '5514', '552', '573', '963',
           '449', '562', '346', '151', '148', '3504', '185', '293', '549',
           '466', '993', '1111', '40', '1055', '482', '47', '274', '223',
           '363', '568', '535', '577', '474', '505', '2', '64', '30', '82',
           '124', '25', '33', '34', '38', '18', '1', '15', '22', '353', '24',
           '2287', '6153', '31', '250', '6172', '2129', '324', '126', '350',
           '764', '3239', '920', '105', '58', '55', '175', '202', '636', '72',
           '7453', '162', '829', '776', '942', '1445', '4239', '1381', '669',
           '235', '7', '270', '76', '811', '164', '3572', '1695', '68'],
          dtype=object)
    

It does appear that it has missing values. Instead of being encoded as N/A, which Python would recognize and treat as a missing automatically, the missing value is recorded as "Don't know/Refused (VOL.)". We'd like Python to recognize the missing values when reading in the data, and we can do so by instructing Python that these values should be treated as missing.

We can specify any missing values using the na_values argument when reading in the data.

df = pd.read_csv('fake_insta_updated.csv', na_values = ["Don't know/Refused (VOL.)"])
df.dtypes
    has_a_profile_pic           object
    number_of_words_in_name      int64
    num_characters_in_bio        int64
    number_of_posts              int64
    number_of_followers          int64
    number_of_follows          float64
    account_type                object
    dtype: object
   

The "number_of_follows" variable is now recorded as a float type variable, and we can calculate summary measures for this variable.

df['number_of_follows'].mean()
783.8898305084746

Python was able to automatically identify all values that were recorded as "Don't know/Refused (VOL.)" and substitute NaN for that value. This eases later analyses substantially.

We can also instruct Python to recognize multiple values as missing variables, if missing values were recorded differently. For example, the following code would replace any values of missing, unknown, and no answer in the data with NaN.

df = pd.read_csv('data.csv', na_values = ['missing', 'unknown', 'no answer'])

Generally, it is easier to handle missing values when first reading in the data. However, you do need to be careful, as missing values are replaced for all variables in the dataset. For example, we might not want to read in "0" as a missing value for our Airbnb example above, as it would also remove informative 0 values in the data.

Analyzing Data with Missing Values

Once we have identified missing data, we may need to determine how to handle the missing data. For our course, we will often elect to remove the missing observations from the data. First, we will demonstrate how to do this. Then, we will briefly identify some alternatives for analyzing missing data. Many sophisticated techniques for missing data exist, and you may learn more about these in future courses.

Counting Missing Values

First, how many missing values are in the data? It is helpful to understand the scope of the missing values, so you understand how much of your data is affected by the missing values.

If we return to the Instagram data, we can count the number of missing values by chaining two functions together:

  • the isna function returns a True for a value that is recorded as missing and a False for a value that is not missing
  • the sum function adds all of the True values in a column, resulting in a count of the number of missing values
df.isna().sum()
    has_a_profile_pic          0
    number_of_words_in_name    0
    num_characters_in_bio      0
    number_of_posts            0
    number_of_followers        0
    number_of_follows          2
    account_type               0
    dtype: int64
    
df.shape
(120, 7)

We can observe that we have only 2 missing follow values out of 120 observations. Since this is a small portion of our data (1.67%), losing this information is not crucial. If we found that a variable had more missing values, we may be more concerned about why that variable was so poorly recorded.

Removing Missing Values

In this case, because we are only missing 2 observations, a simple option is to remove the observations with missing values. We can do this using the dropna function.

df = df.dropna()
df.shape
(118, 7)

Observe that now, our variable has 118 observations, so we have removed the 2 rows that had missing number of follows from the original data.

Data Imputation

Instead of removing missing values, we may instead elect to approximate what those missing values might have been. This allows us to continue using the available data from those observations. This is called imputation, since we are inferring through a calculation what the value could reasonably be.

What you choose to use as the replacement value for your missing values should depend on the context. Some common values to use include:

- 0
- 1
- median
- mean

We may also estimate the mean or median using similar values before imputation. For example, we may decide to estimate the median number of follows using only those Instagram accounts with similar characteristics to those with missing values. In this case, our two missing observations both have a profile picture and 0 characters in the bio, so we may limit our estimation to only the Instagram accounts with similar characteristics before calculating the median, to provide a more reasonable estimate for the number of follows for the missing values.

For the Chicago Airbnb example, we may choose to replace the host listing counts of 0 with 1, as we know that these hosts do have at least one listing (the one that is in our data).

For our course, we'll remove the observations that have missing values for simplicity. For more complex analyses, you could choose to save some of the information available by using data imputation instead, depending on the context.

When to Remove Missing Data

When handling missing data, we would like to preserve as much data as possible while still having quality data. balance influences our decision about how to handle missing data but should also be considered in when to adjust our data for the missing data.

For example, we could remove all observations with any missing data as soon as we read in the data. In the Instagram data, that amounts to removing 1.67% of our original data. This step may be preferred for ease of analysis and for communication. It might also be preferred if we anticipate that observations with missing data may have less quality information recorded for the other variables in the data.

However, what if we have variables in our data that we have no interest in analyzing? This might be especially true for datasets with a large number of variables, like the Chicago Airbnb data. If we remove all observations with any missing values, then we might be removing observations that we could use for our analysis.

For example, suppose that we had a variable recorded for cost of available parking, with options including free, paid, and N/A (representing not applicable), for units with parking included, parking available for pay, and parking unavailable, respectively. If we removed all of the observations with N/A, not only would we be removing an informative level for our parking variable but we would also be removing all units that do not have parking available. Our sample would be less representative of the population, since we would no longer have any data on units that do not have parking available. Additionally, if we do not use the cost of parking variable, we would be limiting our data for no functional reason.

An alternative approach would be isolate our variables of interest from the data first before removing any observations with missing values. We would then retain as much information from the data as possible by keeping as many observations that contain helpful information. This approach may mean that multiple analyses from the same data are based on different sets of observations, resulting in different sample sizes and slightly different outcomes. Communicating the data cleaning process clearly is crucial.

One important characteristic when determining when to remove missing data is the proportion (or number) of observations that would be removed. It is entirely possible that more than half of the observations may have a missing value somewhere in the dataset. In this situation, removing all incomplete observations would reduce your data considerably and may support only removing observations that are missing values for your variables of interest. On the other hand, our Instagram data has incomplete observations for only 1.67% of the observations. This is a relatively small portion of the data and removing them is not likely to change the results substantially.

Thinking Critically about Missing Data

While missing data does pose a challenge to data analysis, there are also examples of where missing data is informative. Consider the previous example of the cost for parking. The option of N/A may be used when no parking is available for a unit. However, the option of N/A may also be recorded if this information wasn't available for a given unit, or if the cost of parking differed by the day of week. Some of these situations could be solved with more informative or careful data collection, but some of these situations may not be able to be clearly recorded in the data without making the data too messy. What can we do?

When handling missing data, it is important to consider the variable carefully. It can also help to think about scenarios when data might be missing, to consider why that might be missing.

The Washington Post has collected data on fatal police shootings since 2015. This data includes a number of variables, including the name and some demographic information about the victim, including age, race, and sex. Some of the demographic information is not available for all victims in the data. There are some victims with all demographic information present except the race variable. What reasons might the race not be recorded? It could be that the race truly cannot be determined and is reflecting the available information. It is also possible that race is withheld for some reason. These reasons could be due to some of the racial tensions currently in the United States. The missing values in this data might not be completely uninformative.

Missing data does not only include values for a given variable being missing. We should also consider when full observations might be missing from the data. Are there portions of the population that are not represented in the data? For example, some data is missing perspectives from certain demographics, since that data is not included in that data.

In general, it can be illuminating to explore the biases that are present in the data through the information that might be missing in the data. For this reason, it can be helpful to think critically about missing data. Some questions that you might ask include:

  • What observations might be systematically missing from my data?
  • What variable values might be systematically missing from my data?
  • What might not be captured in the data in general

Other Types of Missing or Incomplete Data

So far, we've considered data that is missing and recorded as "unavailable" or another similar value.

There are other ways in which data can be missing, incomplete, or otherwise unrealistic. We'll demonstrate a few examples here and a few more on the page about measurement errors, although more are possible.

Truncated Data

Some data is truncated, which means that the values are not fully recorded. For example, the Airbnb data lists the maximum number that a listing can accommodate as 16. However, the website actually truncates this value at 16; therefore, any listing that can host 16 or more people is simplified to 16. This is a limitation to the available data. Therefore, it can be helpful to recognize that the true mean or the true standard deviation will be slightly larger than reported, based on some of the listings underreporting the number that can be accommodated.

Numerical Data Codes

When a variable is recorded using numbers, certain numbers may have special meanings. Be sure to read through any supporting documentation of the data. Some common numbers include 0, 9, 99, or other numbers with repeated 9s. All of these may stand for missing data.

For this reason, it is not enough to look for variables with an unexpected variable type. Filtering out missing data that have a specific value can also be performed before analysis.

The following example shows code that could be used to remove observations that have a specific value in a dataset.

df = df[df['host_listings_count'] != 0]

Overall, it is important to carefully consider what values might be missing, to encode them properly, and to determine the optimal way to handle the missing values, to preserve as much helpful information as possible while also retaining quality observations.