Missing Data
Imagine that you are a participant in a survey. The survey includes a personal question, an unclear question, or a question that you either don't want to or don't know how to answer. What happens in that instance? You may respond with an answer like "I don't know."
How does this get recorded in the data? Often, it will be recorded as a missing value. Encountering missing data when analyzing data is common, so knowing some of your available options can help select an appropriate method to handle the missing data.
Issues Caused by Missing Data
Let's start looking at the Chicago Airbnb data. We will start by using the original version of the data as downloaded from Inside Airbnb.
We'll look at the host_listings_count as provided by the data.
df['host_listings_count'].describe()
count 7747.000000 mean 427.291597 std 1339.708447 min 0.000000 25% 1.000000 50% 4.000000 75% 17.000000 max 4807.000000 Name: host_listings_count, dtype: float64
We can observe some summary characteristics for the number of listings that a host has.
(df['host_listings_count'] == 0).sum()
19
We see that we have 19 listings where the host is recorded to have 0 listings. This value does not seem to make sense, as each host must include at least one listing to be included in our data. Let's adjust this value to be reported as "missing" instead.
Since we are directly adjusting our data, it can be helpful to do so on a copy of the data. That way, if we make a mistake or otherwise need to return to the original data, we can do so easily. We do so using the copy
function. With the code below, we retain our original data as df and perform our remaining analysis on df_copy, allowing us to return to our original data in df if needed.
df_copy = df.copy()
We can then perform our task in either multiple lines of code, or by nesting our code as shown below. Starting in the middle of the code below, we use a logical statement to identify the 19 entries where the host was identified as having 0 listings, using df_copy['host_listings_count'] == 0
. This is nested inside a filtering statement for df_copy['host_listings_count']
, so we only pull out the host_listings_counts where the value is 0. At the end, we see = 'missing'
, where we replace those host_listings_count that were equal to 0 with the value "missing".
df_copy['host_listings_count'][df_copy['host_listings_count'] == 0] = 'missing'

Python returns a warning message.
Note that we did receive a Warning message (but not an Error message). Python is making sure that we are aware that we have only adjusted those observations with a host_listings_count of 0 in one of the two data frames that contains the same data but have not changed both data frames. However, it did successfully adjust this variable for the observations that we wanted to change in df_copy. Warning messages are something to explore, indicating that your code did run but may have accomplished a task that can be prone to mistakes.
Returning to our analysis, we've now replaced the values of 0 with "missing", since we don't think the 0 values are reasonable. How does this affect our later analyses?
df_copy['host_listings_count'].describe()
count 7747 unique 61 top 1 freq 2331 Name: host_listings_count, dtype: int64
df_copy['host_listings_count'].mean()

Now we get a lengthy error message, with the last few lines shown above
Once we replace the 0 values in the data with the word "missing", the mean can no longer be calculated. This is because it is unclear how to incorporate the string value "missing" into the mean calculation.
There is another approach that we can use; instead of representing the values with 0 using the string "missing", we can ask Python to recognize that the variable should be a number but the entry is not a valid number. One option is with None
. None is recorded as NaN
, which stands for "Not a Number", and communicates that the given value is missing.
Once we have updated the data to use NaN for the missing values, the mean can again be calculated appropriately.
df_copy['host_listings_count'][df_copy['host_listings_count'] == "missing"] = None
df_copy['host_listings_count'].mean()
428.342132505176
We can see that how missing data is encoded affects the data cleaning and the downstream data analysis. Therefore, it can be very helpful to have Python identify and represent missing data appropriately, especially for future analyses.
We'll demonstrate one approach to identifying missing data along with a discussion of why this matters on this page.
Using Python to Identify Missing Data
Since missing values may prevent proper analysis of quantitative variables, we'll focus on identifying missing values encoded in quantitative variables. We'll also use a manufactured Instagram dataset (modified from 07-00 and returned to in Logistic Regression) that has more missing values for the purposes of detecting and adjusting missing values.
Note that once we read in the Instagram data as df below, we can no longer access the Chicago Airbnb data using the object df. For this reason, it is important to be careful when naming variables.
df = pd.read_csv('fake_insta_updated.csv')
df.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | |
---|---|---|---|---|---|---|---|
0 | yes | 1 | 30 | 35 | 488 | 604 | real |
1 | yes | 5 | 64 | 3 | 35 | 6 | real |
2 | yes | 2 | 82 | 319 | 328 | 668 | real |
3 | yes | 1 | 143 | 273 | 14890 | 7369 | real |
4 | yes | 1 | 76 | 6 | 225 | 356 | real |
The first five observations for the Instagram account.
At first, we might see that we have what appears to be many quantitative variables in the data, with a few categorical variables.
df.dtypes
has_a_profile_pic object number_of_words_in_name int64 num_characters_in_bio int64 number_of_posts int64 number_of_followers int64 number_of_follows object account_type object dtype: object
After exploring the variable data types in python, we notice that the "number_of_follows" variable is recorded as an object variable type (indicating categorical variable) but seems like it should be recorded as a quantitative one instead from our initial data viewing. The "number_of_follows" variable is the only quantitative variable that seems to be like this.
As a start, let's look at the values that the "number_of_follows" variable has.
df['number_of_follows'].unique()
array(['604', '6', '668', '7369', '356', '424', '254', '521', '143', '358', '492', '436', '437', '622', '141', '337', '499', '605', '199', '694', '276', "Don't know/Refused (VOL.)", '367', '157', '545', '138', '1395', '490', '347', '5514', '552', '573', '963', '449', '562', '346', '151', '148', '3504', '185', '293', '549', '466', '993', '1111', '40', '1055', '482', '47', '274', '223', '363', '568', '535', '577', '474', '505', '2', '64', '30', '82', '124', '25', '33', '34', '38', '18', '1', '15', '22', '353', '24', '2287', '6153', '31', '250', '6172', '2129', '324', '126', '350', '764', '3239', '920', '105', '58', '55', '175', '202', '636', '72', '7453', '162', '829', '776', '942', '1445', '4239', '1381', '669', '235', '7', '270', '76', '811', '164', '3572', '1695', '68'], dtype=object)
It does appear that it has missing values. Instead of being encoded as N/A, which Python would recognize and treat as a missing value automatically, the missing value is recorded as "Don't know/Refused (VOL.)". We'd like Python to recognize this entry as a missing value when reading in the data, and we can do so by instructing Python that these values should be treated as missing.
We can specify any missing values using the na_values argument when reading in the data.
df = pd.read_csv('fake_insta_updated.csv', na_values = ["Don't know/Refused (VOL.)"])
df.dtypes
has_a_profile_pic object number_of_words_in_name int64 num_characters_in_bio int64 number_of_posts int64 number_of_followers int64 number_of_follows float64 account_type object dtype: object
The "number_of_follows" variable is now recorded as a float type variable, and we can calculate summary measures for this variable.
df['number_of_follows'].mean()
783.8898305084746
Python was able to automatically identify all values that were recorded as "Don't know/Refused (VOL.)" and substitute NaN for that value. This eases later analyses substantially.
We can also instruct Python to recognize multiple values as missing, if missing values were recorded differently. For example, the following code would replace any values of missing, unknown, and no answer in the data with NaN.
df = pd.read_csv('data.csv', na_values = ['missing', 'unknown', 'no answer'])
Generally, it is easier to handle missing values when first reading in the data. However, you do need to be careful, as missing values are replaced for all variables in the dataset. For example, we might not want to read in "0" as a missing value for our Airbnb example above, as it would also replace informative 0 values in the data.
Analyzing Data with Missing Values
Once we have identified missing data, we may need to determine how to handle the missing data. For our course, we will often elect to remove the missing observations from the data. First, we will demonstrate how to do this. Then, we will briefly identify some alternatives for analyzing missing data. Many sophisticated techniques for missing data exist, and you may learn more about these in future courses.
Counting Missing Values
First, how many missing values are in the data? It is helpful to understand the scope of the missing values, so you understand how much of your data is affected by the missing values.
If we return to the Instagram data, we can count the number of missing values by chaining two functions together:
- the
isna
function returns a True for a value that is recorded as missing and a False for a value that is not missing - the
sum
function adds all of the True values in a column, resulting in a count of the number of missing values
df.isna().sum()
has_a_profile_pic 0 number_of_words_in_name 0 num_characters_in_bio 0 number_of_posts 0 number_of_followers 0 number_of_follows 2 account_type 0 dtype: int64
df.shape
(120, 7)
We can observe that we have only 2 missing follow values out of 120 observations. Since this is a small portion of our data (1.67%), losing this information is not substantial. If we found that a variable had more missing values, we may be more concerned about why that variable was so poorly recorded.
Removing Missing Values
In this case, because we are only missing 2 observations, a simple option is to remove the observations with missing values. We can do this using the dropna
function.
df = df.dropna()
df.shape
(118, 7)
Observe that now, our data has 118 observations, so we have removed the 2 rows that had missing values for number of follows from the original data.
Data Imputation
Instead of removing missing values, we may instead elect to approximate what those missing values might have been. This allows us to continue using the available data from those observations. This is called imputation, since we are inferring through a calculation what the value could reasonably be.
- 0
- 1
- median
- mean
We may also estimate the mean or median using only similar observations based on some of the observed characteristics before imputation. For example, we may decide to estimate the median number of follows using only those Instagram accounts with similar characteristics to those with missing values. In this case, our two missing observations both have a profile picture and 0 characters in the bio, so we may filter our data to only the Instagram accounts with similar characteristics before calculating the median. This might provide a more reasonable estimate for the number of follows for the missing values.
For the Chicago Airbnb example, we may choose to replace the host listing counts of 0 with 1, as we know that these hosts do have at least one listing (the one that is in our data) and assume that the host has no other listings.
The decision we'll most often make in our course is to remove the observations with missing values, for the sake of simplicity. For more complex analyses, you could choose to save some of the information available by using data imputation instead, depending on the context.
When to Remove Missing Data
When handling missing data, we would like to preserve as much data as possible while still having quality data. How we prioritize and balance these two competing interests influences our decision about how to handle missing data but should also be considered in when to adjust our data.
For example, we could remove all observations with any missing data as soon as we read in the data. In the Instagram data, that amounts to removing 1.67% of our original data. This step may be preferred for ease of analysis and for communication. It might also be preferred if we anticipate that observations with missing data may have less quality information recorded for the other variables in the data.
However, what if we have variables in our data that we have no interest in analyzing? This might be especially true for datasets with a large number of variables, like the Chicago Airbnb data. If we remove all observations with any missing values, then we might be removing observations that we could use for our analysis.
For example, suppose that we had a variable recorded for cost of available parking, with options including free, paid, and N/A (representing not applicable) for units with parking included, parking available for pay, and parking unavailable, respectively. If we removed all of the observations with N/A, not only would we be removing an informative level for our parking variable but we would also be removing all units that do not have parking available. Our sample would be no longer represent all Chicago Airbnb units, since we would no longer have any data on units that do not have parking available. Additionally, if we do not use the cost of parking variable, we would be limiting our data for no functional reason.
An alternative approach would be isolate our variables of interest from the data first before removing any observations with missing values. We would then retain as much information from the data as possible by keeping as many observations that contain helpful information. This approach may mean that multiple analyses from the same data are based on different sets of observations, resulting in different sample sizes and slightly different calculations and results. Communicating the data cleaning process clearly is crucial.
One important characteristic when determining when to remove missing data is the proportion (or number) of observations that would be removed. It is entirely possible that more than half of the observations may have a missing value for at least one variable. In this situation, removing all incomplete observations would reduce your data considerably. Instead, this may suggest only removing observations that are missing values for your variables of interest to retain as much information as possible. On the other hand, our Instagram data has incomplete observations for only 1.67% of the observations. This is a relatively small portion of the data and removing them is not likely to change the results substantially.
Thinking Critically about Missing Data
While missing data does pose a challenge to data analysis, there are also examples of where missing data is informative. Consider the previous example of the cost for parking. The option of N/A may be used when no parking is available for a unit. However, the option of N/A may also be recorded if this information wasn't available for a given unit, or if parking was free or paid depending on the time of day/day of week. In this scenario, N/A may represent two different situations simultaneously. Some of these situations could be solved with more informative or careful data collection, but some of these situations may not be able to be clearly recorded in the data without making the data too messy. What can we do?
When handling missing data, it is important to consider the variable carefully. It can also help to think about scenarios when data might be missing, to consider the underlying cause of the missing data.
The Washington Post has collected data on fatal police shootings since 2015. This data includes a number of variables, including the name and some demographic information about the victim, including age, race, and sex. The demographic information is not available for all victims in the data. There are some victims with all demographic information present except the race variable. What reasons might the race not be recorded? It could be that the race truly cannot be determined and is reflecting the available information. It is also possible that race is withheld for some reason. These reasons could be due to some of the racial tensions currently in the United States. The missing values in this data might not be completely uninformative.
Missing data does not only include values for a given variable being missing. We should also consider when full observations might be missing from the data. Are there portions of the population that are not represented in the data? For example, certain demographics may not be included in a dataset and therefore have a missing perspective in the data.
In general, it can be illuminating to explore the biases (lack of representativeness) that are present in the data through the information that might be missing in the data. For this reason, it can be helpful to think critically about missing data. Some questions that you might ask include:
- What observations might be systematically missing from my data?
- What variable values might be systematically missing from my data?
- What might not be captured in the data in general?
Other Types of Missing or Incomplete Data
So far, we've considered data that is missing and recorded as "unavailable" or another similar value.
There are other ways in which data can be missing, incomplete, or otherwise unrealistic. We'll demonstrate a few examples here and a few more on the page about measurement errors. Many more possibilities exist.
Truncated Data
Some data is truncated, which means that the values are not fully recorded. For example, the Airbnb data lists the maximum number that a listing can accommodate as 16. However, the website actually truncates this value at 16; therefore, any listing that can host 16 or more people is simplified to 16. This is a limitation to the available data. It can be helpful to recognize that the true mean or the true standard deviation will be slightly larger than reported, based on some of the listings underreporting the number that can be accommodated.
Numerical Data Codes
When a variable is recorded using numbers, certain numbers may have special meanings. Be sure to read through any supporting documentation of the data. Some common numbers include 0, 9, 99, or other numbers with repeated 9s. All of these may stand for missing data.
For this reason, it is not enough to look for variables with an unexpected variable type. Filtering out missing data that have a specific value can also be performed before analysis.
The following example shows code that could be used to remove observations that have a specific value in a dataset.
df = df[df['host_listings_count'] != 0]
Overall, it is important to carefully consider what values might be missing, to encode them properly, and to determine the optimal way to handle the missing values, to preserve as much helpful information as possible while also retaining quality observations.