Measurement Errors

← Summarizing Variables with Statistics, Tables, & Plots Next: Deeper Dive in Data Cleaning →

Some missing values are easy to identify. Some missing values may be a little more hidden but still identifiable with careful searching. Then, some values may not appear to be missing but may not accurately reflect the value that they should. Some values may be recorded incorrectly or may be uncommon. Identifying and handling values that are more unusual is more challenging. For these variables, the analyst has to combine their understanding of the context to determine the best path forward. On this page, we will describe some of the most common ways to identify and handle values that are more uncommon.

The values recorded in the data are crucial, as our analysis can only be as good as the data that it is based on. In other words, if we have garbage in with the data, then our results will also be garbage out.

Concepts to Variables

Consider our analysis about how Chicago Airbnb hosts differ by whether they have larger or smaller owners. Specifically, we defined larger hosts to be those who had three or more listings in Chicago.

Is this the best way to identify those who may be operating with Airbnb as a business venture compared to those who have some extra space that they are trying to rent out to earn some extra money? Are there other features that should be considered in defining the larger owners, like the square footage, the total number of people that could be accommodated across all listings, having multiple listings in many locations, including outside of Chicago? All of these could be used to help distinguish the different types of hosts, but they may not be sufficient.

In our example, we chose one way to define larger hosts compared to smaller hosts. This variable may or may not capture what we were hoping for it to capture. For this reason, it can be helpful if you are able to carefully consider the concept or idea that you'd like to capture and then translate that to a variable that can measured or recorded.

Data scientists may not be included in the study until after the data has been gathered, and so it might not be possible to influence the variables that are recorded. However, data scientists may be able to control how certain concepts are defined using the variables that are recorded. For this reason, it is important to think critically about each and every step of the analysis. Is it best to define the larger Chicago Airbnb hosts as we have? What other variables might you incorporate again if you were able to redefine this using additional information in the data?

Biased Values

Recall that we'd like our data to reflect the true values. However, there are a few variables that often are subject to non-response or response biases. This is where individuals may not respond at all, may not respond for a specific question, or may not respond in a way that accurately responds to what the researcher had intended. We talked about missing values a few pages ago, which correspond to non-response biases. We also talked about which observations may not be present in the data (a different type of missing data), which could be due to non-response biases.

Response biases occur when someone responds in a way that misrepresents the individual's accurate value. This might be due to the way the question was asked, who asked the question, concerns over how they might be perceived, or other components of an interaction.

It is important to be aware of response biases when collecting sensitive data. For example, when collecting data about cheating or other immoral behaviors, there are approaches to help ensure that more accurate data can happen. Specifically, you may be blind to the exact question that a person is answering by having them flip a coin to determine if they answer whether they have cheated on an exam, or if they provide a definitive answer of "yes."

Similarly, salary or other variables are sometimes considered to be sensitive data, and not every person is willing to share their salary. Some may not know their exact salaries. One strategy to prevent some bias for the salary variable is to make salary into a categorical variable with various grouping options. For example, someone might provide their salary as falling into one of the following categories: $0-$30,000; $30,001 - $60,000; $60,001 - $90,000; $90,001 - $120,000; and $120,000 or more. This specific approach comes with the drawback that less specific information about salary is communicated but the benefit that more respondents are comfortable with providing their salary within a broader range than a specific number. Here, researchers have typically selected information for more respondents over more specific information for fewer respondents.

Being aware of these sensitive variables and planning beforehand can help to gather less biased values to improve conclusions drawn from data that better represents the truth.

Unusual Quantitative Values

We've seen how to identify values that are missing, including some that could be recorded in ways other than NaN or None. However, there are values that appear to be recorded but may be unrealistic for some reason. The could be a recording error for that value, there may be information provided for an observation that seems inconsistent (e.g., highest degree is high school and profession is a medical doctor), or the value could be correct but highly unusual. We'll focus on the highly unusual values for quantitative values in this section and assume that the value has been accurately recorded.

Highly unusual values for quantitative variables (if extremely small or extremely large) are defined as outliers. Unusual values do not need to be extremes, as they could be fall between two groups. When we analyze data with unusual values, what should we do with these values? There are a few options, each with their own advantages.

You could retain the observations, because these values represent a real observation.
You could remove the outliers or unusual observations, because these values do not represent a "typical" observation that we would anticipate to be representative of most observations that could exist.
You could transform the variable to encourage the distribution of a variable to be more normal. For example, for variables like salary that are right skewed, a log transformation would reduce the distance between extremely large values when they are more spread out than smaller values.

Different analysts may select different options, all of which may be valid, depending on their research question or the goal for analysis.

For the Chicago Airbnb data, consider the nightly price for each unit.

df['price'].describe()

    count     7747.000000
    mean       184.285917
    std       1160.005899
    min          0.000000
    25%         77.000000
    50%        124.000000
    75%        189.000000
    max      99998.000000
    Name: price, dtype: float64

We see that the maximum price is recorded as $99,998 per night. These seems very unusual for a nightly rate for a listing in Chicago. We can first observe information about this particular listing from the available data.

df[['price', 'listing_url', 'beds', 'accommodates', 'bathrooms_text', 'availability_30', 'availability_365']][df['price'] == 99998]

	price	listing_url	beds	accommodates	bathrooms_text	availability_30	availability_365
1335	99998.0	https://www.airbnb.com/rooms/24894665	4.0	6	1 bath	0	0

A few variables for the most expensive listing in the Chicago Airbnb data.

The most expensive unit has 4 beds, accommodates 6 people, has 1 bathroom, and no availability for the next year. This seems like a unit where the host is artificially raising the price to ensure that no one books the unit.

In this case, we can also check the current listing using the provided listing url, although this isn't always available depending on the data. As of this writing, the price is listed as $632 per night. Therefore, the original value doesn't seem accurate.

How about some of the other most expensive listings? We can examine the listings with the highest prices.

df_host['price'].sort_values().tail(20)

    6574     1640.0
    1814     1653.0
    1815     1653.0
    5545     1800.0
    4687     1800.0
    2047     2000.0
    7024     2057.0
    606      2221.0
    2111     2249.0
    2151     2380.0
    4015     2429.0
    2013     2599.0
    6571     2880.0
    7262     2960.0
    2131     4500.0
    140      5000.0
    977      5060.0
    6809     6676.0
    517     10000.0
    1335    99998.0
    Name: price, dtype: float64

The twenty most expensive Chicago Airbnbs (per night), based on our data.

One option would be to determine a cutoff below which we would assume that the price represents the intended price and above which we might further explore to determine its inclusion in the data. When observing some of the most expensive units, it seems as if there are clusters within the 20 most expensive units. One possible cutoff could be $3,000, although other reasonable options could be suggested and more sophisticated methods could be used to identify unreasonable prices.

Uncommon Categorical Levels

Categorical variables can also have uncommon values, or levels that are not commonly recorded.

Let's observe the distribution of host locations for all Airbnb hosts with units in Chicago.

df_host['host_location'].value_counts()

    Chicago, IL                 2559
    United States                 17
    Los Angeles, CA               14
    Illinois, United States       12
    New York, NY                  12
                                ... 
    Crystal Lake, IL               1
    Orlando, FL                    1
    La Cañada Flintridge, CA       1
    Bogota, Colombia               1
    Campbell, CA                   1
    Name: host_location, Length: 225, dtype: int64

We see that the majority of hosts are located in Chicago, IL. The next most common is United States with only 17 observations, 14 from Los Angeles, and 222 other locations with 12 or fewer hosts from that location.

We might opt to separate the host location variable into two options: either local hosts or non-local hosts. While we would be grouping together hosts from many locations in the non-local host group, it is plausible that non-local hosts may be similar in some manner and provide us with more information with which to draw conclusions. If using this scheme, we may want to sort through the other locations provided to determine what defines a local host (is being in the state of Illinois or a suburb sufficient to be local?). We also may need to recognize that a few hosts may be miscategorized. For example, hosts from the United States could be located within the city of Chicago.

We do so with the following code.

df_host['host_Chicago'] = (df_host['host_location'] == 'Chicago, IL')

Gender is a common variable that this type of adjustment may be needed. For example, gender could include options like male, female, and non-binary. Depending on the analysis, you may want to include the non-binary respondents, but you may have fewer non-binary responses in a way that makes a full analysis challenging.

If analyzing salary to learn about the effectiveness of salary negotiations, for example, one option might be to combine groups so that you have a typically disadvantaged group of females and non-binaries and a typically advantaged group of males.

Depending on the context and goals of the analysis, the data scientist may need to be creative in determining both if and how to combine groups.

Take Away

Preparing data for analysis is a crucial step in the data science pipeline. While examining the data, it is important to constantly evaluate reasonable options for data cleaning, relying on the research goals and context of the data. In this way, you can maximize the conclusions that can be drawn using the data, balancing both the quantity and quality of data that is available. With this in mind, we will turn to extending our analysis from our current data to additional data with our next modules through applying machine learning techniques.

← Summarizing Variables with Statistics, Tables, & Plots Next: Deeper Dive in Data Cleaning →