Review of Data Basics


This page reviews some of the crucial components from Data Science Discovery that we will repeatedly return to as we continue exploring data science concepts. The fundamental concepts will continue to guide you as you first discover and then continue to explore a data set in order to make conclusions about the data and the surrounding world.

Variable Types

Recall that there are two primary conceptual types of variables: quantitative variables and categorical variables.

Quantitative variables, also called numerical variables have values recorded as numbers. The numbers themselves are meaningful and can be used reasonably for calculations. That is, calculating a mean, median, or sum provides meaningful information.

Categorical variables have values recorded as levels or categories. The levels are often recorded with words, although they can also be translated into numbers. However, the numbers themselves would not be informative in a calculation of a mean, median, or sum.

A special type of categorical variable is one that only has two possible values. Some example values include True or False, Yes or No, 0 or 1, and Absent or Present. We will see that this type of variable has some helpful properties that can be used during analysis. This variable may be referred to as a boolean, logical, or dichotomous variable.

When determining a variable type, the true underlying behavior of the variable should be considered. For example, zip codes or bar codes are recorded using numbers, but the value itself is not meaningful in a calculation. Therefore, these variables are considered to be categorical variables. Similarly, even though the levels of a categorical variable could be translated to numbers (e.g., 1 = Democrat, 2 = Republican, 3 = Independent, 4 = Undetermined), the underlying behavior of the variable is the word responses. This variable would then be a categorical variable.

The conceptual variable types do relate to how Python records variable types. Python variable types include further separation of variables to assist with data storage, memory, and calculations.

The two most common forms for numerical variables in Python are int (integer, or whole numbers) and float (numbers that allow decimals). Long and complex are also possible for very large or complex numbers, respectively, although are not typically used. 64 also often appears at the end of the variable type to indicate how the variable is stored.

The most common form for categorical variables in Python is object. An object variable could be entirely made of strings or could contain a mix of items. However, any numbers in this list will operate as a string, which means that they cannot be used for calculations. Logical variables can also be recorded as bool variables, which stand for Boolean variables or as objects. Python will only read logical variables as bool variables if they are recorded as True or False.

Other options include ways to record calendar time, calendar dates, and ordered categories. We generally won't use those for our course.

Let's see an example of how we can use Python to report the variable types.

df = pd.read_csv('listings.csv')
df.head()
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 2384 https://www.airbnb.com/rooms/2384 20230319041143 2023-03-19 city scrape Hyde Park - Walk to UChicago You are invited to be the sole Airbnb guest in... The apartment is less than one block from beau... https://a0.muscache.com/pictures/acf6b3c0-47f2... 2613 ... 4.99 4.96 4.93 R17000015609 f 1 0 1 0 2.13
1 1772920 https://www.airbnb.com/rooms/1772920 20230319041143 2023-03-19 city scrape 3 Bedroom Across from Wrigley Field AllStar Suite Welcome to The Inn at Wrigleyville, where you ... Besides being steps from the "Friendly Confine... https://a0.muscache.com/pictures/28490752/b4cc... 9297431 ... 4.75 4.92 4.58 2446868 t 6 6 0 0 0.11
2 1773021 https://www.airbnb.com/rooms/1773021 20230319041143 2023-03-19 city scrape 4 Bedroom Across from Wrigley Field Stadium Suite Welcome to The Inn at Wrigleyville, where you ... Besides being steps from the "Friendly Confine... https://a0.muscache.com/pictures/28491077/6edb... 9297431 ... 4.88 5.00 4.85 2446868 t 6 6 0 0 1.11
3 1773025 https://www.airbnb.com/rooms/1773025 20230319041143 2023-03-19 city scrape 4 Bedroom Across from Wrigley Field Legend Suite Welcome to The Inn at Wrigleyville, where you ... Besides being steps from the "Friendly Confine... https://a0.muscache.com/pictures/28489088/cb9d... 9297431 ... 4.90 5.00 4.76 2446867 t 6 6 0 0 0.90
4 1810118 https://www.airbnb.com/rooms/1810118 20230319041143 2023-03-19 city scrape LARGE Private 1BR/Full Bath near U of Chicago LARGE Bedroom (22'x12' / 24.5m²) w/ private-ac... Wake up and stop by the Robust Coffee shop for... https://a0.muscache.com/pictures/miso/Hosting-... 9483312 ... 4.96 4.48 4.88 R17000015592 f 2 1 1 0 3.21

5 rows × 75 columns

First five observations of Chicago Airbnb listings.

df.dtypes
        id                                                int64
        listing_url                                      object
        scrape_id                                         int64
        last_scraped                                     object
        source                                           object
                                                         ...   
        calculated_host_listings_count                    int64
        calculated_host_listings_count_entire_homes       int64
        calculated_host_listings_count_private_rooms      int64
        calculated_host_listings_count_shared_rooms       int64
        reviews_per_month                               float64
        Length: 75, dtype: object

10 of the variable types in Python.

We can see that we have a few object variables (categorical variables) recorded for the variables listing_url, last_scraped, and source. We also see quantitative variables recorded as integers with the id, scrape_id, and a number of listing counts calculated for the host. We do see one quantitative variable recorded as a float, which is reviews_per_month. In this case, we can only see 10 variable types, as the middle 65 are not printed.

Data Roles

We may also want to consider the context surrounding how variables are used in a study.

Common options for variable roles include response, predictor, confounder, and control variables.

A response variable is the variable of interest and the variable that responds to any changes. Alternate names include dependent variable.

A predictor variable is a variable that is anticipated to explain changes in the response variable. In an experiment, this variable might be manipulated by a researcher. Alternate names include independent or explanatory variable.

A confounder variable is a variable that is associated with both the response and predictor variables. This variable may obscure the true relationship in a situation.

Finally, a control variable is a variable that may affect the response variable. It's often not thought to be associated with the predictor variable(s). However, removing its effect on the response variable can help to further isolate and identify the relationship between predictor variable(s) and the response variable.

Hidden or lurking variables may be either confounders or control variables but are typically excluded from the available data. These variables may affect the results and introduce uncertainty to conclusions drawn from the available data.

The roles of variables may change depending on the research question, even if the research is based on the same dataset. It is important to include key stakeholders when defining the research question of interest to be sure that you are using all variables appropriately.

Observational Units

In Data Science Discovery, you talked about working with data frames. Data frames include both rows that typically consist of observations and columns that typically consist of variables.

You practiced filtering data by pulling out the rows that met certain conditions. These rows were sometimes also referred to as results or observations.

As we work with data, it can be helpful to consider what defines a row or observational unit. This can be one of the most helpful steps in understanding what exactly your data is representing and what types of questions your data might be capable of answering.

For example, we'll be working with Chicago Airbnb data throughout this course. What does each row of the data represent? Without looking at the data, some options include:

  • each booking at a Chicago Airbnb
  • each individual stay at a Chicago Airbnb
  • each host of a Chicago Airbnb
  • each unit (listing) of an Airbnb located in Chicago

For example, we could have observations (or observational units) that correspond to each individual stay. Variables could include the number of nights the reservation was for, if the guest received any complaints, and how many times the guest messaged the host. We could have observational units corresponding to hosts, with variables like the number of units that each host has and the annual income for all of the properties. We could have observational units for each Airbnb listing (or unit), with variables like the price per night, the number of bedrooms, and the number of bathrooms. Our example data uses observational units of each Chicago Airbnb listing.

Some datasets will have a variable recorded for the observational unit. For example, our data has variables for the listing id, listing name, and listing url. These variables assist us in matching the data back to the original listing, so that we could add more variables if needed and identify more about the units. The variables recorded for the observational unit would take distinct or unique values for each observation, and observational units can't be repeated in the data. Additionally, the variable that is recorded for each observational unit would provide very limited use for analysis. That is, the variable is typically categorical with distinct values, which results in minimal conclusions that can be drawn from any analyses.

In each example, knowing the observational unit helps you understand the data and the types of questions that can be asked.

For example, we can't use our current data to assess whether guests who stay at units with more bedrooms generally stay longer than guests who stay at units with fewer bedrooms. We could use our current data to assess whether units with more bedrooms generally require longer minimum stays for guests compared to units with fewer bedrooms. We also could modify our data to answer whether hosts who have more listings generally also have listings with more bedrooms than hosts who have fewer listings.

Along with determining the size of the data (number of observations and number of variables), identifying the observational unit is the first step that I complete when understanding what is included in the data.

Towards Thinking Critically about Data

Above, we demonstrated some of the first steps when encountering new data and orienting yourself to the data. One of the goals of our course will extend beyond understanding the data into thinking critically about the data. We'll return to this goal regularly both in relation to the data and the results of any analytical procedures.

Data provides some small snapshots of a situation, but it also has limitations. It's important to think critically throughout the data science pipeline.

For our Chicago Airbnb data, we may want to think about whether a listing is still active or not. For example, some listings have no availability for the next year. This might be due to a number of reasons, but one may be that the listing is in essence not currently active. No bookings can be made for a listing, but it is still on the Airbnb site. There may be multiple reasons for this, like the listing is currently being renovated, is very popular and is truly booked, or that the listing hasn't been permanently deleted but is not actually an available unit for any guest. This might make you wonder if the listing should be included in the data and if it helps to address any research questions of interest. Deciding which of these options is more likely and how to adjust our analyses may involve multiple perspectives including those who will rely on your answer, implications of the results, and time or cost constraints.

We'll continue to see examples of thinking critically about our available data and limitations as we work through the course material.

In this first section, we'll work with data wrangling and data summarization to determine what the data contains, how best to demonstrate and communicate the results to others, and what limitations might be present in our data and analysis.