Review of Data Basics
This page reviews some of the crucial components from Data Science Discovery that we will repeatedly return to as we continue exploring data science concepts. The fundamental concepts will continue to guide you as you first discover and then continue to explore a data set in order to make conclusions about the data and the surrounding world.
Variable Types
Conceptual Variable Types
Recall that there are two primary conceptual types of variables: quantitative variables and categorical variables.
Quantitative variables, also called numerical variables have values recorded as numbers. The numbers themselves are meaningful and can be used reasonably for calculations. That is, calculating a mean, median, or sum provides meaningful information.
Categorical variables have values recorded as levels or categories. The levels are often recorded with words, although they can also be translated into numbers. However, the numbers themselves would not be informative in a calculation of a mean, median, or sum.
A special type of categorical variable is one that only has two possible values. Some example values include True or False, Yes or No, 0 or 1, and Absent or Present. We will see that this type of variable has some helpful properties that can be used during analysis. This variable may be referred to as a boolean, logical, or dichotomous variable.
When determining a variable type, the true underlying behavior of the variable should be considered. For example, zip codes or bar codes are recorded using numbers, but the value itself is not meaningful in a calculation. Therefore, these variables are considered to be categorical variables. Similarly, even though the levels of a categorical variable could be translated to numbers (e.g., 1 = Democrat, 2 = Republican, 3 = Independent, 4 = Undetermined), the underlying behavior of the variable is the word responses. This variable would then be a categorical variable.
Programming Variable Types
To simplify and assist with data storage, memory, and calculations for computer programs and applications, these programs also categorize each variable based on the type of variable. We will call these programming or Python variable types, although different programs may use different variants of these programming varriable types. The conceptual variable types do relate to how Python records variable types.
The two most common forms for numerical variables in Python are int
(integer, or whole numbers) and float
(numbers that allow decimals). long
and complex
are also possible for very large or complex numbers, respectively, although are not typically used. 64
may also appear at the end of the variable type to indicate how the variable is stored; for the purposes of our course, we will not explore or discuss the 64
more.
The most common form for categorical variables in Python is object
. An object
variable could be entirely made of strings or could contain a mix of numbers, words, and combinations of the two. However, any numbers in this list will operate as a string, which means that they cannot be used for calculations.
Logical variables can also be recorded as bool
variables, which stand for Boolean variables or as objects. Python will only read logical variables as bool
variables if they are recorded as True
or False
.
Other options include ways to record calendar time, calendar dates, and ordered categories. We generally won't use those for our course.
Let's see an example of how we can use Python to report the variable types.
df = pd.read_csv('listings.csv')
df.head()
id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2384 | https://www.airbnb.com/rooms/2384 | 20230319041143 | 2023-03-19 | city scrape | Hyde Park - Walk to UChicago | You are invited to be the sole Airbnb guest in... | The apartment is less than one block from beau... | https://a0.muscache.com/pictures/acf6b3c0-47f2... | 2613 | ... | 4.99 | 4.96 | 4.93 | R17000015609 | f | 1 | 0 | 1 | 0 | 2.13 |
1 | 1772920 | https://www.airbnb.com/rooms/1772920 | 20230319041143 | 2023-03-19 | city scrape | 3 Bedroom Across from Wrigley Field AllStar Suite | Welcome to The Inn at Wrigleyville, where you ... | Besides being steps from the "Friendly Confine... | https://a0.muscache.com/pictures/28490752/b4cc... | 9297431 | ... | 4.75 | 4.92 | 4.58 | 2446868 | t | 6 | 6 | 0 | 0 | 0.11 |
2 | 1773021 | https://www.airbnb.com/rooms/1773021 | 20230319041143 | 2023-03-19 | city scrape | 4 Bedroom Across from Wrigley Field Stadium Suite | Welcome to The Inn at Wrigleyville, where you ... | Besides being steps from the "Friendly Confine... | https://a0.muscache.com/pictures/28491077/6edb... | 9297431 | ... | 4.88 | 5.00 | 4.85 | 2446868 | t | 6 | 6 | 0 | 0 | 1.11 |
3 | 1773025 | https://www.airbnb.com/rooms/1773025 | 20230319041143 | 2023-03-19 | city scrape | 4 Bedroom Across from Wrigley Field Legend Suite | Welcome to The Inn at Wrigleyville, where you ... | Besides being steps from the "Friendly Confine... | https://a0.muscache.com/pictures/28489088/cb9d... | 9297431 | ... | 4.90 | 5.00 | 4.76 | 2446867 | t | 6 | 6 | 0 | 0 | 0.90 |
4 | 1810118 | https://www.airbnb.com/rooms/1810118 | 20230319041143 | 2023-03-19 | city scrape | LARGE Private 1BR/Full Bath near U of Chicago | LARGE Bedroom (22'x12' / 24.5m²) w/ private-ac... | Wake up and stop by the Robust Coffee shop for... | https://a0.muscache.com/pictures/miso/Hosting-... | 9483312 | ... | 4.96 | 4.48 | 4.88 | R17000015592 | f | 2 | 1 | 1 | 0 | 3.21 |
5 rows × 75 columns
First five observations of Chicago Airbnb listings.
This is the first time that you are seeing our dataset that we'll use throughout this course. We'll be exploring a dataset that contains Airbnb listings from Chicago, IL. This data was scraped from the Airbnb website as of March 2023. Take a minute to scroll through this dataset, noticing some of the prominent variables and thinking about what questions you might have from this dataset.
df.dtypes
id int64 listing_url object scrape_id int64 last_scraped object source object ... calculated_host_listings_count int64 calculated_host_listings_count_entire_homes int64 calculated_host_listings_count_private_rooms int64 calculated_host_listings_count_shared_rooms int64 reviews_per_month float64 Length: 75, dtype: object
10 of the variable types in Python.
We can see that we have a few object variables (categorical variables) recorded for the variables listing_url, last_scraped, and source. We also see variables recorded as integers with the id, scrape_id, and 4 listing count variables calculated for the host. We do see one quantitative variable recorded as a float, which is reviews_per_month. In this case, we can only see the variable types for 10 variables, as the middle 65 are not printed.
Note that the variables id and scrape_id are both recorded as integer variable types. Think about this recording for a moment. Does that match up with what we would anticipate from their conceptual variable type? Is this concerning?
Variable Roles
We may also want to consider the context surrounding how variables are used in a study.
Common options for variable roles include response, predictor, confounder, and control variables.
A response variable is the variable of interest and the variable that responds to any changes. Alternate names include dependent variable.
A predictor variable is a variable that is anticipated to explain changes in the response variable. In an experiment, this variable might be manipulated by a researcher. Alternate names include independent or explanatory variable.
A confounder variable is a variable that is associated with both the response and predictor variables. This variable may obscure the true relationship in a situation.
Finally, a control variable is a variable that may affect the response variable. It's often not thought to be associated with the predictor variable(s). However, removing its effect on the response variable can help to further isolate and identify the relationship between predictor variable(s) and the response variable.
Hidden or lurking variables may be either confounders or control variables but are typically excluded from the available data. These variables may affect the results and introduce uncertainty to conclusions drawn from the available data. Our research aim may be to identify a causal relationship between our predictor variable and our response variable, in which case we would like to isolate this relationship as much as possible and remove any unwanted effects, to maximize the support for the suggested relationship between our variable(s) of interest.
The roles of variables may change depending on the research question, even if the research is based on the same dataset. It is important to include key stakeholders when defining the research question of interest to be sure that you are using all variables appropriately.
Observational Units
In Data Science Discovery, you talked about working with data frames. Data frames include both rows that typically consist of observations and columns that typically consist of variables.
You practiced filtering data by pulling out the rows that met certain conditions. These rows were sometimes also referred to as results or observations.
As we work with data, it can be helpful to consider what defines a row or observation. This can be one of the most helpful steps in understanding what exactly your data is representing and what types of questions your data might be capable of answering.
For example, we'll be working with Chicago Airbnb data throughout this course. What does each row of the data represent? Without looking at the data, some options include:
- each booking at a Chicago Airbnb
- each individual stay at a Chicago Airbnb
- each host of a Chicago Airbnb
- each unit (listing) of an Airbnb located in Chicago
For example, we could have observations (or observational units) that correspond to each individual stay. Variables could include the number of nights the reservation was for, if the guest received any complaints, and how many times the guest messaged the host. We could have observational units corresponding to hosts, with variables like the number of units that each host has and the annual income for all of the properties. We could have observational units for each Airbnb listing (or unit), with variables like the price per night, the number of bedrooms, and the number of bathrooms. Our example data uses an observational unit of a single Chicago Airbnb listing.
Some datasets will have a variable recorded for the observational unit. For example, our data has variables for the listing id, listing name, and listing url. These variables assist us in matching the data back to the original listing, so that we could add more variables if needed and identify more about the units. The variables that represent the observational unit take distinct or unique values for each observation, and observational units can't be repeated in the data. Additionally, the variable that is recorded for each observational unit would provide very limited use for analysis. That is, the variable is typically categorical with distinct values, which results in minimal conclusions that can be drawn from any analyses.
In each example, knowing the observational unit helps you understand the data and the types of questions that can be asked.
For example, we can't use our current data to assess whether guests who stay at units with more bedrooms generally stay longer than guests who stay at units with fewer bedrooms. We could use our current data to assess whether units with more bedrooms generally require longer minimum stays for guests compared to units with fewer bedrooms. We also could modify our data to answer whether hosts who have more listings generally also have listings with more bedrooms than hosts who have fewer listings.
Along with determining the size of the data (number of observations and number of variables), identifying the observational unit is among the first steps that I complete when understanding what is included in the data.
Towards Thinking Critically about Data
Above, we demonstrated some of the first steps when encountering new data and orienting yourself to the data. One of the goals of our course will extend beyond understanding the data into thinking critically about the data. We'll return to this goal regularly both in relation to the data and the results of any analytical procedures.
Data provides some small snapshots of a situation, but it also has limitations. It's important to think critically throughout the data science pipeline.
For our Chicago Airbnb data, we may want to think about whether a listing is still active or not. For example, some listings have no availability for the next year. This might be due to a number of reasons, but one may be that the listing is in essence not currently active. No bookings can be made for a listing, but it is still on the Airbnb site. There may be multiple reasons for this, like the listing is currently being renovated, is very popular and is truly booked, or that the listing hasn't been permanently deleted but is not actually an available unit for any guest. This might make you wonder if the listing should be included in the data and if it helps to address any research questions of interest. Deciding which of these options is more likely and how to adjust our analyses may involve multiple perspectives including those who will rely on your answer, implications of the results, and time or cost constraints.
We'll continue to see examples of thinking critically about our available data and limitations as we work through the course material.
In this first section, we'll work with data wrangling and data summarization to determine what the data contains, how best to demonstrate and communicate the results to others, and what limitations might be present in our data and analysis.