Answering Questions Using Data


We often have a research question or goal that guides our data analysis. We'd like to learn more about the world, and so we formulate a question. Once we have that question, we aim to answer it using available data.

What do we need to think about when it comes to answering questions with data?

Which comes first?

Do you first have a question, and then generate data for that question? Or do you look at the data and variables first and then pose your question? And does this distinction matter?

Suppose that you pose a question first. You can then carefully plan what sample you'd like to obtain, what questions you would like to ask or how to conduct your study to have the most exact measurements, and then you can analyze your data. You may have the best data for your question, but that data may or may not apply to additional questions. You also will likely have invested time, effort, and money when generating the data.

On the other hand, you may find that you are given data that has already been collected, as we have in this course. Once you have the data, you may look through the variables. Using the available data and variables, you may pose questions about variables that exist in the data. Some say that using data to determine questions is one of the trademarks of the field of data science. The benefit is that the data is already available and analysis can begin quickly. However, the data may be lower in quality or less exact for the question of interest.

For example, suppose that I wanted to answer the question: do hosts with more listings generally have larger properties compared to hosts with fewer listings?

If I were to collect my own data, I could determine how exactly to gather information about the hosts. How do I measure what constitutes a larger property? I could choose to use square footage, number of bedrooms, and how many individuals can be accommodated by a listing. I could also determine what defines a host with more or less properties.

However, we already have the data available to us. We do not have the ability to gather data to meet all of our specifications, so we are limited to what is available in the data. For example, we may not be able to determine how many listings a host has around the world; we may have to define a large host based on their Chicago listings alone. We also might consider looking through the available variables to define any follow up questions as we work to understand differences between large and small hosts.

Who does the research question apply to?

Beyond balancing the investment of data collection with the quality of the data and the ability to answer questions most appropriately, we also want to consider the implications of the phrasing of the research question. For example, will results that summarize the available data be sufficient, so that our statement applies to the current data? Or, do we want results that generalize to the underlying population that the data stems from?

When our data is representative of a larger group, then we can use the results to generalize statements to the larger group (with appropriate uncertainty accounted for). To be considered representative, the data should be comprised of a random sample and have no (or minimal) biases present. In other words, we hope that our available data provides an appropriate snapshot of the underlying population. Because our snapshot should appropriately represent a larger group, we can generalize results from our data to that larger group. We'll formalize this process more with inference in our last two modules.

When our data is not representative of a larger group, we cannot say much definitive about the underlying population. The data could be not representative based on the sampling approach or by having bias present in the data. We'll return to these ideas shortly.

We can write conclusions that summarize our available data only and do not generalize to another group. For these conclusions, we should be clear that the statement only applies to our available data and that it is not intended to apply to a larger group. This clear communication can assist in ensuring that the data is used appropriately.

What does the data say about causation?

Many times, the ultimate goal is to make sense of the world using data. In other words, we want to be able to say "If I study for 2 more hours, then I should get an A on my upcoming test." We want to make causal claims using the data at hand. However, there are many factors that could contribute to a causal relationship.

As described in Data Science Discovery, the best method for having data support causation is using randomized controlled trials, or experiments. This approach will approximately equally distribute any confounding factors into the two groups and reduce any systematic effects of confounding variables. Sometimes experiments are not ethical or practical; in those cases, sophisticated causal inference techniques could be applied to help support causal statements. These techniques are beyond the scope of our course.

To make causal claims, which are implied by statements using phrases like "make it more likely" or "causes", we should ensure that we have randomization as part of the assignment to treatment or control groups.

For example, our Chicago Airbnb data does not include random assignment. In fact, there would not be a practical way to assign some hosts to have more listings and other to have less. Therefore, we will not be able to make causal claims using our Chicago Airbnb data. For example, we would not be able to say that a host having more listings on Airbnb makes it more likely that the units are larger, because there could be many other factors that are incorporated into that question. Instead, we might say something like a host having more listings is associated with the listings generally being larger in our data.