Answering Questions Using Data


We often have a research question or goal that guides our data analysis. We'd like to learn more about the world, and so we formulate a question. Once we have that question, we aim to answer it using available data.

What do we need to think about when it comes to answering questions with data?

Which comes first?

Do you first have a question, and then generate data for that question? Or do you look at the data and variables first and then pose your question? And does this distinction matter?

Suppose that you pose a question first. You can then carefully plan what sample you'd like to obtain, what questions you would like to ask or how to conduct your study to have the most exact measurements, and then you can analyze your data. You may have the best data for your question, but that data may or may not be usable for answering additional questions. You also will likely have invested time, effort, and money when generating the data.

On the other hand, you may find that you are given data that has already been collected, as we have in this course. Once you have the data, you may look through the variables. Using the available data and variables, you may pose questions about variables that exist in the data. Some say that using data to determine questions is one of the trademarks of the field of data science. The benefit is that the data is already available and analysis can begin quickly. However, the data may be lower in quality or less exact for the question of interest.

For example, suppose that I wanted to answer the question: do hosts with more listings generally have larger properties compared to hosts with fewer listings? I may be wondering if there really are two types of "hosts" on Airbnb: those who have an extra room in their residence and are trying to have a little extra revenue (fewer properties and smaller listings) and those who are using Airbnb as a larger business (more properties and more extravagant and therefore larger properties). Does this idea seem to be supported by the data?

If I were to collect my own data, I could determine how exactly to gather information about the hosts. How do I measure what constitutes a larger property? I could choose to use square footage, number of bedrooms, and how many individuals can be accommodated by a listing. I could also determine what defines a host with more or less properties.

However, I don't have (or want to take) the time to collect this information for all Airbnb hosts. Additionally, I already have Airbnb data available to me. Although it might not be ideal, I'll opt to repurpose my available data in order to answer my question of interest. but I'll be limited to the variables that are already available in the data. I may also recognize that my answer may not be exact, but the approximation may be "good enough" for my research purposes. For example, we may not be able to determine how many listings a host has around the world; we may have to define a large host based on their Chicago listings alone. We also might consider looking through the available variables to define any follow up questions as we work to understand differences between large and small hosts.

Who does the research question apply to?

Beyond balancing the investment of data collection with the quality of the data and the ability to answer questions most appropriately, we also want to consider the implications of the phrasing of the research question. For example, will results that summarize the available data be sufficient, so that our statement only needs to apply to the current data? Or, do we want results that generalize to the underlying population that the data stems from?

When our data is representative of a larger group, then we can use the results to generalize statements to the larger group (with appropriate uncertainty accounted for). To be considered representative, the data should be comprised of a random sample and have no (or minimal) biases present. In other words, we hope that our available data provides an appropriate snapshot of the underlying population. Because our snapshot should appropriately represent a larger group, we can generalize results from our data to that larger group. We'll formalize this process more with our two inference modules.

When our data is not representative of a larger group, we cannot say much definitive about the underlying population. The data could be not representative based on the sampling approach or by having bias present in the data. We'll return to these ideas shortly.

We can write conclusions that summarize our available data only and do not generalize to another group. For these conclusions, we should be clear that the statement only applies to our available data and that it is not intended to apply to a larger group. This clear communication can assist in ensuring that the data is used appropriately.

What does the data say about causation?

Many times, the ultimate goal is to make sense of the world using data. In other words, we want to be able to say "If I study for 2 more hours, then I should get an A on my upcoming test." We want to make causal claims using the data at hand. However, there are many factors that could contribute to a causal relationship, as we hinted at when describing variable roles.

As described in Data Science Discovery, the best method for having data support causation is using randomized controlled trials, oftentimes in the form of experiments. This approach will approximately equally distribute any confounding factors into the two (or more) groups and reduce any systematic effects of confounding variables. Sometimes experiments are not ethical or practical; in those cases, sophisticated causal inference techniques could be applied to help support causal statements. These techniques are beyond the scope of our course.

To make causal claims, which are implied by statements using phrases like "make it more likely" or "causes", we should ensure that we have randomization as part of the assignment to treatment or control groups.

For example, our Chicago Airbnb data does not include random assignment. In fact, there would not be a practical way to assign some hosts to have more listings and other to have less. Therefore, we will not be able to make causal claims using our Chicago Airbnb data. For example, we would not be able to say that a host having more listings on Airbnb makes it more likely that the units are larger, because there could be many other factors that are present and active in the relationship between these two variables. Instead, we might say something like a host having more listings is associated with the listings generally being larger in our data, if in fact this appears to be supported by our data.