Populations


Populations of Interest

Identifying your population of interest is a crucial and often overlooked step when analyzing and interpreting data.

What is the population that you’d like to make a statement about? In an ideal world, if you could collect all possible data, this would be every single unit that you could collect.

Some examples of populations of interest could be:

  • All University of Illinois students
  • All online Data Science Discovery students
  • All bakery orders
  • All visits to a doctor’s office
  • All crayons produced at a factory
  • All Chicago Airbnb listings

While all of these could be populations of interest, some of these could also be clarified further. For example, let’s consider the second proposed population: all online Data Science Discovery students. Questions could include:

  • What does it take to be considered a Data Science Discovery student? Should we count everyone who visits the site one time? Who completes one module? Who completes all modules?
  • Data Science Discovery is also taught through a residential course at the University of Illinois. Should we also consider the students registered through the University of Illinois as online Data Science Discovery students? Would we only want to count them if they use the online content?
  • Are we only interested in previous Data Science Discovery students, or would we also want to make statements about future Data Science Discovery students?

The concept of the population of interest can be somewhat abstract. For example, consider the last question above. Future Data Science Discovery students may not be fully defined, as the students may not have decided to take the course yet. However, if we want to make a statement about Data Science Discovery students, we may be interested in extending a prediction or statement to those upcoming students. It is possible that our population of interest can only be approximated or cannot be fully enumerated.

Additionally, our population consists of individual observational units. In many of the examples above, these were humans. Observational units do not need to be people. They could also be customer orders from a bakery (compared to customers), visits to a doctor’s office (not patients), crayons produced (not employees), or housing units (not Airbnb customers).

Since we’ll be focusing on Airbnb throughout this module, let’s also consider some of the ways we can refine this population:

  • Do the listings (units) need to be located in any specific location? How do we define a listing being located in Chicago? Do the suburbs count?
  • Does the listing need to have had any specific number of bookings?
  • Should we count listings that have been removed?
  • What about listings where the owners have changed? Do they count as two different listings, or only one listing?
  • What about listings where the unit has been updated? Is this one listing, or two? If one, how do we determine which characteristics to include in the data?
  • Is it only listings that are online as of a specific date?

Again, there may still be other considerations to refine this population that we haven’t considered yet. Defining and refining the population of interest can be completed in stages, especially as other key stakeholders ask questions relevant to their contexts. The best data science isn’t completed independently without any outside input until the end, but instead is a team effort from the beginning to the end, including data scientists, statisticians, computer scientists, and domain experts.

Population Data

We’ve talked about how to define your population of interest, and we’ve considered clarifications that might be needed in defining the population of interest. Why does this matter?

First, how does your data compare to your population of interest?

Suppose that your population of interest is all previous customer orders at a bakery that have been received. It is possible that a bakery could have all of this information available in a dataset. The data would then be considered a census, or data for every observation in the population of interest. Inference would not be necessary in this case – we can calculate summary measures directly on this population of interest. We would not need to use any of our inference techniques; in this situation, we do not have any uncertainty about the population. We can analyze this dataset directly to answer our research questions using the tools that you have learned in Modules 1-10, and our results summarize the truth about the population as long as the recorded variables are accurate.

Missing Data

Second, you may also want to consider who from the population might be missing from the data. This has implications in your interpretations of the results.

There are a few primary ways that units could be missing from the data:

  • An observation is excluded from the sampling frame, or the population of units that we could use feasibly sample
  • An observation may not be selected to be included in the sample from the sampling frame
  • An observation may be selected for the sample but may not respond/be included in the sample

Here, we’ll consider the first situation. In the next section, we’ll discuss more about the second and third situations.

Most importantly, is there a systematic way in which observations are excluded from the sampling frame? To answer this question, we might consider why observations might be excluded from the list.

For example, let’s consider the population of all Data Science Discovery students, now or in the future. Future students may not have decided to participate in Data Science Discovery yet or we may not have access to their information, so it’s impossible to include them in the sampling frame. Are the future Data Science Discovery students different from the current ones in a substantial or systematic way? I might be concerned that earlier adopters of Data Science Discovery may be systematically different from future students in Data Science Discovery, and therefore the same generalization may not apply to the future students. Or maybe there are different versions between the earlier adopters and the future students, so that the students might not be that similar. I might be concerned about how applicable my results would be to the full population, because I can’t record information about some portion of the population that is different from my sample.

What if an observation is currently in the population but not contained within the sampling frame? We may want to dive deeper into why the observation was excluded from the sampling frame. Is there a characteristic that resulted in this observation being excluded, so that there is some sort of systematic exclusion of observations with a characteristic? If so, we may be concerned that we are missing information about a portion of the population. It’s also possible that a technical or record-keeping oversight occurred, and that the observation was unintentionally or accidentally.

What Can We Do?

What do we do in the situation that our sampling frame doesn’t match with our population of interest? You’d be surprised by how often this happens, as it is quite challenging to accurately identify and fully record the full population of interest. Instead of prematurely suspending any analyses, we may want to add more uncertainty to our conclusions.

We should consider who might be missing from our sampling frame and why they might be missing. If it is plausible that we are systematically missing some portion of the population, maybe we need to adjust our sampling frame, if possible. How can or should we be more inclusive?

If it is not possible to add a missing observation to our sampling frame, consider this when you frame your conclusions. Recognize and communicate the limitations of the results. We can’t collect some systematic portion of the population, and this limits how generalizable the results may be. In other words, it will add to the future uncertainty that we will define throughout the next two modules.