Samples


Sample Data

Once we have a sampling frame, we need to select the observations for our sample. Once we have identified the observations for the sample, we need to access those observations and record characteristics for each of the observations.

The sample data is often the data that we actually have access to. What do we want to do with this sample data? Many times, we want to use this sample data – the data that we have available to us – to make some sort of generalization that applies to the population as a whole. We often don’t have enough money, time, or resources to gather full census data, and so we instead use our sample, alongside statistical theories, in order to make reasonable statements about the full underlying population.

Gathering the sample is a crucial component of the data analysis process. After all, when it comes to data analysis, remember that we can only perform an analysis using the data at hand. When that data is not well collected, then our results might not be very appropriate. Remember the adage: Garbage in, garbage out. This means that if our original data going in has serious issues with it, then the output (results) that we generate using it will still contain the same issues and limitations, and in some cases may exaggerate those issues.

Missing Observations

Before jumping into the samples themselves, let’s finish our conversation about missing observations. In the previous section, we identified three ways in which observations could be missing from a sample and we discussed one. Here, we will continue by discussing the other two ways:

  • An observation may not be selected to be included in the sample from the sampling frame
  • An observation may be selected for the sample but may not respond/be included in the sample

First, we’ll consider the situation where an observation is not selected to be included in the sample. Provided that the sampling is done appropriately (we’ll talk about some appropriate sampling methods next), this isn’t something that we are concerned about. One of the primary motivations for sampling is that it would be too challenging to collect data for the full population, and so we decide to only collect data for a subset of the population. We have to leave part of the population out, and the portion of the population that we don’t sample will be left out. As long as there aren’t systematic ways in which portions of the population are not sampled, we aren’t too concerned.

Lastly, we may have an observation selected for a sample, but that observation does not respond or isn’t included in the data for some reason. Again, we may want to consider why we don’t have that observation. Is there a systematic reason the observation is not in the data? For example, are we systematically unable to contact those who are busy, don’t have a cell phone, don’t respond to texts, or aren’t active on social media? Do those individuals share a common characteristic, possibly based on age, socioeconomic status, or employment status? Or, is there a question after which respondents drop off? I was recently taking a survey for a company where each question was required. I didn’t know the answer to the question, and there was no way to bypass the question. Since I couldn’t proceed, I left the survey. Alternatively, a question could be offensive or triggering to a respondent and similarly lead them to discontinue the survey.

There are some approaches to limit non-responses, from multiple nudges to more intensive follow ups. Many of these approaches require additional resources. Some of these approaches are utilized by the US government, for example, when collecting samples for various surveys.

Sampling Approaches

When we gather a sample, we are often interested in having that sample be representative of the population. The gold standard of sampling also includes random selection of the observations in the sample. However, there are still a number of ways to help ensure the sample is in fact representative of the population while using appropriate statistical practices.

In the long run, if we were to generate repeated samples, simple random sampling would mean that each observation would be included in the same number of random samples. In other words, each observation has the same probability of being included in any given sample. In this way, every observation from the population is equally likely to be included in a sample. Inclusion in a sample is performed through random selection, which ensures that human or systematic biases aren’t incorporated in the selection of the sample.

However, sometimes you know that you want to include observations with certain characteristics in your sample. Some of these characteristics may be more rare than others, so that you can’t be certain that individuals (or enough individuals) with these characteristics will be included in your sample based on random chance alone. In these instances, you might instead prefer a more complex method of sampling. One possible option is stratified random sampling. In stratified random sampling, you first separate your sampling frame into multiple frames (strata) based on certain characteristics; for example, you might create a sampling frame of patients without a disease and another for those with a disease. Then, you perform simple random sampling within each stratum. This ensures that your sample contains representatives of each population of interest.

More complex methods are also possible. In some of these methods, a statistical weighting is performed to adjust the sample to most accurately represent the population. (Consider the polls associated with the 2016 presidential election, which led to inaccurate conclusions: https://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html & https://www.wpr.org/polls-missed-mark-2016-experts-say-things-are-different-2020. Weighting is used within these polls. This method, although leading to incorrect conclusions, is not inherently incorrect. We’ll talk a little more about this in a few pages.) In other methods, non-randomness is used, often to improve the convenience or cost of the sampling. However, these types of samples are prone to biases, resulting in a sample that might not appropriately represent the population as a whole.

In our course, we will typically analyze data collected using simple random sampling from the population, or treat the data as if it were collected using simple random sampling.

I Have Poorly Collected Data - What Do I Do?

This is a hard question, and there often isn’t a perfect solution to this question.

There are some sophisticated statistical methods (beyond the scope of what we can cover here) that you may be able to apply, at least in certain scenarios.

Often, there isn’t anything that stops you from performing the analysis. Most of the statistical methods and calculations can be applied to any data and generate some output. However, keep in mind: Garbage in, garbage out. Your output might not appropriately generalize to the population. Your results may have some strong and severe limitations, to the extent that they may not provide reasonable (or helpful) insights.

You should be sure to include your limitations and uncertainty many times in your conclusions.