Welcome to Data Science Exploration!
A Second-Semester Exploration into the World of Data Science: Summarizing, Generalizing, and Predicting
In this second-semester exploration into the world of data science, we bring together the tools learned in Data Science Discovery and introduce students to the interconnected nature of the data science pipeline with real-world datasets in Python. We explore how different decisions can be made along the steps of the data science pipeline in pursuit of a research goal or question. We explore how this may lead to different outcomes or answers. With this in mind, we use well-motivated research goals and questions to explore best research practices when it comes to conducting and communicating a beginning-to-end data science analysis.
With real-world datasets we explore and delve deeper into answering questions by summarizing data, making generalizations from the data for a larger population (statistical inference), and making predictions for individuals (machine learning). We start by considering how data wrangling is an essential tool in preparing our data to be able to answer our question of interest. We also consider how both numerical and graphical data summaries can provide insights about the underlying data.
We further hone the concepts of statistical inference by simulating sampling distributions and introducing simulation-based inference. This allows us to take the information provided from a sample to make statements and evaluate theories about the underlying population.
We explore how to build and evaluate linear regression and logistic regression models for machine learning purposes. We introduce feature selection techniques, including regularization and principal component analysis. We introduce cross-validation techniques. We discuss how to build and evaluate classifier models. We also discuss how to build more interpretable models. Students will also learn how to use linear and logistic regression models to evaluate evidence in favor of associations existing in larger populations of data.

Graphic summary of the Data Science Pipeline (icons created with Flaticon.com)
Module 7: Understanding and Wrangling Data
This module provides an introduction to the interconnected nature of the data science pipeline. We consider what it means to pursue research goals and ask research questions effectively with data. Given that there are often many decisions involved in pursuing a beginning-to-end data science analysis, what are some best practices when it comes to communicating our research findings? How do we need to clean, manipulate, and prepare our data in order to answer our questions of interest accurately?
7-00
» Your Data Science Journey - From Beginning to End7-01
» Review of Data Basics7-02
» Answering Questions Using Data7-03
» Cleaning and Preparing Data7-04
» Missing Data7-05
» Reshaping and Merging Data7-06
» Summarizing Variables with Statistics, Tables, & Plots7-07
» Measurement Errors7-08
» Deeper Dive in Data Cleaning
Module 8: Populations, Samples, and Statistics
We return to the idea of summarizing data using a single number. Specifically, we’ll consider a sampling distribution, which is a distribution of statistic values that occur from different samples. If we choose a single number to summarize a sample, how can that statistic change from one sample to the next? What are characteristics of the distribution for a statistic? How does this depend on the characteristics of the sample from which the statistic is drawn? How can we appropriately use data to simulate a sampling distribution? These are some of the important questions that we will explore in this module.
8-00
» Overview of Statistical Inference8-01
» Populations8-02
» Samples8-03
» Describing a Sample with Visualizations and Statistics8-04
» Sampling Distributions8-05
» Sampling Distribution Properties8-06
» Sampling Distribution for Two Populations8-07
» Simulations for Difference Data8-08
» Calculating Probability for Statistics8-09
» Deeper Dive into Underlying Theory8-10
» Conclusion
Module 9: Statistical Inference for Populations
Now that we understand how statistics vary from sample to sample, how can we use this information to make statements about the underlying population? We will consider simulation-based approaches to statistical inference, allowing us to provide a range of reasonable values or decide between competing theories about our parameters? We will also be sure that we use an appropriate parameter and inference technique, relying on guiding questions about our data to make the right decision.
9-00
» Overview9-01
» Population Parameters and Sample Statistics9-02
» One Hypothesis Testing Example9-03
» Hypothesis Testing Framework9-04
» Confidence Intervals9-05
» Traditional Procedures for Inference9-06
» Name That Scenario9-07
» Conclusion
Module 10: Linear Regression
This module introduces how a linear regression model can be used and evaluated for machine learning purposes. We discuss how to predict a numerical response variable given a set of numerical and/or categorical variables.
10-00
» Predicting Airbnb Prices for New Datasets10-01
» Single Variable Descriptive Analytics and Data Manipulation10-02
» Describing Associations between Two Variables10-03
» Describing Associations between Three Variables10-04
» Fitting a Multiple Linear Regression Curve10-05
» How to Incorporate Categorical Explanatory Variables10-06
» Interpreting your Model's Slopes10-07
» Interaction Terms10-08
» A Machine Learning Technique for Finding Good Predictions for New Datasets10-09
» Evaluating your Linear Regression Model for Machine Learning and Interpretation Purposes10-10
» Sampling Distributions for Regression10-11
» Inference for Regression10-12
» Airbnb Research Goal Conclusion10-13
» Variable Transformations
Module 11: Logistic Regression and Classification
In this module we introduce the logistic regression model which is one of the most common models for predicting a categorical response variable with two distinct values. We discuss how to fit and evaluate a logistic regression model for machine learning purposes. Furthermore, we discuss how to use a logistic regression model as a classifier. We discuss how to evaluate the performance of a classifier model.
11-00
» Introduction11-01
» Instagram Classifier Introduction11-02
» Introducing Logistic Regression11-03
» Odds and Probability11-04
» Fitting a Logistic Regression Model11-05
» Multiple Logistic Regression11-06
» Making Predictions11-07
» Slope and Intercept Interpretations11-08
» Evaluating your Logistic Regression Model11-09
» Classification with Logistic Regression11-10
» Inference for Logistic Regression
Module 12: Feature Selection and Cross-Validation Techniques
What does it mean to overfit a predictive model? How does an overfit model impact our ability to pursue machine learning goals? One way to overfit a predictive model is by including too many explanatory variables that don't bring 'enough' predictive power to the model? In this section we explore ways of measuring whether or not an explanatory variable brings 'enough' predictive power to a predictive model. We also explore ways of attempting to find the optimal combination of explanatory variables that best meet our machine learning goals for a predictive model.
12-00
» Introduction12-01
» Overfitting vs. Underfitting to a Dataset12-02
» Finding a Parsimonious Model12-03
» Overview of Feature Selection Techniques12-04
» Backwards Elimination Algorithm12-05
» Forward Selection Algorithm12-06
» Breast Cancer Research Introduction12-07
» Regularization Techniques12-08
» Cross-Validation Techniques12-09
» Principal Component Regression12-10
» Feature Selection for Logistic Regression12-11
» Conclusion
Module 13: More Machine Learning Methods
This module provides a deeper dive into some select machine learning methods, including content and techniques not typically taught during the semester. Students who are eager to learn more can read through these pages to be introduced at a high level to some machine learning methods, their implementation in Python, and brief interpretations of their results.
13-00
» More Machine Learning13-01
» Decision Trees13-02
» Random Forests13-03
» Neural Networks13-04
» Comparing Machine Learning Models