Alternate Options for Using this Content
Different Ordering
The authors of the Data Science Exploration website have experimented with ordering the modules in different ways. Below you will see an alternative ordering for content that is used by one instructor. A course that follows this particular structure places the machine learning content earlier in the semester while the inference content is saved for the end.
The same content is covered in both courses; the sole difference from the structure on the home page is in the ordering of the content. Specifically, pages 10-10, 12-07, 13-06, and 13-07 below have been moved into different modules. Minor reordering within a few modules can also be found compared to the home page.
Module 7: Understanding and Wrangling Data
This module provides an introduction to the interconnected nature of the data science pipeline. We consider what it means to pursue research goals and ask research questions effectively with data. Given that there are often many decisions involved in pursuing a beginning-to-end data science analysis, what are some best practices when it comes to communicating our research findings? How do we need to clean, manipulate, and prepare our data in order to answer our questions of interest accurately?
7-00
» Your Data Science Journey - From Beginning to End7-01
» Review of Data Basics7-02
» Answering Questions Using Data7-03
» Cleaning and Preparing Data7-04
» Missing Data7-05
» Reshaping and Merging Data7-06
» Summarizing Variables with Statistics, Tables, & Plots7-07
» Measurement Errors7-08
» Deeper Dive in Data Cleaning
Module 8: Linear Regression
This module introduces how a linear regression model can be used and evaluated for machine learning purposes. We discuss how to predict a numerical response variable given a set of numerical and/or categorical explanatory variables. Interaction terms and variable transformations are incorporated for enhanced model fit. Finally, we discuss how to make our machine learning regression models more interpretable.
8-00
» Predicting Airbnb Prices for New Datasets8-01
» Single Variable Descriptive Analytics and Data Manipulation8-02
» Describing Associations between Two Variables8-03
» Describing Associations between Three Variables8-04
» A Machine Learning Technique for Finding Good Predictions for New Datasets8-05
» Fitting a Multiple Linear Regression Curve8-06
» How to Incorporate Categorical Explanatory Variables8-07
» Interpreting your Model's Slopes8-08
» Evaluating your Linear Regression Model for Machine Learning and Interpretation Purposes8-09
» Interaction Terms8-10
» Airbnb Research Goal Conclusion8-11
» Variable Transformations
Module 9: Feature Selection and Cross-Validation Techniques
What does it mean to overfit a predictive model? How does an overfit model impact our ability to pursue machine learning goals? One way to overfit a predictive model is by including too many explanatory variables that don't bring 'enough' predictive power to the model. In this section we explore ways of measuring whether or not an explanatory variable brings 'enough' predictive power to a predictive model. We also explore ways of attempting to find the optimal combination of explanatory variables that best meet our machine learning goals for a predictive model.
9-00
» Introduction9-01
» Overfitting vs. Underfitting to a Dataset9-02
» Finding a Parsimonious Model9-03
» Overview of Feature Selection Techniques9-04
» Backwards Elimination Algorithm9-05
» Forward Selection Algorithm9-06
» Breast Cancer Research Introduction9-07
» Regularization Techniques9-08
» Cross-Validation Techniques9-09
» Principal Component Regression9-10
» Conclusion
Module 10: Logistic Regression and Classification
In this module we introduce the logistic regression model which is one of the most common models for predicting a categorical response variable with two distinct values. We discuss how to fit and evaluate a logistic regression model for machine learning purposes. Furthermore, we discuss how to use a logistic regression model as a classifier. We discuss how to evaluate the performance of a classifier model. Finally, we implement the features selection techniques that we introduced in module 9 to attempt to find the optimal combination of explanatory variables to use that yields the best classifier performance for machine learning purposes.
10-00
» Introduction10-01
» Instagram Classifier Introduction10-02
» Introducing Logistic Regression10-03
» Odds and Probability10-04
» Fitting a Logistic Regression Model10-05
» Multiple Logistic Regression10-06
» Making Predictions10-07
» Slope and Intercept Interpretations10-08
» Evaluating your Logistic Regression Model10-09
» Classification with Logistic Regression10-10
» Feature Selection
Module 11: More Machine Learning Methods
This module provides a deeper dive into selected machine learning methods, including content and techniques not typically taught during the semester. Students who are eager to learn more can read through these pages to be introduced at a high level to these methods, their implementation in Python, and brief interpretations of their results.
11-00
» More Machine Learning11-01
» Decision Trees11-02
» Random Forests11-03
» Neural Networks11-04
» Comparing Machine Learning Models
Module 12: Populations, Samples, and Statistics
We return to the idea of summarizing data with a single number. Specifically, we'll consider a sampling distribution, which is a distribution of possible statistic values that occur from different samples. If we choose a single number to summarize a sample, how can that statistic value change from one sample to the next? What are characteristics of the distribution for a statistic? How does this depend on the characteristics of the sample from which the statistic is drawn? How can we appropriately use data to simulate a sampling distribution? These are some of the important questions that we will explore in this module.
12-00
» Overview of Statistical Inference12-01
» Populations12-02
» Samples12-03
» Describing a Sample with Visualizations and Statistics12-04
» Sampling Distributions12-05
» Sampling Distribution Properties12-06
» Sampling Distribution for Two Populations12-07
» Sampling Distributions for Regression12-08
» Simulations for Difference Data12-09
» Calculating Probability for Statistics12-10
» Deeper Dive into Underlying Theory12-11
» Conclusion
Module 13: Statistical Inference for Populations
Now that we understand how statistics vary from sample to sample, how can we use this information to make statements about the underlying population? We will consider simulation-based approaches to statistical inference, allowing us to provide a range of reasonable values or decide between competing theories about our parameters. We will also be sure that we use an appropriate parameter and inference technique, relying on guiding questions about our data to make the right decision.
13-00
» Overview13-01
» Population Parameters and Sample Statistics13-02
» One Hypothesis Testing Example13-03
» Hypothesis Testing Framework13-04
» Confidence Intervals13-05
» Traditional Procedures for Inference13-06
» Inference for Regression13-07
» Inference for Logistic Regression13-08
» Name That Scenario13-09
» Conclusion