Welcome to Data Science Exploration!

A Second-Semester Exploration into the World of Data Science

In this second-semester exploration into the world of data science, we bring together the tools learned in Data Science Discovery and introduce students to the interconnected nature of the data science pipeline with real-world datasets in Python. We explore how different decisions can be made along the steps of the data science pipeline in pursuit of a research goal or question. We explore how this may lead to different outcomes or answers. With this in mind, we use well-motivated research goals and questions to explore best research practices when it comes to conducting and communicating a beginning-to-end data science analysis.

With real-world datasets we explore and delve deeper into machine learning and inference techniques. We explore how to build and evaluate linear regression and logistic regression models for machine learning purposes. We introduce feature selection techniques, including regularization and principal component analysis. We introduce cross-validation techniques. We discuss how to build and evaluate classifier models. We also discuss how to build more interpretable models.

We further hone the concepts of statistical inference by simulating sampling distributions and introducing simulation-based inference. Students should learn how to use linear and logistic regression models to evaluate evidence in favor of associations existing in larger populations of data.

Enhanced data wrangling and data visualization techniques are introduced as well.

Graphic summary of the Data Science Pipeline (icons created with Flaticon.com)

Module 7: Understanding and Wrangling Data

This module provides an introduction to the interconnected nature of the data science pipeline. We consider what it means to pursue research goals and ask research questions effectively with data. Given that there are often many decisions involved in pursuing a beginning-to-end data science analysis, what are some best practices when it comes to communicating our research findings? Finally, what are some ways in which we might clean and manipulate a dataframe for further analysis?

Module 8: Linear Regression

This module introduces how a linear regression model can be used and evaluated for machine learning purposes. We discuss how to predict a numerical response variable given a set of numerical and/or categorical explanatory variables. Interaction terms and variable transformations are incorporated for enhanced model fit. Finally, we discuss how to make our machine learning regression models more interpretable.

Module 9: Feature Selection and Cross-Validation Techniques

What does it mean to overfit a predictive model? How does an overfit model impact our our ability to pursue machine learning goals? One way to overfit a predictive model is by including too many explanatory varaibles that don't bring 'enough' predictive power to the model? In this section we explore ways of measuring whether or not an explanatory variable brings 'enough' predictive power to a predictive model. We also explore ways of attempting to find the optimal combination of explanatory variables that best meet our machine learning goals for a predictive model.

Module 10: Logistic Regression and Classification

In this module we introduce the logistic regression model which is one of the most common models for predicting a categorical response variable with two distinct values. We discuss how to fit and evaluate a logistic regression model for machine learning purposes. Furthermore, we discuss how to use a logistic regression model as a classifier. We discuss how to evaluate the performance of a classifier model. Finally, we implement the features selection techniques that we introduced in module 9 to attempt to find the optimal combination of explanatory variables to use that yields the best classifier performance for machine learning purposes.

Module 11: More Machine Learning Methods