In this second-semester exploration into the world of data science, we bring together the tools learned in Data Science Discovery and introduce students to the interconnected nature of the data science pipeline with real-world datasets in Python. We explore how different decisions can be made along the steps of the data science pipeline in pursuit of a research goal or question. We explore how this may lead to different outcomes or answers. With this in mind, we use well-motivated research goals and questions to explore best research practices when it comes to conducting and communicating a beginning-to-end data science analysis.
With real-world datasets we explore and delve deeper into machine learning and inference techniques. We explore how to build and evaluate linear regression and logistic regression models for machine learning purposes. We introduce feature selection techniques, including regularization and principal component analysis. We introduce cross-validation techniques. We discuss how to build and evaluate classifier models. We also discuss how to build more interpretable models.
We further hone the concepts of statistical inference by simulating sampling distributions and introducing simulation-based inference. Students should learn how to use linear and logistic regression models to evaluate evidence in favor of associations existing in larger populations of data.
Enhanced data wrangling and data visualization techniques are introduced as well.