The ability to collect, store and process large amounts of detailed data in a variety of fields has led to a surge in the use of data in various decision making tasks, ranging from governmental policy making to drafting players in sports. Data literacy is thus important and in this first introductory course we will focus on shifting the traditional mode of deterministic (yes/no) thinking to probabilistic thinking. In this course, we will review concepts from applied probability and statistics and explore how they can be used in building a data-driven infrastructure with applications ranging from understanding a variety of everyday phenomena (e.g., descriptive modeling) to making decisions based on data (e.g., predictive modeling). In particular, we will focus on the principles and best practices in dealing with data, including understanding (a) the bias-variance tradeoff, (b) how to avoid overfitting, (c) how to choose the most appropriate model for your data and (d) how to evaluate your model's performance. While the main focus of the course is on supervised learning, we will also introduce unsupervised learning and in particular the problem of clustering. We will also explore the concept of Monte Carlo simulations and resampling, and how they can be used to make predictions for systems that are too complicated to be solved in closed form. We will also provide an overview of analytical methods for specialized form of data including time series and relational data.

Academic Career: Graduate
Course Component: Lecture
Grade Component: Grad LG/SNC Basis
Course Requirements: PREQ: CMPINF 2100 Introduction to Data Centric Computing
Minimum Credits: 3
Maximum Credits: 3