Course Info

Course Info
Class Time: Tue, Thu 2:00 - 3:15pm
Class Location: Data Science 305
Course Canvas site: https://canvas.its.virginia.edu/courses/115304
Course Teams site: 24F Statistical Learning
Instructor Dr. Michael D. Porter
Email: mdp2u {at} virginia.edu
Office: Data Science 432
Office Hours: Tuesdays 9:40 - 10:45am in DS 300
(and by appt.)
TA: Kaleigh O’Hara
Email: ear3cg {at} virginia.edu
Office Hours: TBD

Course Delivery

The course will be delivered live and in-person. There may be occasional remote synchronous or pre-recorded lectures. Any such lectures will be recorded and available on Canvas.


Course Prerequisites

Students taking this course should have prior knowledge in linear regression analysis (e.g., DS 6021, SYS 4021/6021, STAT 5120), statistical inference (e.g., APMA 3120), and linear algebra (e.g., APMA 3080). Students should also have a basic working knowledge in a scientific programming language (e.g., R, Python, Matlab). All course examples will be in R (tidyverse dialect).


Course Description

Fundamentals of data mining and machine learning within a common statistical framework. Topics include regression, classification, clustering, resampling, regularization, tree-based methods, ensembles, boosting, and algorithmic bias. Coursework is conducted in the R programming language.


Student Learning Objectives

Students will learn how and when to use common statistical learning methods, understand their comparative strengths and weaknesses, and how to critically evaluate their performance. Students completing this course should be able to: (i) construct and apply modern statistical learning methods for predictive modeling, (ii) use unsupervised learning methods to find patterns and structure in data, and (iii) properly select, tune, and evaluate models.


Required Textbooks

  1. An Introduction to Statistical Learning (2nd) by James, Witten, Hastie and Tibshirani.
    • An electronic version of this book is freely available at https://statlearning.com/. This book provides a less technical description of common statistical learning methods.
  1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition) by Hastie, Tibshirani, and Friedman.
  2. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman
    • An electronic version of this book is freely available at http://www.mmds.org/. We will only cover some parts of this text.
  3. Introduction to Data Mining (Second Edition) by Tan, Steinbach, Karpatne, and Kumar.

Other Course Materials

  • This course requires the use of the following statistical and typesetting software:

    • R (http://cran.us.r-project.org) is a free programming language for statistical computing, graphics, and machine learning. I am using R 4.4.1. It is recommended that you update to this version or newer.
    • RStudio is a free IDE for R (https://posit.co/downloads/). I am using RStudio 2024.04.2+764. It is recommended that you update to this version or newer.
  • Quarto (https://quarto.org/docs/get-started/) free technical publishing system that replaces RMarkdown. We will use quarto documents for homework. Version 1.5.56 or higher is required.

  • Other course material and reading assignment will come from instructor notes and recent journal articles.

  • The free textbook Modern Data Science with R by Baumer, Kaplan, and Horton is an undergrad level “Intro to Data Science” course. It covers tidyverse, statistical inference, and basic intro to many of the methods we will study this semester. This would provide a good overall preparation or handy reference.

  • The free textbook Feature Engineering and Selection: A Practical Approach for Predictive Models by Kuhn and Johnson provides a more in-depth coverage of feature engineering than we will be able to do in this course.

  • The free textbook Hands-on Machine Learning with R by Boehmke and Greenwell gives R code with some helpful details for most of the methods we will cover. This can be a handy reference.

  • The free textbook Interpretable Machine Learning by Christoph Molnar is described as A Guide for Making Black Box Models Explainable and covers topics such as feature importance and how to measure the influence of a feature on the predictions (e.g., Shapley, Partial Dependence).

  • The free textbook Introduction to Modern Statistics by Mine Çetinkaya-Rundel and Johanna Hardin is an accessible introduction to modern (i.e., resampling based) statistical inference. If you feel you are still missing the big picture of statistical inference, this is a good place to start.

  • The free textbook Math for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon On is a good reference for the mathematical concepts helpful for machine learning. Chapters 1-7 provide a good foundation for this course.

  • The free textbook Forecasting: Principles and Practice 3e by Rob J Hyndman and George Athanasopoulos provides a great introduction to time series data and forecasting.


Course Assessment

  • The course grade will be based on ten homework assignments (65%), reading quizzes (10%), course participation (in class and on teams) (5%), and a final exam (20%).

  • A: >91%, A-: 90-91%, B+: 88-89%, B: 82-87%, B-: 80-81%, etc.

    • A+: awarded rarely for exceptional work
  • There is no grade “curving” in this course.

    • There will be no make-up homework, exams, projects, or quizzes.
    • Note: There will be no “extra credit” assignments; spend your time on the assigned work.
  • All homework assignment dates are posted in the course homework page. Note these now so there are no conflicts.

  • All assignment submissions will be made through Canvas. You are given a grace period of 5 minutes for late submissions, the time stamps produced by Canvas will be the authoritative reference for all such decisions. If you have special circumstances (e.g., a documented physical condition) that prevent you from adhering to the posted deadlines, please inform me at least 1 week in advance of the deadline so that I can make arrangement to accommodate you.

Homework

  • The 10 homeworks are each worth 50 pts (500 pts total).
    • Several homeworks (see homework page will be treated like an exam; they are required and must be completed independently (with no help from classmates).
    • You can discuss and work with classmates on the other homework assignments, but what you submit must be in your own words (and code). See Honor Code for more details.
  • Homework will be submitted as Quarto source (which will contain the code) and the compiled html.
    • Quarto will produce the html and contain the code.
    • All code must be easy to follow (e.g., by good commenting)
    • Mathematical symbols follows LaTex notation.
  • You will self-grade your homework assignments. The purpose of this is to allow you to actively compare your answers to the solutions as the course progresses (instead of reviewing occasionally or only at end of the course). This will provide immediate guidance if your solutions are incorrect, teach you improved coding, and give you additional questions to ponder.
    • The TA will assign points; it is only your responsibility is to indicate what you did wrong or didn’t complete. This is also a place to ask questions if you aren’t sure if your solution is correct.
    • You will receive (+2) bonus points on each homework assignment that you accurately self-grade within 2 days of the posted solutions.

Quizzes

  • There will be around 24 pre-class reading quizzes (due before the start of class) each worth 1 point. Your quiz percentage will be min(Quiz Total, 20)/20.

    • This effectively allows you to drop the 4 lowest quiz scores.
  • The pre-class quizzes are to encourage you to prepare for the lectures.

  • Quizzes will completed in Canvas/Quizzes.

Course Participation

Your course participation grade is to encourage robust discussion about the course material. I’ve found that students and the professor often learn valuable insights from open discussion. You can earn your participation score from in-class activity and/or posting questions or responses on the course teams page.

Final Exam

The final exam will be a comprehensive review of all course materials including lectures, readings, and homeworks.


Course Outline

  • Bias-Variance Trade-off
  • Penalized Regression
  • Nonparametric Methods
  • Classification and Probability Modeling
  • Support Vector Machines
  • Trees and Random Forest
  • Ensembles and Boosting
  • Resampling Methods
  • Feature Engineering and Importance
  • Predictive model evaluation
  • Non-parametric Density Estimation
  • Clustering

Course Management

  • Most course material will be available from the class webpage
  • All assignments (e.g., homeworks, quizzes, exams) will be submitted in Canvas
  • Announcements may be made in email or Canvas/Discussion
  • Course Discussion on Teams
    • We will be using teams for class discussion. Rather than emailing questions to the teaching staff, I encourage you to post your questions here.
    • The teaching staff will always check discussions during our office hours and possibly at other times.
    • Please feel free to answer questions from other students, but use your discretion in not directly providing specific solutions to a homework problem (e.g., don’t give the code that directly answers a question).
    • Also, please post any discussion questions or material that you want input from the class and instructors.

Recording of classroom lectures

In the event that I or a large number of students cannot attend class in-person, I will record the lecture on zoom. Because lectures may include fellow students, you and they may be personally identifiable on the recordings. These recordings may only be used for the purpose of individual or group study with other students enrolled in this class during this semester. You may not distribute them in whole or in part through any other platform or to any persons outside of this class, nor may you make your own recordings of this class unless written permission has been obtained from the Instructor and all participants in the class have been informed that recording will occur. If you want additional details on this, please see Provost Policy 005.


Academic Calendar

Important dates for the semester can be found on the academic calendar: http://www.virginia.edu/registrar/calendar.html


Policy on Academic Misconduct (Honor Code)

I trust every student in this course to fully comply with all provisions of the University’s Honor Code and work together to maintain UVA’s Community of Trust. By enrolling in this course, you have agreed to abide by and uphold the Honor System of the University of Virginia, as well as the following policies specific to this course.

  • All submitted work must be pledged.
  • All work must be completed individually unless specific permissions are given on the assignment.
    • Homework and in-class exercises can be discussed with classmates, but the final write-up, code, and solutions must be your own. List the names of who you worked with (like a citation).
    • The individual homework sets must be done completely on your own. You are not to discuss exams with anyone except the teaching staff.
    • You are not permitted to copy code. You will no doubt come across examples on the internet. You can consult them to help understand the concept or process, but code in your own words.
  • It is a scholarly responsibility to attribute all your work. This includes figures, code, ideas, etc. Think of it this way: Will someone who reads your submission think that it is your original idea, figure, code, etc? Add a link and/or reference to all sources you used to solve a problem. It is really of no value to you when you just copy someone else’s solutions (other then preserve a grade that you didn’t earn).
  • It is not always easy to tell what qualifies as a violation, so do not be afraid to talk to me about it. Such discussions do not imply guilt of any kind.
  • All suspected violations will be forwarded to the Honor Committee, and you may, at my discretion, receive an immediate zero on that assignment regardless of any action taken by the Honor Committee.

Please let me know if you have any questions regarding the course Honor policy. If you believe you may have committed an Honor Offense, you may wish to file a Conscientious Retraction by calling the Honor Offices at (434) 924-7602. For your retraction to be considered valid, it must, among other things, be filed with the Honor Committee before you are aware that the act in question has come under suspicion by anyone. More information can be found at http://honor.virginia.edu. Your Honor representatives can be found at: http://honor.virginia.edu/representatives.

Generative AI Policy

Generative AI (GenAI), like ChatGPT, is new disruptive technology that has the potential to fundamentally change how we learn, code, and do data science. However, there is little guidance on when and how to use GenAI for learning. As such, I don’t feel very confident in recommending or restricting its use. Therefore, there are no Generative AI restrictions in this course. However, be sure to follow the honor policy as stated above. You cannot copy code and must attribute and detail if and how you used GenAI in the assignments.

GenAI tools can be an especially great resource for troubleshooting and improving code. However, they can also limit your ability to learn good coding if you become too dependent. I do not think GenAI is currently reliable enough to trust for conceptual understanding. I still recommend the assigned reading and references found in the course notes for additional learning resources. If GenAI hallucinates in producing code, you will be able to see right away that it does produced the desired result. However, if it hallucinates about how a model works or perpetuates common misconceptions on methodology you may not know about it for a long time.


Disability Statement

The University of Virginia strives to provide accessibility to all students. If you require an accommodation to fully access this course, please contact the Student Disability Access Center (SDAC) at (434) 243-5180 or . If you are unsure if you require an accommodation, or to learn more about their services, you may contact the SDAC at the number above or by visiting their website at http://studenthealth.virginia.edu/student-disability-access-center/faculty-staff.


Your Well Being

The University of Virginia and SEAS serve as a safe space for students and aims to promote your well-being. If you are feeling overwhelmed, stressed, or isolated, there are many individuals here who are ready and wanting to help. If you wish, you can make an appointment with me to discuss in private. Alternatively, the Student Health Center offers Counseling and Psychological Services (CAPS) https://www.studenthealth.virginia.edu/caps. If you prefer to speak anonymously and confidentially over the phone, call Madison House’s HELP Line 24/7 at434-295-8255 https://www.madisonhouse.org/overview-helpline/.

If you or someone you know is struggling with gender, sexual, or domestic violence, there are many community and University of Virginia resources available. The Office of the Dean of Students, Sexual Assault Resource Agency (SARA), and UVA Women’s Center are ready and eager to help. Contact the Director of Sexual and Domestic Violence Services at 434-982-2774.


Discrimination and power-based violence

The University of Virginia is dedicated to providing a safe and equitable learning environment for all students. To that end, it is vital that you know two values that I and the University hold as critically important:

  1. Power-based personal violence will not be tolerated.
  2. Everyone has a responsibility to do their part to maintain a safe community on Grounds.

If you or someone you know has been affected by power-based personal violence, more information can be found on the UVA Sexual Violence website that describes reporting options and resources available <www.virginia.edu/sexualviolence>. As your professor and as a person, know that I care about you and your well-being and stand ready to provide support and resources as I can. As a faculty member, I am a responsible employee, which means that I am required by University policy and federal law to report what you tell me to the University’s Title IX Coordinator. The Title IX Coordinator’s job is to ensure that the reporting student receives the resources and support that they need, while also reviewing the information presented to determine whether further action is necessary to ensure survivor safety and the safety of the University community. If you wish to report something that you have seen, you can do so at the Just Report It portal. The worst possible situation would be for you or your friend to remain silent when there are so many here willing and able to help.


Religious Accommodations

Students who wish to request academic accommodation for a religious observance should submit their request to me by email as far in advance as possible. If you have questions or concerns about your request, you can contact the University’s Office for Equal Opportunity and Civil Rights (EOCR) https://eocr.virginia.edu/accommodations-religious-observance. Accommodations do not relieve you of the responsibility for completion of any part of the coursework you miss as the result of a religious observance.