Course Info

Course Info
Class Time: Lectures posted Mon, Wed (Asynchronous)
Class Location: Online
Course Canvas site: https://canvas.its.virginia.edu/courses/92132
Course Teams site: SYS 6018 Teams (find join code on Canvas)
Instructor Dr. Michael D. Porter
Email: mdp2u {at} virginia.edu
Office: TBD and Zoom
Office Hours: Mondays 1:45 - 2:45pm (and by appt.)
TA Hossein Kaviani
Email: hk3sku {at} virginia.edu
Office Hours: TBD

Course Delivery

The course lectures will be delivered asynchronously. Recorded lectures will be posted in Canvas on Mondays and Wednesdays.


Course Prerequisites

Students taking this course should have prior knowledge in linear regression analysis (e.g., SYS/STAT 4021/6021, STAT 5120), statistical inference (e.g., APMA 3120), and linear algebra (e.g., APMA 3080). Students should also have a basic working knowledge in a scientific programming language (e.g., R, Python, Matlab). The R-based SYS-2202 provides sufficient background. All course examples will be in R (tidyverse dialect).


Course Description

Fundamentals of data mining and machine learning within a common statistical framework. Topics include regression, classification, clustering, resampling, regularization, tree-based methods, ensembles, boosting, and Support Vector Machines. Coursework is conducted in the R programming language.


Student Learning Objectives

Students will learn how and when to use common data mining and statistical learning methods, understand their comparative strengths and weaknesses, and how to critically evaluate their performance. Students completing this course should be able to: (i) construct and apply novel statistical learning methods for predictive modeling, (ii) use unsupervised learning methods to find structure in data, (iii) properly select, tune, and evaluate models.


Required Textbooks

  1. An Introduction to Statistical Learning (2nd) by James, Witten, Hastie and Tibshirani.
    • An electronic version of this book is freely available at https://statlearning.com/. This book provides a less technical description of common statistical learning methods.
  1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd Edition) by Hastie, Tibshirani, and Friedman.
  2. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman
    • An electronic version of this book is freely available at http://www.mmds.org/. We will only cover some parts of this text.
  3. Introduction to Data Mining (Second Edition) by Tan, Steinbach, Karpatne, and Kumar.

Other Course Materials

  • This course requires the use of the following statistical and typesetting software:

    • R (http://cran.us.r-project.org) is a free programming language for statistical computing, graphics, and machine learning. I am using R 4.3.2. It is recommended that you update to this version or newer.
    • RStudio is a free IDE for R (https://posit.co/downloads/). I am using RStudio 2023.12.0+369. It is recommended that you update to this version or newer.
  • Quarto (https://quarto.org/docs/get-started/) free technical publishing system that replaces RMarkdown. We will use quarto documents for homework. Version 1.3.450 or higher is required.

  • Other course material and reading assignment will come from instructor notes and recent journal articles.

  • The free textbook Modern Data Science with R by Baumer, Kaplan, and Horton is an undergrad level “Intro to Data Science” course. It covers tidyverse, statistical inference, and basic intro to many of the methods we will study this semester. This would provide a good overall preparation or handy reference.

  • The free textbook Feature Engineering and Selection: A Practical Approach for Predictive Models by Kuhn and Johnson provides a more in-depth coverage of feature engineering than we will be able to do in this course.

  • The free textbook Hands-on Machine Learning with R by Boehmke and Greenwell gives R code with some helpful details for most of the methods we will cover. This can be a handy reference.

  • The free textbook Interpretable Machine Learning by Christoph Molnar is described as A Guide for Making Black Box Models Explainable and covers topics such as feature importance and how to measure the influence of a feature on the predictions (e.g., Shapley, Partial Dependence).

  • The free textbook Introduction to Modern Statistics by Mine Çetinkaya-Rundel and Johanna Hardin is an accessible introduction to modern (i.e., resampling based) statistical inference. If you feel you are still missing the big picture of statistical inference, this is a good place to start.

  • The free textbook Math for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon On is a good reference for the mathematical concepts helpful for machine learning. Chapters 1-7 provide a good foundation for this course.

  • The free textbook Forecasting: Principles and Practice 3e by Rob J Hyndman and George Athanasopoulos provides a great introduction to time series data and forecasting.


Course Assessment

  • The course grade will be based on ten homework assignments (70%), reading quizzes (10%), and a final project (20%).

  • A: >91%, A-: 90-91%, B+: 88-89%, B: 82-87%, B-: 80-81%, etc.

    • A+: awarded rarely for exceptional work
  • There is no grade “curving” in this course.

    • There will be no make-up homework, exams, projects, or quizzes.
    • Note: There will be no “extra credit” assignments; spend your time on the assigned work.
  • All homework assignment dates are posted in the Class Schedule (on the course website). Note these now so there are no conflicts.

  • All assignment submissions will be made through Canvas. You are given a grace period of 5 minutes for late submissions, the time stamps produced by Canvas will be the authoritative reference for all such decisions. If you have special circumstances (e.g., a documented physical condition) that prevent you from adhering to the posted deadlines, please inform me at least 1 week in advance of the deadline so that I can make arrangement to accommodate you.

Homework

  • The 10 homeworks are each worth 50 pts (500 pts total). Your homework percentage will be min(HW total, 475)/475 allowing you to effectively drop low scoring submissions. Another way to view this policy is that receiving a 95% is full credit.
    • Homework’s 4 and 8 treated like an exam; they are required and must be completed independently (with no help from classmates)
    • You can discuss and work with classmates on the other homework assignments, but what you submit must be in your own words (and code). See Honor Code for more details.
  • Homework will be submitted as Quarto source (which will contain the code) and the compiled html.
    • Quarto will produce the html and contain the code.
    • All code must be easy to follow (e.g., by good commenting)
    • Mathematical symbols follows LaTex notation.
  • You will self-assess your homework assignments. The purpose of this is to allow you to actively compare your answers to the solutions as the course progresses (instead of reviewing occasionally or only at end of the course). This will provide immediate guidance if your solutions are incorrect, show you improved coding, and give you additional questions to ponder.
    • The TA will assign points; it is only your responsibility is to indicate what you did wrong or didn’t complete.
    • The homework self-assessment is due three days after the homework due date.

Quizzes

  • There will be around 24 pre-class reading quizzes (due before the start of class) each worth 1 point. Your quiz percentage will be min(Quiz Total, 20)/20.

    • This effectively allows you to drop the 4 lowest quiz scores.
  • The pre-class quizzes are to encourage you to prepare for the lectures.

  • Quizzes will completed in Canvas/Quizzes.

Final Project

  • The objective of the final project is gain experience implementing a data mining / statistical learning pipeline. You will apply the concepts and methodologies we covered in class to extract actionable insights and knowledge from data.

  • You need to 1) find a problem or task you’d like to solve or understand; 2) find data; 3) use method covered in this class or related to help you solve the problem.

  • Deliverables:

    1. Project Proposal: All projects will need to be vetted by me to ensure you are proposing a topic that has a good chance of success. Send me or speak to me about your problem and anticipated methods.
    2. Model Development: Development and implementation of predictive or unsupervised models using appropriate data mining algorithms (e.g., classification, regression, clustering).
    3. Model Evaluation: Evaluation of model performance using relevant metrics and comparison with baseline models.
    4. Final Report: A detailed report documenting the entire project workflow, including data preprocessing, model development, evaluation results, and insights derived from the analysis.
    5. Project Presentation: A professional presentation summarizing the project objectives, methodology, findings, and recommendations for future work.
  • Students will work in teams of 2.

  • The deliverables will be in two parts: an recorded presentation and a written component.

  • Written component considerations

    • There is flexibility on the format of the written component. A single document (e.g., pdf, html), series of related blog posts, webpage, or even shiny app are all acceptable. The source document or code is also required for all projects.
    • Reproducibility is good data science practice and certainly encouraged, but is not a hard requirement.
    • The written component should look professional (e.g., math font should look good).
  • Presentation component considerations

    • Clarity: Easy to follow presentation. Easy to pick up main points.
    • Engagement/Interesting: Keep things interesting - there are a lot of talks to watch! Slides and verbal communication that engages the audience.
    • Time Management: You only have a few minutes! Focus on conveying the key information that will benefit classmates. You won’t have time to tell us everything you did. We can use the course Teams page to follow up if someone wants to know more.
    • No need to dress up

Course Outline

  • Bias-Variance Trade-off
  • Penalized Regression
  • Nonparametric Methods
  • Classification and Probability Modeling
  • Support Vector Machines
  • Trees and Random Forest
  • Ensembles and Boosting
  • Resampling Methods
  • Feature Engineering and Importance
  • Predictive model evaluation
  • Non-parametric Density Estimation
  • Clustering

Course Management

  • Most course material will be available from the class webpage
  • All assignments (e.g., homeworks, quizzes, exams) will be submitted in Canvas
  • Announcements may be made in email or teams
  • Course Discussion on Teams
    • We will be using teams for class discussion. Rather than emailing questions to the teaching staff, I encourage you to post your questions here.
    • The teaching staff will always check discussions during our office hours and possibly at other times.
    • Please feel free to answer questions from other students, but use your discretion in not directly providing specific solutions to a homework problem (e.g., don’t give the code that directly answers a question).
    • Also, please post any discussion questions or material that you want input from the class and instructors.

Recording of classroom lectures

I will be recording every lecture to accommodate students who will be learning remotely. Because lectures may include fellow students, you and they may be personally identifiable on the recordings. These recordings may only be used for the purpose of individual or group study with other students enrolled in this class during this semester. You may not distribute them in whole or in part through any other platform or to any persons outside of this class, nor may you make your own recordings of this class unless written permission has been obtained from the Instructor and all participants in the class have been informed that recording will occur. If you want additional details on this, please see Provost Policy 005.


Academic Calendar

Important dates for the semester can be found on the academic calendar: http://www.virginia.edu/registrar/calendar.html


Policy on Academic Misconduct (Honor Code)

I trust every student in this course to fully comply with all provisions of the University’s Honor Code and work together to maintain UVA’s Community of Trust. By enrolling in this course, you have agreed to abide by and uphold the Honor System of the University of Virginia, as well as the following policies specific to this course.

  • All submitted work must be pledged.
  • All work must be completed individually unless specific permissions are given on the assignment.
    • Homework and in-class exercises can be discussed with classmates, but the final write-up, code, and solutions must be your own. List the names of who you worked with (like a citation).
    • The individual homework sets must be done completely on your own. You are not to discuss exams with anyone except the teaching staff.
    • You are not permitted to copy code. You will no doubt come across examples on the internet. You can consult them to help understand the concept or process, but code in your own words.
  • It is a scholarly responsibility to attribute all your work. This includes figures, code, ideas, etc. Think of it this way: Will someone who reads your submission think that it is your original idea, figure, code, etc? Add a link and/or reference to all sources you used to solve a problem. It is really of no value to you when you just copy someone else’s solutions (other then preserve a grade that you didn’t earn).
  • It is not always easy to tell what qualifies as a violation, so do not be afraid to talk to me about it. Such discussions do not imply guilt of any kind.
  • All suspected violations will be forwarded to the Honor Committee, and you may, at my discretion, receive an immediate zero on that assignment regardless of any action taken by the Honor Committee.

Please let me know if you have any questions regarding the course Honor policy. If you believe you may have committed an Honor Offense, you may wish to file a Conscientious Retraction by calling the Honor Offices at (434) 924-7602. For your retraction to be considered valid, it must, among other things, be filed with the Honor Committee before you are aware that the act in question has come under suspicion by anyone. More information can be found at http://honor.virginia.edu. Your Honor representatives can be found at: http://honor.virginia.edu/representatives.


Generative AI Policy

Generative AI (GenAI), like ChatGPT, is new disruptive technology that has the potential to fundamentally change how we learn, code, and do data science. However, there is little guidance on when and how to use GenAI for learning. As such, I don’t feel very confident in recommending or restricting its use. Therefore, there are no Generative AI restrictions in this course. However, be sure to follow the honor policy as stated above. You cannot copy code and must attribute and detail if and how you used GenAI in the assignments.

GenAI tools can be an especially great resource for troubleshooting and improving code. However, they can also limit your ability to learn good coding if you become too dependent. I do not think GenAI is currently reliable enough to trust for conceptual understanding. I still recommend the assigned reading and references found in the course notes for additional learning resources. If GenAI hallucinates in producing code, you will be able to see right away that it does produced the desired result. However, if it hallucinates about how a model works or perpetuates common misconceptions on methodology you may not know about it for a long time.


Disability Statement

The University of Virginia strives to provide accessibility to all students. If you require an accommodation to fully access this course, please contact the Student Disability Access Center (SDAC) at (434) 243-5180 or . If you are unsure if you require an accommodation, or to learn more about their services, you may contact the SDAC at the number above or by visiting their website at http://studenthealth.virginia.edu/student-disability-access-center/faculty-staff.


Your Well Being

The University of Virginia and SEAS serve as a safe space for students and aims to promote your well-being. If you are feeling overwhelmed, stressed, or isolated, there are many individuals here who are ready and wanting to help. If you wish, you can make an appointment with me to discuss in private. Alternatively, the Student Health Center offers Counseling and Psychological Services (CAPS) https://www.studenthealth.virginia.edu/caps. If you prefer to speak anonymously and confidentially over the phone, call Madison House’s HELP Line 24/7 at434-295-8255 https://www.madisonhouse.org/overview-helpline/. Engineering undergraduates are supported through an array of student support services including peer-to-peer tutoring, professional academic coaching, access to mental health support, and dedicated advising. Graduate Engineering students can find similar student support resources. If you are in another school, you can contact the above Engineering resources and they will help direct you to the appropriate resources.

If you or someone you know is struggling with gender, sexual, or domestic violence, there are many community and University of Virginia resources available. The Office of the Dean of Students, Sexual Assault Resource Agency (SARA), and UVA Women’s Center are ready and eager to help. Contact the Director of Sexual and Domestic Violence Services at 434-982-2774.


Discrimination and power-based violence

The University of Virginia is dedicated to providing a safe and equitable learning environment for all students. To that end, it is vital that you know two values that I and the University hold as critically important:

  1. Power-based personal violence will not be tolerated.
  2. Everyone has a responsibility to do their part to maintain a safe community on Grounds.

If you or someone you know has been affected by power-based personal violence, more information can be found on the UVA Sexual Violence website that describes reporting options and resources available <www.virginia.edu/sexualviolence>. As your professor and as a person, know that I care about you and your well-being and stand ready to provide support and resources as I can. As a faculty member, I am a responsible employee, which means that I am required by University policy and federal law to report what you tell me to the University’s Title IX Coordinator. The Title IX Coordinator’s job is to ensure that the reporting student receives the resources and support that they need, while also reviewing the information presented to determine whether further action is necessary to ensure survivor safety and the safety of the University community. If you wish to report something that you have seen, you can do so at the Just Report It portal. The worst possible situation would be for you or your friend to remain silent when there are so many here willing and able to help.


Religious Accommodations

Students who wish to request academic accommodation for a religious observance should submit their request to me by email as far in advance as possible. If you have questions or concerns about your request, you can contact the University’s Office for Equal Opportunity and Civil Rights (EOCR) https://eocr.virginia.edu/accommodations-religious-observance. Accommodations do not relieve you of the responsibility for completion of any part of the coursework you miss as the result of a religious observance.