BMTRY 790: Introduction to Machine Learning and Data Mining

Spring 2023

Instructor: Bethany Wolf

Office: 302B, 135 Cannon Place

Phone: (843)876-1940

Email: wolfb@musc.edu

Lectures: every Tues/Thurs from 9:00-10:30

Office Hours: By appointment

Course website: http://people.musc.edu/~wolfb/BMTRY790_Spring2023/BMTRY790_2023.htm

Course Description

Machine learning is the interdisciplinary field at the intersection of statistics and computer science which develops such statistical models and interweaves them with computer algorithms. This course provides an introduction to the theory with a basis in real-world application, focusing on statistical and computational aspects of data analysis. It is designed to serves as an introduction to the fundamental concepts, techniques and algorithms of machine learning. The course will cover following topics: data representation, feature extraction, dimension reduction, supervised and unsupervised classification, support vector machines, latent variable models and clustering, and model selection. During the course of discussion, a main thread of probabilistic models will be used to integrate different statistical learning and inference techniques, including MLE, Bayesian parameter estimation, information-theory-based learning, EM algorithm, and variational methods.

The course is also aimed to develop computational skills for students. Previous experience with R (or other mathematical/statistical computing languages) is required. The homework will include reading and commenting existing codes as well as writing your own code. Final project will require programming. The mid-term exam consist derivation of statistical learning algorithms, writing pseudo code to demonstrate ability to understand logical sequence for programming statistical algorithms, and comparison of different statistical learning algorithms.

Prerequisites

BMTRY 706: Theoretical Foundations of Statistics I

BMTRY 701/702: Biostatistical Methods I and II

Knowledge of R or permission from the instructor

Required text

Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data mining, Inference and Prediction. Springer, ISBN:978-0-387-84857-0. Available on-line at https://statweb.stanford.edu/~tibs/ElemStatLearn/

Other suggested readings

Christopher Bishop (2006). Pattern Recognition and Machine Learning. Springer. ISBN: 0-387-31073-8.

Class Participation
All students are expected to attend class regularly. As a courtesy to both your instructor and your fellow classmates, please make every effort to be prompt, as class will begin at the scheduled time. The course covers a tremendous amount of material and all classes will be used for lecture or lab activities. Some absences, however, are unavoidable. Special circumstances (e.g. birth or death in family, family illness) will be dealt with on an individual basis.

Software

We will mostly use R for the course. For learning algorithms available in SAS, we will also examine the SAS software. For class, download the latest version of R from the CRAN website (https://cran.r-project.org/). We may also use R packages available on the Bioconductor website (https://www.bioconductor.org/).

Grading

Homework Assignments totaling 60%

Mid-term Exam totaling 20%

Final Project totaling 20%

Homework: There will be 4 to 5 homework assignments worth a total of 60% of your grade. Assignments will consist of some theoretical derivations, application of methods using available software, as well as writing your own code. Homework should be submitted electronically. However, if you wish to hand-write your homework and then send a scanned electronic copy, please be sure your handwriting is legible. Work that is illegible will not be graded. All assignments are due one week after they are assigned unless otherwise stated. Homework turned in one day late will receive a 25% reduction, two days late will receive a 50% reduction, and homework more than two days late will not be accepted.

Exam: There will also be one mid-term exam that constitutes 20% of you grade. The mid-term exam will consist derivation of statistical learning algorithms, writing pseudo code to demonstrate ability to understand logical sequence for programming statistical algorithms, and comparison of different statistical learning algorithms.

Final Project: The final project is designed to simulate a real problem you may face once you have graduate- how to choose from among different modeling techniques. The goal of the project will be to use the methods learned during the class to develop a prediction model for real-world data. The analysis will need to compare the different modeling techniques discussed in class and select the best classification model from among the methods using appropriate methods to compare models. The write up should include a description of the data, the question being answered, a statistical methods section (as if for the manuscript) that details how a model was selected, a results section, and conclusion/discussion of the resulting model.

Suggested Reading

glmmLasso
ThrEEBoost

Lecture Notes, R Code, Homework, and Data

Lectures	Scripts	Homework	Datasets
Lec 1: Overview
Lec 2: Linear Regression; notes2	Lec 2 R Code		Body Fat
Lec 3: Penalized Regression I	Lec 3 R Code
Lec 4: Penalized Regression II	Lec 4 R Code	HW 1
Lec 5: Penalized Regression III	Lec 5 R Code
Lec 6: Regression for Collinearity; notes 6	Lec 6 R code	HW 2	Prostate
Lec 7: Linear Classifiers I; notes 7
Lec 8: Linear Classifiers II	Lec 8 R Code		Breast Tissue
Lec 9: Linear Classifiers III	Lec 9 R Code
Lec 10: Intro to Splines		HW 3	LinSep; Root Resorption
Lec 11: Spline II	Lec 11 R Code		Thermoregulation
Lec 12: Model Assessment I
Lec 13: Model Assessment II	Lec 13 R Code
Lec 14: Generalized Additive Models	Lec 14 R Code		Ozone; Lupus Nephritis
Lec 15: Decision Trees	Lec 15 R Code	HW 4	Breast Cancer 2; Hepatitis; SLE Immune Response
Lec 15a: Bayesian Trees (guest lecture)
Lec 16: Multivariate Adaptive Regression Splines (MARS)	Lec 16 R: Code
Lec 17: Ensemble Models	Lec 17: R Code	Final Project	Project data
Lec 18: Support Vector Machines	Lec 18: R Code
Lec 19: Artificial Neural Networks (ANNs)
Lec 20: ANNs Part 2	Lec 20: R Code