BMTRY 790: Introduction to Machine Learning and Data Mining

Spring 2023

 

Instructor: Bethany Wolf

Office: 302B, 135 Cannon Place                                     

Phone: (843)876-1940

Email: wolfb@musc.edu

Lectures: every Tues/Thurs from 9:00-10:30

Office Hours: By appointment

Course website: http://people.musc.edu/~wolfb/BMTRY790_Spring2023/BMTRY790_2023.htm

 

Course Description

Machine learning is the interdisciplinary field at the intersection of statistics and computer science which develops such statistical models and interweaves them with computer algorithms. This course provides an introduction to the theory with a basis in real-world application, focusing on statistical and computational aspects of data analysis. It is designed to serves as an introduction to the fundamental concepts, techniques and algorithms of machine learning. The course will cover following topics: data representation, feature extraction, dimension reduction, supervised and unsupervised classification, support vector machines, latent variable models and clustering, and model selection. During the course of discussion, a main thread of probabilistic models will be used to integrate different statistical learning and inference techniques, including MLE, Bayesian parameter estimation, information-theory-based learning, EM algorithm, and variational methods.

 

The course is also aimed to develop computational skills for students. Previous experience with R (or other mathematical/statistical computing languages) is required. The homework will include reading and commenting existing codes as well as writing your own code. Final project will require programming. The mid-term exam consist derivation of statistical learning algorithms, writing pseudo code to demonstrate ability to understand logical sequence for programming statistical algorithms, and comparison of different statistical learning algorithms.

 

Prerequisites

BMTRY 706: Theoretical Foundations of Statistics I

BMTRY 701/702: Biostatistical Methods I and II

Knowledge of R or permission from the instructor

Required text

Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data mining, Inference and Prediction. Springer, ISBN:978-0-387-84857-0. Available on-line at https://statweb.stanford.edu/~tibs/ElemStatLearn/

 

Other suggested readings

Christopher Bishop (2006). Pattern Recognition and Machine Learning. Springer. ISBN: 0-387-31073-8.

 

Class Participation
All students are expected to attend class regularly. As a courtesy to both your instructor and your fellow classmates, please make every effort to be prompt, as class will begin at the scheduled time. The course covers a tremendous amount of material and all classes will be used for lecture or lab activities. Some absences, however, are unavoidable. Special circumstances (e.g. birth or death in family, family illness) will be dealt with on an individual basis.

 

Software

We will mostly use R for the course. For learning algorithms available in SAS, we will also examine the SAS software. For class, download the latest version of R from the CRAN website (https://cran.r-project.org/). We may also use R packages available on the Bioconductor website (https://www.bioconductor.org/).

 

Grading

Homework Assignments totaling 60%

Mid-term Exam totaling 20%

Final Project totaling 20%

 

Homework: There will be 4 to 5 homework assignments worth a total of 60% of your grade. Assignments will consist of some theoretical derivations, application of methods using available software, as well as writing your own code. Homework should be submitted electronically. However, if you wish to hand-write your homework and then send a scanned electronic copy, please be sure your handwriting is legible. Work that is illegible will not be graded. All assignments are due one week after they are assigned unless otherwise stated. Homework turned in one day late will receive a 25% reduction, two days late will receive a 50% reduction, and homework more than two days late will not be accepted.

 

Exam: There will also be one mid-term exam that constitutes 20% of you grade. The mid-term exam will consist derivation of statistical learning algorithms, writing pseudo code to demonstrate ability to understand logical sequence for programming statistical algorithms, and comparison of different statistical learning algorithms.

 

Final Project: The final project is designed to simulate a real problem you may face once you have graduate- how to choose from among different modeling techniques. The goal of the project will be to use the methods learned during the class to develop a prediction model for real-world data. The analysis will need to compare the different modeling techniques discussed in class and select the best classification model from among the methods using appropriate methods to compare models. The write up should include a description of the data, the question being answered, a statistical methods section (as if for the manuscript) that details how a model was selected, a results section, and conclusion/discussion of the resulting model.

 

Suggested Reading

glmmLasso

 

 

ThrEEBoost

 

 

 

 

 

 

 

Lecture Notes, R Code, Homework, and Data


Lectures

Scripts

Homework

Datasets

Lec 1: Overview

 

 

 

Lec 2: Linear Regression; notes2

Lec 2 R Code

 

Body Fat

Lec 3: Penalized Regression I

Lec 3 R Code

 

 

Lec 4: Penalized Regression II

Lec 4 R Code

HW 1

 

Lec 5: Penalized Regression III

Lec 5 R Code

 

 

Lec 6: Regression for Collinearity; notes 6

Lec 6 R code

HW 2

Prostate

Lec 7: Linear Classifiers I; notes 7

 

 

 

Lec 8: Linear Classifiers II

Lec 8 R Code

 

Breast Tissue

Lec 9: Linear Classifiers III

Lec 9 R Code

 

 

Lec 10: Intro to Splines

 

HW 3

LinSep; Root Resorption

Lec 11: Spline II

Lec 11 R Code

 

Thermoregulation

Lec 12: Model Assessment I

 

 

 

Lec 13: Model Assessment II

Lec 13 R Code

 

 

Lec 14: Generalized Additive Models

Lec 14 R Code

 

Ozone; Lupus Nephritis

Lec 15: Decision Trees

Lec 15 R Code

HW 4

Breast Cancer 2; Hepatitis; SLE Immune Response

Lec 15a: Bayesian Trees (guest lecture)

 

 

 

Lec 16: Multivariate Adaptive Regression Splines (MARS)

Lec 16 R: Code

 

 

Lec 17: Ensemble Models

Lec 17: R Code

Final Project

Project data

Lec 18: Support Vector Machines

Lec 18: R Code

 

 

Lec 19: Artificial Neural Networks (ANNs)

 

 

 

Lec 20: ANNs Part 2

Lec 20: R Code