BMTRY 790: Introduction to
Machine Learning and Data Mining
Spring 2023
Instructor: Bethany Wolf
Office:
302B, 135 Cannon Place
Phone: (843)876-1940
Email: wolfb@musc.edu
Lectures: every Tues/Thurs from
9:00-10:30
Office
Hours: By appointment
Course
website: http://people.musc.edu/~wolfb/BMTRY790_Spring2023/BMTRY790_2023.htm
Course Description
Machine
learning is the interdisciplinary field at the intersection of statistics and
computer science which develops such statistical models and interweaves them
with computer algorithms. This course provides an
introduction to the theory with a basis in real-world application,
focusing on statistical and computational aspects of data analysis. It is
designed to serves as an introduction to the fundamental concepts, techniques
and algorithms of machine learning. The course will cover following topics: data
representation, feature extraction, dimension reduction, supervised and
unsupervised classification, support vector machines, latent variable models
and clustering, and model selection. During the course of discussion, a main
thread of probabilistic models will be used to integrate different statistical
learning and inference techniques, including MLE, Bayesian parameter
estimation, information-theory-based learning, EM algorithm, and variational
methods.
The
course is also aimed to develop computational skills for students. Previous
experience with R (or other mathematical/statistical computing languages) is
required. The homework will include reading and commenting existing codes as
well as writing your own code. Final project will require programming. The mid-term
exam consist derivation of statistical learning
algorithms, writing pseudo code to demonstrate ability to understand logical
sequence for programming statistical algorithms, and comparison of different
statistical learning algorithms.
Prerequisites
BMTRY
706: Theoretical Foundations of Statistics I
BMTRY
701/702: Biostatistical Methods I and II
Knowledge
of R or permission from the instructor
Hastie,
Tibshirani and Friedman (2009). The Elements of
Statistical Learning: Data mining, Inference and Prediction. Springer,
ISBN:978-0-387-84857-0. Available on-line at https://statweb.stanford.edu/~tibs/ElemStatLearn/
Other suggested readings
Christopher
Bishop (2006). Pattern Recognition and Machine Learning. Springer. ISBN:
0-387-31073-8.
Class
Participation
All students are expected to attend class regularly. As a courtesy to both your
instructor and your fellow classmates, please make every effort to be prompt,
as class will begin at the scheduled time. The course covers a tremendous
amount of material and all classes will be used for
lecture or lab activities. Some absences, however, are
unavoidable. Special circumstances (e.g. birth or
death in family, family illness) will be dealt with on an individual basis.
Software
We
will mostly use R for the course. For learning algorithms available in SAS, we
will also examine the SAS software. For class, download the latest version of R
from the CRAN website (https://cran.r-project.org/). We may also use R packages available
on the Bioconductor website (https://www.bioconductor.org/).
Grading
Homework
Assignments totaling 60%
Mid-term
Exam totaling 20%
Final
Project totaling 20%
Homework: There will be 4 to 5 homework
assignments worth a total of 60% of your grade. Assignments will consist of
some theoretical derivations, application of methods using available software,
as well as writing your own code. Homework should be submitted electronically. However, if you wish to hand-write your homework and then
send a scanned electronic copy, please be
sure your handwriting is legible. Work that is illegible will not be
graded. All assignments are due one week after they are assigned unless
otherwise stated. Homework turned in one day late will receive a 25% reduction,
two days late will receive a 50% reduction, and homework more than two days
late will not be accepted.
Exam: There will also be one mid-term exam
that constitutes 20% of you grade. The mid-term exam will consist derivation of
statistical learning algorithms, writing pseudo code to demonstrate ability to
understand logical sequence for programming statistical algorithms, and
comparison of different statistical learning algorithms.
Final
Project: The final
project is designed to simulate a real problem you may face once you have
graduate- how to choose from among different modeling techniques. The goal of
the project will be to use the methods learned during the class to develop a
prediction model for real-world data. The analysis will need to compare the
different modeling techniques discussed in class and select the best
classification model from among the methods using appropriate methods to
compare models. The write up should include a description of the data, the question
being answered, a statistical methods section (as if for the manuscript) that
details how a model was selected, a results section, and conclusion/discussion
of the resulting model.
Suggested
Reading
|
|
|
|
|
|
|
|
|
Lecture Notes,
R Code, Homework, and Data
Lectures |
Scripts |
Homework |
Datasets |
|
|
|
|
|
|||
|
|
||
|
|||
|
|
||
|
|
|
|
|
|||
|
|
||
|
|||
|
|||
|
|
|
|
|
|
||
|
|||
|
|
|
|
|
|
||
|
|
||
|
|
|
|
|
|
||
|
|
|
|