Machine Learning
Machine learning is a branch of artificial intelligence which can be broadly defined as computer algorithms which learn and self-improve from finding patterns in data.
Jeanette Stingone, PhD, MPH is an Assistant Professor of Epidemiology at the Columbia University Mailman School of Public Health.
Introduction
Machine learning techniques involves data-driven approaches that use patterns to generate information. The most used techniques in epidemiology are unsupervised and supervised learning.
Key Concepts
Data partitioning, algorithms, tuning, overfitting & bias-variance tradeoff are at the fundamental core of the practice of machine learning.
Challenges
Though machine learning is a powerful tool, its successful implementation may be obstructed by issues such as practical interpretability and data quality.
Future Applications
Though current machine learning is focused on prediction, there are growing efforts to further integrate machine learning with causal inference, utilize it for data collection and harmonization, and investigate algorithmic fairness.
Resources
Websites
Overview of different packages in R that are typically used for machine learning
sci-kit learn - a common set of packages in Python used to implement machine learning
TensorFlow - Google’s platform for machine learning
Textbooks
Provides an accessible, less technical look at key topics in statistical learning:
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. 2nd Edition. New York :Springer, 2021. ISBN:978-1071614174 doi: 10.1007/978-1-0716-1418-1_1
Explains concepts of learning from data in a statistical framework, with an emphasis on application rather than theory:
Trevor Hastie, Robert Tibrishani, Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Second Edition. New York: Springer, 2009. ISBN: 978-0-387-84857-0 doi: 10.1007/b94608.
Provides guidance on inference within data science and applications of targeted learning to address research questions based on real data:
Dr. Mark J. van der Laan, Dr. Sherri Rose. Targeted Learning in Data Science - Causal Inference for Complex Longitudinal Studies. New York: Springer, 2018. ISBN: 978-3-319-65304-4.
Methodological Articles
Overview of machine learning targeted towards epidemiologists that touches on key concepts, algorithms, and epidemiologic applications:
Bi Q, Goodman KE, Kaminsky J, Lessler J. What is machine learning? A primer for the epidemiologist. AJE 2019; 188: 2222-2239
Explores the relationship between modern machine learning methods and causal inference:
Blakely T, Lynch J, Simons K, Bentley R, Rose S. Reflecting on modern methods: when worlds collide-prediction, machine learning and causal inference. Int J Epi 2019; doi:10.1093/ije/dyz132.
Commentary on grounding the practice of machine learning in an awareness of structural racism’s impact on all aspects of health research:
Robinson WR, Renson A, Naimi AI. Teaching yourself about structural racism will improve your machine learning. Biostatistics 2020; 21: 338-344
Paper on the distinctions between explanatory and predictive modeling:
Shmueli G. To explain or to predict? Statistical science 2010; 25:289-310.
Application Articles
Application of several algorithms (including artificial neural networks, random forests, gradient boosted trees, least squares, and penalized linear regression ) to predict birth rates in Brazilian municipalities:
Chiavegatto Filho ADP, Dos Santos HG, do Nascimento CF, Massa K, Kawachi I. Overachieving municipalities in public health: A machine learning approach. Epidemiology 2018; 29:836-40.
Use of classification trees and random forests using a range of demographic and health predictors to predict risk of nonfatal suicide attempts by sex:
Gradus JL, Rosellini AJ, Horvath-Puho E, Jiang T, Street AE, Galatzer-Levy I, Lash TL, Sorensen HT. Predicting sex-specific nonfatal suicide attempt risk using machine learning and data from Danish national registries. American Journal of Epidemiology 2021; 190(12):2517-2527.
Use of linear and quantile regression, random forests, Bayesian additive regression trees, and generalized boosted models to predict estimated fetal weight:
Naimi AI, Platt RW, Larkin JC. Machine learning for fetal growth prediction. Epidemiology 2018; 29:290-298.
Courses
Introduction to Machine Learning for Epidemiologists - episummer @ Columbia
Machine Learning Boot Camp: Analyzing Biomedical and Health Data - Columbia University’s Department of Environmental Health Sciences and Department of Biostatistics in the Mailman School of Public Health