Machine Learning

Machine learning is a branch of artificial intelligence which can be broadly defined as computer algorithms which learn and self-improve from finding patterns in data.

Jeanette Stingone, PhD, MPH is an Assistant Professor of Epidemiology at the Columbia University Mailman School of Public Health.

Introduction

Machine learning techniques involves data-driven approaches that use patterns to generate information. The most used techniques in epidemiology are unsupervised and supervised learning.

machine_learning_-_introduction_part_1_of_4

Key Concepts

Data partitioning, algorithms, tuning, overfitting & bias-variance tradeoff are at the fundamental core of the practice of machine learning.

machine_learning_-_key_concepts_part_2_of_4

Challenges

Though machine learning is a powerful tool, its successful implementation may be obstructed by issues such as practical interpretability and data quality.

machine_learning_-_challenges_part_3_of_4

Future Applications

Though current machine learning is focused on prediction, there are growing efforts to further integrate machine learning with causal inference, utilize it for data collection and harmonization, and investigate algorithmic fairness.

machine_learning_-_future_applications_part_4_of_4

Resources

Websites

Overview of different packages in R that are typically used for machine learning

sci-kit learn - a common set of packages in Python used to implement machine learning

TensorFlow - Google’s platform for machine learning

Textbooks

Provides an accessible, less technical look at key topics in statistical learning:
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. 2nd Edition. New York :Springer, 2021. ISBN:978-1071614174 doi: 10.1007/978-1-0716-1418-1_1

Explains concepts of learning from data in a statistical framework, with an emphasis on application rather than theory:
Trevor Hastie, Robert Tibrishani, Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Second Edition. New York: Springer, 2009. ISBN: 978-0-387-84857-0 doi: 10.1007/b94608.

Provides guidance on inference within data science and applications of targeted learning to address research questions based on real data:
Dr. Mark J. van der Laan, Dr. Sherri Rose. Targeted Learning in Data Science - Causal Inference for Complex Longitudinal Studies. New York: Springer, 2018. ISBN: 978-3-319-65304-4.

Methodological Articles

Overview of machine learning targeted towards epidemiologists that touches on key concepts, algorithms, and epidemiologic applications:
Bi Q, Goodman KE, Kaminsky J, Lessler J. What is machine learning? A primer for the epidemiologist. AJE 2019; 188: 2222-2239

Explores the relationship between modern machine learning methods and causal inference:
Blakely T, Lynch J, Simons K, Bentley R, Rose S. Reflecting on modern methods: when worlds collide-prediction, machine learning and causal inference. Int J Epi 2019; doi:10.1093/ije/dyz132.

Commentary on grounding the practice of machine learning in an awareness of structural racism’s impact on all aspects of health research:
Robinson WR, Renson A, Naimi AI. Teaching yourself about structural racism will improve your machine learning. Biostatistics 2020; 21: 338-344

Paper on the distinctions between explanatory and predictive modeling:
Shmueli G. To explain or to predict? Statistical science 2010; 25:289-310.

Application Articles

Application of several algorithms (including artificial neural networks, random forests, gradient boosted trees, least squares, and penalized linear regression ) to predict birth rates in Brazilian municipalities:
Chiavegatto Filho ADP, Dos Santos HG, do Nascimento CF, Massa K, Kawachi I. Overachieving municipalities in public health: A machine learning approach. Epidemiology 2018; 29:836-40.

Use of classification trees and random forests using a range of demographic and health predictors to predict risk of nonfatal suicide attempts by sex:
Gradus JL, Rosellini AJ, Horvath-Puho E, Jiang T, Street AE, Galatzer-Levy I, Lash TL, Sorensen HT. Predicting sex-specific nonfatal suicide attempt risk using machine learning and data from Danish national registries. American Journal of Epidemiology 2021; 190(12):2517-2527.

Use of linear and quantile regression, random forests, Bayesian additive regression trees, and generalized boosted models to predict estimated fetal weight:
Naimi AI, Platt RW, Larkin JC. Machine learning for fetal growth prediction. Epidemiology 2018; 29:290-298.

Courses

Introduction to Machine Learning for Epidemiologists - episummer @ Columbia

Machine Learning Boot Camp: Analyzing Biomedical and Health Data - Columbia University’s Department of Environmental Health Sciences and Department of Biostatistics in the Mailman School of Public Health