Summary Statistics

Written and curated by Emily Potts, MS, Weija Fan, MS, and Min Qian, PhD. 

Overview

A vital skill for researchers is to describe their data using summary statistics: succinct numerical measures that communicate information about an aspect of the dataset. These measures encompass characteristics such as central tendency (mean, median, mode), variability (range, interquartile range, median absolute deviation, standard deviation, standard error of the mean), or shape (skewness or kurtosis). Researchers are often interested in producing summary statistics for subgroups of their dataset, including age, sex, or socioeconomic factors. It is important to understand that the most appropriate summary statistics differ based on the variable type, such as mean and standard deviation for continuous variables, frequencies and proportions for categorical (nominal, ordinal, or binary), or the median for time-to-event. Researchers typically calculate these statistics for subgroups within their dataset, such as age, sex, or treatment groups, and a "Table 1" in publications usually presents these baseline characteristics.  Calculating summary statistics is a crucial part of exploratory data analysis (EDA), but care should be taken, as they can mask important features like outliers or non-normal distributions. To avoid missing meaningful patterns, visualizing the data alongside summary statistics is often recommended.

Videos

  • Summaries of Data | Lecture 1 - American University of Integrative Sciences (37 minutes) 
    • Explains rationale for descriptive statistics, defines measures of central tendency, and their relationship to distributions.
  • Summaries of Data | Lecture 2 - American University of Integrative Sciences (1 hour 12 minutes) 
    • Explains measures of dispersion, such as range, sample and population variance, standard deviation, and coefficient of variation. 
  • Descriptive Statistics - (Playlist with 13 videos ranging from 7 to 21 minutes) 
    • Each video goes in-depth on one type of summary statistic, with an intuitive definition, an example calculation, and comparisons to similar summary statistics.

Websites

Readings

R Annotated Code

SAS Annotated Code

If the goal is to generate simple summary statistics (count, mean, standard deviation, minimum, maximum), PROC MEANS and PROC SUMMARY may be easier to use, but PROC UNIVARIATE has a larger default output and a more comprehensive set of options. PROC FREQ is used for categorical variables to generate frequency tables. 

Stata Annotated Code

Related Topics