Summary Statistics
Written and curated by Emily Potts, MS, Weija Fan, MS, and Min Qian, PhD.
Overview
A vital skill for researchers is to describe their data using summary statistics: succinct numerical measures that communicate information about an aspect of the dataset. These measures encompass characteristics such as central tendency (mean, median, mode), variability (range, interquartile range, median absolute deviation, standard deviation, standard error of the mean), or shape (skewness or kurtosis). Researchers are often interested in producing summary statistics for subgroups of their dataset, including age, sex, or socioeconomic factors. It is important to understand that the most appropriate summary statistics differ based on the variable type, such as mean and standard deviation for continuous variables, frequencies and proportions for categorical (nominal, ordinal, or binary), or the median for time-to-event. Researchers typically calculate these statistics for subgroups within their dataset, such as age, sex, or treatment groups, and a "Table 1" in publications usually presents these baseline characteristics. Calculating summary statistics is a crucial part of exploratory data analysis (EDA), but care should be taken, as they can mask important features like outliers or non-normal distributions. To avoid missing meaningful patterns, visualizing the data alongside summary statistics is often recommended.
Videos
- Summaries of Data | Lecture 1 - American University of Integrative Sciences (37 minutes)
- Explains rationale for descriptive statistics, defines measures of central tendency, and their relationship to distributions.
- Summaries of Data | Lecture 2 - American University of Integrative Sciences (1 hour 12 minutes)
- Explains measures of dispersion, such as range, sample and population variance, standard deviation, and coefficient of variation.
- Descriptive Statistics - (Playlist with 13 videos ranging from 7 to 21 minutes)
- Each video goes in-depth on one type of summary statistic, with an intuitive definition, an example calculation, and comparisons to similar summary statistics.
Websites
- Summarizing One Quantitative Variable - Penn State University | Archive
- Discusses measures of central tendency, position, and variability.
- Measures of Central Tendency and Variability - JMP Statistical Software | Archive
- Includes more detailed links for mean, median, and mode; and normal distribution, standard deviation, empirical rule, and Z-scores.
- Descriptive statistics by hand - Antoine Soetewey at UCLouvain | Archive
- Describes how to compute the main descriptive statistics by hand and how to interpret them.
- What’s the difference between Descriptive and Inferential Statistics? - Bradley University | Archive
- Distinguishes between statistics used to describe the characteristics of data (mean, standard deviation, or frequency) and those used to draw conclusions from data (hypothesis testing, confidence intervals, or regression).
- Writing with Descriptive Statistics - Purdue University | Archive
- A brief outline on integrating summary statistics in writing.
Readings
- Crump, M. J. C., Navarro, D. J., & Suzuki, J. (2019, June 5). Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students. https://www.crumplab.com/statistics/02-Describing_Data.html
- Chapter 2 covers an intuitive rationale for descriptive statistics, with detail on measures of central tendency, variance, and distribution.
- Introduction to Modern Statistics, First Edition by Mine Çetinkaya-Rundel and Johanna Hardin | Archive
- Chapter 4 covers summary statistics (and data visualization) for categorical data and chapter 5 for numerical, continuous data.
- An Introduction to Data Analysis by Michael Franke | Archive
- Chapter 5 covers summary statistics for counts and proportions, summary statistics for simple, one-dimensional vectors with numeric information, and between two numerical vectors.
- Guetterman T. C. (2019). Basics of statistics for primary care research. Family medicine and community health, 7(2), e000067. https://doi.org/10.1136/fmch-2018-000067
- Discusses how summary statistics fit into the clinical research process and compares to inferential statistics.
R Annotated Code
- Descriptive statistics in R - Antoine Soetewey at UCLouvain | Archive
- Explains how to compute basic descriptive statistics using base R.
- Data transformation with dplyr :: Cheatsheet | Archive
- Concise description of functions within the dplyr package to transform data, including creating new tables of summary statistics, compute grouped summaries statistics, extract subsets, make new variables, etc.)
- Pdf available for download.
- Exploratory analysis using data summaries - Jeff Goldsmith at Columbia University | Archive
- Focuses on computation of numeric summaries within groups using dplyr package
- R Tips - University of British Columbia | Archive
- Covers frequency tables for categorical variables and tables of means, sums, etc. for numeric variables.
- R code snippets (Transform Tables) | Archive
- List of useful code examples, including computing summary statistics for groups of rows within a table or creating a three-way contingency table.
- Descriptive Statistics in R (Playlist with 46 videos ranging from 2 to 8 minutes)
- Each video discusses a specific descriptive statistic task, with the use of base R and common packages.
- Summarizing with statistics - Andrew Bray at Reed College
- Walk-through tutorial on conducting numeric summaries with an example dataset, covering measures of center, variability, shape, and outliers.
- R Programming: Elegant Cross Tables in R | gtsummary | publication ready cross tables (13 minutes)
- Covers customizations for tables such as percentage by row or column, creating P values, bolding the labels, or changing the levels of the factors.
SAS Annotated Code
If the goal is to generate simple summary statistics (count, mean, standard deviation, minimum, maximum), PROC MEANS and PROC SUMMARY may be easier to use, but PROC UNIVARIATE has a larger default output and a more comprehensive set of options. PROC FREQ is used for categorical variables to generate frequency tables.
- Intro to SAS Notes - Data Summarization - Robert Parker at the University of Florida | Archive
- Describes how to summarize one categorical variable, one quantitative variable, and basic summaries of bivariate data in SAS.
- SAS PROC UNIVARIATE (18 minutes), MEANS (22 minutes), and FREQ (17 minutes)
- Each video discusses the default output, customization options, how to use the procedures to perform tests and analyses, and compares to the other procedures.
- Descriptive Statistics | SAS Learning Modules - UCLA | Archive
- Covers PROC FREQ for frequencies, PROC MEANS for summary statistics, and PROC UNIVARIATE for detailed summary statistics.
- Summarizing Numerical and Categorical Data in SAS - Winona University | Archive
- Covers PROC TABULATE, PROC FREQ, and PROC MEANS.
- SAS Summary Stats - University of Cincinnati | Archive
- Covers PROC FREQ, PROC MEANS, and PROC UNIVARIATE.
- Summarizing Continuous Data - Penn State University | Archive
- Discusses the MEANS, SUMMARY, and UNIVARIATE procedures to perform various basic descriptive statistics on the numeric variables in a data set.
- Summarizing Categorical Data - Penn State University | Archive
- Discusses the FREQ procedure as a tool for summarizing and analyzing categorical data.
Stata Annotated Code
- Stata for Students: Descriptive Statistics - University of Wisconsin-Madison | Archive
- Uses summarize and tabulate commands to generate a wide range of statistics, including Frequencies for a Single Categorical Variable, Summary Statistics for One Quantitative Variable over One Categorical Variable, and Frequencies for Three or More Categorical Variables.
- Exploring Data using Stata: Descriptive Statistics - Princeton University | Archive
- Descriptive Statistics Using the Summarize Command - UCLA | Archive
- Shows example of getting descriptive statistics using the summarize command with footnotes explaining the output.
- Stata: Descriptive Statistics – Mean, median, variability - Scott Baldwin at Brigham Young University | Archive
- Instructions on how to summarize for multiple variables at once, using “by” to get summary statistics by all groups, using “if” to get them for a specific group, and “in” to specify a particular subset of cases based on their order in the dataset.
- Tables of descriptive statistics - Stata | Archive
- Discusses the use of the Dtable command to create a Table 1.
- Data Analysis with Stata - Tables - University of Notre Dame | Archive
- Automate the creation of publication-quality tables for summary statistics (and regression results) with estout package.
- Descriptive Statistics all Commands in Stata (11 minute video)
- Understand your dataset using the summarize, inspect, codebook, describe, tabulate, and histogram commands.
- Using Stata Creating Crosstabs (7 minute video)
- How to use and customize cross-tabulations using the table function, including the use of conditional functions.
- Creating and exporting tables of descriptive statistics (1 minute video)
- Quickly the use of the Dtable command for Table 1 for “publication-ready” tables, using the beginner-friendly drop-down menu.