Data Types and Distributions
Written and curated by Emily Potts, MS, Weija Fan, MS, and Caleb Miles, PhD.
Overview
An essential first step of handling data is to determine a variable’s data type. The main data types are numerical and categorical, the latter of which do not exactly correspond to numbers, but rather categories. Numerical variables can be discrete (e.g., taking the values of integers, such as household size) or continuous (i.e., they fall on a continuum, as with age or weight). Categorical variables can be ordinal (not numeric, but having an ordering, e.g., Likert scale survey responses) or nominal (having no natural ordering, e.g., marital status). In statistics, we model variables as being random, and every random variable has a distribution, which determines how frequently different values of the variable occur. A statistical model is a family of distributions that are all the same apart from key identifying quantities known as a parameters. Parameters determine the characteristics of the distribution, such as its center, spread, or shape. For example, the normal distribution has two parameters: one determining its mean (center) and the other its variance (spread). Distribution types correspond to data types; for continuous random variables, there are continuous distributions (e.g., normal, exponential), and for discrete random variables, there are discrete distributions (e.g., binomial, Poisson). Distributions are extremely useful for modeling data. One of the central goals of statistics is to learn (i.e., estimate) the values of the parameters in a statistical model, which can provide essential information about the variables of interest and their relationships to one another.
Videos
- Teach me STATISTICS in half an hour! (42 minutes | Rec: 1:18 – 13:50)
- Intuitive explanations with examples. Sections 1 discusses data types, Section 2 data distributions.
- Distributions (Playlist with videos 8-22 minutes)
- Each video covers a specific distribution or concept, including the normal distribution, student’s t-distribution, and Chi-squared distribution.
- Foundational Concepts in Statistics and Epidemiology | Public Health Sciences - Johns Hopkins University (59 minutes | Rec: 21:17 – 30:56)
- Seminar discussing random variables, graphical and numeric methods for describing data, and common parametric distributions.
- Explaining probability distributions (13 minutes)
- A brief guide to understanding random variables and probability distributions (Probability Density Function, Probability Mass Function, and Cumulative Density Function).
- Explaining parametric families (15 minutes)
- A brief guide to understanding parametric families of distributions with examples of common parametric families.
Websites
- Statistical Distributions - NYU | Archive
- Focuses on the aspects of distributions that are most useful when analyzing raw data and trying to fit the right distribution to that data.
- Data Types - Mayo Clinic | Archive
- Explains how the type of data helps determine what statistical procedures are appropriate.
- More Good Reasons to Look at the Data - Mayo Clinic | Archive
- Highlights how visual representations of continuous data can help us understand appropriate analyses.
- Continuous Probability Distributions - Penn State University | Archive
- Overview of continuous data distributions.
- Types of Discrete Data and Discrete Distributions - Penn State University | Archive Link #1 and Link #2
- Overview of discrete data types and distributions.
- Overview of data distributions | Archive
- A slightly more technical document with formulas, with details on when to use each type of distribution to model your data.
Readings
- An Intuitive, Interactive, Introduction to Biostatistics (Chapter 5: Probability Distributions) - Caitlin Ward and Collin Nolte at the University of Iowa | Archive
- Sarkar, Sonali. Understanding data for medical statistics. International Journal of Advanced Medical and Health Research 1(1):p 30-33, Jan–Jun 2014. | DOI: 10.4103/2349-4220.134449
- Exploratory Data Analysis (Chapter 4) - Carnegie Mellon University | Archive
- Textbook chapter covering the typical data format, the types of EDA, univariate and multivariate graphical and non-graphical EDA.
- Introduction to Modern Statistics, First Edition by Mine Çetinkaya-Rundel and Johanna Hardin | Archive
- Part II covers EDA with both data visualization and summary statistics, with chapter 4 on categorical data and chapter 5 on numerical, continuous data.
R Annotated Code
- Understanding Distributions using R | Archive
- Covers distributions and how to use them to summarize your data set; histograms vs density plots; how to check normal distribution, proper use of boxplots.
- Elementary Statistics in R - Chapter 2 Describing Patterns in Data | Archive
- Outlines how to describe a one factor variable, one numeric variable, a relationship between two factor variables, and a relationship between a factor variable and a numeric variable.
- Statistical Methods for Data Science - Chapter 2 Data Distributions - Elizabeth Purdom at UC Berkeley | Archive
- Chapter discusses basic exploratory analyses, probability distributions, continuous distributions, density curve estimation, etc.
- How to explore a “new” data set - Massey University, New Zealand | Archive
- Very simple list of functions to familiarize yourself with a new dataset.
SAS Annotated Code
- The Distribution Analysis Task in SAS Studio (5 minutes)
- Beginner-friendly tutorial shows how to evaluate the distribution of a continuous variable through histograms with overlaid density curves and added statistics for assessing normality.
- The One-Way Frequencies Task in SAS Studio (4 minutes)
- Beginner-friendly tutorial shows how to evaluate the distribution of a categorical variable, through frequency tables and plots.
- The UNIVARIATE Procedure - Exploring a Data Distribution - SAS
- PROC UNIVARIATE - UCLA | Archive
- Code to investigate the distribution of the continuous variable.
Stata Annotated Code
- Stata Class Notes: Exploring Data - Stata | Archive
- List of common Stata commands in the exploratory process and their uses.
- Introducing Stata—sample session - Stata | Archive
- First two sections (Simple data management and Descriptive statistics) are useful starters to open, investigate, and describe the contents of the dataset.
- Ways to View Data in Stata - Georgia State University | Archive
- Variable Distributions | Stata Textbook Examples - UCLA | Archive
- Shows output and customization options for the summarize command.
Courses and Resources
- Managing, Describing, and Analyzing Data - University of Colorado Boulder
- Free self-paced course to learn correct classification and description of data types, four common probability distributions, analyze data sets using the appropriate probability distribution, and basics of sampling in research.
- Introduction to Statistics & Data Analysis in Public Health - Imperial College London
- Free self-paced course to learn how to take a data set you've never seen before, describe its keys features, get to know its strengths and quirks, and run some vital basic analyses.