Data Types and Distributions

Written and curated by Emily Potts, MS, Weija Fan, MS, and Caleb Miles, PhD.

Overview

An essential first step of handling data is to determine a variable’s data type. The main data types are numerical and categorical, the latter of which do not exactly correspond to numbers, but rather categories. Numerical variables can be discrete (e.g., taking the values of integers, such as household size) or continuous (i.e., they fall on a continuum, as with age or weight). Categorical variables can be ordinal (not numeric, but having an ordering, e.g., Likert scale survey responses) or nominal (having no natural ordering, e.g., marital status). In statistics, we model variables as being random, and every random variable has a distribution, which determines how frequently different values of the variable occur. A statistical model is a family of distributions that are all the same apart from key identifying quantities known as a parameters. Parameters determine the characteristics of the distribution, such as its center, spread, or shape. For example, the normal distribution has two parameters: one determining its mean (center) and the other its variance (spread). Distribution types correspond to data types; for continuous random variables, there are continuous distributions (e.g., normal, exponential), and for discrete random variables, there are discrete distributions (e.g., binomial, Poisson). Distributions are extremely useful for modeling data. One of the central goals of statistics is to learn (i.e., estimate) the values of the parameters in a statistical model, which can provide essential information about the variables of interest and their relationships to one another.  

Videos

  • Teach me STATISTICS in half an hour! (42 minutes | Rec: 1:18 – 13:50) 
    • Intuitive explanations with examples. Sections 1 discusses data types, Section 2 data distributions. 
  • Distributions (Playlist with videos 8-22 minutes)  
    • Each video covers a specific distribution or concept, including the normal distribution, student’s t-distribution, and Chi-squared distribution. 
  • Foundational Concepts in Statistics and Epidemiology | Public Health Sciences - Johns Hopkins University (59 minutes | Rec: 21:17 – 30:56) 
    • Seminar discussing random variables, graphical and numeric methods for describing data, and common parametric distributions. 
  • Explaining probability distributions (13 minutes) 
    • A brief guide to understanding random variables and probability distributions (Probability Density Function, Probability Mass Function, and Cumulative Density Function). 
  • Explaining parametric families (15 minutes) 
    • A brief guide to understanding parametric families of distributions with examples of common parametric families. 

Websites

Readings

R Annotated Code

SAS Annotated Code

Stata Annotated Code

Courses and Resources

  • Managing, Describing, and Analyzing Data  - University of Colorado Boulder 
    • Free self-paced course to learn correct classification and description of data types, four common probability distributions, analyze data sets using the appropriate probability distribution, and basics of sampling in research. 
  • Introduction to Statistics & Data Analysis in Public Health - Imperial College London 
    • Free self-paced course to learn how to take a data set you've never seen before, describe its keys features, get to know its strengths and quirks, and run some vital basic analyses. 

Related Topics