ModernDive

Chapter 5 Basic Regression

Now that we are equipped with data visualization skills from Chapter 2, data wrangling skills from Chapter 3, and an understanding of how to import data and the concept of a “tidy” data format from Chapter 4, let’s now proceed with data modeling. The fundamental premise of data modeling is to make explicit the relationship between:

  • an outcome variable \(y\), also called a dependent variable or response variable, and
  • an explanatory/predictor variable \(x\), also called an independent variable or covariate.

Another way to state this is using mathematical terminology: we will model the outcome variable \(y\) “as a function” of the explanatory/predictor variable \(x\). When we say “function” here, we aren’t referring to functions in R like the ggplot() function, but rather as a mathematical function. But, why do we have two different labels, explanatory and predictor, for the variable \(x\)? That’s because even though the two terms are often used interchangeably, roughly speaking data modeling serves one of two purposes:

  1. Modeling for explanation: When you want to explicitly describe and quantify the relationship between the outcome variable \(y\) and a set of explanatory variables \(x\), determine the significance of any relationships, have measures summarizing these relationships, and possibly identify any causal relationships between the variables.
  2. Modeling for prediction: When you want to predict an outcome variable \(y\) based on the information contained in a set of predictor variables \(x\). Unlike modeling for explanation, however, you don’t care so much about understanding how all the variables relate and interact with one another, but rather only whether you can make good predictions about \(y\) using the information in \(x\).

For example, say you are interested in an outcome variable \(y\) of whether patients develop lung cancer and information \(x\) on their risk factors, such as smoking habits, age, and socioeconomic status. If we are modeling for explanation, we would be interested in both describing and quantifying the effects of the different risk factors. One reason could be that you want to design an intervention to reduce lung cancer incidence in a population, such as targeting smokers of a specific age group with advertising for smoking cessation programs. If we are modeling for prediction, however, we wouldn’t care so much about understanding how all the individual risk factors contribute to lung cancer, but rather only whether we can make good predictions of which people will contract lung cancer.

In this book, we’ll focus on modeling for explanation and hence refer to \(x\) as explanatory variables. If you are interested in learning about modeling for prediction, we suggest you check out books and courses on the field of machine learning such as An Introduction to Statistical Learning with Applications in R (ISLR) (James et al. 2017). Furthermore, while there exist many techniques for modeling, such as tree-based models and neural networks, in this book we’ll focus on one particular technique: linear regression. Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling.

Linear regression involves a numerical outcome variable \(y\) and explanatory variables \(x\) that are either numerical or categorical. Furthermore, the relationship between \(y\) and \(x\) is assumed to be linear, or in other words, a line. However, we’ll see that what constitutes a “line” will vary depending on the nature of your explanatory variables \(x\).

In Chapter 5 on basic regression, we’ll only consider models with a single explanatory variable \(x\). In Section 5.1, the explanatory variable will be numerical. This scenario is known as simple linear regression. In Section 5.2, the explanatory variable will be categorical.

In Chapter 6 on multiple regression, we’ll extend the ideas behind basic regression and consider models with two explanatory variables \(x_1\) and \(x_2\). In Section 6.1, we’ll have two numerical explanatory variables. In Section 6.2, we’ll have one numerical and one categorical explanatory variable. In particular, we’ll consider two such models: interaction and parallel slopes models.

In Chapter 10 on inference for regression, we’ll revisit our regression models and analyze the results using the tools for statistical inference you’ll develop in Chapters 7, 8, and 9 on sampling, bootstrapping and confidence intervals, and hypothesis testing and \(p\)-values, respectively.

Let’s now begin with basic regression, which refers to linear regression models with a single explanatory variable \(x\). We’ll also discuss important statistical concepts like the correlation coefficient, that “correlation isn’t necessarily causation,” and what it means for a line to be “best-fitting.”

Needed packages

Let’s now load all the packages needed for this chapter (this assumes you’ve already installed them). In this chapter, we introduce some new packages:

  1. The tidyverse “umbrella” (Wickham 2019b) package. Recall from our discussion in Section 4.4 that loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once:
    • ggplot2 for data visualization
    • dplyr for data wrangling
    • tidyr for converting data to “tidy” format
    • readr for importing spreadsheet data into R
    • As well as the more advanced purrr, tibble, stringr, and forcats packages
  2. The moderndive package of datasets and functions for tidyverse-friendly introductory linear regression.
  3. The skimr (Quinn et al. 2019) package, which provides a simple-to-use function to quickly compute a wide array of commonly used summary statistics.

If needed, read Section 1.3 for information on how to install and load R packages.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning: With Applications in R. First. New York, NY: Springer.

Quinn, Michael, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2019. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.

Wickham, Hadley. 2019b. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.