## 5.3 Related topics

### 5.3.1 Correlation is not necessarily causation

Throughout this chapter we’ve been cautious when interpreting regression slope coefficients. We always discussed the “associated” effect of an explanatory variable \(x\) on an outcome variable \(y\). For example, our statement from Subsection 5.1.2 that “for every increase of 1 unit in `bty_avg`

, there is an *associated* increase of on average 0.067 units of `score`

.” We include the term “associated” to be extra careful not to suggest we are making a *causal* statement. So while “beauty” score of `bty_avg`

is positively correlated with teaching `score`

, we can’t necessarily make any statements about “beauty” scores’ direct causal effect on teaching score without more information on how this study was conducted. Here is another example: a not-so-great medical doctor goes through medical records and finds that patients who slept with their shoes on tended to wake up more with headaches. So this doctor declares, “Sleeping with shoes on causes headaches!”

However, there is a good chance that if someone is sleeping with their shoes on, it’s potentially because they are intoxicated from alcohol. Furthermore, higher levels of drinking leads to more hangovers, and hence more headaches. The amount of alcohol consumption here is what’s known as a *confounding/lurking* variable. It “lurks” behind the scenes, confounding the causal relationship (if any) of “sleeping with shoes on” with “waking up with a headache.” We can summarize this in Figure 5.11 with a *causal graph* where:

- Y is a
*response*variable; here it is “waking up with a headache.” - X is a
*treatment*variable whose causal effect we are interested in; here it is “sleeping with shoes on.”

To study the relationship between Y and X, we could use a regression model where the outcome variable is set to Y and the explanatory variable is set to be X, as you’ve been doing throughout this chapter. However, Figure 5.11 also includes a third variable with arrows pointing at both X and Y:

- Z is a
*confounding*variable that affects both X and Y, thereby “confounding” their relationship. Here the confounding variable is alcohol.

Alcohol will cause people to be both more likely to sleep with their shoes on as well as be more likely to wake up with a headache. Thus any regression model of the relationship between X and Y should also use Z as an explanatory variable. In other words, our doctor needs to take into account who had been drinking the night before. In the next chapter, we’ll start covering multiple regression models that allow us to incorporate more than one variable in our regression models.

Establishing causation is a tricky problem and frequently takes either carefully designed experiments or methods to control for the effects of confounding variables. Both these approaches attempt, as best they can, either to take all possible confounding variables into account or negate their impact. This allows researchers to focus only on the relationship of interest: the relationship between the outcome variable Y and the treatment variable X.

As you read news stories, be careful not to fall into the trap of thinking that correlation necessarily implies causation. Check out the Spurious Correlations website for some rather comical examples of variables that are correlated, but are definitely not causally related.

### 5.3.2 Best-fitting line

Regression lines are also known as “best-fitting” lines. But what do we mean by “best”? Let’s unpack the criteria that is used in regression to determine “best.” Recall Figure 5.6, where for an instructor with a beauty score of \(x = 7.333\) we mark the *observed value* \(y\) with a circle, the *fitted value* \(\widehat{y}\) with a square, and the *residual* \(y - \widehat{y}\) with an arrow. We re-display Figure 5.6 in the top-left plot of Figure 5.12 in addition to three more arbitrarily chosen course instructors:

The three other plots refer to:

- A course whose instructor had a “beauty” score \(x\) = 2.333 and teaching score \(y\) = 2.7. The residual in this case is \(2.7 - 4.036 = -1.336\), which we mark with a new blue arrow in the top-right plot.
- A course whose instructor had a “beauty” score \(x = 3.667\) and teaching score \(y = 4.4\). The residual in this case is \(4.4 - 4.125 = 0.2753\), which we mark with a new blue arrow in the bottom-left plot.
- A course whose instructor had a “beauty” score \(x = 6\) and teaching score \(y = 3.8\). The residual in this case is \(3.8 - 4.28 = -0.4802\), which we mark with a new blue arrow in the bottom-right plot.

Now say we repeated this process of computing residuals for all 463 courses’ instructors, then we squared all the residuals, and then we summed them. We call this quantity the *sum of squared residuals*; it is a measure of the *lack of fit* of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model.

If the regression line fits all the points perfectly, then the sum of squared residuals is 0. This is because if the regression line fits all the points perfectly, then the fitted value \(\widehat{y}\) equals the observed value \(y\) in all cases, and hence the residual \(y-\widehat{y}\) = 0 in all cases, and the sum of even a large number of 0’s is still 0.

Furthermore, of all possible lines we can draw through the cloud of 463 points, the regression line minimizes this value. In other words, the regression and its corresponding fitted values \(\widehat{y}\) minimizes the sum of the squared residuals:

\[ \sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 \]

Let’s use our data wrangling tools from Chapter 3 to compute the sum of squared residuals exactly:

```
# Fit regression model:
score_model <- lm(score ~ bty_avg,
data = evals_ch5)
# Get regression points:
regression_points <- get_regression_points(score_model)
regression_points
```

```
# A tibble: 463 x 5
ID score bty_avg score_hat residual
<int> <dbl> <dbl> <dbl> <dbl>
1 1 4.7 5 4.21 0.486
2 2 4.1 5 4.21 -0.114
3 3 3.9 5 4.21 -0.314
4 4 4.8 5 4.21 0.586
5 5 4.6 3 4.08 0.52
6 6 4.3 3 4.08 0.22
7 7 2.8 3 4.08 -1.28
8 8 4.1 3.33 4.10 -0.002
9 9 3.4 3.33 4.10 -0.702
10 10 4.5 3.17 4.09 0.409
# … with 453 more rows
```

```
# Compute sum of squared residuals
regression_points %>%
mutate(squared_residuals = residual^2) %>%
summarize(sum_of_squared_residuals = sum(squared_residuals))
```

```
# A tibble: 1 x 1
sum_of_squared_residuals
<dbl>
1 132.
```

Any other straight line drawn in the figure would yield a sum of squared residuals greater than 132. This is a mathematically guaranteed fact that you can prove using calculus and linear algebra. That’s why alternative names for the linear regression line are the *best-fitting line* and the *least-squares line*. Why do we square the residuals (i.e., the arrow lengths)? So that both positive and negative deviations of the same amount are treated equally.

*Learning check*

**(LC5.8)** Note in Figure 5.13 there are 3 points marked with dots and:

- The “best” fitting solid regression line in blue
- An arbitrarily chosen dotted red line
- Another arbitrarily chosen dashed green line

Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value.

### 5.3.3 `get_regression_x()`

functions

Recall in this chapter we introduced two functions from the `moderndive`

package:

`get_regression_table()`

that returns a regression table in Subsection 5.1.2 and`get_regression_points()`

that returns point-by-point information from a regression model in Subsection 5.1.3.

What is going on behind the scenes with the `get_regression_table()`

and `get_regression_points()`

functions? We mentioned in Subsection 5.1.2 that these were examples of *wrapper functions*. Such functions take other pre-existing functions and “wrap” them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we’ll “get under the hood” of these functions and see how the “engine” of these wrapper functions works.

Recall our two-step process to generate a regression table from Subsection 5.1.2:

```
# Fit regression model:
score_model <- lm(formula = score ~ bty_avg, data = evals_ch5)
# Get regression table:
get_regression_table(score_model)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 3.880 | 0.076 | 50.96 | 0 | 3.731 | 4.030 |

bty_avg | 0.067 | 0.016 | 4.09 | 0 | 0.035 | 0.099 |

The `get_regression_table()`

wrapper function takes two pre-existing functions in other R packages:

`tidy()`

from the`broom`

package (Robinson, Hayes, and Couch 2020) and`clean_names()`

from the`janitor`

package (Firke 2020)

and “wraps” them into a single function that takes in a saved `lm()`

linear model, here `score_model`

, and returns a regression table saved as a “tidy” data frame. Here is how we used the `tidy()`

and `clean_names()`

functions to produce Table 5.11:

```
library(broom)
library(janitor)
score_model %>%
tidy(conf.int = TRUE) %>%
mutate_if(is.numeric, round, digits = 3) %>%
clean_names() %>%
rename(lower_ci = conf_low, upper_ci = conf_high)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

(Intercept) | 3.880 | 0.076 | 50.96 | 0 | 3.731 | 4.030 |

bty_avg | 0.067 | 0.016 | 4.09 | 0 | 0.035 | 0.099 |

Yikes! That’s a lot of code! So, in order to simplify your lives, we made the editorial decision to “wrap” all the code into `get_regression_table()`

, freeing you from the need to understand the inner workings of the function. Note that the `mutate_if()`

function is from the `dplyr`

package and applies the `round()`

function to three significant digits precision only to those variables that are numerical.

Similarly, the `get_regression_points()`

function is another wrapper function, but this time returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. `get_regression_points()`

uses the `augment()`

function in the `broom`

package instead of the `tidy()`

function as with `get_regression_table()`

to produce the data shown in Table 5.12:

```
library(broom)
library(janitor)
score_model %>%
augment() %>%
mutate_if(is.numeric, round, digits = 3) %>%
clean_names() %>%
select(-c("std_resid", "hat", "sigma", "cooksd", "std_resid"))
```

score | bty_avg | fitted | resid |
---|---|---|---|

4.7 | 5.00 | 4.21 | 0.486 |

4.1 | 5.00 | 4.21 | -0.114 |

3.9 | 5.00 | 4.21 | -0.314 |

4.8 | 5.00 | 4.21 | 0.586 |

4.6 | 3.00 | 4.08 | 0.520 |

4.3 | 3.00 | 4.08 | 0.220 |

2.8 | 3.00 | 4.08 | -1.280 |

4.1 | 3.33 | 4.10 | -0.002 |

3.4 | 3.33 | 4.10 | -0.702 |

4.5 | 3.17 | 4.09 | 0.409 |

In this case, it outputs only the variables of interest to students learning regression: the outcome variable \(y\) (`score`

), all explanatory/predictor variables (`bty_avg`

), all resulting `fitted`

values \(\hat{y}\) used by applying the equation of the regression line to `bty_avg`

, and the `resid`

ual \(y - \hat{y}\).

If you’re even more curious about how these and other wrapper functions work, take a look at the source code for these functions on GitHub.

### References

Firke, Sam. 2020. *Janitor: Simple Tools for Examining and Cleaning Dirty Data*. https://CRAN.R-project.org/package=janitor.

Robinson, David, Alex Hayes, and Simon Couch. 2020. *Broom: Convert Statistical Objects into Tidy Tibbles*. https://CRAN.R-project.org/package=broom.