## 10.1 Regression refresher

Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis in Section 5.1.

### 10.1.1 Teaching evaluations analysis

Recall using simple linear regression we modeled the relationship between

- A numerical outcome variable \(y\) (the instructor’s teaching score) and
- A single numerical explanatory variable \(x\) (the instructor’s “beauty” score).

We first created an `evals_ch5`

data frame that selected a subset of variables from the `evals`

data frame included in the `moderndive`

package. This `evals_ch5`

data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching `score`

and the “beauty” rating `bty_avg`

:

```
Rows: 463
Columns: 4
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
$ bty_avg <dbl> 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…
$ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…
```

In Subsection 5.1.1, we performed an exploratory data analysis of the relationship between these two variables of `score`

and `bty_avg`

. We saw there that a weakly positive correlation of 0.187 existed between the two variables.

This was evidenced in Figure 10.1 of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables of `score`

and `bty_avg`

. Recall in Subsection 5.3.2 that we defined a “best-fitting” line as the line that minimizes the *sum of squared residuals*.

```
ggplot(evals_ch5,
aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score",
y = "Teaching Score",
title = "Relationship between teaching and beauty scores") +
geom_smooth(method = "lm", se = FALSE)
```

Looking at this plot again, you might be asking, “Does that line really have all that positive of a slope?”. It does increase from left to right as the `bty_avg`

variable increases, but by how much? To get to this information, recall that we followed a two-step procedure:

- We first “fit” the linear regression model using the
`lm()`

function with the formula`score ~ bty_avg`

. We saved this model in`score_model`

. - We get the regression table by applying the
`get_regression_table()`

function from the`moderndive`

package to`score_model`

.

```
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
# Get regression table:
get_regression_table(score_model)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 3.880 | 0.076 | 50.96 | 0 | 3.731 | 4.030 |

bty_avg | 0.067 | 0.016 | 4.09 | 0 | 0.035 | 0.099 |

Using the values in the `estimate`

column of the resulting regression table in Table 10.1, we could then obtain the equation of the “best-fitting” regression line in Figure 10.1:

\[ \begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x\\ \widehat{\text{score}} &= b_0 + b_{\text{bty}\_\text{avg}} \cdot\text{bty}\_\text{avg}\\ &= 3.880 + 0.067\cdot\text{bty}\_\text{avg} \end{aligned} \]

where \(b_0\) is the fitted intercept and \(b_1\) is the fitted slope for `bty_avg`

. Recall the interpretation of the \(b_1\) = 0.067 value of the fitted slope:

For every increase of one unit in “beauty” rating, there is an associated increase, on average, of 0.067 units of evaluation score.

Thus, the slope value quantifies the relationship between the \(y\) variable `score`

and the \(x\) variable `bty_avg`

. We also discussed the intercept value of \(b_0\) = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0.

### 10.1.2 Sampling scenario

Let’s now revisit this study in terms of the terminology and notation related to sampling we studied in Subsection 7.3.1.

First, let’s view the instructors for these 463 courses as a *representative sample* from a greater *study population*. In our case, let’s assume that the study population is *all* instructors at UT Austin and that the sample of instructors who taught these 463 courses is a representative sample. Unfortunately, we can only *assume* these two facts without more knowledge of the *sampling methodology* used by the researchers.

Since we are viewing these \(n\) = 463 courses as a sample, we can view our fitted slope \(b_1\) = 0.067 as a *point estimate* of the *population slope* \(\beta_1\). In other words, \(\beta_1\) quantifies the relationship between teaching `score`

and “beauty” average `bty_avg`

for *all* instructors at UT Austin. Similarly, we can view our fitted intercept \(b_0\) = 3.88 as a *point estimate* of the *population intercept* \(\beta_0\) for *all* instructors at UT Austin.

Putting these two ideas together, we can view the equation of the fitted line \(\widehat{y}\) = \(b_0 + b_1 \cdot x\) = \(3.880 + 0.067 \cdot \text{bty}\_\text{avg}\) as an estimate of some true and unknown *population line* \(y = \beta_0 + \beta_1 \cdot x\). Thus we can draw parallels between our teaching evaluations analysis and all the sampling scenarios we’ve seen previously. In this chapter, we’ll focus on the final scenario of regression slopes as shown in Table 10.2.

Scenario | Population parameter | Notation | Point estimate | Symbol(s) |
---|---|---|---|---|

1 | Population proportion | \(p\) | Sample proportion | \(\widehat{p}\) |

2 | Population mean | \(\mu\) | Sample mean | \(\overline{x}\) or \(\widehat{\mu}\) |

3 | Difference in population proportions | \(p_1 - p_2\) | Difference in sample proportions | \(\widehat{p}_1 - \widehat{p}_2\) |

4 | Difference in population means | \(\mu_1 - \mu_2\) | Difference in sample means | \(\overline{x}_1 - \overline{x}_2\) |

5 | Population regression slope | \(\beta_1\) | Fitted regression slope | \(b_1\) or \(\widehat{\beta}_1\) |

Since we are now viewing our fitted slope \(b_1\) and fitted intercept \(b_0\) as *point estimates* based on a *sample*, these estimates will again be subject to *sampling variability*. In other words, if we collected a new sample of data on a different set of \(n\) = 463 courses and their instructors, the new fitted slope \(b_1\) will likely differ from 0.067. The same goes for the new fitted intercept \(b_0\). But by how much will these estimates *vary*? This information is in the remaining columns of the regression table in Table 10.1. Our knowledge of sampling from Chapter 7, confidence intervals from Chapter 8, and hypothesis tests from Chapter 9 will help us interpret these remaining columns.