## 10.1 Regression refresher

Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis in Section 5.1.

### 10.1.1 Teaching evaluations analysis

Recall using simple linear regression we modeled the relationship between

1. A numerical outcome variable $$y$$ (the instructor’s teaching score) and
2. A single numerical explanatory variable $$x$$ (the instructor’s “beauty” score).

We first created an evals_ch5 data frame that selected a subset of variables from the evals data frame included in the moderndive package. This evals_ch5 data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching score and the “beauty” rating bty_avg:

evals_ch5 <- evals %>%
select(ID, score, bty_avg, age)
glimpse(evals_ch5)
Rows: 463
Columns: 4
$ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…$ score   <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
$bty_avg <dbl> 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…$ age     <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…

In Subsection 5.1.1, we performed an exploratory data analysis of the relationship between these two variables of score and bty_avg. We saw there that a weakly positive correlation of 0.187 existed between the two variables.

This was evidenced in Figure 10.1 of the scatterplot along with the “best-fitting” regression line that summarizes the linear relationship between the two variables of score and bty_avg. Recall in Subsection 5.3.2 that we defined a “best-fitting” line as the line that minimizes the sum of squared residuals.

ggplot(evals_ch5,
aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score",
y = "Teaching Score",
title = "Relationship between teaching and beauty scores") +
geom_smooth(method = "lm", se = FALSE)

Looking at this plot again, you might be asking, “Does that line really have all that positive of a slope?”. It does increase from left to right as the bty_avg variable increases, but by how much? To get to this information, recall that we followed a two-step procedure:

1. We first “fit” the linear regression model using the lm() function with the formula score ~ bty_avg. We saved this model in score_model.
2. We get the regression table by applying the get_regression_table() function from the moderndive package to score_model.
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
# Get regression table:
get_regression_table(score_model)
TABLE 10.1: Previously seen linear regression table
term estimate std_error statistic p_value lower_ci upper_ci
intercept 3.880 0.076 50.96 0 3.731 4.030
bty_avg 0.067 0.016 4.09 0 0.035 0.099

Using the values in the estimate column of the resulting regression table in Table 10.1, we could then obtain the equation of the “best-fitting” regression line in Figure 10.1:

\begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x\\ \widehat{\text{score}} &= b_0 + b_{\text{bty}\_\text{avg}} \cdot\text{bty}\_\text{avg}\\ &= 3.880 + 0.067\cdot\text{bty}\_\text{avg} \end{aligned}

where $$b_0$$ is the fitted intercept and $$b_1$$ is the fitted slope for bty_avg. Recall the interpretation of the $$b_1$$ = 0.067 value of the fitted slope:

For every increase of one unit in “beauty” rating, there is an associated increase, on average, of 0.067 units of evaluation score.

Thus, the slope value quantifies the relationship between the $$y$$ variable score and the $$x$$ variable bty_avg. We also discussed the intercept value of $$b_0$$ = 3.88 and its lack of practical interpretation, since the range of possible “beauty” scores does not include 0.

### 10.1.2 Sampling scenario

Let’s now revisit this study in terms of the terminology and notation related to sampling we studied in Subsection 7.3.1.

First, let’s view the instructors for these 463 courses as a representative sample from a greater study population. In our case, let’s assume that the study population is all instructors at UT Austin and that the sample of instructors who taught these 463 courses is a representative sample. Unfortunately, we can only assume these two facts without more knowledge of the sampling methodology used by the researchers.

Since we are viewing these $$n$$ = 463 courses as a sample, we can view our fitted slope $$b_1$$ = 0.067 as a point estimate of the population slope $$\beta_1$$. In other words, $$\beta_1$$ quantifies the relationship between teaching score and “beauty” average bty_avg for all instructors at UT Austin. Similarly, we can view our fitted intercept $$b_0$$ = 3.88 as a point estimate of the population intercept $$\beta_0$$ for all instructors at UT Austin.

Putting these two ideas together, we can view the equation of the fitted line $$\widehat{y}$$ = $$b_0 + b_1 \cdot x$$ = $$3.880 + 0.067 \cdot \text{bty}\_\text{avg}$$ as an estimate of some true and unknown population line $$y = \beta_0 + \beta_1 \cdot x$$. Thus we can draw parallels between our teaching evaluations analysis and all the sampling scenarios we’ve seen previously. In this chapter, we’ll focus on the final scenario of regression slopes as shown in Table 10.2.

TABLE 10.2: Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion $$p$$ Sample proportion $$\widehat{p}$$
2 Population mean $$\mu$$ Sample mean $$\overline{x}$$ or $$\widehat{\mu}$$
3 Difference in population proportions $$p_1 - p_2$$ Difference in sample proportions $$\widehat{p}_1 - \widehat{p}_2$$
4 Difference in population means $$\mu_1 - \mu_2$$ Difference in sample means $$\overline{x}_1 - \overline{x}_2$$
5 Population regression slope $$\beta_1$$ Fitted regression slope $$b_1$$ or $$\widehat{\beta}_1$$

Since we are now viewing our fitted slope $$b_1$$ and fitted intercept $$b_0$$ as point estimates based on a sample, these estimates will again be subject to sampling variability. In other words, if we collected a new sample of data on a different set of $$n$$ = 463 courses and their instructors, the new fitted slope $$b_1$$ will likely differ from 0.067. The same goes for the new fitted intercept $$b_0$$. But by how much will these estimates vary? This information is in the remaining columns of the regression table in Table 10.1. Our knowledge of sampling from Chapter 7, confidence intervals from Chapter 8, and hypothesis tests from Chapter 9 will help us interpret these remaining columns.