## 6.3 Related topics

### 6.3.1 Model selection

When should we use an interaction model versus a parallel slopes model? Recall in Sections 6.1.2 and 6.1.3 we fit both interaction and parallel slopes models for the outcome variable \(y\) (teaching score) using a numerical explanatory variable \(x_1\) (age) and a categorical explanatory variable \(x_2\) (gender recorded as a binary variable). We compared these models in Figure 6.3, which we display again now.

A lot of you might have asked yourselves: “Why would I force the lines to have parallel slopes (as seen in the right-hand plot) when they clearly have different slopes (as seen in the left-hand plot)?”.

The answer lies in a philosophical principle known as “Occam’s Razor.” It states that, “all other things being equal, simpler solutions are more likely to be correct than complex ones.” When viewed in a modeling framework, Occam’s Razor can be restated as, “all other things being equal, simpler models are to be preferred over complex ones.” In other words, we should only favor the more complex model if the additional complexity is *warranted*.

Let’s revisit the equations for the regression line for both the interaction and parallel slopes model:

\[ \begin{aligned} \text{Interaction} &: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x) + \\ & \qquad b_{\text{age,male}} \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}\\ \text{Parallel slopes} &: \widehat{y} = \widehat{\text{score}} = b_0 + b_{\text{age}} \cdot \text{age} + b_{\text{male}} \cdot \mathbb{1}_{\text{is male}}(x) \end{aligned} \]

The interaction model is “more complex” in that there is an additional \(b_{\text{age,male}} \cdot \text{age} \cdot \mathbb{1}_{\text{is male}}\) interaction term in the equation not present for the parallel slopes model. Or viewed alternatively, the regression table for the interaction model in Table 6.3 has *four* rows, whereas the regression table for the parallel slopes model in Table 6.5 has *three* rows. The question becomes: “Is this additional complexity warranted?”. In this case, it can be argued that this additional complexity is warranted, as evidenced by the clear x-shaped pattern of the two regression lines in the left-hand plot of Figure 6.7.

However, let’s consider an example where the additional complexity might *not* be warranted. Let’s consider the `MA_schools`

data included in the `moderndive`

package which contains 2017 data on Massachusetts public high schools provided by the Massachusetts Department of Education. For more details, read the help file for this data by running `?MA_schools`

in the console.

Let’s model the numerical outcome variable \(y\), average SAT math score for a given high school, as a function of two explanatory variables:

- A numerical explanatory variable \(x_1\), the percentage of that high school’s student body that are economically disadvantaged and
- A categorical explanatory variable \(x_2\), the school size as measured by enrollment: small (13-341 students), medium (342-541 students), and large (542-4264 students).

Let’s create visualizations of both the interaction and parallel slopes model once again and display the output in Figure 6.8. Recall from Subsection 6.1.3 that the `geom_parallel_slopes()`

function is a special purpose function included in the `moderndive`

package, since the `geom_smooth()`

method in the `ggplot2`

package does not have a convenient way to plot parallel slopes models.

```
# Interaction model
ggplot(MA_schools,
aes(x = perc_disadvan, y = average_sat_math, color = size)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Percent economically disadvantaged", y = "Math SAT Score",
color = "School size", title = "Interaction model")
```

```
# Parallel slopes model
ggplot(MA_schools,
aes(x = perc_disadvan, y = average_sat_math, color = size)) +
geom_point(alpha = 0.25) +
geom_parallel_slopes(se = FALSE) +
labs(x = "Percent economically disadvantaged", y = "Math SAT Score",
color = "School size", title = "Parallel slopes model")
```

Look closely at the left-hand plot of Figure 6.8 corresponding to an interaction model. While the slopes are indeed different, they do not differ *by much* and are nearly identical. Now compare the left-hand plot with the right-hand plot corresponding to a parallel slopes model. The two models don’t appear all that different. So in this case, it can be argued that the additional complexity of the interaction model is *not warranted*. Thus following Occam’s Razor, we should prefer the “simpler” parallel slopes model. Let’s explicitly define what “simpler” means in this case. Let’s compare the regression tables for the interaction and parallel slopes models in Tables 6.12 and 6.13.

```
model_2_interaction <- lm(average_sat_math ~ perc_disadvan * size,
data = MA_schools)
get_regression_table(model_2_interaction)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 594.327 | 13.288 | 44.726 | 0.000 | 568.186 | 620.469 |

perc_disadvan | -2.932 | 0.294 | -9.961 | 0.000 | -3.511 | -2.353 |

sizemedium | -17.764 | 15.827 | -1.122 | 0.263 | -48.899 | 13.371 |

sizelarge | -13.293 | 13.813 | -0.962 | 0.337 | -40.466 | 13.880 |

perc_disadvan:sizemedium | 0.146 | 0.371 | 0.393 | 0.694 | -0.585 | 0.877 |

perc_disadvan:sizelarge | 0.189 | 0.323 | 0.586 | 0.559 | -0.446 | 0.824 |

```
model_2_parallel_slopes <- lm(average_sat_math ~ perc_disadvan + size,
data = MA_schools)
get_regression_table(model_2_parallel_slopes)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 588.19 | 7.607 | 77.325 | 0.000 | 573.23 | 603.15 |

perc_disadvan | -2.78 | 0.106 | -26.120 | 0.000 | -2.99 | -2.57 |

sizemedium | -11.91 | 7.535 | -1.581 | 0.115 | -26.74 | 2.91 |

sizelarge | -6.36 | 6.923 | -0.919 | 0.359 | -19.98 | 7.26 |

Observe how the regression table for the interaction model has 2 more rows (6 versus 4). This reflects the additional “complexity” of the interaction model over the parallel slopes model.

Furthermore, note in Table 6.12 how the *offsets for the slopes* `perc_disadvan:sizemedium`

being 0.146 and `perc_disadvan:sizelarge`

being 0.189 are small relative to the *slope for the baseline group* of small schools of \(-2.932\). In other words, all three slopes are similarly negative: \(-2.932\) for small schools, \(-2.786\) \((=-2.932 + 0.146)\) for medium schools, and \(-2.743\) \((=-2.932 + 0.189)\) for large schools. These results are suggesting that irrespective of school size, the relationship between average math SAT scores and the percent of the student body that is economically disadvantaged is similar and, alas, quite negative.

What you have just performed is a rudimentary *model selection*: choosing which model fits data best among a set of candidate models. While the model selection approach we just took was visual in nature and hence somewhat qualitative, more statistically rigorous methods for model selection exist in the fields of multiple regression and statistical/machine learning.

### 6.3.2 Correlation coefficient

Recall from Table 6.9 that the correlation coefficient between `income`

in thousands of dollars and credit card `debt`

was 0.464. What if instead we looked at the correlation coefficient between `income`

and credit card `debt`

, but where `income`

was in dollars and not thousands of dollars? This can be done by multiplying `income`

by 1000.

debt | income | |
---|---|---|

debt | 1.000 | 0.464 |

income | 0.464 | 1.000 |

We see it is the same! We say that the correlation coefficient is *invariant to linear transformations*. The correlation between \(x\) and \(y\) will be the same as the correlation between \(a\cdot x + b\) and \(y\) for any numerical values \(a\) and \(b\).

### 6.3.3 Simpson’s Paradox

Recall in Section 6.2, we saw the two seemingly contradictory results when studying the relationship between credit card `debt`

and `income`

. On the one hand, the right hand plot of Figure 6.5 suggested that the relationship between credit card `debt`

and `income`

was *positive*. We re-display this in Figure 6.9.

On the other hand, the multiple regression results in Table 6.10 suggested that the relationship between `debt`

and `income`

was *negative*. We re-display this information in Table 6.15.

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | -385.179 | 19.465 | -19.8 | 0 | -423.446 | -346.912 |

credit_limit | 0.264 | 0.006 | 45.0 | 0 | 0.253 | 0.276 |

income | -7.663 | 0.385 | -19.9 | 0 | -8.420 | -6.906 |

Observe how the slope for `income`

is \(-7.663\) and, most importantly for now, it is negative. This contradicts our observation in Figure 6.9 that the relationship is positive. How can this be? Recall the interpretation of the slope for `income`

in the context of a multiple regression model: *taking into account all the other explanatory variables in our model*, for every increase of one unit in `income`

(i.e., $1000), there is an associated decrease of on average $7.663 in `debt`

.

In other words, while in *isolation*, the relationship between `debt`

and `income`

may be positive, when taking into account `credit_limit`

as well, this relationship becomes negative. These seemingly paradoxical results are due to a phenomenon aptly named *Simpson’s Paradox*. Simpson’s Paradox occurs when trends that exist for the data in aggregate either disappear or reverse when the data are broken down into groups.

Let’s show how Simpson’s Paradox manifests itself in the `credit_ch6`

data. Let’s first visualize the distribution of the numerical explanatory variable `credit_limit`

with a histogram in Figure 6.10.

The vertical dashed lines are the *quartiles* that cut up the variable `credit_limit`

into four equally sized groups. Let’s think of these quartiles as converting our numerical variable `credit_limit`

into a categorical variable “`credit_limit`

bracket” with four levels. This means that

- 25% of credit limits were between $0 and $3088. Let’s assign these 100 people to the “low”
`credit_limit`

bracket. - 25% of credit limits were between $3088 and $4622. Let’s assign these 100 people to the “medium-low”
`credit_limit`

bracket. - 25% of credit limits were between $4622 and $5873. Let’s assign these 100 people to the “medium-high”
`credit_limit`

bracket. - 25% of credit limits were over $5873. Let’s assign these 100 people to the “high”
`credit_limit`

bracket.

Now in Figure 6.11 let’s re-display two versions of the scatterplot of `debt`

and `income`

from Figure 6.9, but with a slight twist:

- The left-hand plot shows the regular scatterplot and the single regression line, just as you saw in Figure 6.9.
- The right-hand plot shows the
*colored scatterplot*, where the color aesthetic is mapped to “`credit_limit`

bracket.” Furthermore, there are now four separate regression lines.

In other words, the location of the 400 points are the same in both scatterplots, but the right-hand plot shows an additional variable of information: `credit_limit`

bracket.

The left-hand plot of Figure 6.11 focuses on the relationship between `debt`

and `income`

in *aggregate*. It is suggesting that overall there exists a positive relationship between `debt`

and `income`

. However, the right-hand plot of Figure 6.11 focuses on the relationship between `debt`

and `income`

*broken down by credit_limit bracket*. In other words, we focus on four

*separate*relationships between

`debt`

and `income`

: one for the “low” `credit_limit`

bracket, one for the “medium-low” `credit_limit`

bracket, and so on.Observe in the right-hand plot that the relationship between `debt`

and `income`

is clearly negative for the “medium-low” and “medium-high” `credit_limit`

brackets, while the relationship is somewhat flat for the “low” `credit_limit`

bracket. The only `credit_limit`

bracket where the relationship remains positive is for the “high” `credit_limit`

bracket. However, this relationship is less positive than in the relationship in aggregate, since the slope is shallower than the slope of the regression line in the left-hand plot.

In this example of Simpson’s Paradox, the `credit_limit`

is a *confounding variable* of the relationship between credit card `debt`

and `income`

as we defined in Subsection 5.3.1. Thus, `credit_limit`

needs to be accounted for in any appropriate model for the relationship between `debt`

and `income`

.