## 6.2 Two numerical explanatory variables

Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from *An Introduction to Statistical Learning with Applications in R (ISLR)*, an intermediate-level textbook on statistical and machine learning (James et al. 2017). Its accompanying `ISLR`

R package contains the datasets to which the authors apply various machine learning methods.

One frequently used dataset in this book is the `Credit`

dataset, where the outcome variable of interest is the credit card debt of 400 individuals. Other variables like income, credit limit, credit rating, and age are included as well. Note that the `Credit`

data is not based on real individuals’ financial information, but rather is a simulated dataset used for educational purposes.

In this section, we’ll fit a regression model where we have

- A numerical outcome variable \(y\), the cardholder’s credit card debt
- Two explanatory variables:
- One numerical explanatory variable \(x_1\), the cardholder’s credit limit
- Another numerical explanatory variable \(x_2\), the cardholder’s income (in thousands of dollars).

### 6.2.1 Exploratory data analysis

Let’s load the `Credit`

dataset. To keep things simple let’s `select()`

the subset of the variables we’ll consider in this chapter, and save this data in the new data frame `credit_ch6`

. Notice our slightly different use of the `select()`

verb here than we introduced in Subsection 3.8.1. For example, we’ll select the `Balance`

variable from `Credit`

but then save it with a new variable name `debt`

. We do this because here the term “debt” is easier to interpret than “balance.”

```
library(ISLR)
credit_ch6 <- Credit %>% as_tibble() %>%
select(ID, debt = Balance, credit_limit = Limit,
income = Income, credit_rating = Rating, age = Age)
```

You can observe the effect of our use of `select()`

in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using `glimpse()`

.

```
Rows: 400
Columns: 6
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ debt <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 140…
$ credit_limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6…
$ income <dbl> 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 15.1…
$ credit_rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, …
$ age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49,…
```

Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table 6.8. Once again, note that due to the random nature of the sampling, you will likely end up with a different subset of five rows.

ID | debt | credit_limit | income | credit_rating | age |
---|---|---|---|---|---|

272 | 436 | 4866 | 45.0 | 347 | 30 |

239 | 52 | 2910 | 26.5 | 236 | 58 |

87 | 815 | 6340 | 55.4 | 448 | 33 |

108 | 0 | 3189 | 39.1 | 263 | 72 |

149 | 0 | 2420 | 15.2 | 192 | 69 |

Now that we’ve looked at the raw values in our `credit_ch6`

data frame and got a sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s use the `skim()`

function from the `skimr`

package, being sure to only `select()`

the columns of interest for our model:

```
Skim summary statistics
n obs: 400
n variables: 3
── Variable type:integer
variable missing complete n mean sd p0 p25 p50 p75 p100
credit_limit 0 400 400 4735.6 2308.2 855 3088 4622.5 5872.75 13913
debt 0 400 400 520.01 459.76 0 68.75 459.5 863 1999
── Variable type:numeric
variable missing complete n mean sd p0 p25 p50 p75 p100
income 0 400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63
```

Observe the summary statistics for the outcome variable `debt`

: the mean and median credit card debt are $520.01 and $459.50, respectively, and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables `credit_limit`

: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less.

Since our outcome variable `debt`

and the explanatory variables `credit_limit`

and `income`

are numerical, we can compute the correlation coefficient between the different possible pairs of these variables. First, we can run the `get_correlation()`

command as seen in Subsection 5.1.1 twice, once for each explanatory variable:

Or we can simultaneously compute them by returning a *correlation matrix* which we display in Table 6.9. We can see the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination.

debt | credit_limit | income | |
---|---|---|---|

debt | 1.000 | 0.862 | 0.464 |

credit_limit | 0.862 | 1.000 | 0.792 |

income | 0.464 | 0.792 | 1.000 |

For example, the correlation coefficient of:

`debt`

with itself is 1 as we would expect based on the definition of the correlation coefficient.`debt`

with`credit_limit`

is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card debts.`debt`

with`income`

is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between`debt`

and`credit_limit`

.- As an added bonus, we can read off the correlation coefficient between the two explanatory variables of
`credit_limit`

and`income`

as 0.792.

We say there is a high degree of *collinearity* between the `credit_limit`

and `income`

explanatory variables. Collinearity (or multicollinearity) is a phenomenon where one explanatory variable in a multiple regression model is highly correlated with another.

So in our case since `credit_limit`

and `income`

are highly correlated, if we knew someone’s `credit_limit`

, we could make pretty good guesses about their `income`

as well. Thus, these two variables provide somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling.

Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots in Figure 6.5.

```
ggplot(credit_ch6, aes(x = credit_limit, y = debt)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card debt (in $)",
title = "Debt and credit limit") +
geom_smooth(method = "lm", se = FALSE)
ggplot(credit_ch6, aes(x = income, y = debt)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card debt (in $)",
title = "Debt and income") +
geom_smooth(method = "lm", se = FALSE)
```

Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier. In the case of income, the positive relationship doesn’t appear as strong, given the weakly positive correlation coefficient of 0.464.

However, the two plots in Figure 6.5 only focus on the relationship of the outcome variable with each of the two explanatory variables *separately*. To visualize the *joint* relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure 6.6. Each of the 400 observations in the `credit_ch6`

data frame are marked with a blue point where

- The numerical outcome variable \(y\)
`debt`

is on the vertical axis. - The two numerical explanatory variables, \(x_1\)
`income`

and \(x_2\)`credit_limit`

, are on the two axes that form the bottom plane.

Furthermore, we also include the *regression plane*. Recall from Subsection 5.3.2 that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the *sum of squared residuals*. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to this website to open an interactive version of this plot in your browser.

*Learning check*

**(LC6.2)** Conduct a new exploratory data analysis with the same outcome variable \(y\) `debt`

but with `credit_rating`

and `age`

as the new explanatory variables \(x_1\) and \(x_2\). What can you say about the relationship between a credit card holder’s debt and their credit rating and age?

### 6.2.2 Regression plane

Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure 6.6. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables `income`

and `credit_limit`

like we did in Subsection 6.1.2 using the model formula `score ~ age * gender`

. Rather we’ll only consider a model fit with a formula of the form `y ~ x1 + x2`

. Confusingly, however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables. Just as we have done multiple times throughout Chapters 5 and this chapter, the regression table for this model using our two-step process is in Table 6.10.

```
# Fit regression model:
debt_model <- lm(debt ~ credit_limit + income, data = credit_ch6)
# Get regression table:
get_regression_table(debt_model)
```

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | -385.179 | 19.465 | -19.8 | 0 | -423.446 | -346.912 |

credit_limit | 0.264 | 0.006 | 45.0 | 0 | 0.253 | 0.276 |

income | -7.663 | 0.385 | -19.9 | 0 | -8.420 | -6.906 |

- We first “fit” the linear regression model using the
`lm(y ~ x1 + x2, data)`

function and save it in`debt_model`

. - We get the regression table by applying the
`get_regression_table()`

function from the`moderndive`

package to`debt_model`

.

Let’s interpret the three values in the `estimate`

column. First, the `intercept`

value is -$385.179. This intercept represents the credit card debt for an individual who has `credit_limit`

of $0 and `income`

of $0. In our data, however, the intercept has no practical interpretation since no individuals had `credit_limit`

or `income`

values of $0. Rather, the intercept is used to situate the regression plane in 3D space.

Second, the `credit_limit`

value is $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in `credit_limit`

, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection 5.1.2, we are cautious *not* to imply causality as we saw in Subsection 5.3.1 that “correlation is not necessarily causation.” We do this merely stating there was an *associated* increase.

Furthermore, we preface our interpretation with the statement, “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean `income`

. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time.

Third, `income`

= -$7.66. Taking into account all other explanatory variables in our model, for every increase of one unit of `income`

($1000 in actual income), there is an associated decrease of, on average, $7.66 in credit card debt.

Putting these results together, the equation of the regression plane that gives us fitted values \(\widehat{y}\) = \(\widehat{\text{debt}}\) is:

\[ \begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x_1 + b_2 \cdot x_2\\ \widehat{\text{debt}} &= b_0 + b_{\text{limit}} \cdot \text{limit} + b_{\text{income}} \cdot \text{income}\\ &= -385.179 + 0.263 \cdot\text{limit} - 7.663 \cdot\text{income} \end{aligned} \]

Recall however in the right-hand plot of Figure 6.5 that when plotting the relationship between `debt`

and `income`

in isolation, there appeared to be a *positive* relationship. In the last discussed multiple regression, however, when *jointly* modeling the relationship between `debt`

, `credit_limit`

, and `income`

, there appears to be a *negative* relationship of `debt`

and `income`

as evidenced by the negative slope for `income`

of -$7.663. What explains these contradictory results? A phenomenon known as *Simpson’s Paradox*, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection 6.3.3 we elaborate on this idea by looking at the relationship between `credit_limit`

and credit card `debt`

, but split along different `income`

brackets.

*Learning check*

**(LC6.3)** Fit a new simple linear regression using `lm(debt ~ credit_rating + age, data = credit_ch6)`

where `credit_rating`

and `age`

are the new numerical explanatory variables \(x_1\) and \(x_2\). Get information about the “best-fitting” regression plane from the regression table by applying the `get_regression_table()`

function. How do the regression results match up with the results from your previous exploratory data analysis?

### 6.2.3 Observed/fitted values and residuals

Let’s also compute all fitted values and residuals for our regression model using the `get_regression_points()`

function and present only the first 10 rows of output in Table 6.11. Remember that the coordinates of each of the blue points in our 3D scatterplot in Figure 6.6 can be found in the `income`

, `credit_limit`

, and `debt`

columns. The fitted values on the regression plane are found in the `debt_hat`

column and are computed using our equation for the regression plane in the previous section:

\[ \begin{aligned} \widehat{y} = \widehat{\text{debt}} &= -385.179 + 0.263 \cdot \text{limit} - 7.663 \cdot \text{income} \end{aligned} \]

ID | debt | credit_limit | income | debt_hat | residual |
---|---|---|---|---|---|

1 | 333 | 3606 | 14.9 | 454 | -120.8 |

2 | 903 | 6645 | 106.0 | 559 | 344.3 |

3 | 580 | 7075 | 104.6 | 683 | -103.4 |

4 | 964 | 9504 | 148.9 | 986 | -21.7 |

5 | 331 | 4897 | 55.9 | 481 | -150.0 |

6 | 1151 | 8047 | 80.2 | 1127 | 23.6 |

7 | 203 | 3388 | 21.0 | 349 | -146.4 |

8 | 872 | 7114 | 71.4 | 948 | -76.0 |

9 | 279 | 3300 | 15.1 | 371 | -92.2 |

10 | 1350 | 6819 | 71.1 | 873 | 477.3 |

### References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. *An Introduction to Statistical Learning: With Applications in R*. First. New York, NY: Springer.