ModernDive

6.2 Two numerical explanatory variables

Let’s now switch gears and consider multiple regression models where instead of one numerical and one categorical explanatory variable, we now have two numerical explanatory variables. The dataset we’ll use is from An Introduction to Statistical Learning with Applications in R (ISLR), an intermediate-level textbook on statistical and machine learning (James et al. 2017). Its accompanying ISLR R package contains the datasets to which the authors apply various machine learning methods.

One frequently used dataset in this book is the Credit dataset, where the outcome variable of interest is the credit card debt of 400 individuals. Other variables like income, credit limit, credit rating, and age are included as well. Note that the Credit data is not based on real individuals’ financial information, but rather is a simulated dataset used for educational purposes.

In this section, we’ll fit a regression model where we have

  1. A numerical outcome variable \(y\), the cardholder’s credit card debt
  2. Two explanatory variables:
    1. One numerical explanatory variable \(x_1\), the cardholder’s credit limit
    2. Another numerical explanatory variable \(x_2\), the cardholder’s income (in thousands of dollars).

6.2.1 Exploratory data analysis

Let’s load the Credit dataset. To keep things simple let’s select() the subset of the variables we’ll consider in this chapter, and save this data in the new data frame credit_ch6. Notice our slightly different use of the select() verb here than we introduced in Subsection 3.8.1. For example, we’ll select the Balance variable from Credit but then save it with a new variable name debt. We do this because here the term “debt” is easier to interpret than “balance.”

You can observe the effect of our use of select() in the first common step of an exploratory data analysis: looking at the raw values either in RStudio’s spreadsheet viewer or by using glimpse().

Rows: 400
Columns: 6
$ ID            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ debt          <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 140…
$ credit_limit  <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6…
$ income        <dbl> 14.9, 106.0, 104.6, 148.9, 55.9, 80.2, 21.0, 71.4, 15.1…
$ credit_rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, …
$ age           <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49,…

Furthermore, let’s look at a random sample of five out of the 400 credit card holders in Table 6.8. Once again, note that due to the random nature of the sampling, you will likely end up with a different subset of five rows.

TABLE 6.8: Random sample of 5 credit card holders
ID debt credit_limit income credit_rating age
272 436 4866 45.0 347 30
239 52 2910 26.5 236 58
87 815 6340 55.4 448 33
108 0 3189 39.1 263 72
149 0 2420 15.2 192 69

Now that we’ve looked at the raw values in our credit_ch6 data frame and got a sense of the data, let’s move on to the next common step in an exploratory data analysis: computing summary statistics. Let’s use the skim() function from the skimr package, being sure to only select() the columns of interest for our model:

Skim summary statistics
 n obs: 400 
 n variables: 3 

── Variable type:integer 
  variable missing complete   n    mean      sd  p0     p25    p50     p75  p100
credit_limit     0      400 400 4735.6  2308.2  855 3088    4622.5 5872.75 13913
         debt    0      400 400  520.01  459.76   0   68.75  459.5  863     1999

── Variable type:numeric 
 variable missing complete   n  mean    sd    p0   p25   p50   p75   p100
   income       0      400 400 45.22 35.24 10.35 21.01 33.12 57.47 186.63

Observe the summary statistics for the outcome variable debt: the mean and median credit card debt are $520.01 and $459.50, respectively, and that 25% of card holders had debts of $68.75 or less. Let’s now look at one of the explanatory variables credit_limit: the mean and median credit card limit are $4735.6 and $4622.50, respectively, while 75% of card holders had incomes of $57,470 or less.

Since our outcome variable debt and the explanatory variables credit_limit and income are numerical, we can compute the correlation coefficient between the different possible pairs of these variables. First, we can run the get_correlation() command as seen in Subsection 5.1.1 twice, once for each explanatory variable:

Or we can simultaneously compute them by returning a correlation matrix which we display in Table 6.9. We can see the correlation coefficient for any pair of variables by looking them up in the appropriate row/column combination.

TABLE 6.9: Correlation coefficients between credit card debt, credit limit, and income
debt credit_limit income
debt 1.000 0.862 0.464
credit_limit 0.862 1.000 0.792
income 0.464 0.792 1.000

For example, the correlation coefficient of:

  1. debt with itself is 1 as we would expect based on the definition of the correlation coefficient.
  2. debt with credit_limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card debts.
  3. debt with income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between debt and credit_limit.
  4. As an added bonus, we can read off the correlation coefficient between the two explanatory variables of credit_limit and income as 0.792.

We say there is a high degree of collinearity between the credit_limit and income explanatory variables. Collinearity (or multicollinearity) is a phenomenon where one explanatory variable in a multiple regression model is highly correlated with another.

So in our case since credit_limit and income are highly correlated, if we knew someone’s credit_limit, we could make pretty good guesses about their income as well. Thus, these two variables provide somewhat redundant information. However, we’ll leave discussion on how to work with collinear explanatory variables to a more intermediate-level book on regression modeling.

Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots in Figure 6.5.

Relationship between credit card debt and credit limit/income.

FIGURE 6.5: Relationship between credit card debt and credit limit/income.

Observe there is a positive relationship between credit limit and credit card debt: as credit limit increases so also does credit card debt. This is consistent with the strongly positive correlation coefficient of 0.862 we computed earlier. In the case of income, the positive relationship doesn’t appear as strong, given the weakly positive correlation coefficient of 0.464.

However, the two plots in Figure 6.5 only focus on the relationship of the outcome variable with each of the two explanatory variables separately. To visualize the joint relationship of all three variables simultaneously, we need a 3-dimensional (3D) scatterplot as seen in Figure 6.6. Each of the 400 observations in the credit_ch6 data frame are marked with a blue point where

  1. The numerical outcome variable \(y\) debt is on the vertical axis.
  2. The two numerical explanatory variables, \(x_1\) income and \(x_2\) credit_limit, are on the two axes that form the bottom plane.
3D scatterplot and regression plane.

FIGURE 6.6: 3D scatterplot and regression plane.

Furthermore, we also include the regression plane. Recall from Subsection 5.3.2 that regression lines are “best-fitting” in that of all possible lines we can draw through a cloud of points, the regression line minimizes the sum of squared residuals. This concept also extends to models with two numerical explanatory variables. The difference is instead of a “best-fitting” line, we now have a “best-fitting” plane that similarly minimizes the sum of squared residuals. Head to this website to open an interactive version of this plot in your browser.

Learning check

(LC6.2) Conduct a new exploratory data analysis with the same outcome variable \(y\) debt but with credit_rating and age as the new explanatory variables \(x_1\) and \(x_2\). What can you say about the relationship between a credit card holder’s debt and their credit rating and age?

6.2.2 Regression plane

Let’s now fit a regression model and get the regression table corresponding to the regression plane in Figure 6.6. To keep things brief in this subsection, we won’t consider an interaction model for the two numerical explanatory variables income and credit_limit like we did in Subsection 6.1.2 using the model formula score ~ age * gender. Rather we’ll only consider a model fit with a formula of the form y ~ x1 + x2. Confusingly, however, since we now have a regression plane instead of multiple lines, the label “parallel slopes” doesn’t apply when you have two numerical explanatory variables. Just as we have done multiple times throughout Chapters 5 and this chapter, the regression table for this model using our two-step process is in Table 6.10.

TABLE 6.10: Multiple regression table
term estimate std_error statistic p_value lower_ci upper_ci
intercept -385.179 19.465 -19.8 0 -423.446 -346.912
credit_limit 0.264 0.006 45.0 0 0.253 0.276
income -7.663 0.385 -19.9 0 -8.420 -6.906
  1. We first “fit” the linear regression model using the lm(y ~ x1 + x2, data) function and save it in debt_model.
  2. We get the regression table by applying the get_regression_table() function from the moderndive package to debt_model.

Let’s interpret the three values in the estimate column. First, the intercept value is -$385.179. This intercept represents the credit card debt for an individual who has credit_limit of $0 and income of $0. In our data, however, the intercept has no practical interpretation since no individuals had credit_limit or income values of $0. Rather, the intercept is used to situate the regression plane in 3D space.

Second, the credit_limit value is $0.264. Taking into account all the other explanatory variables in our model, for every increase of one dollar in credit_limit, there is an associated increase of on average $0.26 in credit card debt. Just as we did in Subsection 5.1.2, we are cautious not to imply causality as we saw in Subsection 5.3.1 that “correlation is not necessarily causation.” We do this merely stating there was an associated increase.

Furthermore, we preface our interpretation with the statement, “taking into account all the other explanatory variables in our model.” Here, by all other explanatory variables we mean income. We do this to emphasize that we are now jointly interpreting the associated effect of multiple explanatory variables in the same model at the same time.

Third, income = -$7.66. Taking into account all other explanatory variables in our model, for every increase of one unit of income ($1000 in actual income), there is an associated decrease of, on average, $7.66 in credit card debt.

Putting these results together, the equation of the regression plane that gives us fitted values \(\widehat{y}\) = \(\widehat{\text{debt}}\) is:

\[ \begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x_1 + b_2 \cdot x_2\\ \widehat{\text{debt}} &= b_0 + b_{\text{limit}} \cdot \text{limit} + b_{\text{income}} \cdot \text{income}\\ &= -385.179 + 0.263 \cdot\text{limit} - 7.663 \cdot\text{income} \end{aligned} \]

Recall however in the right-hand plot of Figure 6.5 that when plotting the relationship between debt and income in isolation, there appeared to be a positive relationship. In the last discussed multiple regression, however, when jointly modeling the relationship between debt, credit_limit, and income, there appears to be a negative relationship of debt and income as evidenced by the negative slope for income of -$7.663. What explains these contradictory results? A phenomenon known as Simpson’s Paradox, whereby overall trends that exist in aggregate either disappear or reverse when the data are broken down into groups. In Subsection 6.3.3 we elaborate on this idea by looking at the relationship between credit_limit and credit card debt, but split along different income brackets.

Learning check

(LC6.3) Fit a new simple linear regression using lm(debt ~ credit_rating + age, data = credit_ch6) where credit_rating and age are the new numerical explanatory variables \(x_1\) and \(x_2\). Get information about the “best-fitting” regression plane from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis?

6.2.3 Observed/fitted values and residuals

Let’s also compute all fitted values and residuals for our regression model using the get_regression_points() function and present only the first 10 rows of output in Table 6.11. Remember that the coordinates of each of the blue points in our 3D scatterplot in Figure 6.6 can be found in the income, credit_limit, and debt columns. The fitted values on the regression plane are found in the debt_hat column and are computed using our equation for the regression plane in the previous section:

\[ \begin{aligned} \widehat{y} = \widehat{\text{debt}} &= -385.179 + 0.263 \cdot \text{limit} - 7.663 \cdot \text{income} \end{aligned} \]

TABLE 6.11: Regression points (First 10 credit card holders out of 400)
ID debt credit_limit income debt_hat residual
1 333 3606 14.9 454 -120.8
2 903 6645 106.0 559 344.3
3 580 7075 104.6 683 -103.4
4 964 9504 148.9 986 -21.7
5 331 4897 55.9 481 -150.0
6 1151 8047 80.2 1127 23.6
7 203 3388 21.0 349 -146.4
8 872 7114 71.4 948 -76.0
9 279 3300 15.1 371 -92.2
10 1350 6819 71.1 873 477.3

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning: With Applications in R. First. New York, NY: Springer.