## 5.2 One categorical explanatory variable

It’s an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways:

- Differences between continents: Are there significant differences in average life expectancy between the five populated continents of the world: Africa, the Americas, Asia, Europe, and Oceania?
- Differences within continents: How does life expectancy vary within the world’s five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia?

To answer such questions, we’ll use the `gapminder`

data frame included in the `gapminder`

package. This dataset has international development statistics such as life expectancy, GDP per capita, and population for 142 countries for 5-year intervals between 1952 and 2007. Recall we visualized some of this data in Figure 2.1 in Subsection 2.1.2 on the grammar of graphics.

We’ll use this data for basic regression again, but now using an explanatory variable \(x\) that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section 5.1.

- A numerical outcome variable \(y\) (a country’s life expectancy) and
- A single categorical explanatory variable \(x\) (the continent that the country is a part of).

When the explanatory variable \(x\) is categorical, the concept of a “best-fitting” regression line is a little different than the one we saw previously in Section 5.1 where the explanatory variable \(x\) was numerical. We’ll study these differences shortly in Subsection 5.2.2, but first we conduct an exploratory data analysis.

### 5.2.1 Exploratory data analysis

The data on the 142 countries can be found in the `gapminder`

data frame included in the `gapminder`

package. However, to keep things simple, let’s `filter()`

for only those observations/rows corresponding to the year 2007. Additionally, let’s `select()`

only the subset of the variables we’ll consider in this chapter. We’ll save this data in a new data frame called `gapminder2007`

:

```
library(gapminder)
gapminder2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, lifeExp, continent, gdpPercap)
```

Let’s perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio’s spreadsheet viewer or by using the `glimpse()`

command as introduced in Subsection 1.4.3 on exploring data frames:

```
Rows: 142
Columns: 4
$ country <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Australia…
$ lifeExp <dbl> 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,…
$ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…
$ gdpPercap <dbl> 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33…
```

Observe that `Observations: 142`

indicates that there are 142 rows/observations in `gapminder2007`

, where each row corresponds to one country. In other words, the *observational unit* is an individual country. Furthermore, observe that the variable `continent`

is of type `<fct>`

, which stands for *factor*, which is R’s way of encoding categorical variables.

A full description of all the variables included in `gapminder`

can be found by reading the associated help file (run `?gapminder`

in the console). However, let’s fully describe only the 4 variables we selected in `gapminder2007`

:

`country`

: An identification variable of type character/text used to distinguish the 142 countries in the dataset.`lifeExp`

: A numerical variable of that country’s life expectancy at birth. This is the outcome variable \(y\) of interest.`continent`

: A categorical variable with five levels. Here “levels” correspond to the possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable \(x\) of interest.`gdpPercap`

: A numerical variable of that country’s GDP per capita in US inflation-adjusted dollars that we’ll use as another outcome variable \(y\) in the*Learning check*at the end of this subsection.

Let’s look at a random sample of five out of the 142 countries in Table 5.5.

country | lifeExp | continent | gdpPercap |
---|---|---|---|

Togo | 58.4 | Africa | 883 |

Sao Tome and Principe | 65.5 | Africa | 1598 |

Congo, Dem. Rep. | 46.5 | Africa | 278 |

Lesotho | 42.6 | Africa | 1569 |

Bulgaria | 73.0 | Europe | 10681 |

Note that random sampling will likely produce a different subset of 5 rows for you than what’s shown. Now that we’ve looked at the raw values in our `gapminder2007`

data frame and got a sense of the data, let’s move on to computing summary statistics. Let’s once again apply the `skim()`

function from the `skimr`

package. Recall from our previous EDA that this function takes in a data frame, “skims” it, and returns commonly used summary statistics. Let’s take our `gapminder2007`

data frame, `select()`

only the outcome and explanatory variables `lifeExp`

and `continent`

, and pipe them into the `skim()`

function:

```
Skim summary statistics
n obs: 142
n variables: 2
── Variable type:factor
variable missing complete n n_unique top_counts ordered
continent 0 142 142 5 Afr: 52, Asi: 33, Eur: 30, Ame: 25 FALSE
── Variable type:numeric
variable missing complete n mean sd p0 p25 p50 p75 p100
lifeExp 0 142 142 67.01 12.07 39.61 57.16 71.94 76.41 82.6
```

The `skim()`

output now reports summaries for categorical variables (`Variable type:factor`

) separately from the numerical variables (`Variable type:numeric`

). For the categorical variable `continent`

, it reports:

`missing`

,`complete`

, and`n`

, which are the number of missing, complete, and total number of values as before, respectively.`n_unique`

: The number of unique levels to this variable, corresponding to Africa, Asia, Americas, Europe, and Oceania. This refers to how many countries are in the data for each continent.`top_counts`

: In this case, the top four counts:`Africa`

has 52 countries,`Asia`

has 33,`Europe`

has 30, and`Americas`

has 25. Not displayed is`Oceania`

with 2 countries.`ordered`

: This tells us whether the categorical variable is “ordinal”: whether there is an encoded hierarchy (like low, medium, high). In this case,`continent`

is not ordered.

Turning our attention to the summary statistics of the numerical variable `lifeExp`

, we observe that the global median life expectancy in 2007 was 71.94. Thus, half of the world’s countries (71 countries) had a life expectancy less than 71.94. The mean life expectancy of 67.01 is lower, however. Why is the mean life expectancy lower than the median?

We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable \(y\) = `lifeExp`

in Figure 5.7.

```
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white") +
labs(x = "Life expectancy", y = "Number of countries",
title = "Histogram of distribution of worldwide life expectancies")
```

We see that this data is *left-skewed*, also known as *negatively* skewed: there are a few countries with low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers; hence, the median is greater than the mean in this case.

Remember, however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable `continent`

. We can do this easily with a faceted histogram. Recall from Section 2.6 that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure 5.8 by adding a `facet_wrap(~ continent, nrow = 2)`

layer.

```
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white") +
labs(x = "Life expectancy",
y = "Number of countries",
title = "Histogram of distribution of worldwide life expectancies") +
facet_wrap(~ continent, nrow = 2)
```

Observe that unfortunately the distribution of African life expectancies is much lower than the other continents, while in Europe life expectancies tend to be higher and furthermore do not vary as much. On the other hand, both Asia and Africa have the most variation in life expectancies. There is the least variation in Oceania, but keep in mind that there are only two countries in Oceania: Australia and New Zealand.

Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable `continent`

to the \(x\)-axis and the different life expectancies within each continent on the \(y\)-axis in Figure 5.9.

```
ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life expectancy",
title = "Life expectancy by continent")
```

Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable’s levels with imaginary horizontal lines. For example, observe in Figure 5.9 that we can quickly convince ourselves that Oceania has the highest median life expectancies by drawing an imaginary horizontal line at \(y\) = 80. Furthermore, as we observed in the faceted histogram in Figure 5.8, Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges (the heights of the boxes).

It’s important to remember, however, that the solid lines in the middle of the boxes correspond to the medians (the middle value) rather than the mean (the average). So, for example, if you look at Asia, the solid line denotes the median life expectancy of around 72 years. This tells us that half of all countries in Asia have a life expectancy below 72 years, whereas half have a life expectancy above 72 years.

Let’s compute the median and mean life expectancy for each continent with a little more data wrangling and display the results in Table 5.6.

```
lifeExp_by_continent <- gapminder2007 %>%
group_by(continent) %>%
summarize(median = median(lifeExp),
mean = mean(lifeExp))
```

continent | median | mean |
---|---|---|

Africa | 52.9 | 54.8 |

Americas | 72.9 | 73.6 |

Asia | 72.4 | 70.7 |

Europe | 78.6 | 77.6 |

Oceania | 80.7 | 80.7 |

Observe the order of the second column `median`

life expectancy: Africa is lowest, the Americas and Asia are next with similar medians, then Europe, then Oceania. This ordering corresponds to the ordering of the solid black lines inside the boxes in our side-by-side boxplot in Figure 5.9.

Let’s now turn our attention to the values in the third column `mean`

. Using Africa’s mean life expectancy of 54.8 as a *baseline for comparison*, let’s start making comparisons to the mean life expectancies of the other four continents and put these values in Table 5.7, which we’ll revisit later on in this section.

- For the Americas, it is 73.6 - 54.8 = 18.8 years higher.
- For Asia, it is 70.7 - 54.8 = 15.9 years higher.
- For Europe, it is 77.6 - 54.8 = 22.8 years higher.
- For Oceania, it is 80.7 - 54.8 = 25.9 years higher.

continent | mean | Difference versus Africa |
---|---|---|

Africa | 54.8 | 0.0 |

Americas | 73.6 | 18.8 |

Asia | 70.7 | 15.9 |

Europe | 77.6 | 22.8 |

Oceania | 80.7 | 25.9 |

*Learning check*

**(LC5.4)** Conduct a new exploratory data analysis with the same explanatory variable \(x\) being `continent`

but with `gdpPercap`

as the new outcome variable \(y\). What can you say about the differences in GDP per capita between continents based on this exploration?

### 5.2.2 Linear regression

In Subsection 5.1.2 we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable \(y\) and a numerical explanatory variable \(x\). In our life expectancy example, we now instead have a categorical explanatory variable `continent`

. Our model will not yield a “best-fitting” regression line like in Figure 5.4, but rather *offsets* relative to a baseline for comparison.

As we did in Subsection 5.1.2 when studying the relationship between teaching scores and “beauty” scores, let’s output the regression table for this model. Recall that this is done in two steps:

- We first “fit” the linear regression model using the
`lm(y ~ x, data)`

function and save it in`lifeExp_model`

. - We get the regression table by applying the
`get_regression_table()`

function from the`moderndive`

package to`lifeExp_model`

.

term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
---|---|---|---|---|---|---|

intercept | 54.8 | 1.02 | 53.45 | 0 | 52.8 | 56.8 |

continentAmericas | 18.8 | 1.80 | 10.45 | 0 | 15.2 | 22.4 |

continentAsia | 15.9 | 1.65 | 9.68 | 0 | 12.7 | 19.2 |

continentEurope | 22.8 | 1.70 | 13.47 | 0 | 19.5 | 26.2 |

continentOceania | 25.9 | 5.33 | 4.86 | 0 | 15.4 | 36.5 |

Let’s once again focus on the values in the `term`

and `estimate`

columns of Table 5.8. Why are there now 5 rows? Let’s break them down one-by-one:

`intercept`

corresponds to the mean life expectancy of countries in Africa of 54.8 years.`continentAmericas`

corresponds to countries in the Americas and the value +18.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in the Americas is \(54.8 + 18.8 = 73.6\).`continentAsia`

corresponds to countries in Asia and the value +15.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Asia is \(54.8 + 15.9 = 70.7\).`continentEurope`

corresponds to countries in Europe and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Europe is \(54.8 + 22.8 = 77.6\).`continentOceania`

corresponds to countries in Oceania and the value +25.9 is the same difference in mean life expectancy relative to Africa we displayed in Table 5.7. In other words, the mean life expectancy of countries in Oceania is \(54.8 + 25.9 = 80.7\).

To summarize, the 5 values in the `estimate`

column in Table 5.8 correspond to the “baseline for comparison” continent Africa (the intercept) as well as four “offsets” from this baseline for the remaining 4 continents: the Americas, Asia, Europe, and Oceania.

You might be asking at this point why was Africa chosen as the “baseline for comparison” group. This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable `continent`

’s factor “levels” using the `forcats`

package. See Chapter 15 of *R for Data Science* (Grolemund and Wickham 2017) for examples.

Let’s now write the equation for our fitted values \(\widehat{y} = \widehat{\text{life exp}}\).

\[ \begin{aligned} \widehat{y} = \widehat{\text{life exp}} &= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\text{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\text{Asia}}(x) + \\ & \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\text{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\text{Ocean}}(x)\\ &= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x) + \\ & \qquad 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + 25.9\cdot\mathbb{1}_{\text{Ocean}}(x) \end{aligned} \]

Whoa! That looks daunting! Don’t fret, however, as once you understand what all the elements mean, things simplify greatly. First, \(\mathbb{1}_{A}(x)\) is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where

\[ \mathbb{1}_{A}(x) = \left\{ \begin{array}{ll} 1 & \text{if } x \text{ is in } A \\ 0 & \text{if } \text{otherwise} \end{array} \right. \]

In a statistical modeling context, this is also known as a *dummy variable*. In our case, let’s consider the first such indicator variable \(\mathbb{1}_{\text{Amer}}(x)\). This indicator function returns 1 if a country is in the Americas, 0 otherwise:

\[ \mathbb{1}_{\text{Amer}}(x) = \left\{ \begin{array}{ll} 1 & \text{if } \text{country } x \text{ is in the Americas} \\ 0 & \text{otherwise}\end{array} \right. \]

Second, \(b_0\) corresponds to the intercept as before; in this case, it’s the mean life expectancy of all countries in Africa. Third, the \(b_{\text{Amer}}\), \(b_{\text{Asia}}\), \(b_{\text{Euro}}\), and \(b_{\text{Ocean}}\) represent the 4 “offsets relative to the baseline for comparison” in the regression table output in Table 5.8: `continentAmericas`

, `continentAsia`

, `continentEurope`

, and `continentOceania`

.

Let’s put this all together and compute the fitted value \(\widehat{y} = \widehat{\text{life exp}}\) for a country in Africa. Since the country is in Africa, all four indicator functions \(\mathbb{1}_{\text{Amer}}(x)\), \(\mathbb{1}_{\text{Asia}}(x)\), \(\mathbb{1}_{\text{Euro}}(x)\), and \(\mathbb{1}_{\text{Ocean}}(x)\) will equal 0, and thus:

\[ \begin{aligned} \widehat{\text{life exp}} &= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\text{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\text{Asia}}(x) + \\ & \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\text{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\text{Ocean}}(x)\\ &= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x) + \\ & \qquad 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)\\ &= 54.8 + 18.8\cdot 0 + 15.9\cdot 0 + 22.8\cdot 0 + 25.9\cdot 0\\ &= 54.8 \end{aligned} \]

In other words, all that’s left is the intercept \(b_0\), corresponding to the average life expectancy of African countries of 54.8 years. Next, say we are considering a country in the Americas. In this case, only the indicator function \(\mathbb{1}_{\text{Amer}}(x)\) for the Americas will equal 1, while all the others will equal 0, and thus:

\[ \begin{aligned} \widehat{\text{life exp}} &= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x) + 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + \\ & \qquad 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)\\ &= 54.8 + 18.8\cdot 1 + 15.9\cdot 0 + 22.8\cdot 0 + 25.9\cdot 0\\ &= 54.8 + 18.8 \\ & = 73.6 \end{aligned} \]

which is the mean life expectancy for countries in the Americas of 73.6 years in Table 5.7. Note the “offset from the baseline for comparison” is +18.8 years.

Let’s do one more. Say we are considering a country in Asia. In this case, only the indicator function \(\mathbb{1}_{\text{Asia}}(x)\) for Asia will equal 1, while all the others will equal 0, and thus:

\[ \begin{aligned} \widehat{\text{life exp}} &= 54.8 + 18.8\cdot\mathbb{1}_{\text{Amer}}(x) + 15.9\cdot\mathbb{1}_{\text{Asia}}(x) + 22.8\cdot\mathbb{1}_{\text{Euro}}(x) + \\ & \qquad 25.9\cdot\mathbb{1}_{\text{Ocean}}(x)\\ &= 54.8 + 18.8\cdot 0 + 15.9\cdot 1 + 22.8\cdot 0 + 25.9\cdot 0\\ &= 54.8 + 15.9 \\ & = 70.7 \end{aligned} \]

which is the mean life expectancy for Asian countries of 70.7 years in Table 5.7. The “offset from the baseline for comparison” here is +15.9 years.

Let’s generalize this idea a bit. If we fit a linear regression model using a categorical explanatory variable \(x\) that has \(k\) possible categories, the regression table will return an intercept and \(k - 1\) “offsets.” In our case, since there are \(k = 5\) continents, the regression model returns an intercept corresponding to the baseline for comparison group of Africa and \(k - 1 = 4\) offsets corresponding to the Americas, Asia, Europe, and Oceania.

Understanding a regression table output when you’re using a categorical explanatory variable is a topic those new to regression often struggle with. The only real remedy for these struggles is practice, practice, practice. However, once you equip yourselves with an understanding of how to create regression models using categorical explanatory variables, you’ll be able to incorporate many new variables into your models, given the large amount of the world’s data that is categorical. If you feel like you’re still struggling at this point, however, we suggest you closely compare Tables 5.7 and 5.8 and note how you can compute all the values from one table using the values in the other.

*Learning check*

**(LC5.5)** Fit a new linear regression using `lm(gdpPercap ~ continent, data = gapminder2007)`

where `gdpPercap`

is the new outcome variable \(y\). Get information about the “best-fitting” line from the regression table by applying the `get_regression_table()`

function. How do the regression results match up with the results from your previous exploratory data analysis?

### 5.2.3 Observed/fitted values and residuals

Recall in Subsection 5.1.3, we defined the following three concepts:

- Observed values \(y\), or the observed value of the outcome variable
- Fitted values \(\widehat{y}\), or the value on the regression line for a given \(x\) value
- Residuals \(y - \widehat{y}\), or the error between the observed value and the fitted value

We obtained these values and other values using the `get_regression_points()`

function from the `moderndive`

package. This time, however, let’s add an argument setting `ID = "country"`

: this is telling the function to use the variable `country`

in `gapminder2007`

as an *identification variable* in the output. This will help contextualize our analysis by matching values to countries.

country | lifeExp | continent | lifeExp_hat | residual |
---|---|---|---|---|

Afghanistan | 43.8 | Asia | 70.7 | -26.900 |

Albania | 76.4 | Europe | 77.6 | -1.226 |

Algeria | 72.3 | Africa | 54.8 | 17.495 |

Angola | 42.7 | Africa | 54.8 | -12.075 |

Argentina | 75.3 | Americas | 73.6 | 1.712 |

Australia | 81.2 | Oceania | 80.7 | 0.516 |

Austria | 79.8 | Europe | 77.6 | 2.180 |

Bahrain | 75.6 | Asia | 70.7 | 4.907 |

Bangladesh | 64.1 | Asia | 70.7 | -6.666 |

Belgium | 79.4 | Europe | 77.6 | 1.792 |

Observe in Table 5.9 that `lifeExp_hat`

contains the fitted values \(\widehat{y}\) = \(\widehat{\text{lifeExp}}\). If you look closely, there are only 5 possible values for `lifeExp_hat`

. These correspond to the five mean life expectancies for the 5 continents that we displayed in Table 5.7 and computed using the values in the `estimate`

column of the regression table in Table 5.8.

The `residual`

column is simply \(y - \widehat{y}\) = `lifeExp - lifeExp_hat`

. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy. For example, look at the first row of Table 5.9 corresponding to Afghanistan. The residual of \(y - \widehat{y} = 43.8 - 70.7 = -26.9\) is telling us that Afghanistan’s life expectancy is a whopping 26.9 years lower than the mean life expectancy of all Asian countries. This can in part be explained by the many years of war that country has suffered.

*Learning check*

**(LC5.6)** Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter 3, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents’ life expectancy?

**(LC5.7)** Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents’ life expectancy?

### References

Grolemund, Garrett, and Hadley Wickham. 2017. *R for Data Science*. First. Sebastopol, CA: O’Reilly Media. https://r4ds.had.co.nz/.