## D.5 Chapter 5 Solutions

library(tidyverse)
library(moderndive)
library(skimr)
library(gapminder)

(LC5.1) Conduct a new exploratory data analysis with the same outcome variable $$y$$ being score but with age as the new explanatory variable $$x$$. Remember, this involves three things:

1. Looking at the raw data values.
2. Computing summary statistics.
3. Creating data visualizations.

What can you say about the relationship between age and teaching scores based on this exploration?

Solution:

• Looking at the raw data values:
glimpse(evals_ch5)
Rows: 463
Columns: 4
$ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…$ score   <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
$bty_avg <dbl> 5.00, 5.00, 5.00, 5.00, 3.00, 3.00, 3.00, 3.33, 3.33, 3.17, 3…$ age     <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 4…
• Computing summary statistics:
skim_with(numeric = list(hist = NULL), integer = list(hist = NULL))
evals_ch5 %>%
select(score, age) %>%
skim()
Skim summary statistics
n obs: 463
n variables: 2
group variables:

── Variable type:integer ───────────────────────────────────────────────────────
variable missing complete   n  mean  sd p0 p25 p50 p75 p100
age       0      463 463 48.37 9.8 29  42  48  57   73

── Variable type:numeric ───────────────────────────────────────────────────────
variable missing complete   n mean   sd  p0 p25 p50 p75 p100
score       0      463 463 4.17 0.54 2.3 3.8 4.3 4.6    5

(Note that for formatting purposes, the inline histogram that is usually printed with skim() has been removed. This can be done by running skim_with(numeric = list(hist = NULL), integer = list(hist = NULL)) prior to using the skim() function as well.)

• Creating data visualizations:
ggplot(evals_ch5, aes(x = age, y = score)) +
geom_point() +
labs(
x = "Age", y = "Teaching Score",
title = "Scatterplot of relationship of teaching score and age"
)

Based on the scatterplot visualization, there seem to have a weak negative relationship between age and teaching score. As age increases, the teaching score see, to decrease slightly.

(LC5.2) Fit a new simple linear regression using lm(score ~ age, data = evals_ch5) where age is the new explanatory variable $$x$$. Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your earlier exploratory data analysis?

Solution:

# Fit regression model:
score_age_model <- lm(score ~ age, data = evals_ch5)
# Get regression table:
get_regression_table(score_age_model)
# A tibble: 2 x 7
term      estimate std_error statistic p_value lower_ci upper_ci
<chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    4.462     0.127    35.195   0        4.213    4.711
2 age         -0.006     0.003    -2.311   0.021   -0.011   -0.001

\begin{aligned} \widehat{y} &= b_0 + b_1 \cdot x\\ \widehat{\text{score}} &= b_0 + b_{\text{age}} \cdot\text{age}\\ &= 4.462 - 0.006\cdot\text{age} \end{aligned}

For every increase of 1 unit in age, there is an associated decrease of, on average, 0.006 units of score. It matches with the results from our earlier exploratory data analysis.

(LC5.3) Generate a data frame of the residuals of the model where you used age as the explanatory $$x$$ variable.

Solution:

score_age_regression_points <- get_regression_points(score_age_model)
score_age_regression_points
# A tibble: 463 x 5
ID score   age score_hat residual
<int> <dbl> <int>     <dbl>    <dbl>
1     1 4.7      36     4.248  0.452
2     2 4.100    36     4.248 -0.148
3     3 3.9      36     4.248 -0.34800
4     4 4.8      36     4.248  0.552
5     5 4.600    59     4.112  0.488
6     6 4.3      59     4.112  0.188
7     7 2.8      59     4.112 -1.312
8     8 4.100    51     4.159 -0.059
9     9 3.4      51     4.159 -0.759
10    10 4.5      40     4.224  0.276
# … with 453 more rows

(LC5.4) Conduct a new exploratory data analysis with the same explanatory variable $$x$$ being continent but with gdpPercap as the new outcome variable $$y$$. Remember, this involves three things:

1. Most crucially: Looking at the raw data values.
2. Computing summary statistics, such as means, medians, and interquartile ranges.
3. Creating data visualizations.

What can you say about the differences in GDP per capita between continents based on this exploration?

Solution:

• Looking at the raw data values:
glimpse(gapminder2007)
Rows: 142
Columns: 4
$country <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Australia…$ lifeExp   <dbl> 43.8, 76.4, 72.3, 42.7, 75.3, 81.2, 79.8, 75.6, 64.1, 79.4,…
$continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…$ gdpPercap <dbl> 975, 5937, 6223, 4797, 12779, 34435, 36126, 29796, 1391, 33…
• Computing summary statistics, such as means, medians, and interquartile ranges:
gapminder2007 %>%
select(gdpPercap, continent) %>%
skim()
Skim summary statistics
n obs: 142
n variables: 2
group variables:

── Variable type:factor ────────────────────────────────────────────────────────
variable missing complete   n n_unique                         top_counts
continent       0      142 142        5 Afr: 52, Asi: 33, Eur: 30, Ame: 25
ordered
FALSE

── Variable type:numeric ───────────────────────────────────────────────────────
variable missing complete   n     mean       sd     p0     p25     p50
gdpPercap       0      142 142 11680.07 12859.94 277.55 1624.84 6124.37
p75     p100
18008.84 49357.19
• Creating data visualizations:
ggplot(gapminder2007, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
labs(
x = "Continent", y = "GPD per capita",
title = "GDP by continent"
)

Based on this exploration, it seems that GDP’s are very different among different continents, which means that continent might be a statistically significant predictor for an area’s GDP.

(LC5.5) Fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable $$y$$. Get information about the “best-fitting” line from the regression table by applying the get_regression_table() function. How do the regression results match up with the results from your previous exploratory data analysis?

Solution:

# Fit regression model:
gdp_model <- lm(gdpPercap ~ continent, data = gapminder2007)
# Get regression table:
get_regression_table(gdp_model)
# A tibble: 5 x 7
term              estimate std_error statistic p_value  lower_ci upper_ci
<chr>                <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
1 intercept          3089.03   1372.74     2.25    0.026   374.538  5803.53
2 continentAmericas  7914.00   2409.14     3.285   0.001  3150.08  12677.9
3 continentAsia      9383.99   2203.13     4.259   0      5027.46  13740.5
4 continentEurope   21965.4    2269.52     9.678   0     17477.6   26453.3
5 continentOceania  26721.2    7132.96     3.746   0     12616.2   40826.1 

\begin{aligned} \widehat{y} = \widehat{\text{gdpPercap}} &= b_0 + b_{\text{Amer}}\cdot\mathbb{1}_{\mbox{Amer}}(x) + b_{\text{Asia}}\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\ & \qquad b_{\text{Euro}}\cdot\mathbb{1}_{\mbox{Euro}}(x) + b_{\text{Ocean}}\cdot\mathbb{1}_{\mbox{Ocean}}(x)\\ &= 3089 + 7914\cdot\mathbb{1}_{\mbox{Amer}}(x) + 9384\cdot\mathbb{1}_{\mbox{Asia}}(x) + \\ & \qquad 21965\cdot\mathbb{1}_{\mbox{Euro}}(x) + 26721\cdot\mathbb{1}_{\mbox{Ocean}}(x) \end{aligned}

In our previous exploratory data analysis, it seemed that continent is a statistically significant predictor for an area’s GDP. Here, by fit a new linear regression using lm(gdpPercap ~ continent, data = gapminder2007) where gdpPercap is the new outcome variable $$y$$, we are able to write an equation to predict gdpPercap using the continent as statistically significant predictors. Therefore, the regression results matches with the results from your previous exploratory data analysis.

(LC5.6) Using either the sorting functionality of RStudio’s spreadsheet viewer or using the data wrangling tools you learned in Chapter 3, identify the five countries with the five smallest (most negative) residuals? What do these negative residuals say about their life expectancy relative to their continents?

Solution: Using the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five smallest (most negative) residuals are: Afghanistan, Swaziland, Mozambique, Haiti, and Zambia.

These negative residuals indicate that these data points have the biggest negative deviations from their group means. This means that these five countries’ average life expectancies are the lowest comparing to their respective continents’ average life expectancies. For example, the residual for Afghanistan is $$-26.900$$ and it is the smallest residual. This means that the average life expectancy of Afghanistan is $$26.900$$ years lower than the average life expectancy of its continent, Asia.

(LC5.7) Repeat this process, but identify the five countries with the five largest (most positive) residuals. What do these positive residuals say about their life expectancy relative to their continents?

Solution: Using either the sorting functionality of RStudio’s spreadsheet viewer, we can identify that the five countries with the five largest (most positive) residuals are: Reunion, Libya, Tunisia, Mauritius, and Algeria.

These positive residuals indicate that the data points are above the regression line with the longest distance. This means that these five countries’ average life expectancies are the highest comparing to their respective continents’ average life expectancies. For example, the residual for Reunion is $$21.636$$ and it is the largest residual. This means that the average life expectancy of Reunion is $$21.636$$ years higher than the average life expectancy of its continent, Africa.

(LC5.8) Note in the following plot there are 3 points marked with dots along with:

• The “best” fitting solid regression line in blue
• An arbitrarily chosen dotted red line
• Another arbitrarily chosen dashed green line

Compute the sum of squared residuals by hand for each line and show that of these three lines, the regression line in blue has the smallest value.

Solution:

• The “best” fitting solid regression line in blue:

$\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-1.5)^2+(1.0-2.0)^2+(3.0-2.5)^2 = 1.5$

• An arbitrarily chosen dotted red line:

$\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-2.5)^2+(1.00-2.5)^2+(3.0-2.5)^2 = 2.75$

• Another arbitrarily chosen dashed green line:

$\sum_{i=1}^{n}(y_i - \widehat{y}_i)^2 = (2.0-2.0)^2+(1.0-1.5)^2+(3.0-1.0)^2 = 4.25$

As calculated, $$1.5 < 2.75 < 4.25$$. Therefore, we show that the regression line in blue has the smallest value of the residual sum of squares.