## 8.2 Computer simulation of resampling

Let’s now mimic our tactile resampling activity virtually with a computer.

### 8.2.1 Virtually resampling once

First, let’s perform the virtual analog of resampling once. Recall that the `pennies_sample`

data frame included in the `moderndive`

package contains the years of our original sample of 50 pennies from the bank. Furthermore, recall in Chapter 7 on sampling that we used the `rep_sample_n()`

function as a virtual shovel to sample balls from our virtual bowl of 2400 balls as follows:

Let’s modify this code to perform the resampling with replacement of the 50 slips of paper representing our original sample 50 pennies:

Observe how we explicitly set the `replace`

argument to `TRUE`

in order to tell `rep_sample_n()`

that we would like to sample pennies *with* replacement. Had we not set `replace = TRUE`

, the function would’ve assumed the default value of `FALSE`

and hence done resampling *without* replacement. Additionally, since we didn’t specify the number of replicates via the `reps`

argument, the function assumes the default of one replicate `reps = 1`

. Lastly, observe also that the `size`

argument is set to match the original sample size of 50 pennies.

Let’s look at only the first 10 out of 50 rows of `virtual_resample`

:

```
# A tibble: 50 x 3
# Groups: replicate [1]
replicate ID year
<int> <int> <dbl>
1 1 37 1962
2 1 1 2002
3 1 45 1997
4 1 28 2006
5 1 50 2017
6 1 10 2000
7 1 16 2015
8 1 47 1982
9 1 23 1998
10 1 44 2015
# … with 40 more rows
```

The `replicate`

variable only takes on the value of 1 corresponding to us only having `reps = 1`

, the `ID`

variable indicates which of the 50 pennies from `pennies_sample`

was resampled, and `year`

denotes the year of minting. Let’s now compute the mean `year`

in our virtual resample of size 50 using data wrangling functions included in the `dplyr`

package:

```
# A tibble: 1 x 2
replicate resample_mean
<int> <dbl>
1 1 1996
```

As we saw when we did our tactile resampling exercise, the resulting mean year is different than the mean year of our 50 originally sampled pennies of 1995.44.

### 8.2.2 Virtually resampling 35 times

Let’s now perform the virtual analog of our 35 friends’ resampling. Using these results, we’ll be able to study the variability in the sample means from 35 resamples of size 50. Let’s first add a `reps = 35`

argument to `rep_sample_n()`

to indicate we would like 35 replicates. Thus, we want to repeat the resampling with the replacement of 50 pennies 35 times.

```
virtual_resamples <- pennies_sample %>%
rep_sample_n(size = 50, replace = TRUE, reps = 35)
virtual_resamples
```

```
# A tibble: 1,750 x 3
# Groups: replicate [35]
replicate ID year
<int> <int> <dbl>
1 1 21 1981
2 1 34 1985
3 1 4 1988
4 1 11 1994
5 1 26 1979
6 1 8 1996
7 1 19 1983
8 1 21 1981
9 1 49 2006
10 1 2 1986
# … with 1,740 more rows
```

The resulting `virtual_resamples`

data frame has 35 \(\cdot\) 50 = 1750 rows corresponding to 35 resamples of 50 pennies. Let’s now compute the resulting 35 sample means using the same `dplyr`

code as we did in the previous section, but this time adding a `group_by(replicate)`

:

```
virtual_resampled_means <- virtual_resamples %>%
group_by(replicate) %>%
summarize(mean_year = mean(year))
virtual_resampled_means
```

```
# A tibble: 35 x 2
replicate mean_year
<int> <dbl>
1 1 1995.58
2 2 1999.74
3 3 1993.7
4 4 1997.1
5 5 1999.42
6 6 1995.12
7 7 1994.94
8 8 1997.78
9 9 1991.26
10 10 1996.88
# … with 25 more rows
```

Observe that `virtual_resampled_means`

has 35 rows, corresponding to the 35 resampled means. Furthermore, observe that the values of `mean_year`

vary. Let’s visualize this variation using a histogram in Figure 8.12.

```
ggplot(virtual_resampled_means, aes(x = mean_year)) +
geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
labs(x = "Resample mean year")
```

Let’s compare our virtually constructed bootstrap distribution with the one our 35 friends constructed via our tactile resampling exercise in Figure 8.13. Observe how they are somewhat similar, but not identical.

Recall that in the “resampling with replacement” scenario we are illustrating here, both of these histograms have a special name: the *bootstrap distribution of the sample mean*. Furthermore, recall they are an approximation to the *sampling distribution* of the sample mean, a concept you saw in Chapter 7 on sampling. These distributions allow us to study the effect of sampling variation on our estimates of the true population mean, in this case the true mean year for *all* US pennies. However, unlike in Chapter 7 where we took multiple samples (something one would never do in practice), bootstrap distributions are constructed by taking multiple resamples from a *single* sample: in this case, the 50 original pennies from the bank.

### 8.2.3 Virtually resampling 1000 times

Remember that one of the goals of resampling with replacement is to construct the bootstrap distribution, which is an approximation of the sampling distribution. However, the bootstrap distribution in Figure 8.12 is based only on 35 resamples and hence looks a little coarse. Let’s increase the number of resamples to 1000, so that we can hopefully better see the shape and the variability between different resamples.

```
# Repeat resampling 1000 times
virtual_resamples <- pennies_sample %>%
rep_sample_n(size = 50, replace = TRUE, reps = 1000)
# Compute 1000 sample means
virtual_resampled_means <- virtual_resamples %>%
group_by(replicate) %>%
summarize(mean_year = mean(year))
```

However, in the interest of brevity, going forward let’s combine these two operations into a single chain of pipe (`%>%`

) operators:

```
virtual_resampled_means <- pennies_sample %>%
rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean_year = mean(year))
virtual_resampled_means
```

```
# A tibble: 1,000 x 2
replicate mean_year
<int> <dbl>
1 1 1992.6
2 2 1994.78
3 3 1994.74
4 4 1997.88
5 5 1990
6 6 1999.48
7 7 1990.26
8 8 1993.2
9 9 1994.88
10 10 1996.3
# … with 990 more rows
```

In Figure 8.14 let’s visualize the bootstrap distribution of these 1000 means based on 1000 virtual resamples:

```
ggplot(virtual_resampled_means, aes(x = mean_year)) +
geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
labs(x = "sample mean")
```

Note here that the bell shape is starting to become much more apparent. We now have a general sense for the range of values that the sample mean may take on. But where is this histogram centered? Let’s compute the mean of the 1000 resample means:

```
# A tibble: 1 x 1
mean_of_means
<dbl>
1 1995.36
```

The mean of these 1000 means is 1995.36, which is quite close to the mean of our original sample of 50 pennies of 1995.44. This is the case since each of the 1000 resamples is based on the original sample of 50 pennies.

Congratulations! You’ve just constructed your first bootstrap distribution! In the next section, you’ll see how to use this bootstrap distribution to construct *confidence intervals*.

*Learning check*

**(LC8.1)** What is the chief difference between a bootstrap distribution and a sampling distribution?

**(LC8.2)** Looking at the bootstrap distribution for the sample mean in Figure 8.14, between what two values would you say *most* values lie?