## D.7 Chapter 7 Solutions

library(ggplot2)
library(dplyr)
library(moderndive)
library(gapminder)
library(skimr)

(LC7.1) Why was it important to mix the bowl before we sampled the balls?

Solution:

So that we make sure the sampled balls are randomized.

(LC7.2) Why is it that our 33 groups of friends did not all have the same numbers of balls that were red out of 50, and hence different proportions red?

Solution:

Because not all pairs have the same portion of the population of the balls, so each pair has a different sampled balls with different color compositions.

(LC7.3) Why couldn’t we study the effects of sampling variation when we used the virtual shovel only once? Why did we need to take more than one virtual sample (in our case 33 virtual samples)?

Solution:

If we use the virtual shovel only once, we only get one sample of the population. We need to take more than one virtual sample to get a range of proportions.

(LC7.4) Why did we not take 1000 “tactile” samples of 50 balls by hand?

Solution:

That would be way too much repeated work.

(LC7.5) Looking at Figure 7.10, would you say that sampling 50 balls where 30% of them were red is likely or not? What about sampling 50 balls where 10% of them were red?

Solution:

According to the Figure, less than 150 out of the 1000 counts were 30% red. So I would say that sampling 50 balls where 30% of them were red is not very likely. Almost no count was only 10% red, so sampling 50 balls where 10% of them were red is extremely unlikely.

(LC7.6) In Figure 7.12, we used shovels to take 1000 samples each, computed the resulting 1000 proportions of the shovel’s balls that were red, and then visualized the distribution of these 1000 proportions in a histogram. We did this for shovels with 25, 50, and 100 slots in them. As the size of the shovels increased, the histograms got narrower. In other words, as the size of the shovels increased from 25 to 50 to 100, did the 1000 proportions

• A. vary less,
• B. vary by the same amount, or
• C. vary more?

Solution:

A. As the histograms got narrower, the 1000 proportions varied less.

(LC7.7) What summary statistic did we use to quantify how much the 1000 proportions red varied?

• A. The inter-quartile range
• B. The standard deviation
• C. The range: the largest value minus the smallest.

Solution:

B. The standard deviation is used to quantify how much a set of data varies.

(LC7.8) In the case of our bowl activity, what is the population parameter? Do we know its value?

Solution:

The population parameter in the case of our bowl activity is the total number of balls. We know its value.

(LC7.9) What would performing a census in our bowl activity correspond to? Why did we not perform a census?

Solution:

Performing a census in our bowl activity correspond to counting the total number of red balls in all balls, We did not perform a census because it would be too much repetitive work and it is unnecessary.

(LC7.10) What purpose do point estimates serve in general? What is the name of the point estimate specific to our bowl activity? What is its mathematical notation?

Solution:

Point estimates serve to estimate an unknown population parameter in the sample. In our bowl activity, our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. We mathematically denote the sample proportion using $$\widehat{p}$$.

(LC7.11) How did we ensure that our tactile samples using the shovel were random?

Solution:

We virtually shuffle the sample each time.

(LC7.12) Why is it important that sampling be done at random?

Solution:

So that we get different samples each time to estimate the total population.

(LC7.13) What are we inferring about the bowl based on the samples using the shovel?

Solution:

We are inferring that the samples are representing the total population in the ball.

(LC7.14) What purpose did the sampling distributions serve?

Solution:

Using the sampling distributions, for a given sample size $$n$$, we can make statements about what values we can typically expect.

(LC7.15) What does the standard error of the sample proportion $$\widehat{p}$$ quantify?

Solution:

Standard errors quantify the effect of sampling variation induced on our estimates.

(LC7.16) The table that follows is a version of Table 7.3 matching sample sizes $$n$$ to different standard errors of the sample proportion $$\widehat{p}$$, but with the rows randomly re-ordered and the sample sizes removed. Fill in the table by matching the correct sample sizes to the correct standard errors.

Sample size Standard error of $$\widehat{p}$$
n = 0.094
n = 0.045
n = 0.069

Solution:

$$n$$ = $$25$$, $$100$$, $$50$$ respectively.

For the following four learning checks, let the estimate be the sample proportion $$\widehat{p}$$: the proportion of a shovel’s balls that were red. It estimates the population proportion $$p$$: the proportion of the bowl’s balls that were red.

(LC7.17) What is the difference between an accurate estimate and a precise estimate?

Solution:

An accurate estimate gives an estimate that is close to, but not necessary the exact, actual value. A precise estimate gives the exact actual value.

(LC7.18) How do we ensure that an estimate is accurate? How do we ensure that an estimate is precise?

To ensure that an estimate is accurate, we need to have a reasonable range of estimate, and make sure that the estimate is reasonably close to the actual value To ensure that an estimate is precise, we need to make sure the estimate is equivalent to the actual value.

(LC7.19) In a real-life situation, we would not take 1000 different samples to infer about a population, but rather only one. Then, what was the purpose of our exercises where we took 1000 different samples?

Solution:

To get a narrower range of the estimates.

(LC7.20) Figure 7.16 with the targets shows four combinations of “accurate versus precise” estimates. Draw four corresponding sampling distributions of the sample proportion $$\widehat{p}$$, like the one in the left-most plot in Figure 7.15.

Solution:

Comment on the representativeness of the following sampling methodologies:

(LC7.21) The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).

Solution:

The airplanes on the tarmac after an air battle against the Luftwaffe is not a good representation of all airplanes, because the airplanes which were attacked in less resistant areas did not make it back to the tarmac. This is called survival bias. Survivor’s bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias.

(LC7.22) Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.

Solution:

This is not a good representation, because: (1) adults are more likely to pickup phone calls; (2) households with more people are more likely to have people to be available to pickup phone calls; (3) we are not certain whether all households are in the phone book.

(LC7.23) You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”.

Solution:

This is not a good representation, because it is very likely that students will lie in this survey to stay out of trouble. So we may not get honest data. This is called volunteer bias: systematic error due to differences between those who choose to participate in studies and those who do not.

(LC7.24) A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers.

Solution:

This is not a good representation, because the sample size is too small. The sample is representative but not precise.