## 7.4 Case study: Polls

Let’s now switch gears to a more realistic sampling scenario than our bowl activity: a poll. In practice, pollsters do not take 1000 repeated samples as we did in our previous sampling activities, but rather take only a single sample that’s as large as possible.

On December 4, 2013, National Public Radio in the US reported on a poll of President Obama’s approval rating among young Americans aged 18-29 in an article, “Poll: Support For Obama Among Young Americans Eroding.” The poll was conducted by the Kennedy School’s Institute of Politics at Harvard University. A quote from the article:

After voting for him in large numbers in 2008 and 2012, young Americans are souring on President Obama.

According to a new Harvard University Institute of Politics poll, just 41 percent of millennials — adults ages 18-29 — approve of Obama’s job performance, his lowest-ever standing among the group and an 11-point drop from April.

Let’s tie elements of the real-life poll in this news article with our “tactile” and “virtual” bowl activity from Sections 7.1 and 7.2 using the terminology, notations, and definitions we learned in Section 7.3. You’ll see that our sampling activity with the bowl is an idealized version of what pollsters are trying to do in real life.

First, who is the (Study) Population of $$N$$ individuals or observations of interest?

• Bowl: $$N$$ = 2400 identically sized red and white balls
• Obama poll: $$N$$ = ? young Americans aged 18-29

Second, what is the population parameter?

• Bowl: The population proportion $$p$$ of all the balls in the bowl that are red.
• Obama poll: The population proportion $$p$$ of all young Americans who approve of Obama’s job performance.

Third, what would a census look like?

• Bowl: Manually going over all $$N$$ = 2400 balls and exactly computing the population proportion $$p$$ of the balls that are red.
• Obama poll: Locating all $$N$$ young Americans and asking them all if they approve of Obama’s job performance. In this case, we don’t even know what the population size $$N$$ is!

Fourth, how do you perform sampling to obtain a sample of size $$n$$?

• Bowl: Using a shovel with $$n$$ slots.
• Obama poll: One method is to get a list of phone numbers of all young Americans and pick out $$n$$ phone numbers. In this poll’s case, the sample size of this poll was $$n = 2089$$ young Americans.

Fifth, what is your point estimate (AKA sample statistic) of the unknown population parameter?

• Bowl: The sample proportion $$\widehat{p}$$ of the balls in the shovel that were red.
• Obama poll: The sample proportion $$\widehat{p}$$ of young Americans in the sample that approve of Obama’s job performance. In this poll’s case, $$\widehat{p} = 0.41 = 41\%$$, the quoted percentage in the second paragraph of the article.

Sixth, is the sampling procedure representative?

• Bowl: Are the contents of the shovel representative of the contents of the bowl? Because we mixed the bowl before sampling, we can feel confident that they are.
• Obama poll: Is the sample of $$n = 2089$$ young Americans representative of all young Americans aged 18-29? This depends on whether the sampling was random.

Seventh, are the samples generalizable to the greater population?

• Bowl: Is the sample proportion $$\widehat{p}$$ of the shovel’s balls that are red a “good guess” of the population proportion $$p$$ of the bowl’s balls that are red? Given that the sample was representative, the answer is yes.
• Obama poll: Is the sample proportion $$\widehat{p} = 0.41$$ of the sample of young Americans who supported Obama a “good guess” of the population proportion $$p$$ of all young Americans who supported Obama at this time in 2013? In other words, can we confidently say that roughly 41% of all young Americans approved of Obama at the time of the poll? Again, this depends on whether the sampling was random.

Eighth, is the sampling procedure unbiased? In other words, do all observations have an equal chance of being included in the sample?

• Bowl: Since each ball was equally sized and we mixed the bowl before using the shovel, each ball had an equal chance of being included in a sample and hence the sampling was unbiased.
• Obama poll: Did all young Americans have an equal chance at being represented in this poll? Again, this depends on whether the sampling was random.

Ninth and lastly, was the sampling done at random?

• Bowl: As long as you mixed the bowl sufficiently before sampling, your samples would be random.
• Obama poll: Was the sample conducted at random? We can’t answer this question without knowing about the sampling methodology used by Kennedy School’s Institute of Politics at Harvard University. We’ll discuss this more at the end of this section.

In other words, the poll by Kennedy School’s Institute of Politics at Harvard University can be thought of as an instance of using the shovel to sample balls from the bowl. Furthermore, if another polling company conducted a similar poll of young Americans at roughly the same time, they would likely get a different estimate than 41%. This is due to sampling variation.

Let’s now revisit the sampling paradigm from Subsection 7.3.1:

In general:

• If the sampling of a sample of size $$n$$ is done at random, then
• the sample is unbiased and representative of the population of size $$N$$, thus
• any result based on the sample can generalize to the population, thus
• the point estimate is a “good guess” of the unknown population parameter, thus
• instead of performing a census, we can infer about the population using sampling.

Specific to the bowl:

• If we extract a sample of $$n = 50$$ balls at random, in other words, we mix all of the equally sized balls before using the shovel, then
• the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus
• any result based on the shovel’s balls can generalize to the bowl, thus
• the sample proportion $$\widehat{p}$$ of the $$n = 50$$ balls in the shovel that are red is a “good guess” of the population proportion $$p$$ of the $$N = 2400$$ balls that are red, thus
• instead of manually going over all 2400 balls in the bowl, we can infer about the bowl using the shovel.

Specific to the Obama poll:

• If we had a way of contacting a randomly chosen sample of 2089 young Americans and polling their approval of President Obama in 2013, then
• these 2089 young Americans would be an unbiased and representative sample of all young Americans in 2013, thus
• any results based on this sample of 2089 young Americans can generalize to the entire population of all young Americans in 2013, thus
• the reported sample approval rating of 41% of these 2089 young Americans is a good guess of the true approval rating among all young Americans in 2013, thus
• instead of performing an expensive census of all young Americans in 2013, we can infer about all young Americans in 2013 using polling.

So as you can see, it was critical for the sample obtained by Kennedy School’s Institute of Politics at Harvard University to be truly random in order to infer about all young Americans’ opinions about Obama. Was their sample truly random? It’s hard to answer such questions without knowing about the sampling methodology they used. For example, if this poll was conducted using only mobile phone numbers, people without mobile phones would be left out and therefore not represented in the sample. What about if Kennedy School’s Institute of Politics at Harvard University conducted this poll on an internet news site? Then people who don’t read this particular internet news site would be left out. Ensuring that our samples were random was easy to do in our sampling bowl exercises; however, in a real-life situation like the Obama poll, this is much harder to do.

Learning check

Comment on the representativeness of the following sampling methodologies:

(LC7.21) The Royal Air Force wants to study how resistant all their airplanes are to bullets. They study the bullet holes on all the airplanes on the tarmac after an air battle against the Luftwaffe (German Air Force).

(LC7.22) Imagine it is 1993, a time when almost all households had landlines. You want to know the average number of people in each household in your city. You randomly pick out 500 phone numbers from the phone book and conduct a phone survey.

(LC7.23) You want to know the prevalence of illegal downloading of TV shows among students at a local college. You get the emails of 100 randomly chosen students and ask them, “How many times did you download a pirated TV show last week?”.

(LC7.24) A local college administrator wants to know the average income of all graduates in the last 10 years. So they get the records of five randomly chosen graduates, contact them, and obtain their answers.