## 9.5 Case study: Are action or romance movies rated higher?

Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?”. IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.

### 9.5.1 IMDb ratings data

The `movies`

dataset in the `ggplot2movies`

package contains information on 58,788 movies that have been rated by users of IMDb.com.

```
# A tibble: 58,788 x 24
title year length budget rating votes r1 r2 r3 r4 r5 r6
<chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5
2 $100… 1939 71 NA 6 20 0 14.5 4.5 24.5 14.5 14.5
3 $21 … 1941 7 NA 8.200 5 0 0 0 0 0 24.5
4 $40,… 1996 70 NA 8.200 6 14.5 0 0 0 0 0
5 $50,… 1975 71 NA 3.4 17 24.5 4.5 0 14.5 14.5 4.5
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5
7 $win… 2002 93 NA 5.3 200 4.5 0 4.5 4.5 24.5 24.5
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0 0 0
10 '49-… 1917 61 NA 6 51 4.5 0 4.5 4.5 4.5 44.5
# … with 58,778 more rows, and 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>,
# r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>, Comedy <int>,
# Drama <int>, Documentary <int>, Romance <int>, Short <int>
```

We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original `movies`

dataset was a little messy, we provide a pre-wrangled version of our data in the `movies_sample`

data frame included in the `moderndive`

package. If you’re curious, you can look at the necessary data wrangling code to do this on GitHub.

```
# A tibble: 68 x 4
title year rating genre
<chr> <int> <dbl> <chr>
1 Underworld 1985 3.1 Action
2 Love Affair 1932 6.3 Romance
3 Junglee 1961 6.8 Romance
4 Eversmile, New Jersey 1989 5 Romance
5 Search and Destroy 1979 4 Action
6 Secreto de Romelia, El 1988 4.9 Romance
7 Amants du Pont-Neuf, Les 1991 7.4 Romance
8 Illicit Dreams 1995 3.5 Action
9 Kabhi Kabhie 1976 7.7 Romance
10 Electric Horseman, The 1979 5.8 Romance
# … with 58 more rows
```

The variables include the `title`

and `year`

the movie was filmed. Furthermore, we have a numerical variable `rating`

, which is the IMDb rating out of 10 stars, and a binary categorical variable `genre`

indicating if the movie was an `Action`

or `Romance`

movie. We are interested in whether `Action`

or `Romance`

movies got a higher `rating`

on average.

Let’s perform an exploratory data analysis of this data. Recall from Subsection 2.7.1 that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section 2.6 would be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in Figure 9.17.

Eyeballing Figure 9.17, romance movies have a higher median rating. Do we have reason to believe, however, that there is a *significant* difference between the mean `rating`

for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.

However, there is a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.

Let’s calculate some summary statistics split by the binary categorical variable `genre`

: the number of movies, the mean rating, and the standard deviation split by `genre`

. We’ll do this using `dplyr`

data wrangling verbs. Notice in particular how we count the number of each type of movie using the `n()`

summary function.

```
movies_sample %>%
group_by(genre) %>%
summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))
```

```
# A tibble: 2 x 4
genre n mean_rating std_dev
<chr> <int> <dbl> <dbl>
1 Action 32 5.275 1.36121
2 Romance 36 6.32222 1.60963
```

Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is thus 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. The question is, however, are these results indicative of a true difference for *all* romance and action movies? Or could we attribute this difference to chance *sampling variation*?

### 9.5.2 Sampling scenario

Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Subsection 7.3.1. The *study population* is all movies in the IMDb database that are either action or romance (but not both). The *sample* from this population is the 68 movies included in the `movies_sample`

dataset.

Since this sample was randomly taken from the population `movies`

, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on `movies_sample`

can generalize to the entire population. What are the relevant *population parameter* and *point estimates*? We introduce the fourth sampling scenario in Table 9.3.

Scenario | Population parameter | Notation | Point estimate | Symbol(s) |
---|---|---|---|---|

1 | Population proportion | \(p\) | Sample proportion | \(\widehat{p}\) |

2 | Population mean | \(\mu\) | Sample mean | \(\overline{x}\) or \(\widehat{\mu}\) |

3 | Difference in population proportions | \(p_1 - p_2\) | Difference in sample proportions | \(\widehat{p}_1 - \widehat{p}_2\) |

4 | Difference in population means | \(\mu_1 - \mu_2\) | Difference in sample means | \(\overline{x}_1 - \overline{x}_2\) |

So, whereas the sampling bowl exercise in Section 7.1 concerned *proportions*, the pennies exercise in Section 8.1 concerned *means*, the case study on whether yawning is contagious in Section 8.6 and the promotions activity in Section 9.1 concerned *differences in proportions*, we are now concerned with *differences in means*.

In other words, the population parameter of interest is the difference in population mean ratings \(\mu_a - \mu_r\), where \(\mu_a\) is the mean rating of all action movies on IMDb and similarly \(\mu_r\) is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means \(\overline{x}_a - \overline{x}_r\), where \(\overline{x}_a\) is the mean rating of the \(n_a\) = 32 movies in our sample and \(\overline{x}_r\) is the mean rating of the \(n_r\) = 36 in our sample. Based on our earlier exploratory data analysis, our estimate \(\overline{x}_a - \overline{x}_r\) is \(5.275 - 6.322 = -1.047\).

So there appears to be a slight difference of -1.047 in favor of romance movies. The question is, however, could this difference of -1.047 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for *all* romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing.

### 9.5.3 Conducting the hypothesis test

We’ll be testing:

\[ \begin{aligned} H_0 &: \mu_a - \mu_r = 0\\ \text{vs } H_A&: \mu_a - \mu_r \neq 0 \end{aligned} \]

In other words, the null hypothesis \(H_0\) suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll *assume* is true. On the other hand, the alternative hypothesis \(H_A\) suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise \(H_A: p_m - p_f > 0\), we are now considering a two-sided alternative of \(H_A: \mu_a - \mu_r \neq 0\).

Furthermore, we’ll pre-specify a low significance level of \(\alpha\) = 0.001. By setting this value low, all things being equal, there is a lower chance that the \(p\)-value will be less than \(\alpha\). Thus, there is a lower chance that we’ll reject the null hypothesis \(H_0\) in favor of the alternative hypothesis \(H_A\). In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, only if we have quite strong evidence. This is known as a “conservative” hypothesis testing procedure.

#### 1. `specify`

variables

Let’s now perform all the steps of the `infer`

workflow. We first `specify()`

the variables of interest in the `movies_sample`

data frame using the formula `rating ~ genre`

. This tells `infer`

that the numerical variable `rating`

is the outcome variable, while the binary variable `genre`

is the explanatory variable. Note that unlike previously when we were interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the `success`

argument.

```
Response: rating (numeric)
Explanatory: genre (factor)
# A tibble: 68 x 2
rating genre
<dbl> <fct>
1 3.1 Action
2 6.3 Romance
3 6.8 Romance
4 5 Romance
5 4 Action
6 4.9 Romance
7 7.4 Romance
8 3.5 Action
9 7.7 Romance
10 5.8 Romance
# … with 58 more rows
```

Observe at this point that the data in `movies_sample`

has not changed. The only change so far is the newly defined `Response: rating (numeric)`

and `Explanatory: genre (factor)`

*meta-data*.

#### 2. `hypothesize`

the null

We set the null hypothesis \(H_0: \mu_a - \mu_r = 0\) by using the `hypothesize()`

function. Since we have two samples, action and romance movies, we set `null`

to be `"independence"`

as we described in Section 9.3.

```
Response: rating (numeric)
Explanatory: genre (factor)
Null Hypothesis: independence
# A tibble: 68 x 2
rating genre
<dbl> <fct>
1 3.1 Action
2 6.3 Romance
3 6.8 Romance
4 5 Romance
5 4 Action
6 4.9 Romance
7 7.4 Romance
8 3.5 Action
9 7.7 Romance
10 5.8 Romance
# … with 58 more rows
```

#### 3. `generate`

replicates

After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section 9.1.

We’ll repeat this resampling without replacement of `type = "permute"`

a total of `reps = 1000`

times. Feel free to run the code below to check out what the `generate()`

step produces.

#### 4. `calculate`

summary statistics

Now that we have 1000 replicated “shuffles” assuming the null hypothesis \(H_0\) that both `Action`

and `Romance`

movies on average have the same ratings on IMDb, let’s `calculate()`

the appropriate summary statistic for these 1000 replicated shuffles. From Section 9.2, summary statistics relating to hypothesis testing have a specific name: *test statistics*. Since the unknown population parameter of interest is the difference in population means \(\mu_{a} - \mu_{r}\), the test statistic of interest here is the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\).

For each of our 1000 shuffles, we can calculate this test statistic by setting `stat = "diff in means"`

. Furthermore, since we are interested in \(\overline{x}_{a} - \overline{x}_{r}\), we set `order = c("Action", "Romance")`

. Let’s save the results in a data frame called `null_distribution_movies`

:

```
null_distribution_movies <- movies_sample %>%
specify(formula = rating ~ genre) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Action", "Romance"))
null_distribution_movies
```

```
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 0.511111
2 2 0.345833
3 3 -0.327083
4 4 -0.209028
5 5 -0.433333
6 6 -0.102778
7 7 0.387153
8 8 0.16875
9 9 0.257292
10 10 0.334028
# … with 990 more rows
```

Observe that we have 1000 values of `stat`

, each representing one instance of \(\overline{x}_{a} - \overline{x}_{r}\). The 1000 values form the *null distribution*, which is the technical term for the sampling distribution of the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\) assuming \(H_0\) is true. What happened in real life? What was the observed difference in promotion rates? What was the *observed test statistic* \(\overline{x}_{a} - \overline{x}_{r}\)? Recall from our earlier data wrangling, this observed difference in means was \(5.275 - 6.322 = -1.047\). We can also achieve this using the code that constructed the null distribution `null_distribution_movies`

but with the `hypothesize()`

and `generate()`

steps removed. Let’s save this in `obs_diff_means`

:

```
obs_diff_means <- movies_sample %>%
specify(formula = rating ~ genre) %>%
calculate(stat = "diff in means", order = c("Action", "Romance"))
obs_diff_means
```

```
# A tibble: 1 x 1
stat
<dbl>
1 -1.04722
```

#### 5. `visualize`

the p-value

Lastly, in order to compute the \(p\)-value, we have to assess how “extreme” the observed difference in means of -1.047 is. We do this by comparing -1.047 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the \(p\)-value in Figure 9.18. Unlike our example in Subsection 9.3.1 involving promotions, since we have a two-sided \(H_A: \mu_a - \mu_r \neq 0\), we have to allow for both possibilities for *more extreme*, so we set `direction = "both"`

.

```
visualize(null_distribution_movies, bins = 10) +
shade_p_value(obs_stat = obs_diff_means, direction = "both")
```

Let’s go over the elements of this plot. First, the histogram is the *null distribution*. Second, the solid line is the *observed test statistic*, or the difference in sample means we observed in real life of \(5.275 - 6.322 = -1.047\). Third, the two shaded areas of the histogram form the *\(p\)-value*, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic *assuming the null hypothesis \(H_0\) is true*.

What proportion of the null distribution is shaded? In other words, what is the numerical value of the \(p\)-value? We use the `get_p_value()`

function to compute this value:

```
# A tibble: 1 x 1
p_value
<dbl>
1 0.004
```

This \(p\)-value of 0.004 is very small. In other words, there is a very small chance that we’d observe a difference of 5.275 - 6.322 = -1.047 in a hypothesized universe where there was truly no difference in ratings.

But this \(p\)-value is larger than our (even smaller) pre-specified \(\alpha\) significance level of 0.001. Thus, we are inclined to fail to reject the null hypothesis \(H_0: \mu_a - \mu_r = 0\). In non-statistical language, the conclusion is: we do not have the evidence needed in this sample of data to suggest that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies. We, thus, cannot say that a difference exists in romance and action movie ratings, on average, for all IMDb movies.

*Learning check*

**(LC9.9)** Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?

**(LC9.10)** What conclusions can you make from viewing the faceted histogram looking at `rating`

versus `genre`

that you couldn’t see when looking at the boxplot?

**(LC9.11)** Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies.

**(LC9.12)** Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres?

**(LC9.13)** Using the definition of \(p\)-value, write in words what the \(p\)-value represents for the hypothesis test comparing the mean rating of romance to action movies.

**(LC9.14)** What is the value of the \(p\)-value for the hypothesis test comparing the mean rating of romance to action movies?

**(LC9.15)** Test your data wrangling knowledge and EDA skills:

- Use
`dplyr`

and`tidyr`

to create the necessary data frame focused on only action and romance movies (but not both) from the`movies`

data frame in the`ggplot2movies`

package. - Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb.
- Discuss how these plots compare to the similar plots produced for the
`movies_sample`

data.