9.5 Case study: Are action or romance movies rated higher?

Let’s apply our knowledge of hypothesis testing to answer the question: “Are action or romance movies rated higher on IMDb?”. IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.

9.5.1 IMDb ratings data

The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of

# A tibble: 58,788 x 24
   title  year length budget rating votes    r1    r2    r3    r4    r5    r6
   <chr> <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 $      1971    121     NA  6.4     348   4.5   4.5   4.5   4.5  14.5  24.5
 2 $100…  1939     71     NA  6        20   0    14.5   4.5  24.5  14.5  14.5
 3 $21 …  1941      7     NA  8.200     5   0     0     0     0     0    24.5
 4 $40,…  1996     70     NA  8.200     6  14.5   0     0     0     0     0  
 5 $50,…  1975     71     NA  3.4      17  24.5   4.5   0    14.5  14.5   4.5
 6 $pent  2000     91     NA  4.3      45   4.5   4.5   4.5  14.5  14.5  14.5
 7 $win…  2002     93     NA  5.3     200   4.5   0     4.5   4.5  24.5  24.5
 8 '15'   2002     25     NA  6.7      24   4.5   4.5   4.5   4.5   4.5  14.5
 9 '38    1987     97     NA  6.6      18   4.5   4.5   4.5   0     0     0  
10 '49-…  1917     61     NA  6        51   4.5   0     4.5   4.5   4.5  44.5
# … with 58,778 more rows, and 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>,
#   r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>, Comedy <int>,
#   Drama <int>, Documentary <int>, Romance <int>, Short <int>

We’ll focus on a random sample of 68 movies that are classified as either “action” or “romance” movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, since the original movies dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample data frame included in the moderndive package. If you’re curious, you can look at the necessary data wrangling code to do this on GitHub.

# A tibble: 68 x 4
   title                     year rating genre  
   <chr>                    <int>  <dbl> <chr>  
 1 Underworld                1985    3.1 Action 
 2 Love Affair               1932    6.3 Romance
 3 Junglee                   1961    6.8 Romance
 4 Eversmile, New Jersey     1989    5   Romance
 5 Search and Destroy        1979    4   Action 
 6 Secreto de Romelia, El    1988    4.9 Romance
 7 Amants du Pont-Neuf, Les  1991    7.4 Romance
 8 Illicit Dreams            1995    3.5 Action 
 9 Kabhi Kabhie              1976    7.7 Romance
10 Electric Horseman, The    1979    5.8 Romance
# … with 58 more rows

The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We are interested in whether Action or Romance movies got a higher rating on average.

Let’s perform an exploratory data analysis of this data. Recall from Subsection 2.7.1 that a boxplot is a visualization we can use to show the relationship between a numerical and a categorical variable. Another option you saw in Section 2.6 would be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in Figure 9.17.

Boxplot of IMDb rating vs. genre.

FIGURE 9.17: Boxplot of IMDb rating vs. genre.

Eyeballing Figure 9.17, romance movies have a higher median rating. Do we have reason to believe, however, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.

However, there is a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.

Let’s calculate some summary statistics split by the binary categorical variable genre: the number of movies, the mean rating, and the standard deviation split by genre. We’ll do this using dplyr data wrangling verbs. Notice in particular how we count the number of each type of movie using the n() summary function.

# A tibble: 2 x 4
  genre       n mean_rating std_dev
  <chr>   <int>       <dbl>   <dbl>
1 Action     32     5.275   1.36121
2 Romance    36     6.32222 1.60963

Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is thus 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. The question is, however, are these results indicative of a true difference for all romance and action movies? Or could we attribute this difference to chance sampling variation?

9.5.2 Sampling scenario

Let’s now revisit this study in terms of terminology and notation related to sampling we studied in Subsection 7.3.1. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset.

Since this sample was randomly taken from the population movies, it is representative of all romance and action movies on IMDb. Thus, any analysis and results based on movies_sample can generalize to the entire population. What are the relevant population parameter and point estimates? We introduce the fourth sampling scenario in Table 9.3.

TABLE 9.3: Scenarios of sampling for inference
Scenario Population parameter Notation Point estimate Symbol(s)
1 Population proportion \(p\) Sample proportion \(\widehat{p}\)
2 Population mean \(\mu\) Sample mean \(\overline{x}\) or \(\widehat{\mu}\)
3 Difference in population proportions \(p_1 - p_2\) Difference in sample proportions \(\widehat{p}_1 - \widehat{p}_2\)
4 Difference in population means \(\mu_1 - \mu_2\) Difference in sample means \(\overline{x}_1 - \overline{x}_2\)

So, whereas the sampling bowl exercise in Section 7.1 concerned proportions, the pennies exercise in Section 8.1 concerned means, the case study on whether yawning is contagious in Section 8.6 and the promotions activity in Section 9.1 concerned differences in proportions, we are now concerned with differences in means.

In other words, the population parameter of interest is the difference in population mean ratings \(\mu_a - \mu_r\), where \(\mu_a\) is the mean rating of all action movies on IMDb and similarly \(\mu_r\) is the mean rating of all romance movies. Additionally the point estimate/sample statistic of interest is the difference in sample means \(\overline{x}_a - \overline{x}_r\), where \(\overline{x}_a\) is the mean rating of the \(n_a\) = 32 movies in our sample and \(\overline{x}_r\) is the mean rating of the \(n_r\) = 36 in our sample. Based on our earlier exploratory data analysis, our estimate \(\overline{x}_a - \overline{x}_r\) is \(5.275 - 6.322 = -1.047\).

So there appears to be a slight difference of -1.047 in favor of romance movies. The question is, however, could this difference of -1.047 be merely due to chance and sampling variation? Or are these results indicative of a true difference in mean ratings for all romance and action movies on IMDb? To answer this question, we’ll use hypothesis testing.

9.5.3 Conducting the hypothesis test

We’ll be testing:

\[ \begin{aligned} H_0 &: \mu_a - \mu_r = 0\\ \text{vs } H_A&: \mu_a - \mu_r \neq 0 \end{aligned} \]

In other words, the null hypothesis \(H_0\) suggests that both romance and action movies have the same mean rating. This is the “hypothesized universe” we’ll assume is true. On the other hand, the alternative hypothesis \(H_A\) suggests that there is a difference. Unlike the one-sided alternative we used in the promotions exercise \(H_A: p_m - p_f > 0\), we are now considering a two-sided alternative of \(H_A: \mu_a - \mu_r \neq 0\).

Furthermore, we’ll pre-specify a low significance level of \(\alpha\) = 0.001. By setting this value low, all things being equal, there is a lower chance that the \(p\)-value will be less than \(\alpha\). Thus, there is a lower chance that we’ll reject the null hypothesis \(H_0\) in favor of the alternative hypothesis \(H_A\). In other words, we’ll reject the hypothesis that there is no difference in mean ratings for all action and romance movies, only if we have quite strong evidence. This is known as a “conservative” hypothesis testing procedure.

1. specify variables

Let’s now perform all the steps of the infer workflow. We first specify() the variables of interest in the movies_sample data frame using the formula rating ~ genre. This tells infer that the numerical variable rating is the outcome variable, while the binary variable genre is the explanatory variable. Note that unlike previously when we were interested in proportions, since we are now interested in the mean of a numerical variable, we do not need to set the success argument.

Response: rating (numeric)
Explanatory: genre (factor)
# A tibble: 68 x 2
   rating genre  
    <dbl> <fct>  
 1    3.1 Action 
 2    6.3 Romance
 3    6.8 Romance
 4    5   Romance
 5    4   Action 
 6    4.9 Romance
 7    7.4 Romance
 8    3.5 Action 
 9    7.7 Romance
10    5.8 Romance
# … with 58 more rows

Observe at this point that the data in movies_sample has not changed. The only change so far is the newly defined Response: rating (numeric) and Explanatory: genre (factor) meta-data.

2. hypothesize the null

We set the null hypothesis \(H_0: \mu_a - \mu_r = 0\) by using the hypothesize() function. Since we have two samples, action and romance movies, we set null to be "independence" as we described in Section 9.3.

Response: rating (numeric)
Explanatory: genre (factor)
Null Hypothesis: independence
# A tibble: 68 x 2
   rating genre  
    <dbl> <fct>  
 1    3.1 Action 
 2    6.3 Romance
 3    6.8 Romance
 4    5   Romance
 5    4   Action 
 6    4.9 Romance
 7    7.4 Romance
 8    3.5 Action 
 9    7.7 Romance
10    5.8 Romance
# … with 58 more rows

3. generate replicates

After we have set the null hypothesis, we generate “shuffled” replicates assuming the null hypothesis is true by repeating the shuffling/permutation exercise you performed in Section 9.1.

We’ll repeat this resampling without replacement of type = "permute" a total of reps = 1000 times. Feel free to run the code below to check out what the generate() step produces.

4. calculate summary statistics

Now that we have 1000 replicated “shuffles” assuming the null hypothesis \(H_0\) that both Action and Romance movies on average have the same ratings on IMDb, let’s calculate() the appropriate summary statistic for these 1000 replicated shuffles. From Section 9.2, summary statistics relating to hypothesis testing have a specific name: test statistics. Since the unknown population parameter of interest is the difference in population means \(\mu_{a} - \mu_{r}\), the test statistic of interest here is the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\).

For each of our 1000 shuffles, we can calculate this test statistic by setting stat = "diff in means". Furthermore, since we are interested in \(\overline{x}_{a} - \overline{x}_{r}\), we set order = c("Action", "Romance"). Let’s save the results in a data frame called null_distribution_movies:

# A tibble: 1,000 x 2
   replicate      stat
       <int>     <dbl>
 1         1  0.511111
 2         2  0.345833
 3         3 -0.327083
 4         4 -0.209028
 5         5 -0.433333
 6         6 -0.102778
 7         7  0.387153
 8         8  0.16875 
 9         9  0.257292
10        10  0.334028
# … with 990 more rows

Observe that we have 1000 values of stat, each representing one instance of \(\overline{x}_{a} - \overline{x}_{r}\). The 1000 values form the null distribution, which is the technical term for the sampling distribution of the difference in sample means \(\overline{x}_{a} - \overline{x}_{r}\) assuming \(H_0\) is true. What happened in real life? What was the observed difference in promotion rates? What was the observed test statistic \(\overline{x}_{a} - \overline{x}_{r}\)? Recall from our earlier data wrangling, this observed difference in means was \(5.275 - 6.322 = -1.047\). We can also achieve this using the code that constructed the null distribution null_distribution_movies but with the hypothesize() and generate() steps removed. Let’s save this in obs_diff_means:

# A tibble: 1 x 1
1 -1.04722

5. visualize the p-value

Lastly, in order to compute the \(p\)-value, we have to assess how “extreme” the observed difference in means of -1.047 is. We do this by comparing -1.047 to our null distribution, which was constructed in a hypothesized universe of no true difference in movie ratings. Let’s visualize both the null distribution and the \(p\)-value in Figure 9.18. Unlike our example in Subsection 9.3.1 involving promotions, since we have a two-sided \(H_A: \mu_a - \mu_r \neq 0\), we have to allow for both possibilities for more extreme, so we set direction = "both".

Null distribution, observed test statistic, and $p$-value.

FIGURE 9.18: Null distribution, observed test statistic, and \(p\)-value.

Let’s go over the elements of this plot. First, the histogram is the null distribution. Second, the solid line is the observed test statistic, or the difference in sample means we observed in real life of \(5.275 - 6.322 = -1.047\). Third, the two shaded areas of the histogram form the \(p\)-value, or the probability of obtaining a test statistic just as or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true.

What proportion of the null distribution is shaded? In other words, what is the numerical value of the \(p\)-value? We use the get_p_value() function to compute this value:

# A tibble: 1 x 1
1   0.004

This \(p\)-value of 0.004 is very small. In other words, there is a very small chance that we’d observe a difference of 5.275 - 6.322 = -1.047 in a hypothesized universe where there was truly no difference in ratings.

But this \(p\)-value is larger than our (even smaller) pre-specified \(\alpha\) significance level of 0.001. Thus, we are inclined to fail to reject the null hypothesis \(H_0: \mu_a - \mu_r = 0\). In non-statistical language, the conclusion is: we do not have the evidence needed in this sample of data to suggest that we should reject the hypothesis that there is no difference in mean IMDb ratings between romance and action movies. We, thus, cannot say that a difference exists in romance and action movie ratings, on average, for all IMDb movies.

Learning check

(LC9.9) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?

(LC9.10) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot?

(LC9.11) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies.

(LC9.12) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres?

(LC9.13) Using the definition of \(p\)-value, write in words what the \(p\)-value represents for the hypothesis test comparing the mean rating of romance to action movies.

(LC9.14) What is the value of the \(p\)-value for the hypothesis test comparing the mean rating of romance to action movies?

(LC9.15) Test your data wrangling knowledge and EDA skills:

  • Use dplyr and tidyr to create the necessary data frame focused on only action and romance movies (but not both) from the movies data frame in the ggplot2movies package.
  • Make a boxplot and a faceted histogram of this population data comparing ratings of action and romance movies from IMDb.
  • Discuss how these plots compare to the similar plots produced for the movies_sample data.