## 11.3 Case study: Effective data storytelling

As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression models and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible.

We now present another set of case studies, but this time on the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling.

### 11.3.1 Bechdel test for Hollywood gender representation

We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article, “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.” In it, Walt completed a multidecade study of how many movies pass the Bechdel test, an informal test of gender representation in a movie that was created by Alison Bechdel.

As you read over the article, think carefully about how Walt Hickey is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight have also shared the data and R code that they used for this article. You can also find the data used in many more of their articles on their GitHub page.

ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the fivethirtyeight package which provides access to these datasets more easily in R. For a complete list of all 127 datasets included in the fivethirtyeight package, check out the package webpage at https://fivethirtyeight-r.netlify.app/articles/fivethirtyeight.html.

Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using dplyr, ggplot2, and other packages in the tidyverse are available here. For example, a vignette showing how to reproduce one of the plots at the end of the article on the Bechdel test is available here.

### 11.3.2 US Births in 1999

The US_births_1994_2003 data frame included in the fivethirtyeight package provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running ?US_births_1994_2003 in the console.

It’s always a good idea to preview your data, either by using RStudio’s spreadsheet View() function or using glimpse() from the dplyr package:

glimpse(US_births_1994_2003)
Rows: 3,652
Columns: 6
$year <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1…$ month         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …$ date          <date> 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0…
$day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu…$ births        <int> 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79…

We’ll focus on the number of births for each date, but only for births that occurred in 1999. Recall from Section 3.2 we can do this using the filter() function from the dplyr package:

US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)

As discussed in Section 2.4, since date is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a geom_line() instead of geom_point(). Recall that such plots are called time series plots.

ggplot(US_births_1999, aes(x = date, y = births)) +
geom_line() +
labs(x = "Date",
y = "Number of births",
title = "US Births in 1999")

We see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike?

Let’s sort the rows of US_births_1999 in descending order of the number of births. Recall from Section 3.6 that we can use the arrange() function from the dplyr function to do this, making sure to sort births in descending order:

US_births_1999 %>%
arrange(desc(births))
# A tibble: 365 x 6
year month date_of_month date       day_of_week births
<int> <int>         <int> <date>     <ord>        <int>
1  1999     9             9 1999-09-09 Thurs        14540
2  1999    12            21 1999-12-21 Tues         13508
3  1999     9             8 1999-09-08 Wed          13437
4  1999     9            21 1999-09-21 Tues         13384
5  1999     9            28 1999-09-28 Tues         13358
6  1999     7             7 1999-07-07 Wed          13343
7  1999     7             8 1999-07-08 Thurs        13245
8  1999     8            17 1999-08-17 Tues         13201
9  1999     9            10 1999-09-10 Fri          13181
10  1999    12            28 1999-12-28 Tues         13158
# … with 355 more rows

The date with the highest number of births (14,540) is in fact 1999-09-09. If we write down this date in month/day/year format (a standard format in the US), the date with the highest number of births is 9/9/99! All nines! Could it be that parents deliberately induced labor at a higher rate on this date? Maybe? Whatever the cause may be, this fact makes a fun story!

Learning check

(LC11.2) What date between 1994 and 2003 has the fewest number of births in the US? What story could you tell about why this is the case?

Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine.

Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with!

Please check out additional problem sets and labs at https://moderndive.com/labs as well.

### 11.3.3 Scripts of R code

An R script file of all R code used in this chapter is available here.

R code files saved as *.R files for all relevant chapters throughout the entire book are in the following table.