11.3 Case study: Effective data storytelling
As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in spreadsheet form and calculated summary statistics for a variety of different variables. Furthermore, you’ve seen the value of statistical inference as a process to come to conclusions about a population by using sampling. Lastly, you’ve explored how to fit linear regression models and the importance of checking the conditions required so that all confidence intervals and hypothesis tests have valid interpretation. All throughout, you’ve learned many computational techniques and focused on writing R code that’s reproducible.
We now present another set of case studies, but this time on the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through storytelling.
11.3.1 Bechdel test for Hollywood gender representation
We recommend you read and analyze Walt Hickey’s FiveThirtyEight.com article, “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women.” In it, Walt completed a multidecade study of how many movies pass the Bechdel test, an informal test of gender representation in a movie that was created by Alison Bechdel.
As you read over the article, think carefully about how Walt Hickey is using data, graphics, and analyses to tell the reader a story. In the spirit of reproducibility, FiveThirtyEight have also shared the data and R code that they used for this article. You can also find the data used in many more of their articles on their GitHub page.
ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the
fivethirtyeight package which provides access to these datasets more easily in R. For a complete list of all 127 datasets included in the
fivethirtyeight package, check out the package webpage at https://fivethirtyeight-r.netlify.app/articles/fivethirtyeight.html.
Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using
ggplot2, and other packages in the
tidyverse are available here. For example, a vignette showing how to reproduce one of the plots at the end of the article on the Bechdel test is available here.
11.3.2 US Births in 1999
US_births_1994_2003 data frame included in the
fivethirtyeight package provides information about the number of daily births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running
?US_births_1994_2003 in the console.
It’s always a good idea to preview your data, either by using RStudio’s spreadsheet
View() function or using
glimpse() from the
Rows: 3,652 Columns: 6 $ year <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1… $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ date <date> 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0… $ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu… $ births <int> 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79…
We’ll focus on the number of
births for each
date, but only for births that occurred in 1999. Recall from Section 3.2 we can do this using the
filter() function from the
As discussed in Section 2.4, since
date is a notion of time and thus has sequential ordering to it, a linegraph would be a more appropriate visualization to use than a scatterplot. In other words, we should use a
geom_line() instead of
geom_point(). Recall that such plots are called time series plots.
We see a big dip occurring just before January 1st, 2000, most likely due to the holiday season. However, what about the large spike of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike?
Let’s sort the rows of
US_births_1999 in descending order of the number of births. Recall from Section 3.6 that we can use the
arrange() function from the
dplyr function to do this, making sure to sort
# A tibble: 365 x 6 year month date_of_month date day_of_week births <int> <int> <int> <date> <ord> <int> 1 1999 9 9 1999-09-09 Thurs 14540 2 1999 12 21 1999-12-21 Tues 13508 3 1999 9 8 1999-09-08 Wed 13437 4 1999 9 21 1999-09-21 Tues 13384 5 1999 9 28 1999-09-28 Tues 13358 6 1999 7 7 1999-07-07 Wed 13343 7 1999 7 8 1999-07-08 Thurs 13245 8 1999 8 17 1999-08-17 Tues 13201 9 1999 9 10 1999-09-10 Fri 13181 10 1999 12 28 1999-12-28 Tues 13158 # … with 355 more rows
The date with the highest number of births (14,540) is in fact 1999-09-09. If we write down this date in month/day/year format (a standard format in the US), the date with the highest number of births is 9/9/99! All nines! Could it be that parents deliberately induced labor at a higher rate on this date? Maybe? Whatever the cause may be, this fact makes a fun story!
(LC11.2) What date between 1994 and 2003 has the fewest number of births in the US? What story could you tell about why this is the case?
Time to think with data and further tell your story with data! How could statistical modeling help you here? What types of statistical inference would be helpful? What else can you find and where can you take this analysis? What assumptions did you make in this analysis? We leave these questions to you as the reader to explore and examine.
Remember to get in touch with us via our contact info in the Preface. We’d love to see what you come up with!
Please check out additional problem sets and labs at https://moderndive.com/labs as well.
11.3.3 Scripts of R code
An R script file of all R code used in this chapter is available here.
R code files saved as *.R files for all relevant chapters throughout the entire book are in the following table.