ModernDive

4.2 “Tidy” data

Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the fivethirtyeight package. The fivethirtyeight package (Kim, Ismay, and Chunn 2019) provides access to the datasets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all 127 datasets included in the fivethirtyeight package, check out the package webpage by going to: https://fivethirtyeight-r.netlify.app/articles/fivethirtyeight.html.

Let’s focus our attention on the drinks data frame and look at its first 5 rows:

# A tibble: 5 x 5
  country    beer_servings spirit_servings wine_servings total_litres_of_pure_a…
  <chr>              <int>           <int>         <int>                   <dbl>
1 Afghanist…             0               0             0                     0  
2 Albania               89             132            54                     4.9
3 Algeria               25               0            14                     0.7
4 Andorra              245             138           312                    12.4
5 Angola               217              57            45                     5.9

After reading the help file by running ?drinks, you’ll see that drinks is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.

Let’s apply some of the data wrangling verbs we learned in Chapter 3 on the drinks data frame:

  1. filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then
  2. select() all columns except total_litres_of_pure_alcohol by using the - sign, then
  3. rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively.

and save the resulting data frame in drinks_smaller:

# A tibble: 4 x 4
  country       beer spirit  wine
  <chr>        <int>  <int> <int>
1 China           79    192     8
2 Italy           85     42   237
3 Saudi Arabia     0      5     0
4 USA            249    158    84

Let’s now ask ourselves a question: “Using the drinks_smaller data frame, how would we create the side-by-side barplot in Figure 4.2?”. Recall we saw barplots displaying two categorical variables in Subsection 2.8.3.

Comparing alcohol consumption in 4 countries.

FIGURE 4.2: Comparing alcohol consumption in 4 countries.

Let’s break down the grammar of graphics we introduced in Section 2.1:

  1. The categorical variable country with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the x-position of the bars.
  2. The numerical variable servings would have to be mapped to the y-position of the bars (the height of the bars).
  3. The categorical variable type with three levels (beer, spirit, wine) would have to be mapped to the fill color of the bars.

Observe that drinks_smaller has three separate variables beer, spirit, and wine. In order to use the ggplot() function to recreate the barplot in Figure 4.2 however, we need a single variable type with three possible values: beer, spirit, and wine. We could then map this type variable to the fill aesthetic of our plot. In other words, to recreate the barplot in Figure 4.2, our data frame would have to look like this:

# A tibble: 12 x 3
   country      type   servings
   <chr>        <chr>     <int>
 1 China        beer         79
 2 Italy        beer         85
 3 Saudi Arabia beer          0
 4 USA          beer        249
 5 China        spirit      192
 6 Italy        spirit       42
 7 Saudi Arabia spirit        5
 8 USA          spirit      158
 9 China        wine          8
10 Italy        wine        237
11 Saudi Arabia wine          0
12 USA          wine         84

Observe that while drinks_smaller and drinks_smaller_tidy are both rectangular in shape and contain the same 12 numerical values (3 alcohol types by 4 countries), they are formatted differently. drinks_smaller is formatted in what’s known as “wide” format, whereas drinks_smaller_tidy is formatted in what’s known as “long/narrow” format.

In the context of doing data science in R, long/narrow format is also known as “tidy” format. In order to use the ggplot2 and dplyr packages for data visualization and data wrangling, your input data frames must be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first. Before we convert non-“tidy” data frames like drinks_smaller to “tidy” data frames like drinks_smaller_tidy, let’s define “tidy” data.

4.2.1 Definition of “tidy” data

You have surely heard the word “tidy” in your life:

What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s definition of “tidy” data (Wickham 2014) shown also in Figure 4.3:

A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes.

“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
Tidy data graphic from R for Data Science.

FIGURE 4.3: Tidy data graphic from R for Data Science.

For example, say you have the following table of stock prices in Table 4.1:

TABLE 4.1: Stock prices (non-tidy format)
Date Boeing stock price Amazon stock price Google stock price
2009-01-01 $173.55 $174.90 $174.34
2009-01-02 $172.61 $171.42 $170.04

Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table 4.2. Notice that both tables present the same information, but in different formats.

TABLE 4.2: Stock prices (tidy format)
Date Stock Name Stock Price
2009-01-01 Boeing $173.55
2009-01-01 Amazon $174.90
2009-01-01 Google $174.34
2009-01-02 Boeing $172.61
2009-01-02 Amazon $171.42
2009-01-02 Google $170.04

Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table 4.3.

TABLE 4.3: Example of tidy data
Date Boeing Price Weather
2009-01-01 $173.55 Sunny
2009-01-02 $172.61 Overcast

In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table 4.1, the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day.

Learning check

(LC4.1) What are common characteristics of “tidy” data frames?

(LC4.2) What makes “tidy” data frames useful for organizing data?

4.2.2 Converting to “tidy” data

In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package (Wickham and Henry 2020).

Going back to our drinks_smaller data frame from earlier:

# A tibble: 4 x 4
  country       beer spirit  wine
  <chr>        <int>  <int> <int>
1 China           79    192     8
2 Italy           85     42   237
3 Saudi Arabia     0      5     0
4 USA            249    158    84

We convert it to “tidy” format by using the pivot_longer() function from the tidyr package as follows:

# A tibble: 12 x 3
   country      type   servings
   <chr>        <chr>     <int>
 1 China        beer         79
 2 China        spirit      192
 3 China        wine          8
 4 Italy        beer         85
 5 Italy        spirit       42
 6 Italy        wine        237
 7 Saudi Arabia beer          0
 8 Saudi Arabia spirit        5
 9 Saudi Arabia wine          0
10 USA          beer        249
11 USA          spirit      158
12 USA          wine         84

We set the arguments to pivot_longer() as follows:

  1. names_to here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here.
  2. values_to here is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "servings" since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 \(\times\) 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument.
  3. The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it.

The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:

Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we could also do the following for the cols argument:

With our drinks_smaller_tidy “tidy” formatted data frame, we can now produce the barplot you saw in Figure 4.2 using geom_col(). This is done in Figure 4.4. Recall from Section 2.8 on barplots that we use geom_col() and not geom_bar(), since we would like to map the “pre-counted” servings variable to the y-aesthetic of the bars.

Comparing alcohol consumption in 4 countries using geom_col().

FIGURE 4.4: Comparing alcohol consumption in 4 countries using geom_col().

Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different datasets. For example, run ?pivot_longer and look at the examples in the bottom of the help file. We’ll show another example of using pivot_longer() to convert a “wide” formatted data frame to “tidy” format in Section 4.3.

If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples.

You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.

Learning check

(LC4.3) Take a look at the airline_safety data frame included in the fivethirtyeight data package. Run the following:

After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airline companies’ safety records. This data was originally reported on the data journalism website, FiveThirtyEight.com, in Nate Silver’s article, “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s only consider the variables airlines and those relating to fatalities for simplicity:

# A tibble: 56 x 3
   airline               fatalities_85_99 fatalities_00_14
   <chr>                            <int>            <int>
 1 Aer Lingus                           0                0
 2 Aeroflot                           128               88
 3 Aerolineas Argentinas                0                0
 4 Aeromexico                          64                0
 5 Air Canada                           0                0
 6 Air France                          79              337
 7 Air India                          329              158
 8 Air New Zealand                      0                7
 9 Alaska Airlines                      0               88
10 Alitalia                            50                0
# … with 46 more rows

This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable fatalities_years indicating the incident year and a variable count of the fatality counts?

4.2.3 nycflights13 package

Recall the nycflights13 package we introduced in Section 1.4 with data about all domestic flights departing from New York City in 2013. Let’s revisit the flights data frame by running View(flights). We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection 4.2.1: that “Each variable forms a column” and “Each observation forms a row.” But what about the third property of “tidy” data that “Each type of observational unit forms a table”?

Recall that we saw in Subsection 1.4.3 that the observational unit for the flights data frame is an individual flight. In other words, the rows of the flights data frame refer to characteristics/measurements of individual flights. Also included in the nycflights13 package are other data frames with their rows representing different observational units (Wickham 2019a):

  • airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.
  • planes: aircraft information about each of 3,322 planes used, i.e., the observational unit is an aircraft.
  • weather: hourly meteorological data (about 8,705 observations) for each of the three NYC airports, i.e., the observational unit is an hourly measurement of weather at one of the three airports.
  • airports: airport names and locations. The observational unit is an airport.

The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table, i.e., data frame. You could think of this property as the old English expression: “birds of a feather flock together.”

References

Kim, Albert Y., Chester Ismay, and Jennifer Chunn. 2019. Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’. https://CRAN.R-project.org/package=fivethirtyeight.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software Volume 59 (Issue 10). https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf.

Wickham, Hadley. 2019a. Nycflights13: Flights That Departed Nyc in 2013. https://CRAN.R-project.org/package=nycflights13.

Wickham, Hadley, and Lionel Henry. 2020. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.