ModernDive

D.4 Chapter 4 Solutions

(LC4.1) What are common characteristics of “tidy” datasets?

Solution: Rows correspond to observations, while columns correspond to variables.

(LC4.2) What makes “tidy” datasets useful for organizing data?

Solution: Tidy datasets are an organized way of viewing data. This format is required for the ggplot2 and dplyr packages for data visualization and wrangling.

(LC4.3) Take a look the airline_safety data frame included in the fivethirtyeight data. Run the following:

After reading the help file by running ?airline_safety, we see that airline_safety is a data frame containing information on different airlines companies’ safety records. This data was originally reported on the data journalism website FiveThirtyEight.com in Nate Silver’s article “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?”. Let’s ignore the incl_reg_subsidiaries and avail_seat_km_per_week variables for simplicity:

# A tibble: 56 x 7
   airline incidents_85_99 fatal_accidents… fatalities_85_99 incidents_00_14
   <chr>             <int>            <int>            <int>           <int>
 1 Aer Li…               2                0                0               0
 2 Aerofl…              76               14              128               6
 3 Aeroli…               6                0                0               1
 4 Aerome…               3                1               64               5
 5 Air Ca…               2                0                0               2
 6 Air Fr…              14                4               79               6
 7 Air In…               2                1              329               4
 8 Air Ne…               3                0                0               5
 9 Alaska…               5                0                0               5
10 Alital…               7                2               50               4
# … with 46 more rows, and 2 more variables: fatal_accidents_00_14 <int>,
#   fatalities_00_14 <int>

This data frame is not in “tidy” format. How would you convert this data frame to be in “tidy” format, in particular so that it has a variable incident_type_years indicating the incident type/year and a variable count of the counts?

Solution:

This can been done using the pivot_longer() function from the tidyr package:

# A tibble: 336 x 3
   airline    incident_type_years   count
   <chr>      <chr>                 <int>
 1 Aer Lingus incidents_85_99           2
 2 Aer Lingus fatal_accidents_85_99     0
 3 Aer Lingus fatalities_85_99          0
 4 Aer Lingus incidents_00_14           0
 5 Aer Lingus fatal_accidents_00_14     0
 6 Aer Lingus fatalities_00_14          0
 7 Aeroflot   incidents_85_99          76
 8 Aeroflot   fatal_accidents_85_99    14
 9 Aeroflot   fatalities_85_99        128
10 Aeroflot   incidents_00_14           6
# … with 326 more rows

If you look at the resulting airline_safety_smaller_tidy data frame in the spreadsheet viewer, you’ll see that the variable incident_type_years has 6 possible values: "incidents_85_99", "fatal_accidents_85_99", "fatalities_85_99", "incidents_00_14", "fatal_accidents_00_14", "fatalities_00_14" corresponding to the 6 columns of airline_safety_smaller we tidied.

Note that prior to tidyr version 1.0.0 released to CRAN in September 2019, this could also have been done using the gather() function from the tidyr package:

# A tibble: 336 x 3
   airline               incident_type_years count
   <chr>                 <chr>               <int>
 1 Aer Lingus            incidents_85_99         2
 2 Aeroflot              incidents_85_99        76
 3 Aerolineas Argentinas incidents_85_99         6
 4 Aeromexico            incidents_85_99         3
 5 Air Canada            incidents_85_99         2
 6 Air France            incidents_85_99        14
 7 Air India             incidents_85_99         2
 8 Air New Zealand       incidents_85_99         3
 9 Alaska Airlines       incidents_85_99         5
10 Alitalia              incidents_85_99         7
# … with 326 more rows

(LC4.4) Convert the dem_score data frame into a tidy data frame and assign the name of dem_score_tidy to the resulting long-formatted data frame.

Solution: Running the following in the console:

Let’s now compare the dem_score and dem_score_tidy. dem_score has democracy score information for each year in columns, whereas in dem_score_tidy there are explicit variables year and democracy_score. While both representations of the data contain the same information, we can only use ggplot() to create plots using the dem_score_tidy data frame.

# A tibble: 96 x 10
   country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
   <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Albania        -9     -9     -9     -9     -9     -9     -9     -9      5
 2 Argentina      -9     -1     -1     -9     -9     -9     -8      8      7
 3 Armenia        -9     -7     -7     -7     -7     -7     -7     -7      7
 4 Australia      10     10     10     10     10     10     10     10     10
 5 Austria        10     10     10     10     10     10     10     10     10
 6 Azerbaijan     -9     -7     -7     -7     -7     -7     -7     -7      1
 7 Belarus        -9     -7     -7     -7     -7     -7     -7     -7      7
 8 Belgium        10     10     10     10     10     10     10     10     10
 9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
# … with 86 more rows
# A tibble: 864 x 3
   country   year  democracy_score
   <chr>     <chr>           <dbl>
 1 Albania   1952               -9
 2 Albania   1957               -9
 3 Albania   1962               -9
 4 Albania   1967               -9
 5 Albania   1972               -9
 6 Albania   1977               -9
 7 Albania   1982               -9
 8 Albania   1987               -9
 9 Albania   1992                5
10 Argentina 1952               -9
# … with 854 more rows

(LC4.5) Read in the life expectancy data stored at https://moderndive.com/data/le_mess.csv and convert it to a tidy data frame.

Solution: The code is similar

We observe the same construct structure with respect to year in life_expectancy vs life_expectancy_tidy as we did in dem_score vs dem_score_tidy:

# A tibble: 202 x 67
   country  `1951` `1952` `1953`  `1954` `1955` `1956` `1957` `1958` `1959`
   <chr>     <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan… 27.13    27.67  28.19 28.73    29.27  29.8  30.34  30.86   31.4 
 2 Albania 54.72    55.23  55.85 56.59    57.45  58.42 59.48  60.6    61.75
 3 Algeria 43.03    43.5   43.96 44.44    44.93  45.44 45.94  46.45   46.97
 4 Angola  31.05    31.59  32.14 32.6900  33.24  33.78 34.33  34.88   35.43
 5 Antigu… 58.26    58.8   59.34 59.87    60.41  60.93 61.45  61.97   62.48
 6 Argent… 61.93    62.54  63.1  63.59    64.03  64.41 64.73  65      65.22
 7 Armenia 62.67    63.13  63.6  64.0700  64.54  65    65.45  65.92   66.39
 8 Aruba   58.96    60.01  60.98 61.87    62.69  63.42 64.09  64.68   65.2 
 9 Austra… 68.710   69.11  69.69 69.84    70.16  70.03 70.31  70.86   70.43
10 Austria 65.2400  66.78  67.27 67.3     67.58  67.7  67.460 68.460  68.39
# … with 192 more rows, and 57 more variables: `1960` <dbl>, `1961` <dbl>,
#   `1962` <dbl>, `1963` <dbl>, `1964` <dbl>, `1965` <dbl>, `1966` <dbl>,
#   `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
#   `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
#   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
#   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
#   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>
# A tibble: 13,332 x 3
   country     year  life_expectancy
   <chr>       <chr>           <dbl>
 1 Afghanistan 1951            27.13
 2 Afghanistan 1952            27.67
 3 Afghanistan 1953            28.19
 4 Afghanistan 1954            28.73
 5 Afghanistan 1955            29.27
 6 Afghanistan 1956            29.8 
 7 Afghanistan 1957            30.34
 8 Afghanistan 1958            30.86
 9 Afghanistan 1959            31.4 
10 Afghanistan 1960            31.94
# … with 13,322 more rows