ModernDive

D.3 Chapter 3 Solutions

(LC3.1) What’s another way using the “not” operator ! to filter only the rows that are not going to Burlington, VT nor Seattle, WA in the flights data frame? Test this out using the code above.

Solution:

(LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?

Solution: The missing patients may have died of lung cancer! So to ignore them might seriously bias your results! It is very important to think of what the consequences on your analysis are of ignoring missing data! Ask yourself:

  • There is a systematic reasons why certain values are missing? If so, you might be biasing your results!
  • If there isn’t, then it might be ok to “sweep missing values under the rug.”

(LC3.3) Modify the above summarize function to create summary_temp to also use the n() summary function: summarize(count = n()). What does the returned value correspond to?

Solution: It corresponds to a count of the number of observations/rows:

# A tibble: 1 x 1
  count
  <int>
1 26115

(LC3.4) Why doesn’t the following code work? Run the code line by line instead of all at once, and then look at the data. In other words, run summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE)) first.

Solution: Consider the output of only running the first two lines:

# A tibble: 1 x 1
     mean
    <dbl>
1 55.2604

Because after the first summarize(), the variable temp disappears as it has been collapsed to the value mean. So when we try to run the second summarize(), it can’t find the variable temp to compute the standard deviation of.

(LC3.5) Recall from Chapter 2 when we looked at plots of temperatures by months in NYC. What does the standard deviation column in the summary_monthly_temp data frame tell us about temperatures in New York City throughout the year?

Solution:

month mean std_dev
1 35.6 10.22
2 34.3 6.98
3 39.9 6.25
4 51.7 8.79
5 61.8 9.68
6 72.2 7.55
7 80.1 7.12
8 74.5 5.19
9 67.4 8.47
10 60.1 8.85
11 45.0 10.44
12 38.4 9.98

The standard deviation is a quantification of spread and variability. We see that the period in November, December, and January has the most variation in weather, so you can expect very different temperatures on different days.

(LC3.6) What code would be required to get the mean and standard deviation temperature for each day in 2013 for NYC?

Solution:

# A tibble: 364 x 5
# Groups:   year, month [12]
    year month   day    mean std_dev
   <int> <int> <int>   <dbl>   <dbl>
 1  2013     1     1 36.9997 4.00117
 2  2013     1     2 28.7025 3.45205
 3  2013     1     3 29.9725 2.58472
 4  2013     1     4 34.94   2.45283
 5  2013     1     5 37.205  4.00500
 6  2013     1     6 40.0518 4.39562
 7  2013     1     7 40.5825 3.68319
 8  2013     1     8 40.1175 5.77457
 9  2013     1     9 43.225  5.39724
10  2013     1    10 43.85   2.95214
# … with 354 more rows

Note: group_by(day) is not enough, because day is a value between 1-31. We need to group_by(year, month, day)

(LC3.7) Recreate by_monthly_origin, but instead of grouping via group_by(origin, month), group variables in a different order group_by(month, origin). What differs in the resulting dataset?

Solution:

month origin count
1 EWR 9893
1 JFK 9161
1 LGA 7950
2 EWR 9107
2 JFK 8421
2 LGA 7423
3 EWR 10420
3 JFK 9697
3 LGA 8717
4 EWR 10531
4 JFK 9218
4 LGA 8581
5 EWR 10592
5 JFK 9397
5 LGA 8807
6 EWR 10175
6 JFK 9472
6 LGA 8596
7 EWR 10475
7 JFK 10023
7 LGA 8927
8 EWR 10359
8 JFK 9983
8 LGA 8985
9 EWR 9550
9 JFK 8908
9 LGA 9116
10 EWR 10104
10 JFK 9143
10 LGA 9642
11 EWR 9707
11 JFK 8710
11 LGA 8851
12 EWR 9922
12 JFK 9146
12 LGA 9067

In by_monthly_origin the month column is now first and the rows are sorted by month instead of origin. If you compare the values of count in by_origin_monthly and by_monthly_origin using the View() function, you’ll see that the values are actually the same, just presented in a different order.

(LC3.8) How could we identify how many flights left each of the three airports for each carrier?

Solution: We could summarize the count from each airport using the n() function, which counts rows.

origin carrier count
EWR 9E 1268
EWR AA 3487
EWR AS 714
EWR B6 6557
EWR DL 4342
EWR EV 43939
EWR MQ 2276
EWR OO 6
EWR UA 46087
EWR US 4405
EWR VX 1566
EWR WN 6188
JFK 9E 14651
JFK AA 13783
JFK B6 42076
JFK DL 20701
JFK EV 1408
JFK HA 342
JFK MQ 7193
JFK UA 4534
JFK US 2995
JFK VX 3596
LGA 9E 2541
LGA AA 15459
LGA B6 6002
LGA DL 23067
LGA EV 8826
LGA F9 685
LGA FL 3260
LGA MQ 16928
LGA OO 26
LGA UA 8044
LGA US 13136
LGA WN 6087
LGA YV 601

All remarkably similar! Note: the n() function counts rows, whereas the sum(VARIABLE_NAME) function sums all values of a certain numerical variable VARIABLE_NAME.

(LC3.9) How does the filter operation differ from a group_by followed by a summarize?

Solution:

  • filter picks out rows from the original dataset without modifying them, whereas
  • group_by %>% summarize computes summaries of numerical variables, and hence reports new values.

(LC3.10) What do positive values of the gain variable in flights correspond to? What about negative values? And what about a zero value?

Solution:

  • Say a flight departed 20 minutes late, i.e. dep_delay = 20
  • Then arrived 10 minutes late, i.e. arr_delay = 10.
  • Then gain = dep_delay - arr_delay = 20 - 10 = 10 is positive, so it “made up/gained time in the air.”
  • 0 means the departure and arrival time were the same, so no time was made up in the air. We see in most cases that the gain is near 0 minutes.
  • I never understood this. If the pilot says “we’re going make up time in the air” because of delay by flying faster, why don’t you always just fly faster to begin with?

(LC3.11) Could we create the dep_delay and arr_delay columns by simply subtracting dep_time from sched_dep_time and similarly for arrivals? Try the code out and explain any differences between the result and what actually appears in flights.

Solution: No because you can’t do direct arithmetic on times. The difference in time between 12:03 and 11:59 is 4 minutes, but 1203-1159 = 44

(LC3.12) What can we say about the distribution of gain? Describe it in a few sentences using the plot and the gain_summary data frame values.

Solution: Most of the time the gain is a little under zero, most of the time the gain is between -50 and 50 minutes. There are some extreme cases however!

(LC3.13) Looking at Figure 3.7, when joining flights and weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of year, month, day, hour, and origin, and not just hour?

Solution: Because hour is simply a value between 0 and 23; to identify a specific hour, we need to know which year, month, day and at which airport.

(LC3.14) What surprises you about the top 10 destinations from NYC in 2013?

Solution: This question is subjective! What surprises me is the high number of flights to Boston. Wouldn’t it be easier and quicker to take the train?

(LC3.15) What are some advantages of data in normal forms? What are some disadvantages?

Solution: When datasets are in normal form, we can easily _join them with other datasets! For example, we can join the flights data with the planes data.

(LC3.16) What are some ways to select all three of the dest, air_time, and distance variables from flights? Give the code showing how to do this in at least three different ways.

Solution:

# A tibble: 336,776 x 3
   dest  air_time distance
   <chr>    <dbl>    <dbl>
 1 IAH        227     1400
 2 IAH        227     1416
 3 MIA        160     1089
 4 BQN        183     1576
 5 ATL        116      762
 6 ORD        150      719
 7 FLL        158     1065
 8 IAD         53      229
 9 MCO        140      944
10 ORD        138      733
# … with 336,766 more rows
# A tibble: 336,776 x 3
   dest  air_time distance
   <chr>    <dbl>    <dbl>
 1 IAH        227     1400
 2 IAH        227     1416
 3 MIA        160     1089
 4 BQN        183     1576
 5 ATL        116      762
 6 ORD        150      719
 7 FLL        158     1065
 8 IAD         53      229
 9 MCO        140      944
10 ORD        138      733
# … with 336,766 more rows
# A tibble: 336,776 x 6
   dest  air_time distance  gain    hours gain_per_hour
   <chr>    <dbl>    <dbl> <dbl>    <dbl>         <dbl>
 1 IAH        227     1400    -9 3.78333       -2.37885
 2 IAH        227     1416   -16 3.78333       -4.22907
 3 MIA        160     1089   -31 2.66667      -11.625  
 4 BQN        183     1576    17 3.05           5.57377
 5 ATL        116      762    19 1.93333        9.82759
 6 ORD        150      719   -16 2.5           -6.4    
 7 FLL        158     1065   -24 2.63333       -9.11392
 8 IAD         53      229    11 0.883333      12.4528 
 9 MCO        140      944     5 2.33333        2.14286
10 ORD        138      733   -10 2.300         -4.34783
# … with 336,766 more rows

(LC3.17) How could one use starts_with, ends_with, and contains to select columns from the flights data frame? Provide three different examples in total: one for starts_with, one for ends_with, and one for contains.

Solution:

# A tibble: 336,776 x 5
     day dep_time dep_delay dest  distance
   <int>    <int>     <dbl> <chr>    <dbl>
 1     1      517         2 IAH       1400
 2     1      533         4 IAH       1416
 3     1      542         2 MIA       1089
 4     1      544        -1 BQN       1576
 5     1      554        -6 ATL        762
 6     1      554        -4 ORD        719
 7     1      555        -5 FLL       1065
 8     1      557        -3 IAD        229
 9     1      557        -3 MCO        944
10     1      558        -2 ORD        733
# … with 336,766 more rows
# A tibble: 336,776 x 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
# A tibble: 336,776 x 3
   dep_time sched_dep_time dep_delay
      <int>          <int>     <dbl>
 1      517            515         2
 2      533            529         4
 3      542            540         2
 4      544            545        -1
 5      554            600        -6
 6      554            558        -4
 7      555            600        -5
 8      557            600        -3
 9      557            600        -3
10      558            600        -2
# … with 336,766 more rows

(LC3.18) Why might we want to use the select() function on a data frame?

Solution: To narrow down the data frame, to make it easier to look at. Using View() for example.

(LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.

Solution:

# A tibble: 5 x 2
  dest  avg_delay
  <chr>     <dbl>
1 CAE     41.7642
2 TUL     33.6599
3 OKC     30.6190
4 JAC     28.0952
5 TYS     24.0692

(LC3.20) Using the datasets included in the nycflights13 package, compute the available seat miles for each airline sorted in descending order. After completing all the necessary data wrangling steps, the resulting data frame should have 16 rows (one for each airline) and 2 columns (airline name and available seat miles). Here are some hints:

  1. Crucial: Unless you are very confident in what you are doing, it is worthwhile to not starting coding right away, but rather first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code).
  2. Take a close look at all the datasets using the View() function: flights, weather, planes, airports, and airlines to identify which variables are necessary to compute available seat miles.
  3. Figure 3.7 above showing how the various datasets can be joined will also be useful.
  4. Consider the data wrangling verbs in Table 3.2 as your toolbox!

Solution: Here are some examples of student-written pseudocode. Based on our own pseudocode, let’s first display the entire solution.

# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 UA      15516377526
 2 DL      10532885801
 3 B6       9618222135
 4 AA       3677292231
 5 US       2533505829
 6 VX       2296680778
 7 EV       1817236275
 8 WN       1718116857
 9 9E        776970310
10 HA        642478122
11 AS        314104736
12 FL        219628520
13 F9        184832280
14 YV         20163632
15 MQ          7162420
16 OO          1299835

Let’s now break this down step-by-step. To compute the available seat miles for a given flight, we need the distance variable from the flights data frame and the seats variable from the planes data frame, necessitating a join by the key variable tailnum as illustrated in Figure 3.7. To keep the resulting data frame easy to view, we’ll select() only these two variables and carrier:

# A tibble: 284,170 x 3
   carrier seats distance
   <chr>   <int>    <dbl>
 1 UA        149     1400
 2 UA        149     1416
 3 AA        178     1089
 4 B6        200     1576
 5 DL        178      762
 6 UA        191      719
 7 B6        200     1065
 8 EV         55      229
 9 B6        200      944
10 B6        200     1028
# … with 284,160 more rows

Now for each flight we can compute the available seat miles ASM by multiplying the number of seats by the distance via a mutate():

# A tibble: 284,170 x 4
   carrier seats distance    ASM
   <chr>   <int>    <dbl>  <dbl>
 1 UA        149     1400 208600
 2 UA        149     1416 210984
 3 AA        178     1089 193842
 4 B6        200     1576 315200
 5 DL        178      762 135636
 6 UA        191      719 137329
 7 B6        200     1065 213000
 8 EV         55      229  12595
 9 B6        200      944 188800
10 B6        200     1028 205600
# … with 284,160 more rows

Next we want to sum the ASM for each carrier. We achieve this by first grouping by carrier and then summarizing using the sum() function:

# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 9E        776970310
 2 AA       3677292231
 3 AS        314104736
 4 B6       9618222135
 5 DL      10532885801
 6 EV       1817236275
 7 F9        184832280
 8 FL        219628520
 9 HA        642478122
10 MQ          7162420
11 OO          1299835
12 UA      15516377526
13 US       2533505829
14 VX       2296680778
15 WN       1718116857
16 YV         20163632

However, because for certain carriers certain flights have missing NA values, the resulting table also returns NA’s. We can eliminate these by adding a na.rm = TRUE argument to sum(), telling R that we want to remove the NA’s in the sum. We saw this in Section 3.3:

# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 9E        776970310
 2 AA       3677292231
 3 AS        314104736
 4 B6       9618222135
 5 DL      10532885801
 6 EV       1817236275
 7 F9        184832280
 8 FL        219628520
 9 HA        642478122
10 MQ          7162420
11 OO          1299835
12 UA      15516377526
13 US       2533505829
14 VX       2296680778
15 WN       1718116857
16 YV         20163632

Finally, we arrange() the data in desc()ending order of ASM.

# A tibble: 16 x 2
   carrier         ASM
   <chr>         <dbl>
 1 UA      15516377526
 2 DL      10532885801
 3 B6       9618222135
 4 AA       3677292231
 5 US       2533505829
 6 VX       2296680778
 7 EV       1817236275
 8 WN       1718116857
 9 9E        776970310
10 HA        642478122
11 AS        314104736
12 FL        219628520
13 F9        184832280
14 YV         20163632
15 MQ          7162420
16 OO          1299835

While the above data frame is correct, the IATA carrier code is not always useful. For example, what carrier is WN? We can address this by joining with the airlines dataset using carrier is the key variable. While this step is not absolutely required, it goes a long way to making the table easier to make sense of. It is important to be empathetic with the ultimate consumers of your presented data!

# A tibble: 16 x 3
   carrier         ASM name                       
   <chr>         <dbl> <chr>                      
 1 UA      15516377526 United Air Lines Inc.      
 2 DL      10532885801 Delta Air Lines Inc.       
 3 B6       9618222135 JetBlue Airways            
 4 AA       3677292231 American Airlines Inc.     
 5 US       2533505829 US Airways Inc.            
 6 VX       2296680778 Virgin America             
 7 EV       1817236275 ExpressJet Airlines Inc.   
 8 WN       1718116857 Southwest Airlines Co.     
 9 9E        776970310 Endeavor Air Inc.          
10 HA        642478122 Hawaiian Airlines Inc.     
11 AS        314104736 Alaska Airlines Inc.       
12 FL        219628520 AirTran Airways Corporation
13 F9        184832280 Frontier Airlines Inc.     
14 YV         20163632 Mesa Airlines Inc.         
15 MQ          7162420 Envoy Air                  
16 OO          1299835 SkyWest Airlines Inc.