## D.2 Chapter 2 Solutions

**(LC2.1)** Take a look at both the `flights`

and `alaska_flights`

data frames by running `View(flights)`

and `View(alaska_flights)`

in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset.

**Solution**: `flights`

contains all flight data, while `alaska_flights`

contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while `alaska_flights`

has only 714

**(LC2.2)** What are some practical reasons why `dep_delay`

and `arr_delay`

have a positive relationship?

**Solution**: The later a plane departs, typically the later it will arrive.

**(LC2.3)** What variables in the `weather`

data frame would you expect to have a negative correlation (i.e. a negative relationship) with `dep_delay`

? Why? Remember that we are focusing on numerical variables here. Hint: Explore the `weather`

dataset by using the `View()`

function.

**Solution**: An example in the `weather`

dataset is `visibility`

, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease.

**(LC2.4)** Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?

**Solution**: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time.

**(LC2.5)** What are some other features of the plot that stand out to you?

**Solution**: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late.

**(LC2.6)** Create a new scatterplot using different variables in the `alaska_flights`

data frame by modifying the example above.

**Solution**: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart.

**(LC2.7)** Why is setting the `alpha`

argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?

**Solution**: It thins out the points so we address overplotting. But more importantly it hints at the (statistical) **density** and **distribution** of the points: where are the points concentrated, where do they occur.

**(LC2.8)** After viewing the Figure 2.4 above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the `alpha = 0.2`

set in Figure 2.2?

**Solution**: The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.

**(LC2.9)** Take a look at both the `weather`

and `early_january_weather`

data frames by running `View(weather)`

and `View(early_january_weather)`

in the console. In what respect do these data frames differ?

**Solution**: *The rows of early_january_weather are a subset of weather.*

**(LC2.10)** `View()`

the `flights`

data frame again. Why does the `time_hour`

variable uniquely identify the hour of the measurement whereas the `hour`

variable does not?

**Solution**: Because to uniquely identify an hour, we need the `year`

/`month`

/`day`

/`hour`

sequence, whereas there are only 24 possible `hour`

’s.

**(LC2.11)** Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?

**Solution**: Because lines suggest connectedness and ordering.

**(LC2.12)** Why are linegraphs frequently used when time is the explanatory variable?

**Solution**: Because time is sequential: subsequent observations are closely related to each other.

**(LC2.13)** Plot a time series of a variable other than `temp`

for Newark Airport in the first 15 days of January 2013.

**Solution**: Humidity is a good one to look at, since this very closely related to the cycles of a day.

**(LC2.14)** What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?

**Solution**: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the `temp`

variable by `View(weather)`

, we see that the precision of each temperature recording is 2 decimal places.

**(LC2.15)** Would you classify the distribution of temperatures as symmetric or skewed?

**Solution**: It is rather symmetric, i.e. there are no **long tails** on only one side of the distribution

**(LC2.16)** What would you guess is the “center” value in this distribution? Why did you make that choice?

**Solution**: The center is around 55.26°F. By running the `summary()`

command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.

**(LC2.17)** Is this data spread out greatly from the center or is it close? Why?

**Solution**: This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures:

While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:

**(LC2.18)** What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?

**Solution**:

- Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
- Because we see
`temp`

recordings split by`month`

, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher.**(LC2.19)**What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?

**Solution**:

- They correspond to the month of the flight. While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically, this is an
**ordinal categorical**variable since there is an ordering to the categories. - 25, 50, 75, 100 are temperatures

**(LC2.20)** For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.

**Solution**:

- It would not work if we had a very large number of facets. For example, if we faceted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends.

**(LC2.21)** Does the `temp`

variable in the `weather`

dataset have a lot of variability? Why do you say that?

**Solution**: Again, like in LC (LC2.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!

**(LC2.22)** What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

**Solution**: It appears to be an outlier. Let’s revisit the use of the `filter`

command to hone in on it. We want all data points where the `month`

is 5 and `temp<25`

```
# A tibble: 1 x 16
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 JFK 2013 5 8 22 13.1 12.02 95.34 80 8.05546 NA
# … with 5 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
# time_hour <dttm>, temp_in_C <dbl>
```

There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (LaGuardia)?

**(LC2.23)** Which months have the highest variability in temperature? What reasons do you think this is?

**Solution**: We are now interested in the **spread** of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):

- The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
- You can also think of this as the spread of the
**middle 50%**of the data

Just from eyeballing it, it seems

- November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
- August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise

Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 3 of the text):

`group`

the observations by`month`

then- for each
`group`

, i.e.`month`

,`summarize`

it by applying the summary statistic function`IQR()`

, while making sure to skip over missing data via`na.rm=TRUE`

then `arrange`

the table in`desc`

ending order of`IQR`

month | IQR |
---|---|

11 | 16.02 |

12 | 14.04 |

1 | 13.77 |

9 | 12.06 |

4 | 12.06 |

5 | 11.88 |

6 | 10.98 |

10 | 10.98 |

2 | 10.08 |

7 | 9.18 |

3 | 9.00 |

8 | 7.02 |

**(LC2.24)** We looked at the distribution of the numerical variable `temp`

split by the numerical variable `month`

that we converted to a categorical variable using the `factor()`

function. Why would a boxplot of `temp`

split by the numerical variable `pressure`

similarly converted to a categorical variable using the `factor()`

not be informative?

**Solution**: Because there are 12 unique values of `month`

yielding only 12 boxes in our boxplot. There are many more unique values of `pressure`

(469 unique values in fact), because values are to the first decimal place. This would lead to 469 boxes, which is too many for people to digest.

**(LC2.25)** Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

**Solution**: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.

**(LC2.26)** Why are histograms inappropriate for visualizing categorical variables?

**Solution**: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.

**(LC2.27)** What is the difference between histograms and barplots?

**Solution**: See above.

**(LC2.28)** How many Envoy Air flights departed NYC in 2013?

**Solution**: Envoy Air is carrier code `MQ`

and thus 26397 flights departed NYC in 2013.

**(LC2.29)** What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly?

**Solution**: The answer is US, AKA U.S. Airways, with 20536 flights. However, picking out the seventh highest airline when the rows are sorted alphabetically by carrier code is difficult. This would be easier to do if the rows were sorted by number. We’ll learn how to do this in Chapter 3 on data wrangling.

**(LC2.30)** Why should pie charts be avoided and replaced by barplots?

**Solution**: In our **opinion**, comparisons using horizontal lines are easier than comparing angles and areas of circles.

**(LC2.31)** What is your opinion as to why pie charts continue to be used?

**Solution**: In our **opinion**, pie charts are generally considered as a poorer method for communicating data than bar charts. People’s brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when representing 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is divided by 75% and 25%. (Read more at: https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/)

**(LC2.32)** What kinds of questions are not easily answered by looking at the above figure?

**Solution**: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.

**(LC2.33)** What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?

**Solution**: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.

**(LC2.34)** Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?

**Solution**: We can easily compare the different airports for a given carrier using a single comparison line i.e. things are lined up

**(LC2.35)** What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?

**Solution**: It is hard to get totals for each airline.

**(LC2.36)** Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?

**Solution**: Not that different than using side-by-side; depends on how you want to organize your presentation.

**(LC2.37)** What information about the different carriers at different airports is more easily seen in the faceted barplot?

**Solution**: Now we can also compare the different carriers **within** a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.