D.2 Chapter 2 Solutions

library(nycflights13)
library(ggplot2)
library(dplyr)

(LC2.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights) in the console. In what respect do these data frames differ? For example, think about the number of rows in each dataset.

Solution: flights contains all flight data, while alaska_flights contains only data from Alaskan carrier “AS”. We can see that flights has 336776 rows while alaska_flights has only 714

(LC2.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship?

Solution: The later a plane departs, typically the later it will arrive.

(LC2.3) What variables in the weather data frame would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on numerical variables here. Hint: Explore the weather dataset by using the View() function.

Solution: An example in the weather dataset is visibility, which measure visibility in miles. As visibility increases, we would expect departure delays to decrease.

(LC2.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?

Solution: The point (0,0) means no delay in departure nor arrival. From the point of view of Alaska airlines, this means the flight was on time. It seems most flights are at least close to being on time.

(LC2.5) What are some other features of the plot that stand out to you?

Solution: Different people will answer this one differently. One answer is most flights depart and arrive less than an hour late.

(LC2.6) Create a new scatterplot using different variables in the alaska_flights data frame by modifying the example above.

Solution: Many possibilities for this one, see the plot below. Is there a pattern in departure delay depending on when the flight is scheduled to depart? Interestingly, there seems to be only two blocks of time where flights depart.

ggplot(data = alaska_flights, mapping = aes(x = dep_time, y = dep_delay)) +
geom_point()

(LC2.7) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?

Solution: It thins out the points so we address overplotting. But more importantly it hints at the (statistical) density and distribution of the points: where are the points concentrated, where do they occur.

(LC2.8) After viewing the Figure 2.4 above, give an approximate range of arrival delays and departure delays that occur the most frequently. How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 2.2?

Solution: The lower plot suggests that most Alaska flights from NYC depart between 12 minutes early and on time and arrive between 50 minutes early and on time.

(LC2.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ?

Solution: The rows of early_january_weather are a subset of weather.

(LC2.10) View() the flights data frame again. Why does the time_hour variable uniquely identify the hour of the measurement whereas the hour variable does not?

Solution: Because to uniquely identify an hour, we need the year/month/day/hour sequence, whereas there are only 24 possible hour’s.

(LC2.11) Why should linegraphs be avoided when there is not a clear ordering of the horizontal axis?

Solution: Because lines suggest connectedness and ordering.

(LC2.12) Why are linegraphs frequently used when time is the explanatory variable?

Solution: Because time is sequential: subsequent observations are closely related to each other.

(LC2.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013.

Solution: Humidity is a good one to look at, since this very closely related to the cycles of a day.

ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = humid)) +
geom_line()

(LC2.14) What does changing the number of bins from 30 to 40 tell us about the distribution of temperatures?

Solution: The distribution doesn’t change much. But by refining the bin width, we see that the temperature data has a high degree of accuracy. What do I mean by accuracy? Looking at the temp variable by View(weather), we see that the precision of each temperature recording is 2 decimal places.

(LC2.15) Would you classify the distribution of temperatures as symmetric or skewed?

Solution: It is rather symmetric, i.e. there are no long tails on only one side of the distribution

(LC2.16) What would you guess is the “center” value in this distribution? Why did you make that choice?

Solution: The center is around 55.26°F. By running the summary() command, we see that the mean and median are very similar. In fact, when the distribution is symmetric the mean equals the median.

(LC2.17) Is this data spread out greatly from the center or is it close? Why?

Solution: This can only be answered relatively speaking! Let’s pick things to be relative to Seattle, WA temperatures:

While, it appears that Seattle weather has a similar center of 55°F, its temperatures are almost entirely between 35°F and 75°F for a range of about 40°F. Seattle temperatures are much less spread out than New York i.e. much more consistent over the year. New York on the other hand has much colder days in the winter and much hotter days in the summer. Expressed differently, the middle 50% of values, as delineated by the interquartile range is 30°F:

(LC2.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see relationships between two variables?

Solution:

• Certain months have much more consistent weather (August in particular), while others have crazy variability like January and October, representing changes in the seasons.
• Because we see temp recordings split by month, we are considering the relationship between these two variables. For example, for summer months, temperatures tend to be higher. (LC2.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?

Solution:

• They correspond to the month of the flight. While month is technically a number between 1-12, we’re viewing it as a categorical variable here. Specifically, this is an ordinal categorical variable since there is an ordering to the categories.
• 25, 50, 75, 100 are temperatures

(LC2.20) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.

Solution:

• It would not work if we had a very large number of facets. For example, if we faceted by individual days rather than months, as we would have 365 facets to look at. When considering all days in 2013, it could be argued that we shouldn’t care about day-to-day fluctuation in weather so much, but rather month-to-month fluctuations, allowing us to focus on seasonal trends.

(LC2.21) Does the temp variable in the weather dataset have a lot of variability? Why do you say that?

Solution: Again, like in LC (LC2.17), this is a relative question. I would say yes, because in New York City, you have 4 clear seasons with different weather. Whereas in Seattle WA and Portland OR, you have two seasons: summer and rain!

(LC2.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.

Solution: It appears to be an outlier. Let’s revisit the use of the filter command to hone in on it. We want all data points where the month is 5 and temp<25

weather %>%
filter(month == 5 & temp < 25)
# A tibble: 1 x 16
origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
<chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 JFK     2013     5     8    22  13.1 12.02 95.34       80    8.05546        NA
# … with 5 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>, temp_in_C <dbl>

There appears to be only one hour and only at JFK that recorded 13.1 F (-10.5 C) in the month of May. This is probably a data entry mistake! Why wasn’t the weather at least similar at EWR (Newark) and LGA (LaGuardia)?

(LC2.23) Which months have the highest variability in temperature? What reasons do you think this is?

Solution: We are now interested in the spread of the data. One measure some of you may have seen previously is the standard deviation. But in this plot we can read off the Interquartile Range (IQR):

• The distance from the 1st to the 3rd quartiles i.e. the length of the boxes
• You can also think of this as the spread of the middle 50% of the data

Just from eyeballing it, it seems

• November has the biggest IQR, i.e. the widest box, so has the most variation in temperature
• August has the smallest IQR, i.e. the narrowest box, so is the most consistent temperature-wise

Here’s how we compute the exact IQR values for each month (we’ll see this more in depth Chapter 3 of the text):

1. group the observations by month then
2. for each group, i.e. month, summarize it by applying the summary statistic function IQR(), while making sure to skip over missing data via na.rm=TRUE then
3. arrange the table in descending order of IQR
weather %>%
group_by(month) %>%
summarize(IQR = IQR(temp, na.rm = TRUE)) %>%
arrange(desc(IQR))
month IQR
11 16.02
12 14.04
1 13.77
9 12.06
4 12.06
5 11.88
6 10.98
10 10.98
2 10.08
7 9.18
3 9.00
8 7.02

(LC2.24) We looked at the distribution of the numerical variable temp split by the numerical variable month that we converted to a categorical variable using the factor() function. Why would a boxplot of temp split by the numerical variable pressure similarly converted to a categorical variable using the factor() not be informative?

Solution: Because there are 12 unique values of month yielding only 12 boxes in our boxplot. There are many more unique values of pressure (469 unique values in fact), because values are to the first decimal place. This would lead to 469 boxes, which is too many for people to digest.

(LC2.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

Solution: In a histogram, the bin corresponding to where an outlier lies may not by high enough for us to see. In a boxplot, they are explicitly labelled separately.

(LC2.26) Why are histograms inappropriate for visualizing categorical variables?

Solution: Histograms are for numerical variables i.e. the horizontal part of each histogram bar represents an interval, whereas for a categorical variable each bar represents only one level of the categorical variable.

(LC2.27) What is the difference between histograms and barplots?

Solution: See above.

(LC2.28) How many Envoy Air flights departed NYC in 2013?

Solution: Envoy Air is carrier code MQ and thus 26397 flights departed NYC in 2013.

(LC2.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly?

Solution: The answer is US, AKA U.S. Airways, with 20536 flights. However, picking out the seventh highest airline when the rows are sorted alphabetically by carrier code is difficult. This would be easier to do if the rows were sorted by number. We’ll learn how to do this in Chapter 3 on data wrangling.

(LC2.30) Why should pie charts be avoided and replaced by barplots?

Solution: In our opinion, comparisons using horizontal lines are easier than comparing angles and areas of circles.

(LC2.31) What is your opinion as to why pie charts continue to be used?

Solution: In our opinion, pie charts are generally considered as a poorer method for communicating data than bar charts. People’s brains are not as good at comparing the size of angles because there is no scale, and in comparison, it is much easier to compare the heights of bars in a bar charts. However, in some circumstances, for example, when representing 25% and 75% of a sample size, if we have 2 bars, in which the higher one is three times in height of the other one, it is difficult to tell the scale of their comparison without labels. But in a bar chart, it would be easy to compare if a circle is divided by 75% and 25%. (Read more at: https://www.displayr.com/why-pie-charts-are-better-than-bar-charts/)

(LC2.32) What kinds of questions are not easily answered by looking at the above figure?

Solution: Because the red, green, and blue bars don’t all start at 0 (only red does), it makes comparing counts hard.

(LC2.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?

Solution: The different airlines prefer different airports. For example, United is mostly a Newark carrier and JetBlue is a JFK carrier. If airlines didn’t prefer airports, each color would be roughly one third of each bar.

(LC2.34) Why might the side-by-side (AKA dodged) barplot be preferable to a stacked barplot in this case?

Solution: We can easily compare the different airports for a given carrier using a single comparison line i.e. things are lined up

(LC2.35) What are the disadvantages of using a side-by-side (AKA dodged) barplot, in general?

Solution: It is hard to get totals for each airline.

(LC2.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?

Solution: Not that different than using side-by-side; depends on how you want to organize your presentation.

(LC2.37) What information about the different carriers at different airports is more easily seen in the faceted barplot?

Solution: Now we can also compare the different carriers within a particular airport easily too. For example, we can read off who the top carrier for each airport is easily using a single horizontal line.