join data frames
Another common data transformation task is “joining” or “merging” two different datasets. For example, in the
flights data frame, the variable
carrier lists the carrier code for the different flights. While the corresponding airline names for
"AA" might be somewhat easy to guess (United and American Airlines), what airlines have codes
"B6"? This information is provided in a separate data frame
We see that in
carrier is the carrier code, while
name is the full name of the airline company. Using this table, we can see that
"B6" correspond to Virgin America, Hawaiian Airlines, and JetBlue, respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” the
airlines data frames.
Note that the values in the variable
carrier in the
flights data frame match the values in the variable
carrier in the
airlines data frame. In this case, we can use the variable
carrier as a key variable to match the rows of the two data frames. Key variables are almost always identification variables that uniquely identify the observational units as we saw in Subsection 1.4.4. This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett (Grolemund and Wickham 2017) created the diagram shown in Figure 3.7 to help us understand how the different data frames in the
nycflights13 package are linked by various key variables:
3.7.1 Matching “key” variable names
In both the
airlines data frames, the key variable we want to join/merge/match the rows by has the same name:
carrier. Let’s use the
inner_join() function to join the two data frames, where the rows will be matched by the variable
carrier, and then compare the resulting data frames:
Observe that the
flights_joined data frames are identical except that
flights_joined has an additional variable
name. The values of
name correspond to the airline companies’ names as indicated in the
airlines data frame.
A visual representation of the
inner_join() is shown in Figure 3.8 (Grolemund and Wickham 2017). There are other types of joins available (such as
anti_join()), but the
inner_join() will solve nearly all of the problems you’ll encounter in this book.
3.7.2 Different “key” variable names
Say instead you are interested in the destinations of all domestic flights departing NYC in 2013, and you ask yourself questions like: “What cities are these airports in?”, or “Is
"ORD" Orlando?”, or “Where is
airports data frame contains the airport codes for each airport:
However, if you look at both the
flights data frames, you’ll find that the airport codes are in variables that have different names. In
airports the airport code is in
faa, whereas in
flights the airport codes are in
dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure 3.7.
In order to join these two data frames by airport code, our
inner_join() operation will use the
by = c("dest" = "faa") argument with modified code syntax allowing us to join two data frames where the key variable has a different name:
Let’s construct the chain of pipe operators
%>% that computes the number of flights from NYC to each destination, but also includes information about each destination airport:
# A tibble: 101 x 9 dest num_flights airport_name lat lon alt tz dst tzone <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> 1 ORD 17283 Chicago Ohare Intl 42.0 -87.9 668 -6 A America… 2 ATL 17215 Hartsfield Jackson… 33.6 -84.4 1026 -5 A America… 3 LAX 16174 Los Angeles Intl 33.9 -118. 126 -8 A America… 4 BOS 15508 General Edward Law… 42.4 -71.0 19 -5 A America… 5 MCO 14082 Orlando Intl 28.4 -81.3 96 -5 A America… 6 CLT 14064 Charlotte Douglas … 35.2 -80.9 748 -5 A America… 7 SFO 13331 San Francisco Intl 37.6 -122. 13 -8 A America… 8 FLL 12055 Fort Lauderdale Ho… 26.1 -80.2 9 -5 A America… 9 MIA 11728 Miami Intl 25.8 -80.3 8 -5 A America… 10 DCA 9705 Ronald Reagan Wash… 38.9 -77.0 15 -5 A America… # … with 91 more rows
In case you didn’t know,
"ORD" is the airport code of Chicago O’Hare airport and
"FLL" is the main airport in Fort Lauderdale, Florida, which can be seen in the
3.7.3 Multiple “key” variables
Say instead we want to join two data frames by multiple key variables. For example, in Figure 3.7, we see that in order to join the
weather data frames, we need more than one key variable:
origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the
weather data frame: hourly weather recordings at each of the 3 NYC airports.
We achieve this by specifying a vector of key variables to join by using the
c() function. Recall from Subsection 1.2.1 that
c() is short for “combine” or “concatenate.”
(LC3.13) Looking at Figure 3.7, when joining
weather (or, in other words, matching the hourly weather values with each flight), why do we need to join by all of
origin, and not just
(LC3.14) What surprises you about the top 10 destinations from NYC in 2013?
3.7.4 Normal forms
The data frames included in the
nycflights13 package are in a form that minimizes redundancy of data. For example, the
flights data frame only saves the
carrier code of the airline company; it does not include the actual name of the airline. For example, the first row of
carrier equal to
UA, but it does not include the airline name of “United Air Lines Inc.”
The names of the airline companies are included in the
name variable of the
airlines data frame. In order to have the airline company name included in
flights, we could join these two data frames as follows:
We are capable of performing this join because each of the data frames have keys in common to relate one to another: the
carrier variable in both the
airlines data frames. The key variable(s) that we base our joins on are often identification variables as we mentioned previously.
This is an important property of what’s known as normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia.
dplyr and SQL we mentioned in the introduction of this chapter use such normal forms. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily.
(LC3.15) What are some advantages of data in normal forms? What are some disadvantages?
Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. First. Sebastopol, CA: O’Reilly Media. https://r4ds.had.co.nz/.