3.8 Other verbs
Here are some other useful data wrangling verbs:
select()only a subset of variables/columns.
rename()variables/columns to have new names.
- Return only the
top_n()values of a variable.
We’ve seen that the
flights data frame in the
nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the
glimpse() function from the
However, say you only need two of these 19 variables, say
flight. You can
select() these two variables:
This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we
select() only a smaller number of variables as is shown in Figure 3.9, it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible.
Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable
year in the
flights data frame. This variable isn’t quite a “variable” because it is always
2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect
year by using the
Another way of selecting columns/variables is by specifying a range of columns:
select() all columns between
day, as well as between
sched_arr_time, and drop the rest.
select() function can also be used to reorder columns when used with the
everything() helper function. For example, suppose we want the
time_hour variables to appear immediately after the
day variables, while not discarding the rest of the variables. In the following code,
everything() will pick up all remaining variables:
Lastly, the helper functions
contains() can be used to select variables/columns that match those conditions. As examples,
Another useful function is
rename(), which as you may have guessed changes the name of variables. Suppose we want to only focus on
arr_time and change
arr_time to be
arrival_time instead in the
flights_time data frame:
Note that in this case we used a single
= sign within the
rename(). For example,
departure_time = dep_time renames the
dep_time variable to have the new name
departure_time. This is because we are not testing for equality like we would using
==. Instead we want to assign a new variable
departure_time to have the same values as
dep_time and then delete the variable
dep_time. Note that new
dplyr users often forget that the new variable name comes before the equal sign.
top_n values of a variable
We can also return the top
n values of a variable using the
top_n() function. For example, we can return a data frame of the top 10 destination airports using the example from Subsection 3.7.2. Observe that we set the number of values to return to
n = 10 and
wt = num_flights to indicate that we want the rows corresponding to the top 10 values of
num_flights. See the help file for
top_n() by running
?top_n for more information.
arrange() these results in descending order of
(LC3.16) What are some ways to select all three of the
distance variables from
flights? Give the code showing how to do this in at least three different ways.
(LC3.17) How could one use
contains() to select columns from the
flights data frame? Provide three different examples in total: one for
starts_with(), one for
ends_with(), and one for
(LC3.18) Why might we want to use the
select function on a data frame?
(LC3.19) Create a new data frame that shows the top 5 airports with the largest arrival delays from NYC in 2013.