Chapter 3 Data Wrangling
So far in our journey, we’ve seen how to look at data saved in data frames using the
View() functions in Chapter 1, and how to create data visualizations using the
ggplot2 package in Chapter 2. In particular we studied what we term the “five named graphs” (5NG):
- scatterplots via
- linegraphs via
- boxplots via
- histograms via
- barplots via
We created these visualizations using the grammar of graphics, which maps variables in a data frame to the aesthetic attributes of one of the 5
geometric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure 2.1.
Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure 2.2 of departure and arrival delays only for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the
flights data frame to an
alaska_flights data frame consisting of only
carrier == "AS" flights. Thus,
alaska_flights will have fewer rows than
flights. We did this using the
In this chapter, we’ll extend this example and we’ll introduce a series of functions from the
dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include:
filter()a data frame’s existing rows to only pick out a subset of them. For example, the
summarize()one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section 2.7 on boxplots.
group_by()its rows. In other words, assign different rows to be part of the same group. We can then combine
summarize()to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay
dep_delayfor all three
originairports combined, but rather three separate average departure delays, one computed for each of the three
mutate()its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius.
arrange()its rows. For example, sort the rows of
weatherin ascending or descending order of
join()it with another data frame by matching along a “key” variable. In other words, merge these two data frames together.
Notice how we used
computer_code font to describe the actions we want to take on our data frames. This is because the
dplyr package for data wrangling has intuitively verb-named functions that are easy to remember.
There is a further benefit to learning to use the
dplyr package for data wrangling: its similarity to the database querying language SQL (pronounced “sequel” or spelled out as “S”, “Q”, “L”). SQL (which stands for “Structured Query Language”) is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn
dplyr, you can learn SQL easily. We’ll talk more about their similarities in Subsection 3.7.4.