ModernDive

Chapter 3 Data Wrangling

So far in our journey, we’ve seen how to look at data saved in data frames using the glimpse() and View() functions in Chapter 1, and how to create data visualizations using the ggplot2 package in Chapter 2. In particular we studied what we term the “five named graphs” (5NG):

  1. scatterplots via geom_point()
  2. linegraphs via geom_line()
  3. boxplots via geom_boxplot()
  4. histograms via geom_histogram()
  5. barplots via geom_bar() or geom_col()

We created these visualizations using the grammar of graphics, which maps variables in a data frame to the aesthetic attributes of one of the 5 geometric objects. We can also control other aesthetic attributes of the geometric objects such as the size and color as seen in the Gapminder data example in Figure 2.1.

Recall however that for two of our visualizations, we first needed to transform/modify existing data frames a little. For example, recall the scatterplot in Figure 2.2 of departure and arrival delays only for Alaska Airlines flights. In order to create this visualization, we first needed to pare down the flights data frame to an alaska_flights data frame consisting of only carrier == "AS" flights. Thus, alaska_flights will have fewer rows than flights. We did this using the filter() function:

In this chapter, we’ll extend this example and we’ll introduce a series of functions from the dplyr package for data wrangling that will allow you to take a data frame and “wrangle” it (transform it) to suit your needs. Such functions include:

  1. filter() a data frame’s existing rows to only pick out a subset of them. For example, the alaska_flights data frame.
  2. summarize() one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section 2.7 on boxplots.
  3. group_by() its rows. In other words, assign different rows to be part of the same group. We can then combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one computed for each of the three origin airports.
  4. mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius.
  5. arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp.
  6. join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together.

Notice how we used computer_code font to describe the actions we want to take on our data frames. This is because the dplyr package for data wrangling has intuitively verb-named functions that are easy to remember.

There is a further benefit to learning to use the dplyr package for data wrangling: its similarity to the database querying language SQL (pronounced “sequel” or spelled out as “S”, “Q”, “L”). SQL (which stands for “Structured Query Language”) is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn dplyr, you can learn SQL easily. We’ll talk more about their similarities in Subsection 3.7.4.

Needed packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages.