ModernDive

3.3 summarize variables

The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation. See Appendix A.1 for a glossary of such summary statistics.

Let’s calculate two summary statistics of the temp temperature variable in the weather data frame: the mean and standard deviation (recall from Section 1.4 that the weather data frame is included in the nycflights13 package). To compute these summary statistics, we need the mean() and sd() summary functions in R. Summary functions in R take in many values and return a single value, as illustrated in Figure 3.2.

Diagram illustrating a summary function in R.

FIGURE 3.2: Diagram illustrating a summary function in R.

More precisely, we’ll use the mean() and sd() summary functions within the summarize() function from the dplyr package. Note you can also use the British English spelling of summarise(). As shown in Figure 3.3, the summarize() function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.

Diagram of summarize() rows.

FIGURE 3.3: Diagram of summarize() rows.

We’ll save the results in a new data frame called summary_temp that will have two columns/variables: the mean and the std_dev:

# A tibble: 1 x 2
   mean std_dev
  <dbl>   <dbl>
1    NA      NA

Why are the values returned NA? As we saw in Subsection 2.3.1 when creating the scatterplot of departure and arrival delays for alaska_flights, NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.

Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values.

The code that follows computes the mean and standard deviation of all non-missing values of temp:

# A tibble: 1 x 2
   mean std_dev
  <dbl>   <dbl>
1  55.3    17.8

Notice how the na.rm = TRUE are used as arguments to the mean() and sd() summary functions individually, and not to the summarize() function.

However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming Learning checks questions, we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the na.rm argument to any summary statistic function in R is set to FALSE by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.

What are other summary functions we can use inside the summarize() verb to compute summary statistics? As seen in the diagram in Figure 3.2, you can use any function in R that takes many values and returns just one. Here are just a few:

  • mean(): the average
  • sd(): the standard deviation, which is a measure of spread
  • min() and max(): the minimum and maximum values, respectively
  • IQR(): interquartile range
  • sum(): the total amount when adding multiple numbers
  • n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by() is covered in Section 3.4.

Learning check

(LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five-year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?

(LC3.3) Modify the earlier summarize() function code that creates the summary_temp data frame to also use the n() summary function: summarize(... , count = n()). What does the returned value correspond to?

(LC3.4) Why doesn’t the following code work? Run the code line-by-line instead of all at once, and then look at the data. In other words, run summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE)) first.