The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation. See Appendix A.1 for a glossary of such summary statistics.
Let’s calculate two summary statistics of the
temp temperature variable in the
weather data frame: the mean and standard deviation (recall from Section 1.4 that the
weather data frame is included in the
nycflights13 package). To compute these summary statistics, we need the
sd() summary functions in R. Summary functions in R take in many values and return a single value, as illustrated in Figure 3.2.
More precisely, we’ll use the
sd() summary functions within the
summarize() function from the
dplyr package. Note you can also use the British English spelling of
summarise(). As shown in Figure 3.3, the
summarize() function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.
We’ll save the results in a new data frame called
summary_temp that will have two columns/variables: the
mean and the
# A tibble: 1 x 2 mean std_dev <dbl> <dbl> 1 NA NA
Why are the values returned
NA? As we saw in Subsection 2.3.1 when creating the scatterplot of departure and arrival delays for
NA is how R encodes missing values where
NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist,
NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.
Going back to our
summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more
NA missing values in R,
NA is returned. To work around this fact, you can set the
na.rm argument to
rm is short for “remove”; this will ignore any
NA missing values and only return the summary value for all non-missing values.
The code that follows computes the mean and standard deviation of all non-missing values of
# A tibble: 1 x 2 mean std_dev <dbl> <dbl> 1 55.3 17.8
Notice how the
na.rm = TRUE are used as arguments to the
sd() summary functions individually, and not to the
However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming Learning checks questions, we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the
na.rm argument to any summary statistic function in R is set to
FALSE by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.
What are other summary functions we can use inside the
summarize() verb to compute summary statistics? As seen in the diagram in Figure 3.2, you can use any function in R that takes many values and returns just one. Here are just a few:
mean(): the average
sd(): the standard deviation, which is a measure of spread
max(): the minimum and maximum values, respectively
IQR(): interquartile range
sum(): the total amount when adding multiple numbers
n(): a count of the number of rows in each group. This particular summary function will make more sense when
group_by()is covered in Section 3.4.
(LC3.2) Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five-year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?
(LC3.3) Modify the earlier
summarize() function code that creates the
summary_temp data frame to also use the
n() summary function:
summarize(... , count = n()). What does the returned value correspond to?
(LC3.4) Why doesn’t the following code work? Run the code line-by-line instead of all at once, and then look at the data. In other words, run
summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE)) first.