## 3.3 `summarize`

variables

The next common task when working with data frames is to compute *summary statistics*. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the *sum*, the smallest value also called the *minimum*, the largest value also called the *maximum*, and the *standard deviation*. See Appendix A.1 for a glossary of such summary statistics.

Let’s calculate two summary statistics of the `temp`

temperature variable in the `weather`

data frame: the mean and standard deviation (recall from Section 1.4 that the `weather`

data frame is included in the `nycflights13`

package). To compute these summary statistics, we need the `mean()`

and `sd()`

*summary functions* in R. Summary functions in R take in many values and return a single value, as illustrated in Figure 3.2.

More precisely, we’ll use the `mean()`

and `sd()`

summary functions within the `summarize()`

function from the `dplyr`

package. Note you can also use the British English spelling of `summarise()`

. As shown in Figure 3.3, the `summarize()`

function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.

We’ll save the results in a new data frame called `summary_temp`

that will have two columns/variables: the `mean`

and the `std_dev`

:

```
# A tibble: 1 x 2
mean std_dev
<dbl> <dbl>
1 NA NA
```

Why are the values returned `NA`

? As we saw in Subsection 2.3.1 when creating the scatterplot of departure and arrival delays for `alaska_flights`

, `NA`

is how R encodes *missing values* where `NA`

indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, `NA`

is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.

Going back to our `summary_temp`

output, by default any time you try to calculate a summary statistic of a variable that has one or more `NA`

missing values in R, `NA`

is returned. To work around this fact, you can set the `na.rm`

argument to `TRUE`

, where `rm`

is short for “remove”; this will ignore any `NA`

missing values and only return the summary value for all non-missing values.

The code that follows computes the mean and standard deviation of all non-missing values of `temp`

:

```
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE),
std_dev = sd(temp, na.rm = TRUE))
summary_temp
```

```
# A tibble: 1 x 2
mean std_dev
<dbl> <dbl>
1 55.3 17.8
```

Notice how the `na.rm = TRUE`

are used as arguments to the `mean()`

and `sd()`

summary functions individually, and not to the `summarize()`

function.

However, one needs to be cautious whenever ignoring missing values as we’ve just done. In the upcoming *Learning checks* questions, we’ll consider the possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the `na.rm`

argument to any summary statistic function in R is set to `FALSE`

by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.

What are other summary functions we can use inside the `summarize()`

verb to compute summary statistics? As seen in the diagram in Figure 3.2, you can use any function in R that takes many values and returns just one. Here are just a few:

`mean()`

: the average`sd()`

: the standard deviation, which is a measure of spread`min()`

and`max()`

: the minimum and maximum values, respectively`IQR()`

: interquartile range`sum()`

: the total amount when adding multiple numbers`n()`

: a count of the number of rows in each group. This particular summary function will make more sense when`group_by()`

is covered in Section 3.4.

*Learning check*

**(LC3.2)** Say a doctor is studying the effect of smoking on lung cancer for a large number of patients who have records measured at five-year intervals. She notices that a large number of patients have missing data points because the patient has died, so she chooses to ignore these patients in her analysis. What is wrong with this doctor’s approach?

**(LC3.3)** Modify the earlier `summarize()`

function code that creates the `summary_temp`

data frame to also use the `n()`

summary function: `summarize(... , count = n())`

. What does the returned value correspond to?

**(LC3.4)** Why doesn’t the following code work? Run the code line-by-line instead of all at once, and then look at the data. In other words, run `summary_temp <- weather %>% summarize(mean = mean(temp, na.rm = TRUE))`

first.

```
summary_temp <- weather %>%
summarize(mean = mean(temp, na.rm = TRUE)) %>%
summarize(std_dev = sd(temp, na.rm = TRUE))
```