ModernDive

C.1 Data wrangling

In this Section, we address some of the most common data wrangling questions we’ve encountered in student projects (shout out to Dr. Jenny Smetzer for her work setting this up!):

  • C.1.1: Dealing with missing values
  • C.1.2: Reordering bars in a barplot
  • C.1.3: Showing money on an axis
  • C.1.4: Changing values inside cells
  • C.1.5: Converting a numerical variable to a categorical one
  • C.1.6: Computing proportions
  • C.1.7: Dealing with %, commas, and $

Let’s load an example movies dataset, pare down the rows and columns a bit, and then show the first 10 rows using slice().

# A tibble: 10 x 5
   name               score rating type        millions
   <chr>              <dbl> <chr>  <chr>          <dbl>
 1 2 Fast 2 Furious 48.9000 PG-13  action      NA      
 2 A Guy Thing      39.5    PG-13  rom comedy  15.545  
 3 A Man Apart      42.9000 R      action      26.2480 
 4 A Mighty Wind    79.9000 PG-13  comedy      17.781  
 5 Agent Cody Banks 57.9000 PG     action      47.8110 
 6 Alex & Emma      35.1000 PG-13  rom comedy  14.219  
 7 American Wedding 50.7000 R      comedy     104.441  
 8 Anger Management 62.6000 PG-13  comedy     134.404  
 9 Anything Else    63.3000 R      rom comedy   3.21200
10 Bad Boys II      38.1000 R      action     138.397  

C.1.1 Dealing with missing values

You see the revenue in millions value for the movie “2 Fast 2 Furious” is NA (missing). So the following occurs when computing the median revenue:

# A tibble: 1 x 1
  mean_profit
        <dbl>
1          NA

You should always think about why a data value might be missing and what that missingness may mean. For example, imagine you are conducting a study on the effects of smoking on lung cancer and a lot of your patients’ data is missing because they died of lung cancer. If you just “sweep these patients under the rug” and ignore them, you are clearly biasing the results.

While there are statistical methods to deal with missing data they are beyond the reach of this class. The easiest thing to do is to remove all missing cases, but you should always at the very least report to the reader if you do so, as by removing the missing values you may be biasing your results.

You can do this with a na.rm = TRUE argument like so:

# A tibble: 1 x 1
  mean_profit
        <dbl>
1     43.4270

If you decide you want to remove the row with the missing data, you can use the filter function like so:

# A tibble: 10 x 5
   name               score rating type        millions
   <chr>              <dbl> <chr>  <chr>          <dbl>
 1 A Guy Thing      39.5    PG-13  rom comedy  15.545  
 2 A Man Apart      42.9000 R      action      26.2480 
 3 A Mighty Wind    79.9000 PG-13  comedy      17.781  
 4 Agent Cody Banks 57.9000 PG     action      47.8110 
 5 Alex & Emma      35.1000 PG-13  rom comedy  14.219  
 6 American Wedding 50.7000 R      comedy     104.441  
 7 Anger Management 62.6000 PG-13  comedy     134.404  
 8 Anything Else    63.3000 R      rom comedy   3.21200
 9 Bad Boys II      38.1000 R      action     138.397  
10 Bad Santa        75.8000 R      comedy      59.5230 

We see “2 Fast 2 Furious” is now gone.

C.1.2 Reordering bars in a barplot

Let’s compute the total revenue for each movie type and plot a barplot.

# A tibble: 6 x 2
  type       total_revenue
  <chr>              <dbl>
1 action            NA    
2 animated         561.306
3 comedy          2286.81 
4 drama            840.038
5 fantasy          508.580
6 rom comedy       492.282

Say we want to reorder the categorical variable type so that the bars show in a different order. We can reorder the bars by manually defining the order of the levels in the factor() command:

Or if you want to reorder type in ascending order of total_revenue, we use reorder()

Or if you want to reorder type in descending order of total_revenue, just put a - sign in front of -total_revenue in reorder():

For more advanced categorical variable (i.e. factor) manipulations, check out the forcats package. Note: forcats is an anagram of factors

C.1.4 Changing values inside cells

The rename() function in the dplyr package renames column/variable names. To “rename” values inside cells of a particular column, you need to mutate() the column using one of the three functions below. There might be other ones too, but these are the three we’ve seen the most. In these examples, we’ll change values in the variable type.

  1. if_else()
  2. recode()
  3. case_when()

if_else()

Switch all instances of rom comedy with romantic comedy using if_else() from the dplyr package. If a particular row has type == "rom comedy", then return "romantic comedy", else return whatever was originally in type. Save everything in a new variable type_new:

# A tibble: 10 x 7
   name               score rating type       millions    revenue type_new      
   <chr>              <dbl> <chr>  <chr>         <dbl>      <dbl> <chr>         
 1 2 Fast 2 Furious 48.9000 PG-13  action     NA              NA  action        
 2 A Guy Thing      39.5    PG-13  rom come…  15.545    15545000  romantic come…
 3 A Man Apart      42.9000 R      action     26.2480   26247999  action        
 4 A Mighty Wind    79.9000 PG-13  comedy     17.781    17781000  comedy        
 5 Agent Cody Banks 57.9000 PG     action     47.8110   47811001  action        
 6 Alex & Emma      35.1000 PG-13  rom come…  14.219    14219000  romantic come…
 7 American Wedding 50.7000 R      comedy    104.441   104441000  comedy        
 8 Anger Management 62.6000 PG-13  comedy    134.404   134404010  comedy        
 9 Anything Else    63.3000 R      rom come…   3.21200   3212000. romantic come…
10 Bad Boys II      38.1000 R      action    138.397   138397000  action        

Do the same here, but return "not romantic comedy" if type is not "rom comedy" and this time overwrite the original type variable

# A tibble: 10 x 6
   name               score rating type                 millions    revenue
   <chr>              <dbl> <chr>  <chr>                   <dbl>      <dbl>
 1 2 Fast 2 Furious 48.9000 PG-13  not romantic comedy  NA              NA 
 2 A Guy Thing      39.5    PG-13  romantic comedy      15.545    15545000 
 3 A Man Apart      42.9000 R      not romantic comedy  26.2480   26247999 
 4 A Mighty Wind    79.9000 PG-13  not romantic comedy  17.781    17781000 
 5 Agent Cody Banks 57.9000 PG     not romantic comedy  47.8110   47811001 
 6 Alex & Emma      35.1000 PG-13  romantic comedy      14.219    14219000 
 7 American Wedding 50.7000 R      not romantic comedy 104.441   104441000 
 8 Anger Management 62.6000 PG-13  not romantic comedy 134.404   134404010 
 9 Anything Else    63.3000 R      romantic comedy       3.21200   3212000.
10 Bad Boys II      38.1000 R      not romantic comedy 138.397   138397000 

recode()

if_else() is rather limited however. What if we want to “rename” all type so that they start with uppercase? Use recode():

# A tibble: 10 x 7
   name               score rating type       millions    revenue type_new      
   <chr>              <dbl> <chr>  <chr>         <dbl>      <dbl> <chr>         
 1 2 Fast 2 Furious 48.9000 PG-13  action     NA              NA  Action        
 2 A Guy Thing      39.5    PG-13  rom come…  15.545    15545000  Romantic Come…
 3 A Man Apart      42.9000 R      action     26.2480   26247999  Action        
 4 A Mighty Wind    79.9000 PG-13  comedy     17.781    17781000  Comedy        
 5 Agent Cody Banks 57.9000 PG     action     47.8110   47811001  Action        
 6 Alex & Emma      35.1000 PG-13  rom come…  14.219    14219000  Romantic Come…
 7 American Wedding 50.7000 R      comedy    104.441   104441000  Comedy        
 8 Anger Management 62.6000 PG-13  comedy    134.404   134404010  Comedy        
 9 Anything Else    63.3000 R      rom come…   3.21200   3212000. Romantic Come…
10 Bad Boys II      38.1000 R      action    138.397   138397000  Action        

case_when()

case_when() is a little trickier, but allows you to evaluate boolean operations using ==, >, >=, &, |, etc:

# A tibble: 108 x 7
   name             score rating type       millions    revenue type_new        
   <chr>            <dbl> <chr>  <chr>         <dbl>      <dbl> <chr>           
 1 2 Fast 2 Furi… 48.9000 PG-13  action     NA              NA  Rest            
 2 A Guy Thing    39.5    PG-13  rom come…  15.545    15545000  Small budget ro…
 3 A Man Apart    42.9000 R      action     26.2480   26247999  Rest            
 4 A Mighty Wind  79.9000 PG-13  comedy     17.781    17781000  Rest            
 5 Agent Cody Ba… 57.9000 PG     action     47.8110   47811001  Big budget acti…
 6 Alex & Emma    35.1000 PG-13  rom come…  14.219    14219000  Small budget ro…
 7 American Wedd… 50.7000 R      comedy    104.441   104441000  Rest            
 8 Anger Managem… 62.6000 PG-13  comedy    134.404   134404010  Rest            
 9 Anything Else  63.3000 R      rom come…   3.21200   3212000. Small budget ro…
10 Bad Boys II    38.1000 R      action    138.397   138397000  Big budget acti…
# … with 98 more rows

C.1.5 Converting a numerical variable to a categorical one

Sometimes we want to turn a numerical, continuous variable into a categorical variable. For instance, what if we wanted to have a variable that tells us if a movie made one hundred million dollars or more. That is to say, we can create a binary variable, which is the same thing as a categorical variable with 2 levels. We can again use the mutate() function:

# A tibble: 10 x 7
   name               score rating type        millions    revenue big_budget
   <chr>              <dbl> <chr>  <chr>          <dbl>      <dbl> <lgl>     
 1 2 Fast 2 Furious 48.9000 PG-13  action      NA              NA  NA        
 2 A Guy Thing      39.5    PG-13  rom comedy  15.545    15545000  FALSE     
 3 A Man Apart      42.9000 R      action      26.2480   26247999  FALSE     
 4 A Mighty Wind    79.9000 PG-13  comedy      17.781    17781000  FALSE     
 5 Agent Cody Banks 57.9000 PG     action      47.8110   47811001  FALSE     
 6 Alex & Emma      35.1000 PG-13  rom comedy  14.219    14219000  FALSE     
 7 American Wedding 50.7000 R      comedy     104.441   104441000  TRUE      
 8 Anger Management 62.6000 PG-13  comedy     134.404   134404010  TRUE      
 9 Anything Else    63.3000 R      rom comedy   3.21200   3212000. FALSE     
10 Bad Boys II      38.1000 R      action     138.397   138397000  TRUE      

What if you want to convert a numerical variable into a categorical variable with more than 2 levels? One way is to use the cut() command. For instance, below, we cut() the score variable, to recode it into 4 categories:

  1. 0 - 40 = bad
  2. 40.1 - 60 = so-so
  3. 60.1 - 80 = good
  4. 80.1+ = great

We set the breaking points for cutting the numerical variable with the c(0, 40, 60, 80, 100) part, and set the labels for each of these bins with the labels = c("bad", "so-so", "good", "great") part. All this action happens inside the mutate() command, so the new categorical variable score_categ is added to the data frame.

# A tibble: 10 x 7
   name               score rating type        millions    revenue score_categ
   <chr>              <dbl> <chr>  <chr>          <dbl>      <dbl> <fct>      
 1 2 Fast 2 Furious 48.9000 PG-13  action      NA              NA  so-so      
 2 A Guy Thing      39.5    PG-13  rom comedy  15.545    15545000  bad        
 3 A Man Apart      42.9000 R      action      26.2480   26247999  so-so      
 4 A Mighty Wind    79.9000 PG-13  comedy      17.781    17781000  good       
 5 Agent Cody Banks 57.9000 PG     action      47.8110   47811001  so-so      
 6 Alex & Emma      35.1000 PG-13  rom comedy  14.219    14219000  bad        
 7 American Wedding 50.7000 R      comedy     104.441   104441000  so-so      
 8 Anger Management 62.6000 PG-13  comedy     134.404   134404010  good       
 9 Anything Else    63.3000 R      rom comedy   3.21200   3212000. good       
10 Bad Boys II      38.1000 R      action     138.397   138397000  bad        

Other options with the cut function:

  • By default, if the value is exactly the upper bound of an interval, it’s included in the lessor category (e.g. 60.0 is ‘so-so’ not ‘good’), to flip this, include the argument right = FALSE.
  • You could also have R equally divide the variable into a balanced number of groups. For example, specifying breaks = 3 would create 3 groups with approximately the same number of values in each group.

C.1.6 Computing proportions

By using a group_by() followed not by a summarize() as is often the case, but rather a mutate(). So say we compute the total revenue millions for each movie rating and type:

# A tibble: 15 x 3
# Groups:   rating [4]
   rating type        millions
   <chr>  <chr>          <dbl>
 1 G      animated    495.594 
 2 PG     action       47.8110
 3 PG     animated     65.712 
 4 PG     comedy      829.616 
 5 PG     drama       160.873 
 6 PG     fantasy     147.461 
 7 PG-13  action       NA     
 8 PG-13  comedy     1208.31  
 9 PG-13  drama       306.26  
10 PG-13  fantasy     361.119 
11 PG-13  rom comedy  406.251 
12 R      action     1044.82  
13 R      comedy      248.876 
14 R      drama       372.905 
15 R      rom comedy   86.0310

Say within each movie rating (G, PG, PG-13, R), we want to know the proportion of total_millions that made by each movie type (animated, action, comedy, etc). We can:

# A tibble: 15 x 5
# Groups:   rating [4]
   rating type        millions total_millions       prop
   <chr>  <chr>          <dbl>          <dbl>      <dbl>
 1 G      animated    495.594         495.594  1        
 2 PG     action       47.8110       1251.47   0.0382038
 3 PG     animated     65.712        1251.47   0.0525077
 4 PG     comedy      829.616        1251.47   0.662912 
 5 PG     drama       160.873        1251.47   0.128547 
 6 PG     fantasy     147.461        1251.47   0.117830 
 7 PG-13  action       NA              NA     NA        
 8 PG-13  comedy     1208.31           NA     NA        
 9 PG-13  drama       306.26           NA     NA        
10 PG-13  fantasy     361.119          NA     NA        
11 PG-13  rom comedy  406.251          NA     NA        
12 R      action     1044.82         1752.63   0.596143 
13 R      comedy      248.876        1752.63   0.142001 
14 R      drama       372.905        1752.63   0.212769 
15 R      rom comedy   86.0310       1752.63   0.0490868

So for example, the 4 proportions corresponding to R rated movies are 0.596 + 0.142 + 0.213 + 0.0491 = 1.

C.1.7 Dealing with %, commas, and $

Say you have numerical data that are recorded as percentages, have commas, or are in dollar form and hence are character strings. How do you convert these to numerical values? Using the parse_number() function from the readr package inside a mutate()! Shout out to Stack Overflow

[1] 10.5
[1] 145897
[1] 1234

What about the other way around? Use the scales package!

[1] "10%"
[1] "145,897"
[1] "$1,234.50"

Congratulations. You are now an R Ninja!