ModernDive

4.1 Importing data

Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats:

First, a Comma Separated Values .csv file. You can think of a .csv file as a bare-bones spreadsheet where:

  • Each line in the file corresponds to one row of data/one observation.
  • Values for each line are separated with commas. In other words, the values of different variables are separated by commas in each row.
  • The first line is often, but not always, a header row indicating the names of the columns/variables.

Second, an Excel .xlsx spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to bare-bones .csv files, .xlsx Excel files contain a lot of meta-data (data about data). Recall we saw a previous example of meta-data in Section 3.4 when adding “group structure” meta-data to a data frame by using the group_by() verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros.

Third, a Google Sheets file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values .csv and Excel .xlsx formats. One way to import Google Sheets data in R is to go to the Google Sheets menu bar -> File -> Download as -> Select “Microsoft Excel” or “Comma-separated values” and then load that data into R. A more advanced way to import Google Sheets data in R is by using the googlesheets package, a method we leave to a more advanced data science book.

We’ll cover two methods for importing .csv and .xlsx spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated as “GUI.”

4.1.1 Using the console

First, let’s import a Comma Separated Values .csv file that exists on the internet. The .csv file dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the read_csv() function from the readr (Wickham, Hester, and Francois 2018) package to read it off the web, import it into R, and save it in a data frame called dem_score.

# A tibble: 96 x 10
   country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
   <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Albania        -9     -9     -9     -9     -9     -9     -9     -9      5
 2 Argentina      -9     -1     -1     -9     -9     -9     -8      8      7
 3 Armenia        -9     -7     -7     -7     -7     -7     -7     -7      7
 4 Australia      10     10     10     10     10     10     10     10     10
 5 Austria        10     10     10     10     10     10     10     10     10
 6 Azerbaijan     -9     -7     -7     -7     -7     -7     -7     -7      1
 7 Belarus        -9     -7     -7     -7     -7     -7     -7     -7      7
 8 Belgium        10     10     10     10     10     10     10     10     10
 9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
# … with 86 more rows

In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation, whereas a value of 10 corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the dem_score data frame in a case study in the upcoming Section 4.3.

Note that the read_csv() function included in the readr package is different than the read.csv() function that comes installed with R. While the difference in the names might seem trivial (an _ instead of a .), the read_csv() function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the read_csv() function included in the readr saves data frames as tibbles by default.

4.1.2 Using RStudio’s interface

Let’s read in the exact same data, but this time from an Excel file saved on your computer. Furthermore, we’ll do this using RStudio’s graphical interface instead of running read_csv() in the console. First, download the Excel file dem_score.xlsx by going to https://moderndive.com/data/dem_score.xlsx, then

  1. Go to the Files pane of RStudio.
  2. Navigate to the directory (i.e., folder on your computer) where the downloaded dem_score.xlsx Excel file is saved. For example, this might be in your Downloads folder.
  3. Click on dem_score.xlsx.
  4. Click “Import Dataset…”

At this point, you should see a screen pop-up like in Figure 4.1. After clicking on the “Import” button on the bottom right of Figure 4.1, RStudio will save this spreadsheet’s data in a data frame called dem_score and display its contents in the spreadsheet viewer.

Importing an Excel file to R.

FIGURE 4.1: Importing an Excel file to R.

Furthermore, note the “Code Preview” block in the bottom right of Figure 4.1. You can copy and paste this code to reload your data again later programmatically, instead of repeating this manual point-and-click process.

References

Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.