4.1 Importing data
Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats:
First, a Comma Separated Values
.csv file. You can think of a
.csv file as a bare-bones spreadsheet where:
- Each line in the file corresponds to one row of data/one observation.
- Values for each line are separated with commas. In other words, the values of different variables are separated by commas in each row.
- The first line is often, but not always, a header row indicating the names of the columns/variables.
Second, an Excel
.xlsx spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to bare-bones
.xlsx Excel files contain a lot of meta-data (data about data). Recall we saw a previous example of meta-data in Section 3.4 when adding “group structure” meta-data to a data frame by using the
group_by() verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros.
Third, a Google Sheets file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values
.csv and Excel
.xlsx formats. One way to import Google Sheets data in R is to go to the Google Sheets menu bar -> File -> Download as -> Select “Microsoft Excel” or “Comma-separated values” and then load that data into R. A more advanced way to import Google Sheets data in R is by using the
googlesheets package, a method we leave to a more advanced data science book.
We’ll cover two methods for importing
.xlsx spreadsheet data in R: one using the console and the other using RStudio’s graphical user interface, abbreviated as “GUI.”
4.1.1 Using the console
First, let’s import a Comma Separated Values
.csv file that exists on the internet. The
dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the
read_csv() function from the
readr (Wickham, Hester, and Francois 2018) package to read it off the web, import it into R, and save it in a data frame called
# A tibble: 96 x 10 country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Albania -9 -9 -9 -9 -9 -9 -9 -9 5 2 Argentina -9 -1 -1 -9 -9 -9 -8 8 7 3 Armenia -9 -7 -7 -7 -7 -7 -7 -7 7 4 Australia 10 10 10 10 10 10 10 10 10 5 Austria 10 10 10 10 10 10 10 10 10 6 Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7 1 7 Belarus -9 -7 -7 -7 -7 -7 -7 -7 7 8 Belgium 10 10 10 10 10 10 10 10 10 9 Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10 10 Bolivia -4 -3 -3 -4 -7 -7 8 9 9 # … with 86 more rows
dem_score data frame, the minimum value of
-10 corresponds to a highly autocratic nation, whereas a value of
10 corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the
dem_score data frame in a case study in the upcoming Section 4.3.
Note that the
read_csv() function included in the
readr package is different than the
read.csv() function that comes installed with R. While the difference in the names might seem trivial (an
_ instead of a
read_csv() function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the
read_csv() function included in the
readr saves data frames as
tibbles by default.
4.1.2 Using RStudio’s interface
Let’s read in the exact same data, but this time from an Excel file saved on your computer. Furthermore, we’ll do this using RStudio’s graphical interface instead of running
read_csv() in the console. First, download the Excel file
dem_score.xlsx by going to https://moderndive.com/data/dem_score.xlsx, then
- Go to the Files pane of RStudio.
- Navigate to the directory (i.e., folder on your computer) where the downloaded
dem_score.xlsxExcel file is saved. For example, this might be in your Downloads folder.
- Click on
- Click “Import Dataset…”
At this point, you should see a screen pop-up like in Figure 4.1. After clicking on the “Import” button on the bottom right of Figure 4.1, RStudio will save this spreadsheet’s data in a data frame called
dem_score and display its contents in the spreadsheet viewer.
Furthermore, note the “Code Preview” block in the bottom right of Figure 4.1. You can copy and paste this code to reload your data again later programmatically, instead of repeating this manual point-and-click process.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.