Introduction for students

This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding experience. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.

We present a map of your upcoming journey in Figure 0.1.

ModernDive flowchart.

FIGURE 0.1: ModernDive flowchart.

You’ll first get started with data in Chapter 1 where you’ll learn about the difference between R and RStudio, start coding in R, install and load your first R packages, and explore your first dataset: all domestic departure flights from a New York City airport in 2013. Then you’ll cover the following three portions of this book (Parts 2 and 4 are combined into a single portion):

  1. Data science with tidyverse. You’ll assemble your data science toolbox using tidyverse packages. In particular, you’ll
    • Ch.2: Visualize data using the ggplot2 package.
    • Ch.3: Wrangle data using the dplyr package.
    • Ch.4: Learn about the concept of “tidy” data as a standardized data input and output format for all packages in the tidyverse. Furthermore, you’ll learn how to import spreadsheet files into R using the readr package.
  2. Data modeling with moderndive. Using these data science tools and helper functions from the moderndive package, you’ll fit your first data models. In particular, you’ll
    • Ch.5: Discover basic regression models with only one explanatory variable.
    • Ch.6: Examine multiple regression models with more than one explanatory variable.
  3. Statistical inference with infer. Once again using your newly acquired data science tools, you’ll unpack statistical inference using the infer package. In particular, you’ll:
    • Ch.7: Learn about the role that sampling variability plays in statistical inference and the role that sample size plays in this sampling variability.
    • Ch.8: Construct confidence intervals using bootstrapping.
    • Ch.9: Conduct hypothesis tests using permutation.
  4. Data modeling with moderndive (revisited): Armed with your understanding of statistical inference, you’ll revisit and review the models you’ll construct in Ch.5 and Ch.6. In particular, you’ll:
    • Ch.10: Interpret confidence intervals and hypothesis tests in a regression setting.

We’ll end with a discussion on what it means to “tell your story with data” in Chapter 11 by presenting example case studies.1

What we hope you will learn from this book

We hope that by the end of this book, you’ll have learned how to:

  1. Use R and the tidyverse suite of R packages for data science.
  2. Fit your first models to data, using a method known as linear regression.
  3. Perform statistical inference using sampling, confidence intervals. and hypothesis tests.
  4. Tell your story with data using these tools.

What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion. Further discussions on data stories can be found in the blog post “Tell a Meaningful Story With Data.”

Over the course of this book, you will develop your “data science toolbox,” equipping yourself with tools such as data visualization, data formatting, data wrangling, and data modeling using regression.

In particular, this book will lean heavily on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are used to convey relationships within data. In general, we’ll use visualization as a way of building almost all of the ideas in this book.

To impart the statistical lessons of this book, we have intentionally minimized the number of mathematical formulas used. Instead, you’ll develop a conceptual understanding of statistics using data visualization and computer simulations. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past and how it is commonly perceived.

Finally, you’ll learn the importance of literate programming. By this we mean you’ll learn how to write code that is useful not just for a computer to execute, but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see the “Reproducible research” subsection in this Preface for more details). Hal Abelson coined the phrase that we will follow throughout this book:

Programs must be written for people to read, and only incidentally for machines to execute.

We understand that there may be challenging moments as you learn to program. Both of us continue to struggle and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run though, we all can solve problems faster and more elegantly via programming. We wrote this book as our way to help you get started and you should know that there is a huge community of R users that are happy to help everyone along as well. This community exists in particular on the internet on various forums and websites such as

Data/science pipeline

You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences.

You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see articles that say, “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis. By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be wary. Inside data analysis are many sub-fields that we will discuss throughout this book (though not necessarily in this order):

  • data collection
  • data wrangling
  • data visualization
  • data modeling
  • inference
  • correlation and regression
  • interpretation of results
  • data communication/storytelling

These sub-fields are summarized in what Grolemund and Wickham have previously termed the “data/science pipeline” in Figure 0.2.

Data/science pipeline.

FIGURE 0.2: Data/science pipeline.

We will begin by digging into the grey Understand portion of the cycle with data visualization, then with a discussion on what is meant by tidy data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via Communication. These steps are vital to any statistical analysis. But, why should you care about statistics?

There’s a reason that many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the beast that it used to be and, paired with computation, you’ll see how reproducible research in the sciences particularly increases scientific knowledge.

Reproducible research

The most important tool is the mindset, when starting, that the end product will be reproducible. – Keith Baggerly

Another goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be trying to help you build new habits. This will take practice and be difficult at times. You’ll see just why it is so important for you to keep track of your code and document it well to help yourself later and any potential collaborators as well.

Copying and pasting results from one program into a word processor is not an ideal way to conduct efficient and effective scientific research. It’s much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs.

In traditional analyses, if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy-and-paste all of the new plots and our statistical analysis into our document. This is error prone and a frustrating use of time. We want to help you to get away from this tedious activity so that we can spend more time doing science.

We are talking about computational reproducibility. - Yihui Xie

Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis, datasets, and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent interpreting results and considering assumptions instead of the more error prone way of starting from scratch or following a list of steps that may be different from machine to machine.

Final note for students

At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book’s construction and publishing, then continue with the rest of the chapter. Otherwise, let’s get started with R and RStudio in Chapter 1!

  1. Note that you’ll see different versions of the word “ModernDive” in this book: (1) moderndive refers to the R package. (2) ModernDive is an abbreviated version of Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. It’s essentially a nickname we gave the book. (3) ModernDive (without italics) corresponds to both the book and the corresponding R package together as an entity.