Chapter 2 Data Visualization
We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights we couldn’t initially obtain from just looking at the raw data values. We’ll use the
ggplot2 package, as it provides an easy way to customize your plots.
ggplot2 is rooted in the data visualization theory known as the grammar of graphics (Wilkinson 2005), developed by Leland Wilkinson.
At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of outliers, distributions of individual variables, and relationships between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does, however, require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience.
As we will see, plots also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the distribution of one numerical variable, such as what are the center and spread of the values, as we go across the levels of a different categorical variable.
Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). First. Secaucus, NJ: Springer-Verlag.