1.2 The Importance of Tidy Data
The learning objectives for this section are to:
- Define tidy data and to transform non-tidy data into tidy data
One unifying concept of this book is the notion of tidy data. As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy. In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience). Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
As a quick example, consider the following data illustrating death rates in Virginia in 1940 in a classic table format:
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy. This format violates the tidy form because there are variables in both the rows and columns. In this case the variables are age category, gender, and urban-ness. Finally, the death rate itself, which is the fourth variable, is presented inside the table.
Converting this data to tidy format would give us
library(tidyr)
library(dplyr)
%>%
VADeaths tbl_df() %>%
mutate(age = row.names(VADeaths)) %>%
gather(key, death_rate, -age) %>%
separate(key, c("urban", "gender"), sep = " ") %>%
mutate(age = factor(age), urban = factor(urban), gender = factor(gender))
: `tbl_df()` is deprecated as of dplyr 1.0.0.
Warning`tibble::as_tibble()` instead.
Please use 8 hours.
This warning is displayed once every `lifecycle::last_warnings()` to see where this warning was generated.
Call # A tibble: 20 x 4
age urban gender death_rate<fct> <fct> <fct> <dbl>
1 50-54 Rural Male 11.7
2 55-59 Rural Male 18.1
3 60-64 Rural Male 26.9
4 65-69 Rural Male 41
5 70-74 Rural Male 66
6 50-54 Rural Female 8.7
7 55-59 Rural Female 11.7
8 60-64 Rural Female 20.3
9 65-69 Rural Female 30.9
10 70-74 Rural Female 54.3
11 50-54 Urban Male 15.4
12 55-59 Urban Male 24.3
13 60-64 Urban Male 37
14 65-69 Urban Male 54.6
15 70-74 Urban Male 71.1
16 50-54 Urban Female 8.4
17 55-59 Urban Female 13.6
18 60-64 Urban Female 19.3
19 65-69 Urban Female 35.1
20 70-74 Urban Female 50
1.2.1 The “Tidyverse”
There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is sometimes referred to as the “tidyverse” because of their dependence on and presumption of tidy data. “Tidyverse” packages include
ggplot2: a plotting system based on the grammar of graphics
magrittr: defines the
%>%
operator for chaining functions together in a series of operations on datadplyr: a suite of (fast) functions for working with data frames
tidyr: easily tidy data with
spread()
andgather()
functions
We will be using these packages extensively in this book.
The “tidyverse” package can be used to install all of the packages in the tidyverse at once. For example, instead of starting an R script with this:
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
You can start with this:
library(tidyverse)