9 LECTURE: Joining Data in R: Basics
9.1 Introduction
The dplyr
package provides a set of functions for joining two data frames into a single data frame based on a set of key columns. There are several functions in the *_join
family. These functions all merge together two data frames; they differ in how they handle observations that exist in one but not both data frames. Here are the four functions from this family that you will likely use the most often:
Function | What it includes in merged data frame |
---|---|
left_join |
Includes all observations in the left data frame, whether or not there is a match in the right data frame |
right_join |
Includes all observations in the right data frame, whether or not there is a match in the left data frame |
inner_join |
Includes only observations that are in both data frames |
full_join |
Includes all observations from both data frames |
9.2 The First Table
Imagine you are conduct a study and collecting data on subjects and a health outcome. Often, subjects will make multiple visits (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.
This code creates a simple table with some made up data about some hypothetical subjects’ outcomes.
library(tidyverse)
outcomes <- tibble(
id = rep(c("a", "b", "c"), each = 3),
visit = rep(0:2, 3),
outcome = rnorm(3 * 3, 3)
)
print(outcomes)
## # A tibble: 9 x 3
## id visit outcome
## <chr> <int> <dbl>
## 1 a 0 3.95
## 2 a 1 4.27
## 3 a 2 3.13
## 4 b 0 4.11
## 5 b 1 3.51
## 6 b 2 3.71
## 7 c 0 2.56
## 8 c 1 2.98
## 9 c 2 3.22
Note that subjects are labeled by their id in the id
column.
9.2.1 A Second Table
Here is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects’ housing situation by recording the type of house they live in.
subjects <- tibble(
id = c("a", "b", "c"),
house = c("detached", "rowhouse", "rowhouse")
)
print(subjects)
## # A tibble: 3 x 2
## id house
## <chr> <chr>
## 1 a detached
## 2 b rowhouse
## 3 c rowhouse
9.3 Left Join
Now suppose we want to create a table that combines the information about houses with the information about the outcomes. We can use the left_join()
function to merge the outcomes
and subjects
tables and produce the output above.
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <int> <dbl> <chr>
## 1 a 0 3.95 detached
## 2 a 1 4.27 detached
## 3 a 2 3.13 detached
## 4 b 0 4.11 rowhouse
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 rowhouse
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 rowhouse
## 9 c 2 3.22 rowhouse
The by
argument indicates the column (or columns) that the two tables have in common.
9.4 Left Join with Incomplete Data
In the previous examples, the subjects
table didn’t have a visit
column. But suppose it did? Maybe people move around during the study. We could image a table like this one.
subjects <- tibble(
id = c("a", "b", "c"),
visit = c(0, 1, 0),
house = c("detached", "rowhouse", "rowhouse"),
)
print(subjects)
## # A tibble: 3 x 3
## id visit house
## <chr> <dbl> <chr>
## 1 a 0 detached
## 2 b 1 rowhouse
## 3 c 0 rowhouse
When we left joint the tables now we get:
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 a 0 3.95 detached
## 2 a 1 4.27 <NA>
## 3 a 2 3.13 <NA>
## 4 b 0 4.11 <NA>
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 <NA>
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 <NA>
## 9 c 2 3.22 <NA>
Notice how now if we do not have information about a subject’s housing in a given visit, the left_join()
function automatically inserts an NA
value to indicate that it is missing.
Also, in the above example, we joined on the id
and the visit
columns.
We may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject a
.
subjects <- tibble(
id = c("b", "c"),
visit = c(1, 0),
house = c("rowhouse", "rowhouse"),
)
subjects
## # A tibble: 2 x 3
## id visit house
## <chr> <dbl> <chr>
## 1 b 1 rowhouse
## 2 c 0 rowhouse
But we can still join the tables together and the house
values for subject a
will all be NA
.
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 a 0 3.95 <NA>
## 2 a 1 4.27 <NA>
## 3 a 2 3.13 <NA>
## 4 b 0 4.11 <NA>
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 <NA>
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 <NA>
## 9 c 2 3.22 <NA>
The bottom line for left_join()
is that it always retains the values in the “left” argument (in this case the outcomes
table). If there are no corresponding values in the “right” argument, NA
values will be filled in.
9.5 Inner Join
The inner_join()
function only retains the rows of both tables that have corresponding values. Here we can see the difference.
## # A tibble: 2 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 b 1 3.51 rowhouse
## 2 c 0 2.56 rowhouse
9.6 Right Join
The right_join()
function is like the left_join()
function except that it gives priority to the “right” hand argument.
## # A tibble: 2 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 b 1 3.51 rowhouse
## 2 c 0 2.56 rowhouse
9.7 More Examples
Often, you will have data in two separate datasets that you’d like to combine based on a common variable or variables. For example, for the World Cup example data we’ve been using, it would be interesting to add in a column with the final standing of each player’s team. We’ve included data with that information in a file called “team_standings.csv”, which can be read into the R object team_standings
with the call:
library(knitr)
library(tidyverse)
team_standings <- read_csv("data/team_standings.csv")
team_standings %>%
slice(1:3)
## # A tibble: 3 x 2
## Standing Team
## <dbl> <chr>
## 1 1 Spain
## 2 2 Netherlands
## 3 3 Germany
This data frame has one observation per team, and the team names are consistent with the team names in the worldcup
data frame.
You can use the different functions from the *_join
family to merge this team standing data with the player statistics in the worldcup
data frame. Once you’ve done that, you can use other data cleaning tools from dplyr
to quickly pull and explore interesting parts of the dataset. The main arguments for the *_join
functions are the object names of the two data frames to join and by
, which specifies which variables to use to match up observations from the two dataframes.
The “left” data frame refers to the first data frame input in the *_join
call, while the “right” data frame refers to the second data frame input into the function. For example, in the call
the world_cup
data frame is the “left” data frame and the team_standings
data frame is the “right” data frame. Therefore, using left_join
would include all rows from world_cup
, whether or not the player had a team listed in team_standings
, while right_join
would include all the rows from team_standings
, whether or not there were any players from that team in world_cup
.
Remember that if you are using piping, the first data frame (“left” for these functions) is by default the dataframe created by the code right before the pipe. When you merge data frames as a step in piped code, therefore, the “left” data frame is the one piped into the function while the “right” data frame is the one stated in the *_join
function call.
As an example of merging, say you want to create a table of the top 5 players by shots on goal, as well as the final standing for each of these player’s teams, using the worldcup
and team_standings
data. You can do this by running:
library(faraway)
data(worldcup)
worldcup %>%
mutate(Name = rownames(worldcup),
Team = as.character(Team)) %>%
select(Name, Position, Shots, Team) %>%
arrange(desc(Shots)) %>%
slice(1:5) %>%
left_join(team_standings, by = "Team") %>% # Merge in team standings
rename("Team Standing" = Standing) %>%
kable()
Name | Position | Shots | Team | Team Standing |
---|---|---|---|---|
Gyan | Forward | 27 | Ghana | 7 |
Villa | Forward | 22 | Spain | 1 |
Messi | Forward | 21 | Argentina | 5 |
Suarez | Forward | 19 | Uruguay | 4 |
Forlan | Forward | 18 | Uruguay | 4 |
In addition to the merging in this code, there are a few other interesting things to point out:
- The code uses the
as.character
function within amutate
call to change the team name from a factor to a character in theworldcup
data frame. When merging two data frames, it’s safest if the column you’re using to merge has the same class in each data frame. The “Team” column is a character class in theteam_standings
data frame but a factor class in theworldcup
data frame, so this call converts that column to a character class inworldcup
. Theleft_join
function will still perform a merge if you don’t include this call, but it will throw a warning that it is coercing the column inworldcup
to a character vector. It’s generally safer to do this yourself explictly. - It uses the
select
function both to remove columns we’re not interested in and also to put the columns we want to keep in the order we’d like for the final table. - It uses
arrange
followed byslice
to pull out the top 5 players and order them by number of shots. - For one of the column names, we want to use “Team Standing” rather than the current column name of “Standing”. This code uses
rename
at the very end to make this change right before creating the table. You can also use thecol.names
argument in thekable
function to customize all the column names in the final table, but thisrename
call is a quick fix since we just want to change one column name.
9.8 Summary
left_join()
is useful for merging a “large” data frame with a “smaller” one while retaining all the rows of the “large” data frameinner_join()
gives you the intersection of the rows between two data framesright_join()
is likeleft_join()
with the arguments reversed (likely only useful at the end of a pipeline)