9 LECTURE: Joining Data in R: Basics
9.1 Introduction
The dplyr package provides a set of functions for joining two data frames into a single data frame based on a set of key columns. There are several functions in the *_join family. These functions all merge together two data frames; they differ in how they handle observations that exist in one but not both data frames. Here are the four functions from this family that you will likely use the most often:
| Function | What it includes in merged data frame |
|---|---|
left_join |
Includes all observations in the left data frame, whether or not there is a match in the right data frame |
right_join |
Includes all observations in the right data frame, whether or not there is a match in the left data frame |
inner_join |
Includes only observations that are in both data frames |
full_join |
Includes all observations from both data frames |
9.2 The First Table
Imagine you are conduct a study and collecting data on subjects and a health outcome. Often, subjects will make multiple visits (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.
This code creates a simple table with some made up data about some hypothetical subjects’ outcomes.
library(tidyverse)
outcomes <- tibble(
id = rep(c("a", "b", "c"), each = 3),
visit = rep(0:2, 3),
outcome = rnorm(3 * 3, 3)
)
print(outcomes)## # A tibble: 9 x 3
## id visit outcome
## <chr> <int> <dbl>
## 1 a 0 3.95
## 2 a 1 4.27
## 3 a 2 3.13
## 4 b 0 4.11
## 5 b 1 3.51
## 6 b 2 3.71
## 7 c 0 2.56
## 8 c 1 2.98
## 9 c 2 3.22
Note that subjects are labeled by their id in the id column.
9.2.1 A Second Table
Here is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects’ housing situation by recording the type of house they live in.
subjects <- tibble(
id = c("a", "b", "c"),
house = c("detached", "rowhouse", "rowhouse")
)
print(subjects)## # A tibble: 3 x 2
## id house
## <chr> <chr>
## 1 a detached
## 2 b rowhouse
## 3 c rowhouse
9.3 Left Join
Now suppose we want to create a table that combines the information about houses with the information about the outcomes. We can use the left_join() function to merge the outcomes and subjects tables and produce the output above.
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <int> <dbl> <chr>
## 1 a 0 3.95 detached
## 2 a 1 4.27 detached
## 3 a 2 3.13 detached
## 4 b 0 4.11 rowhouse
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 rowhouse
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 rowhouse
## 9 c 2 3.22 rowhouse
The by argument indicates the column (or columns) that the two tables have in common.
9.4 Left Join with Incomplete Data
In the previous examples, the subjects table didn’t have a visit column. But suppose it did? Maybe people move around during the study. We could image a table like this one.
subjects <- tibble(
id = c("a", "b", "c"),
visit = c(0, 1, 0),
house = c("detached", "rowhouse", "rowhouse"),
)
print(subjects)## # A tibble: 3 x 3
## id visit house
## <chr> <dbl> <chr>
## 1 a 0 detached
## 2 b 1 rowhouse
## 3 c 0 rowhouse
When we left joint the tables now we get:
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 a 0 3.95 detached
## 2 a 1 4.27 <NA>
## 3 a 2 3.13 <NA>
## 4 b 0 4.11 <NA>
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 <NA>
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 <NA>
## 9 c 2 3.22 <NA>
Notice how now if we do not have information about a subject’s housing in a given visit, the left_join() function automatically inserts an NA value to indicate that it is missing.
Also, in the above example, we joined on the id and the visit columns.
We may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject a.
subjects <- tibble(
id = c("b", "c"),
visit = c(1, 0),
house = c("rowhouse", "rowhouse"),
)
subjects## # A tibble: 2 x 3
## id visit house
## <chr> <dbl> <chr>
## 1 b 1 rowhouse
## 2 c 0 rowhouse
But we can still join the tables together and the house values for subject a will all be NA.
## # A tibble: 9 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 a 0 3.95 <NA>
## 2 a 1 4.27 <NA>
## 3 a 2 3.13 <NA>
## 4 b 0 4.11 <NA>
## 5 b 1 3.51 rowhouse
## 6 b 2 3.71 <NA>
## 7 c 0 2.56 rowhouse
## 8 c 1 2.98 <NA>
## 9 c 2 3.22 <NA>
The bottom line for left_join() is that it always retains the values in the “left” argument (in this case the outcomes table). If there are no corresponding values in the “right” argument, NA values will be filled in.
9.5 Inner Join
The inner_join() function only retains the rows of both tables that have corresponding values. Here we can see the difference.
## # A tibble: 2 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 b 1 3.51 rowhouse
## 2 c 0 2.56 rowhouse
9.6 Right Join
The right_join() function is like the left_join() function except that it gives priority to the “right” hand argument.
## # A tibble: 2 x 4
## id visit outcome house
## <chr> <dbl> <dbl> <chr>
## 1 b 1 3.51 rowhouse
## 2 c 0 2.56 rowhouse
9.7 More Examples
Often, you will have data in two separate datasets that you’d like to combine based on a common variable or variables. For example, for the World Cup example data we’ve been using, it would be interesting to add in a column with the final standing of each player’s team. We’ve included data with that information in a file called “team_standings.csv”, which can be read into the R object team_standings with the call:
library(knitr)
library(tidyverse)
team_standings <- read_csv("data/team_standings.csv")
team_standings %>%
slice(1:3)## # A tibble: 3 x 2
## Standing Team
## <dbl> <chr>
## 1 1 Spain
## 2 2 Netherlands
## 3 3 Germany
This data frame has one observation per team, and the team names are consistent with the team names in the worldcup data frame.
You can use the different functions from the *_join family to merge this team standing data with the player statistics in the worldcup data frame. Once you’ve done that, you can use other data cleaning tools from dplyr to quickly pull and explore interesting parts of the dataset. The main arguments for the *_join functions are the object names of the two data frames to join and by, which specifies which variables to use to match up observations from the two dataframes.
The “left” data frame refers to the first data frame input in the *_join call, while the “right” data frame refers to the second data frame input into the function. For example, in the call
the world_cup data frame is the “left” data frame and the team_standings data frame is the “right” data frame. Therefore, using left_join would include all rows from world_cup, whether or not the player had a team listed in team_standings, while right_join would include all the rows from team_standings, whether or not there were any players from that team in world_cup.
Remember that if you are using piping, the first data frame (“left” for these functions) is by default the dataframe created by the code right before the pipe. When you merge data frames as a step in piped code, therefore, the “left” data frame is the one piped into the function while the “right” data frame is the one stated in the *_join function call.
As an example of merging, say you want to create a table of the top 5 players by shots on goal, as well as the final standing for each of these player’s teams, using the worldcup and team_standings data. You can do this by running:
library(faraway)
data(worldcup)
worldcup %>%
mutate(Name = rownames(worldcup),
Team = as.character(Team)) %>%
select(Name, Position, Shots, Team) %>%
arrange(desc(Shots)) %>%
slice(1:5) %>%
left_join(team_standings, by = "Team") %>% # Merge in team standings
rename("Team Standing" = Standing) %>%
kable()| Name | Position | Shots | Team | Team Standing |
|---|---|---|---|---|
| Gyan | Forward | 27 | Ghana | 7 |
| Villa | Forward | 22 | Spain | 1 |
| Messi | Forward | 21 | Argentina | 5 |
| Suarez | Forward | 19 | Uruguay | 4 |
| Forlan | Forward | 18 | Uruguay | 4 |
In addition to the merging in this code, there are a few other interesting things to point out:
- The code uses the
as.characterfunction within amutatecall to change the team name from a factor to a character in theworldcupdata frame. When merging two data frames, it’s safest if the column you’re using to merge has the same class in each data frame. The “Team” column is a character class in theteam_standingsdata frame but a factor class in theworldcupdata frame, so this call converts that column to a character class inworldcup. Theleft_joinfunction will still perform a merge if you don’t include this call, but it will throw a warning that it is coercing the column inworldcupto a character vector. It’s generally safer to do this yourself explictly. - It uses the
selectfunction both to remove columns we’re not interested in and also to put the columns we want to keep in the order we’d like for the final table. - It uses
arrangefollowed bysliceto pull out the top 5 players and order them by number of shots. - For one of the column names, we want to use “Team Standing” rather than the current column name of “Standing”. This code uses
renameat the very end to make this change right before creating the table. You can also use thecol.namesargument in thekablefunction to customize all the column names in the final table, but thisrenamecall is a quick fix since we just want to change one column name.
9.8 Summary
left_join()is useful for merging a “large” data frame with a “smaller” one while retaining all the rows of the “large” data frameinner_join()gives you the intersection of the rows between two data framesright_join()is likeleft_join()with the arguments reversed (likely only useful at the end of a pipeline)