9 LECTURE: Joining Data in R: Basics

9.1 Introduction

The dplyr package provides a set of functions for joining two data frames into a single data frame based on a set of key columns. There are several functions in the *_join family. These functions all merge together two data frames; they differ in how they handle observations that exist in one but not both data frames. Here are the four functions from this family that you will likely use the most often:

Function What it includes in merged data frame
left_join Includes all observations in the left data frame, whether or not there is a match in the right data frame
right_join Includes all observations in the right data frame, whether or not there is a match in the left data frame
inner_join Includes only observations that are in both data frames
full_join Includes all observations from both data frames

9.2 The First Table

Imagine you are conduct a study and collecting data on subjects and a health outcome. Often, subjects will make multiple visits (a so-called longitudinal study) and so we will record the outcome for each visit. Similarly, we may record other information about them, such as the kind of housing they live in.

This code creates a simple table with some made up data about some hypothetical subjects’ outcomes.

library(tidyverse)

outcomes <- tibble(
        id = rep(c("a", "b", "c"), each = 3),
        visit = rep(0:2, 3),
        outcome = rnorm(3 * 3, 3)
)

print(outcomes)
## # A tibble: 9 x 3
##   id    visit outcome
##   <chr> <int>   <dbl>
## 1 a         0    3.95
## 2 a         1    4.27
## 3 a         2    3.13
## 4 b         0    4.11
## 5 b         1    3.51
## 6 b         2    3.71
## 7 c         0    2.56
## 8 c         1    2.98
## 9 c         2    3.22

Note that subjects are labeled by their id in the id column.

9.2.1 A Second Table

Here is some code to create a second table (we will be joining the first and second tables shortly). This table contains some data about the hypothetical subjects’ housing situation by recording the type of house they live in.

subjects <- tibble(
        id = c("a", "b", "c"),
        house = c("detached", "rowhouse", "rowhouse")
)

print(subjects)
## # A tibble: 3 x 2
##   id    house   
##   <chr> <chr>   
## 1 a     detached
## 2 b     rowhouse
## 3 c     rowhouse

9.3 Left Join

Now suppose we want to create a table that combines the information about houses with the information about the outcomes. We can use the left_join() function to merge the outcomes and subjects tables and produce the output above.

left_join(outcomes, subjects, by = "id")
## # A tibble: 9 x 4
##   id    visit outcome house   
##   <chr> <int>   <dbl> <chr>   
## 1 a         0    3.95 detached
## 2 a         1    4.27 detached
## 3 a         2    3.13 detached
## 4 b         0    4.11 rowhouse
## 5 b         1    3.51 rowhouse
## 6 b         2    3.71 rowhouse
## 7 c         0    2.56 rowhouse
## 8 c         1    2.98 rowhouse
## 9 c         2    3.22 rowhouse

The by argument indicates the column (or columns) that the two tables have in common.

9.4 Left Join with Incomplete Data

In the previous examples, the subjects table didn’t have a visit column. But suppose it did? Maybe people move around during the study. We could image a table like this one.

subjects <- tibble(
        id = c("a", "b", "c"),
        visit = c(0, 1, 0),
        house = c("detached", "rowhouse", "rowhouse"),
)

print(subjects)
## # A tibble: 3 x 3
##   id    visit house   
##   <chr> <dbl> <chr>   
## 1 a         0 detached
## 2 b         1 rowhouse
## 3 c         0 rowhouse

When we left joint the tables now we get:

left_join(outcomes, subjects, by = c("id", "visit"))
## # A tibble: 9 x 4
##   id    visit outcome house   
##   <chr> <dbl>   <dbl> <chr>   
## 1 a         0    3.95 detached
## 2 a         1    4.27 <NA>    
## 3 a         2    3.13 <NA>    
## 4 b         0    4.11 <NA>    
## 5 b         1    3.51 rowhouse
## 6 b         2    3.71 <NA>    
## 7 c         0    2.56 rowhouse
## 8 c         1    2.98 <NA>    
## 9 c         2    3.22 <NA>

Notice how now if we do not have information about a subject’s housing in a given visit, the left_join() function automatically inserts an NA value to indicate that it is missing.

Also, in the above example, we joined on the id and the visit columns.

We may even have a situation where we are missing housing data for a subject completely. The following table has no information about subject a.

subjects <- tibble(
        id = c("b", "c"),
        visit = c(1, 0),
        house = c("rowhouse", "rowhouse"),
)

subjects
## # A tibble: 2 x 3
##   id    visit house   
##   <chr> <dbl> <chr>   
## 1 b         1 rowhouse
## 2 c         0 rowhouse

But we can still join the tables together and the house values for subject a will all be NA.

left_join(outcomes, subjects, by = c("id", "visit"))
## # A tibble: 9 x 4
##   id    visit outcome house   
##   <chr> <dbl>   <dbl> <chr>   
## 1 a         0    3.95 <NA>    
## 2 a         1    4.27 <NA>    
## 3 a         2    3.13 <NA>    
## 4 b         0    4.11 <NA>    
## 5 b         1    3.51 rowhouse
## 6 b         2    3.71 <NA>    
## 7 c         0    2.56 rowhouse
## 8 c         1    2.98 <NA>    
## 9 c         2    3.22 <NA>

The bottom line for left_join() is that it always retains the values in the “left” argument (in this case the outcomes table). If there are no corresponding values in the “right” argument, NA values will be filled in.

9.5 Inner Join

The inner_join() function only retains the rows of both tables that have corresponding values. Here we can see the difference.

inner_join(outcomes, subjects, by = c("id", "visit"))
## # A tibble: 2 x 4
##   id    visit outcome house   
##   <chr> <dbl>   <dbl> <chr>   
## 1 b         1    3.51 rowhouse
## 2 c         0    2.56 rowhouse

9.6 Right Join

The right_join() function is like the left_join() function except that it gives priority to the “right” hand argument.

right_join(outcomes, subjects, by = c("id", "visit"))
## # A tibble: 2 x 4
##   id    visit outcome house   
##   <chr> <dbl>   <dbl> <chr>   
## 1 b         1    3.51 rowhouse
## 2 c         0    2.56 rowhouse

9.7 More Examples

Often, you will have data in two separate datasets that you’d like to combine based on a common variable or variables. For example, for the World Cup example data we’ve been using, it would be interesting to add in a column with the final standing of each player’s team. We’ve included data with that information in a file called “team_standings.csv”, which can be read into the R object team_standings with the call:

library(knitr)
library(tidyverse)
team_standings <- read_csv("data/team_standings.csv")
team_standings %>% 
        slice(1:3)
## # A tibble: 3 x 2
##   Standing Team       
##      <dbl> <chr>      
## 1        1 Spain      
## 2        2 Netherlands
## 3        3 Germany

This data frame has one observation per team, and the team names are consistent with the team names in the worldcup data frame.

You can use the different functions from the *_join family to merge this team standing data with the player statistics in the worldcup data frame. Once you’ve done that, you can use other data cleaning tools from dplyr to quickly pull and explore interesting parts of the dataset. The main arguments for the *_join functions are the object names of the two data frames to join and by, which specifies which variables to use to match up observations from the two dataframes.

The “left” data frame refers to the first data frame input in the *_join call, while the “right” data frame refers to the second data frame input into the function. For example, in the call

left_join(world_cup, team_standings, by = "Team")

the world_cup data frame is the “left” data frame and the team_standings data frame is the “right” data frame. Therefore, using left_join would include all rows from world_cup, whether or not the player had a team listed in team_standings, while right_join would include all the rows from team_standings, whether or not there were any players from that team in world_cup.

Remember that if you are using piping, the first data frame (“left” for these functions) is by default the dataframe created by the code right before the pipe. When you merge data frames as a step in piped code, therefore, the “left” data frame is the one piped into the function while the “right” data frame is the one stated in the *_join function call.

As an example of merging, say you want to create a table of the top 5 players by shots on goal, as well as the final standing for each of these player’s teams, using the worldcup and team_standings data. You can do this by running:

library(faraway)
data(worldcup)
worldcup %>% 
  mutate(Name = rownames(worldcup),
         Team = as.character(Team)) %>%
  select(Name, Position, Shots, Team) %>%
  arrange(desc(Shots)) %>%
  slice(1:5) %>%
  left_join(team_standings, by = "Team") %>% # Merge in team standings
  rename("Team Standing" = Standing) %>%
  kable()
Name Position Shots Team Team Standing
Gyan Forward 27 Ghana 7
Villa Forward 22 Spain 1
Messi Forward 21 Argentina 5
Suarez Forward 19 Uruguay 4
Forlan Forward 18 Uruguay 4

In addition to the merging in this code, there are a few other interesting things to point out:

  • The code uses the as.character function within a mutate call to change the team name from a factor to a character in the worldcup data frame. When merging two data frames, it’s safest if the column you’re using to merge has the same class in each data frame. The “Team” column is a character class in the team_standings data frame but a factor class in the worldcup data frame, so this call converts that column to a character class in worldcup. The left_join function will still perform a merge if you don’t include this call, but it will throw a warning that it is coercing the column in worldcup to a character vector. It’s generally safer to do this yourself explictly.
  • It uses the select function both to remove columns we’re not interested in and also to put the columns we want to keep in the order we’d like for the final table.
  • It uses arrange followed by slice to pull out the top 5 players and order them by number of shots.
  • For one of the column names, we want to use “Team Standing” rather than the current column name of “Standing”. This code uses rename at the very end to make this change right before creating the table. You can also use the col.names argument in the kable function to customize all the column names in the final table, but this rename call is a quick fix since we just want to change one column name.

9.8 Summary

  • left_join() is useful for merging a “large” data frame with a “smaller” one while retaining all the rows of the “large” data frame

  • inner_join() gives you the intersection of the rows between two data frames

  • right_join() is like left_join() with the arguments reversed (likely only useful at the end of a pipeline)