5 Data Manipulation

R packages are collections of functions that extend the functionality of the R programming language. They provide a convenient way for users to access and utilize specialized tools for data analysis, visualization, statistical modeling, and more. Among them, tidyverse provides very useful functions for data manipulation and visualization.

tidyverse:: bundles R packages designed to make data manipulation, exploration, and visualization more efficient and intuitive. Developed by Hadley Wickham and other contributors, the Tidyverse packages share a common philosophy and syntax, emphasizing a consistent and tidy data format. The core packages, such as dplyr::, tidyr::, and ggplot2::, provide powerful tools for data wrangling, reshaping, and creating visualizations.

In this Chapter, we will use iris data, which is available by default in R. While the default data structure iris is prepared in data.frame(), I convert this data format to tibble() in this exercise. Copy and run the code below. We will use iris_sub in the following demonstration.

library(tidyverse)

set.seed(123)

iris_sub <- as_tibble(iris) %>% 
  group_by(Species) %>% 
  sample_n(3) %>% 
  ungroup()

print(iris_sub)

## # A tibble: 9 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.8         3.1          1.6         0.2 setosa    
## 2          5.8         4            1.2         0.2 setosa    
## 3          4.3         3            1.1         0.1 setosa    
## 4          6.9         3.1          4.9         1.5 versicolor
## 5          6.1         3            4.6         1.4 versicolor
## 6          5.8         2.6          4           1.2 versicolor
## 7          6.3         3.4          5.6         2.4 virginica 
## 8          5.7         2.5          5           2   virginica 
## 9          6.7         3.3          5.7         2.1 virginica

I cover the basics for those who are not familiar with tidyverse::, but there is plenty of documentation available for the tidyverse package.

In particular, dplyr:: and tidyr:: (and others) offer useful functions for data manipulation. Key manipulations include: row/column manipulations, group operation, reshape, and more.

5.1 Row Manipulation

5.1.1 Subset rows

The filter() function (dplyr package) is used for subsetting rows of a data frame (or tibble) based on conditions.

Single match ==

filter(iris_sub, Species == "virginica")

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
## 1          6.3         3.4          5.6         2.4 virginica
## 2          5.7         2.5          5           2   virginica
## 3          6.7         3.3          5.7         2.1 virginica

Multiple match %in%

filter(iris_sub, Species %in% c("virginica", "versicolor"))

## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          6.9         3.1          4.9         1.5 versicolor
## 2          6.1         3            4.6         1.4 versicolor
## 3          5.8         2.6          4           1.2 versicolor
## 4          6.3         3.4          5.6         2.4 virginica 
## 5          5.7         2.5          5           2   virginica 
## 6          6.7         3.3          5.7         2.1 virginica

Except !=

filter(iris_sub, Species != "virginica")

## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.8         3.1          1.6         0.2 setosa    
## 2          5.8         4            1.2         0.2 setosa    
## 3          4.3         3            1.1         0.1 setosa    
## 4          6.9         3.1          4.9         1.5 versicolor
## 5          6.1         3            4.6         1.4 versicolor
## 6          5.8         2.6          4           1.2 versicolor

Except multiple !=

filter(iris_sub, !(Species %in% c("virginica", "versicolor")))

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          5.8         4            1.2         0.2 setosa 
## 3          4.3         3            1.1         0.1 setosa

Greater than >

filter(iris_sub, Sepal.Length > 5)

## # A tibble: 7 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5.8         4            1.2         0.2 setosa    
## 2          6.9         3.1          4.9         1.5 versicolor
## 3          6.1         3            4.6         1.4 versicolor
## 4          5.8         2.6          4           1.2 versicolor
## 5          6.3         3.4          5.6         2.4 virginica 
## 6          5.7         2.5          5           2   virginica 
## 7          6.7         3.3          5.7         2.1 virginica

Greater than and equal to >=

filter(iris_sub, Sepal.Length >= 5)

## # A tibble: 7 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5.8         4            1.2         0.2 setosa    
## 2          6.9         3.1          4.9         1.5 versicolor
## 3          6.1         3            4.6         1.4 versicolor
## 4          5.8         2.6          4           1.2 versicolor
## 5          6.3         3.4          5.6         2.4 virginica 
## 6          5.7         2.5          5           2   virginica 
## 7          6.7         3.3          5.7         2.1 virginica

Less than <

filter(iris_sub, Sepal.Length < 5)

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          4.3         3            1.1         0.1 setosa

Less than and equal to <=

filter(iris_sub, Sepal.Length <= 5)

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          4.3         3            1.1         0.1 setosa

Multiple conditions (AND) & (or ,)

# Sepal.Length is less than 5 AND Species equals "setosa"
filter(iris_sub,
       Sepal.Length < 5 & Species == "setosa")

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          4.3         3            1.1         0.1 setosa

# same; "," works like "&"
filter(iris_sub,
       Sepal.Length < 5, Species == "setosa")

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          4.3         3            1.1         0.1 setosa

Multiple conditions (OR) |

# Either Sepal.Length is less than 5 OR Species equals "setosa"
filter(iris_sub,
       Sepal.Length < 5 | Species == "setosa")

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.8         3.1          1.6         0.2 setosa 
## 2          5.8         4            1.2         0.2 setosa 
## 3          4.3         3            1.1         0.1 setosa

5.1.2 Arrange rows

The arrange() function (package dplyr) is used to reorder rows of a data frame (or tibble) based on the values in one or more columns.

Increasing/ascending order

arrange(iris_sub, Sepal.Length)

## # A tibble: 9 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.3         3            1.1         0.1 setosa    
## 2          4.8         3.1          1.6         0.2 setosa    
## 3          5.7         2.5          5           2   virginica 
## 4          5.8         4            1.2         0.2 setosa    
## 5          5.8         2.6          4           1.2 versicolor
## 6          6.1         3            4.6         1.4 versicolor
## 7          6.3         3.4          5.6         2.4 virginica 
## 8          6.7         3.3          5.7         2.1 virginica 
## 9          6.9         3.1          4.9         1.5 versicolor

Decreasing/descending order

arrange(iris_sub, desc(Sepal.Length))

## # A tibble: 9 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          6.9         3.1          4.9         1.5 versicolor
## 2          6.7         3.3          5.7         2.1 virginica 
## 3          6.3         3.4          5.6         2.4 virginica 
## 4          6.1         3            4.6         1.4 versicolor
## 5          5.8         4            1.2         0.2 setosa    
## 6          5.8         2.6          4           1.2 versicolor
## 7          5.7         2.5          5           2   virginica 
## 8          4.8         3.1          1.6         0.2 setosa    
## 9          4.3         3            1.1         0.1 setosa

5.1.3 Exercise

Using iris_sub dataframe, select rows that match the following contidions

Sepal.Width is greater than 3.0 and assign the new dataframe to iris_3
Species is "setosa" and assign the new dataframe to iris_setosa
Sepal.Width is greater than 3.0 AND Species is "setosa", and assign the new dataframe to iris_3_setosa

5.2 Column Manipulation

5.2.1 Select columns

The select() function (package dplyr) is used to choose specific columns from a data frame or tibble. It is particularly useful for narrowing down a dataset to focus on relevant variables.

Select one column

select(iris_sub, Sepal.Length)

## # A tibble: 9 × 1
##   Sepal.Length
##          <dbl>
## 1          4.8
## 2          5.8
## 3          4.3
## 4          6.9
## 5          6.1
## 6          5.8
## 7          6.3
## 8          5.7
## 9          6.7

Select multiple columns

select(iris_sub, c(Sepal.Length, Sepal.Width))

## # A tibble: 9 × 2
##   Sepal.Length Sepal.Width
##          <dbl>       <dbl>
## 1          4.8         3.1
## 2          5.8         4  
## 3          4.3         3  
## 4          6.9         3.1
## 5          6.1         3  
## 6          5.8         2.6
## 7          6.3         3.4
## 8          5.7         2.5
## 9          6.7         3.3

Remove one column

select(iris_sub, -Sepal.Length)

## # A tibble: 9 × 4
##   Sepal.Width Petal.Length Petal.Width Species   
##         <dbl>        <dbl>       <dbl> <fct>     
## 1         3.1          1.6         0.2 setosa    
## 2         4            1.2         0.2 setosa    
## 3         3            1.1         0.1 setosa    
## 4         3.1          4.9         1.5 versicolor
## 5         3            4.6         1.4 versicolor
## 6         2.6          4           1.2 versicolor
## 7         3.4          5.6         2.4 virginica 
## 8         2.5          5           2   virginica 
## 9         3.3          5.7         2.1 virginica

Remove multiple columns

select(iris_sub, -c(Sepal.Length, Sepal.Width))

## # A tibble: 9 × 3
##   Petal.Length Petal.Width Species   
##          <dbl>       <dbl> <fct>     
## 1          1.6         0.2 setosa    
## 2          1.2         0.2 setosa    
## 3          1.1         0.1 setosa    
## 4          4.9         1.5 versicolor
## 5          4.6         1.4 versicolor
## 6          4           1.2 versicolor
## 7          5.6         2.4 virginica 
## 8          5           2   virginica 
## 9          5.7         2.1 virginica

Select/Remove with starts_with()

# select columns starting with "Sepal"
select(iris_sub, starts_with("Sepal"))

## # A tibble: 9 × 2
##   Sepal.Length Sepal.Width
##          <dbl>       <dbl>
## 1          4.8         3.1
## 2          5.8         4  
## 3          4.3         3  
## 4          6.9         3.1
## 5          6.1         3  
## 6          5.8         2.6
## 7          6.3         3.4
## 8          5.7         2.5
## 9          6.7         3.3

# remove columns starting with "Sepal"
select(iris_sub, -starts_with("Sepal"))

## # A tibble: 9 × 3
##   Petal.Length Petal.Width Species   
##          <dbl>       <dbl> <fct>     
## 1          1.6         0.2 setosa    
## 2          1.2         0.2 setosa    
## 3          1.1         0.1 setosa    
## 4          4.9         1.5 versicolor
## 5          4.6         1.4 versicolor
## 6          4           1.2 versicolor
## 7          5.6         2.4 virginica 
## 8          5           2   virginica 
## 9          5.7         2.1 virginica

Select with ends_with()

# select columns ending with "Sepal"
select(iris_sub, ends_with("Width"))

## # A tibble: 9 × 2
##   Sepal.Width Petal.Width
##         <dbl>       <dbl>
## 1         3.1         0.2
## 2         4           0.2
## 3         3           0.1
## 4         3.1         1.5
## 5         3           1.4
## 6         2.6         1.2
## 7         3.4         2.4
## 8         2.5         2  
## 9         3.3         2.1

# remove columns ending with "Sepal"
select(iris_sub, -ends_with("Width"))

## # A tibble: 9 × 3
##   Sepal.Length Petal.Length Species   
##          <dbl>        <dbl> <fct>     
## 1          4.8          1.6 setosa    
## 2          5.8          1.2 setosa    
## 3          4.3          1.1 setosa    
## 4          6.9          4.9 versicolor
## 5          6.1          4.6 versicolor
## 6          5.8          4   versicolor
## 7          6.3          5.6 virginica 
## 8          5.7          5   virginica 
## 9          6.7          5.7 virginica

5.2.2 Add columns

The mutate() function (package dplyr) package and is used to add new variables or modify existing ones in a data frame or tibble. It allows you to apply transformations to columns without changing the original dataset.

Add a new column

# nrow() returns the number of rows of the dataframe
(x_max <- nrow(iris_sub))

## [1] 9

# create a vector from 1 to x_max
x <- 1:x_max

# add as a new column
# named `x` as `row_id` when added
mutate(iris_sub, row_id = x)

## # A tibble: 9 × 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    row_id
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>       <int>
## 1          4.8         3.1          1.6         0.2 setosa          1
## 2          5.8         4            1.2         0.2 setosa          2
## 3          4.3         3            1.1         0.1 setosa          3
## 4          6.9         3.1          4.9         1.5 versicolor      4
## 5          6.1         3            4.6         1.4 versicolor      5
## 6          5.8         2.6          4           1.2 versicolor      6
## 7          6.3         3.4          5.6         2.4 virginica       7
## 8          5.7         2.5          5           2   virginica       8
## 9          6.7         3.3          5.7         2.1 virginica       9

Modify an existing column

# twice `Sepal.Length` and add as a new column
mutate(iris_sub, sl_two_times = 2 * Sepal.Length)

## # A tibble: 9 × 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    sl_two_times
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <dbl>
## 1          4.8         3.1          1.6         0.2 setosa              9.6
## 2          5.8         4            1.2         0.2 setosa             11.6
## 3          4.3         3            1.1         0.1 setosa              8.6
## 4          6.9         3.1          4.9         1.5 versicolor         13.8
## 5          6.1         3            4.6         1.4 versicolor         12.2
## 6          5.8         2.6          4           1.2 versicolor         11.6
## 7          6.3         3.4          5.6         2.4 virginica          12.6
## 8          5.7         2.5          5           2   virginica          11.4
## 9          6.7         3.3          5.7         2.1 virginica          13.4

5.2.3 Exercise

Using iris_sub dataframe, select columns that match the following contidions

Select column Petal.Width and Species and assign the new dataframe to iris_pw
Select columns starting with text "Petal" and assign the new dataframe to iris_petal
Add new column pw_two_times by doubling values in column Petal.Width, and assign the new dataframe to iris_pw_two

5.3 Piping

5.3.1 Pipe `%>%`

%>% (pipe) allows sequential operations of multiple functions (hot key: Ctr + Shift + M). The pipe passes the object to the following function as the first argument. For example, the following code subset rows by Species column, then select column Sepal.Length.

## 
df_vir <- filter(iris_sub, Species == "virginica")
df_vir_sl <- select(df_vir, Sepal.Length)

print(df_vir_sl)

## # A tibble: 3 × 1
##   Sepal.Length
##          <dbl>
## 1          6.3
## 2          5.7
## 3          6.7

With piping, the above example becomes:

df_vir_sl <- iris_sub %>% 
  filter(Species == "virginica") %>% 
  select(Sepal.Length)

print(df_vir_sl)

## # A tibble: 3 × 1
##   Sepal.Length
##          <dbl>
## 1          6.3
## 2          5.7
## 3          6.7

5.3.2 Exercise

Subset iris_sub by Species column (choose only "setosa" ) and add a new column pw_two_times by doubling values in column Petal.Width. Assign the resultant dataframe to iris_pipe. USE pipe %>% in this operation.

5.4 Group Operation

5.4.1 Grouping `group_by()`

Often, we may want to calculate summary statistics by group. In such cases, the group_by() function in R provides an effective way to group the data before applying summary functions.

In the original iris_sub dataframe, no group structure is imposed, meaning that when you print the dataframe, it does not display any information about grouping.

print(iris_sub)

## # A tibble: 9 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.8         3.1          1.6         0.2 setosa    
## 2          5.8         4            1.2         0.2 setosa    
## 3          4.3         3            1.1         0.1 setosa    
## 4          6.9         3.1          4.9         1.5 versicolor
## 5          6.1         3            4.6         1.4 versicolor
## 6          5.8         2.6          4           1.2 versicolor
## 7          6.3         3.4          5.6         2.4 virginica 
## 8          5.7         2.5          5           2   virginica 
## 9          6.7         3.3          5.7         2.1 virginica

Now, let’s add a group structure to the dataframe. The following code groups the data based on the values in the Species column:

iris_sub %>% 
  group_by(Species)

## # A tibble: 9 × 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.8         3.1          1.6         0.2 setosa    
## 2          5.8         4            1.2         0.2 setosa    
## 3          4.3         3            1.1         0.1 setosa    
## 4          6.9         3.1          4.9         1.5 versicolor
## 5          6.1         3            4.6         1.4 versicolor
## 6          5.8         2.6          4           1.2 versicolor
## 7          6.3         3.4          5.6         2.4 virginica 
## 8          5.7         2.5          5           2   virginica 
## 9          6.7         3.3          5.7         2.1 virginica

The dataframe itself remains unchanged, but you’ll notice additional text at the top, indicating Groups: Species[3]. This means that the dataframe now contains three groups, as defined by the unique values in the Species column.

5.4.2 Summarize with `summarize()`

The group_by() function becomes especially useful when combined with functions like summarize() or mutate(). For instance, to calculate the mean values for each species in the iris_sub dataframe, you can use the following code:

iris_sub %>% 
  group_by(Species) %>% 
  summarize(mu_sl = mean(Sepal.Length))

## # A tibble: 3 × 2
##   Species    mu_sl
##   <fct>      <dbl>
## 1 setosa      4.97
## 2 versicolor  6.27
## 3 virginica   6.23

If you want to calculate multiple summary statistics, you can separate the operations with a comma , as shown below:

iris_sub %>% 
  group_by(Species) %>% 
  summarize(mu_sl = mean(Sepal.Length),
            sum_sl = sum(Sepal.Length))

## # A tibble: 3 × 3
##   Species    mu_sl sum_sl
##   <fct>      <dbl>  <dbl>
## 1 setosa      4.97   14.9
## 2 versicolor  6.27   18.8
## 3 virginica   6.23   18.7

5.4.3 Summarize with `mutate()`

The summarize() function returns a summary table, with each group represented by a single row. On the other hand, mutate() retains the original individual rows, adding summary columns. In this case, each row within the same group will have the same summary value. Remember to use ungroup() after grouping operations to prevent errors in subsequent operations.

# grouping by "Species", then take means "Speal.Length" for each species
iris_sub %>% 
  group_by(Species) %>% 
  mutate(mu_sl = mean(Sepal.Length)) %>% 
  ungroup()

## # A tibble: 9 × 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    mu_sl
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>      <dbl>
## 1          4.8         3.1          1.6         0.2 setosa      4.97
## 2          5.8         4            1.2         0.2 setosa      4.97
## 3          4.3         3            1.1         0.1 setosa      4.97
## 4          6.9         3.1          4.9         1.5 versicolor  6.27
## 5          6.1         3            4.6         1.4 versicolor  6.27
## 6          5.8         2.6          4           1.2 versicolor  6.27
## 7          6.3         3.4          5.6         2.4 virginica   6.23
## 8          5.7         2.5          5           2   virginica   6.23
## 9          6.7         3.3          5.7         2.1 virginica   6.23

5.5 Reshape

pivot_wider() : reshape a data frame to a wide format

iris_w <- iris %>% 
  mutate(id = rep(1:50, 3)) %>% # add an ID column
  select(id, Sepal.Length, Species) %>% 
  pivot_wider(id_cols = "id", # unique row ID based on
              values_from = "Sepal.Length", # values in each cell from
              names_from = "Species") # new column names from

print(iris_w)

## # A tibble: 50 × 4
##       id setosa versicolor virginica
##    <int>  <dbl>      <dbl>     <dbl>
##  1     1    5.1        7         6.3
##  2     2    4.9        6.4       5.8
##  3     3    4.7        6.9       7.1
##  4     4    4.6        5.5       6.3
##  5     5    5          6.5       6.5
##  6     6    5.4        5.7       7.6
##  7     7    4.6        6.3       4.9
##  8     8    5          4.9       7.3
##  9     9    4.4        6.6       6.7
## 10    10    4.9        5.2       7.2
## # ℹ 40 more rows

pivot_longer(): reshape a data frame to a long format

iris_l <- iris_w %>% 
  pivot_longer(cols = c("setosa",
                        "versicolor",
                        "virginica"), # columns with values to be reshaped
               names_to = "Species", # column IDs move to "Species"
               values_to = "Sepal.Length") # column values move to "Sepal.Length"

print(iris_l)

## # A tibble: 150 × 3
##       id Species    Sepal.Length
##    <int> <chr>             <dbl>
##  1     1 setosa              5.1
##  2     1 versicolor          7  
##  3     1 virginica           6.3
##  4     2 setosa              4.9
##  5     2 versicolor          6.4
##  6     2 virginica           5.8
##  7     3 setosa              4.7
##  8     3 versicolor          6.9
##  9     3 virginica           7.1
## 10     4 setosa              4.6
## # ℹ 140 more rows

5.6 Join

left_join(): merge data frames based on column(s)

# matching by a single column
## left join by "Species": one to one
df1 <- tibble(Species = c("A", "B", "C"),
              x = c(1, 2, 3))

df2 <- tibble(Species = c("A", "B", "C"),
              y = c(4, 5, 6))

left_join(x = df1,
          y = df2,
          by = "Species")

## # A tibble: 3 × 3
##   Species     x     y
##   <chr>   <dbl> <dbl>
## 1 A           1     4
## 2 B           2     5
## 3 C           3     6

# matching by a single column
## left join by "Species": one to many
df3 <- tibble(Species = c("A", "A", "B", "C"),
              y = c(4, 5, 6, 7))

left_join(x = df1,
          y = df3,
          by = "Species")

## # A tibble: 4 × 3
##   Species     x     y
##   <chr>   <dbl> <dbl>
## 1 A           1     4
## 2 A           1     5
## 3 B           2     6
## 4 C           3     7

# matching by a single column
## left join by "Species": one to missing
df4 <- tibble(Species = c("A", "A", "C"),
              y = c(4, 5, 7))

left_join(x = df1,
          y = df4,
          by = "Species")

## # A tibble: 4 × 3
##   Species     x     y
##   <chr>   <dbl> <dbl>
## 1 A           1     4
## 2 A           1     5
## 3 B           2    NA
## 4 C           3     7

# matching by multiple columns
## one to one
df5 <- tibble(Species = c("A", "B", "C"),
              x = c(1, 2, 3),
              z = c("cool", "awesome", "magical"))

left_join(x = df1,
          y = df5,
          by = c("Species", "x"))

## # A tibble: 3 × 3
##   Species     x z      
##   <chr>   <dbl> <chr>  
## 1 A           1 cool   
## 2 B           2 awesome
## 3 C           3 magical

# matching by multiple columns
## one to many
df6 <- tibble(Species = c("A", "A", "B", "C"),
              x = c(1, 1, 2, 3),
              z = c("cool", "cool", "awesome", "magical"))

left_join(x = df1,
          y = df6,
          by = c("Species", "x"))

## # A tibble: 4 × 3
##   Species     x z      
##   <chr>   <dbl> <chr>  
## 1 A           1 cool   
## 2 A           1 cool   
## 3 B           2 awesome
## 4 C           3 magical

# matching by multiple columns
## one to missing
df6 <- tibble(Species = c("A", "B", "C"),
              x = c(1, 2, 4),
              z = c("cool", "awesome", "magical"))

left_join(x = df1,
          y = df6,
          by = c("Species", "x"))

## # A tibble: 3 × 3
##   Species     x z      
##   <chr>   <dbl> <chr>  
## 1 A           1 cool   
## 2 B           2 awesome
## 3 C           3 <NA>

4 Data Structure

6 Visualization