5 Data Manipulation
R packages are collections of functions that extend the functionality of the R programming language. They provide a convenient way for users to access and utilize specialized tools for data analysis, visualization, statistical modeling, and more. Among them, tidyverse
provides very useful functions for data manipulation and visualization.
tidyverse::
bundles R packages designed to make data manipulation, exploration, and visualization more efficient and intuitive. Developed by Hadley Wickham and other contributors, the Tidyverse packages share a common philosophy and syntax, emphasizing a consistent and tidy data format. The core packages, such as dplyr::
, tidyr::
, and ggplot2::
, provide powerful tools for data wrangling, reshaping, and creating visualizations.
In this Chapter, we will use iris
data, which is available by default in R. While the default data structure iris
is prepared in data.frame()
, I convert this data format to tibble()
in this exercise. Copy and run the code below. We will use iris_sub
in the following demonstration.
library(tidyverse)
set.seed(123)
iris_sub <- as_tibble(iris) %>%
group_by(Species) %>%
sample_n(3) %>%
ungroup()
print(iris_sub)
## # A tibble: 9 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
## 4 6.9 3.1 4.9 1.5 versicolor
## 5 6.1 3 4.6 1.4 versicolor
## 6 5.8 2.6 4 1.2 versicolor
## 7 6.3 3.4 5.6 2.4 virginica
## 8 5.7 2.5 5 2 virginica
## 9 6.7 3.3 5.7 2.1 virginica
I cover the basics for those who are not familiar with tidyverse::
, but there is plenty of documentation available for the tidyverse package.
In particular, dplyr::
and tidyr::
(and others) offer useful functions for data manipulation. Key manipulations include: row/column manipulations, group operation, reshape, and more.
5.1 Row Manipulation
5.1.1 Subset rows
The filter()
function (dplyr
package) is used for subsetting rows of a data frame (or tibble) based on conditions.
Single match ==
filter(iris_sub, Species == "virginica")
## # A tibble: 3 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6.3 3.4 5.6 2.4 virginica
## 2 5.7 2.5 5 2 virginica
## 3 6.7 3.3 5.7 2.1 virginica
Multiple match %in%
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6.9 3.1 4.9 1.5 versicolor
## 2 6.1 3 4.6 1.4 versicolor
## 3 5.8 2.6 4 1.2 versicolor
## 4 6.3 3.4 5.6 2.4 virginica
## 5 5.7 2.5 5 2 virginica
## 6 6.7 3.3 5.7 2.1 virginica
Except !=
filter(iris_sub, Species != "virginica")
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
## 4 6.9 3.1 4.9 1.5 versicolor
## 5 6.1 3 4.6 1.4 versicolor
## 6 5.8 2.6 4 1.2 versicolor
Except multiple !=
## # A tibble: 3 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
Greater than >
filter(iris_sub, Sepal.Length > 5)
## # A tibble: 7 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.8 4 1.2 0.2 setosa
## 2 6.9 3.1 4.9 1.5 versicolor
## 3 6.1 3 4.6 1.4 versicolor
## 4 5.8 2.6 4 1.2 versicolor
## 5 6.3 3.4 5.6 2.4 virginica
## 6 5.7 2.5 5 2 virginica
## 7 6.7 3.3 5.7 2.1 virginica
Greater than and equal to >=
filter(iris_sub, Sepal.Length >= 5)
## # A tibble: 7 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.8 4 1.2 0.2 setosa
## 2 6.9 3.1 4.9 1.5 versicolor
## 3 6.1 3 4.6 1.4 versicolor
## 4 5.8 2.6 4 1.2 versicolor
## 5 6.3 3.4 5.6 2.4 virginica
## 6 5.7 2.5 5 2 virginica
## 7 6.7 3.3 5.7 2.1 virginica
Less than <
filter(iris_sub, Sepal.Length < 5)
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 4.3 3 1.1 0.1 setosa
Less than and equal to <=
filter(iris_sub, Sepal.Length <= 5)
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 4.3 3 1.1 0.1 setosa
Multiple conditions (AND) &
(or ,
)
# Sepal.Length is less than 5 AND Species equals "setosa"
filter(iris_sub,
Sepal.Length < 5 & Species == "setosa")
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 4.3 3 1.1 0.1 setosa
# same; "," works like "&"
filter(iris_sub,
Sepal.Length < 5, Species == "setosa")
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 4.3 3 1.1 0.1 setosa
Multiple conditions (OR) |
# Either Sepal.Length is less than 5 OR Species equals "setosa"
filter(iris_sub,
Sepal.Length < 5 | Species == "setosa")
## # A tibble: 3 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
5.1.2 Arrange rows
The arrange()
function (package dplyr
) is used to reorder rows of a data frame (or tibble) based on the values in one or more columns.
Increasing/ascending order
arrange(iris_sub, Sepal.Length)
## # A tibble: 9 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.3 3 1.1 0.1 setosa
## 2 4.8 3.1 1.6 0.2 setosa
## 3 5.7 2.5 5 2 virginica
## 4 5.8 4 1.2 0.2 setosa
## 5 5.8 2.6 4 1.2 versicolor
## 6 6.1 3 4.6 1.4 versicolor
## 7 6.3 3.4 5.6 2.4 virginica
## 8 6.7 3.3 5.7 2.1 virginica
## 9 6.9 3.1 4.9 1.5 versicolor
Decreasing/descending order
## # A tibble: 9 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6.9 3.1 4.9 1.5 versicolor
## 2 6.7 3.3 5.7 2.1 virginica
## 3 6.3 3.4 5.6 2.4 virginica
## 4 6.1 3 4.6 1.4 versicolor
## 5 5.8 4 1.2 0.2 setosa
## 6 5.8 2.6 4 1.2 versicolor
## 7 5.7 2.5 5 2 virginica
## 8 4.8 3.1 1.6 0.2 setosa
## 9 4.3 3 1.1 0.1 setosa
5.1.3 Exercise
Using iris_sub
dataframe, select rows that match the following contidions
Sepal.Width
is greater than3.0
and assign the new dataframe toiris_3
Species
is"setosa"
and assign the new dataframe toiris_setosa
Sepal.Width
is greater than3.0
ANDSpecies
is"setosa"
, and assign the new dataframe toiris_3_setosa
5.2 Column Manipulation
5.2.1 Select columns
The select()
function (package dplyr
) is used to choose specific columns from a data frame or tibble. It is particularly useful for narrowing down a dataset to focus on relevant variables.
Select one column
select(iris_sub, Sepal.Length)
## # A tibble: 9 × 1
## Sepal.Length
## <dbl>
## 1 4.8
## 2 5.8
## 3 4.3
## 4 6.9
## 5 6.1
## 6 5.8
## 7 6.3
## 8 5.7
## 9 6.7
Select multiple columns
## # A tibble: 9 × 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 4.8 3.1
## 2 5.8 4
## 3 4.3 3
## 4 6.9 3.1
## 5 6.1 3
## 6 5.8 2.6
## 7 6.3 3.4
## 8 5.7 2.5
## 9 6.7 3.3
Remove one column
select(iris_sub, -Sepal.Length)
## # A tibble: 9 × 4
## Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <fct>
## 1 3.1 1.6 0.2 setosa
## 2 4 1.2 0.2 setosa
## 3 3 1.1 0.1 setosa
## 4 3.1 4.9 1.5 versicolor
## 5 3 4.6 1.4 versicolor
## 6 2.6 4 1.2 versicolor
## 7 3.4 5.6 2.4 virginica
## 8 2.5 5 2 virginica
## 9 3.3 5.7 2.1 virginica
Remove multiple columns
## # A tibble: 9 × 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.6 0.2 setosa
## 2 1.2 0.2 setosa
## 3 1.1 0.1 setosa
## 4 4.9 1.5 versicolor
## 5 4.6 1.4 versicolor
## 6 4 1.2 versicolor
## 7 5.6 2.4 virginica
## 8 5 2 virginica
## 9 5.7 2.1 virginica
Select/Remove with starts_with()
# select columns starting with "Sepal"
select(iris_sub, starts_with("Sepal"))
## # A tibble: 9 × 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 4.8 3.1
## 2 5.8 4
## 3 4.3 3
## 4 6.9 3.1
## 5 6.1 3
## 6 5.8 2.6
## 7 6.3 3.4
## 8 5.7 2.5
## 9 6.7 3.3
# remove columns starting with "Sepal"
select(iris_sub, -starts_with("Sepal"))
## # A tibble: 9 × 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.6 0.2 setosa
## 2 1.2 0.2 setosa
## 3 1.1 0.1 setosa
## 4 4.9 1.5 versicolor
## 5 4.6 1.4 versicolor
## 6 4 1.2 versicolor
## 7 5.6 2.4 virginica
## 8 5 2 virginica
## 9 5.7 2.1 virginica
Select with ends_with()
## # A tibble: 9 × 2
## Sepal.Width Petal.Width
## <dbl> <dbl>
## 1 3.1 0.2
## 2 4 0.2
## 3 3 0.1
## 4 3.1 1.5
## 5 3 1.4
## 6 2.6 1.2
## 7 3.4 2.4
## 8 2.5 2
## 9 3.3 2.1
## # A tibble: 9 × 3
## Sepal.Length Petal.Length Species
## <dbl> <dbl> <fct>
## 1 4.8 1.6 setosa
## 2 5.8 1.2 setosa
## 3 4.3 1.1 setosa
## 4 6.9 4.9 versicolor
## 5 6.1 4.6 versicolor
## 6 5.8 4 versicolor
## 7 6.3 5.6 virginica
## 8 5.7 5 virginica
## 9 6.7 5.7 virginica
5.2.2 Add columns
The mutate()
function (package dplyr
) package and is used to add new variables or modify existing ones in a data frame or tibble. It allows you to apply transformations to columns without changing the original dataset.
Add a new column
# nrow() returns the number of rows of the dataframe
(x_max <- nrow(iris_sub))
## [1] 9
# create a vector from 1 to x_max
x <- 1:x_max
# add as a new column
# named `x` as `row_id` when added
mutate(iris_sub, row_id = x)
## # A tibble: 9 × 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species row_id
## <dbl> <dbl> <dbl> <dbl> <fct> <int>
## 1 4.8 3.1 1.6 0.2 setosa 1
## 2 5.8 4 1.2 0.2 setosa 2
## 3 4.3 3 1.1 0.1 setosa 3
## 4 6.9 3.1 4.9 1.5 versicolor 4
## 5 6.1 3 4.6 1.4 versicolor 5
## 6 5.8 2.6 4 1.2 versicolor 6
## 7 6.3 3.4 5.6 2.4 virginica 7
## 8 5.7 2.5 5 2 virginica 8
## 9 6.7 3.3 5.7 2.1 virginica 9
Modify an existing column
# twice `Sepal.Length` and add as a new column
mutate(iris_sub, sl_two_times = 2 * Sepal.Length)
## # A tibble: 9 × 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species sl_two_times
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 4.8 3.1 1.6 0.2 setosa 9.6
## 2 5.8 4 1.2 0.2 setosa 11.6
## 3 4.3 3 1.1 0.1 setosa 8.6
## 4 6.9 3.1 4.9 1.5 versicolor 13.8
## 5 6.1 3 4.6 1.4 versicolor 12.2
## 6 5.8 2.6 4 1.2 versicolor 11.6
## 7 6.3 3.4 5.6 2.4 virginica 12.6
## 8 5.7 2.5 5 2 virginica 11.4
## 9 6.7 3.3 5.7 2.1 virginica 13.4
5.2.3 Exercise
Using iris_sub
dataframe, select columns that match the following contidions
Select column
Petal.Width
andSpecies
and assign the new dataframe toiris_pw
Select columns starting with text
"Petal"
and assign the new dataframe toiris_petal
Add new column
pw_two_times
by doubling values in columnPetal.Width
, and assign the new dataframe toiris_pw_two
5.3 Piping
5.3.1 Pipe %>%
%>%
(pipe) allows sequential operations of multiple functions (hot key: Ctr + Shift + M
). The pipe passes the object to the following function as the first argument. For example, the following code subset rows by Species
column, then select column Sepal.Length
.
##
df_vir <- filter(iris_sub, Species == "virginica")
df_vir_sl <- select(df_vir, Sepal.Length)
print(df_vir_sl)
## # A tibble: 3 × 1
## Sepal.Length
## <dbl>
## 1 6.3
## 2 5.7
## 3 6.7
With piping, the above example becomes:
## # A tibble: 3 × 1
## Sepal.Length
## <dbl>
## 1 6.3
## 2 5.7
## 3 6.7
5.4 Group Operation
5.4.1 Grouping group_by()
Often, we may want to calculate summary statistics by group. In such cases, the group_by()
function in R provides an effective way to group the data before applying summary functions.
In the original iris_sub
dataframe, no group structure is imposed, meaning that when you print the dataframe, it does not display any information about grouping.
print(iris_sub)
## # A tibble: 9 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
## 4 6.9 3.1 4.9 1.5 versicolor
## 5 6.1 3 4.6 1.4 versicolor
## 6 5.8 2.6 4 1.2 versicolor
## 7 6.3 3.4 5.6 2.4 virginica
## 8 5.7 2.5 5 2 virginica
## 9 6.7 3.3 5.7 2.1 virginica
Now, let’s add a group structure to the dataframe. The following code groups the data based on the values in the Species
column:
## # A tibble: 9 × 5
## # Groups: Species [3]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.8 3.1 1.6 0.2 setosa
## 2 5.8 4 1.2 0.2 setosa
## 3 4.3 3 1.1 0.1 setosa
## 4 6.9 3.1 4.9 1.5 versicolor
## 5 6.1 3 4.6 1.4 versicolor
## 6 5.8 2.6 4 1.2 versicolor
## 7 6.3 3.4 5.6 2.4 virginica
## 8 5.7 2.5 5 2 virginica
## 9 6.7 3.3 5.7 2.1 virginica
The dataframe itself remains unchanged, but you’ll notice additional text at the top, indicating Groups: Species[3]
. This means that the dataframe now contains three groups, as defined by the unique values in the Species
column.
5.4.2 Summarize with summarize()
The group_by()
function becomes especially useful when combined with functions like summarize()
or mutate()
. For instance, to calculate the mean values for each species in the iris_sub
dataframe, you can use the following code:
## # A tibble: 3 × 2
## Species mu_sl
## <fct> <dbl>
## 1 setosa 4.97
## 2 versicolor 6.27
## 3 virginica 6.23
If you want to calculate multiple summary statistics, you can separate the operations with a comma ,
as shown below:
iris_sub %>%
group_by(Species) %>%
summarize(mu_sl = mean(Sepal.Length),
sum_sl = sum(Sepal.Length))
## # A tibble: 3 × 3
## Species mu_sl sum_sl
## <fct> <dbl> <dbl>
## 1 setosa 4.97 14.9
## 2 versicolor 6.27 18.8
## 3 virginica 6.23 18.7
5.4.3 Summarize with mutate()
The summarize()
function returns a summary table, with each group represented by a single row. On the other hand, mutate()
retains the original individual rows, adding summary columns. In this case, each row within the same group will have the same summary value. Remember to use ungroup()
after grouping operations to prevent errors in subsequent operations.
# grouping by "Species", then take means "Speal.Length" for each species
iris_sub %>%
group_by(Species) %>%
mutate(mu_sl = mean(Sepal.Length)) %>%
ungroup()
## # A tibble: 9 × 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species mu_sl
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 4.8 3.1 1.6 0.2 setosa 4.97
## 2 5.8 4 1.2 0.2 setosa 4.97
## 3 4.3 3 1.1 0.1 setosa 4.97
## 4 6.9 3.1 4.9 1.5 versicolor 6.27
## 5 6.1 3 4.6 1.4 versicolor 6.27
## 6 5.8 2.6 4 1.2 versicolor 6.27
## 7 6.3 3.4 5.6 2.4 virginica 6.23
## 8 5.7 2.5 5 2 virginica 6.23
## 9 6.7 3.3 5.7 2.1 virginica 6.23
5.5 Reshape
pivot_wider()
: reshape a data frame to a wide format
iris_w <- iris %>%
mutate(id = rep(1:50, 3)) %>% # add an ID column
select(id, Sepal.Length, Species) %>%
pivot_wider(id_cols = "id", # unique row ID based on
values_from = "Sepal.Length", # values in each cell from
names_from = "Species") # new column names from
print(iris_w)
## # A tibble: 50 × 4
## id setosa versicolor virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 5.1 7 6.3
## 2 2 4.9 6.4 5.8
## 3 3 4.7 6.9 7.1
## 4 4 4.6 5.5 6.3
## 5 5 5 6.5 6.5
## 6 6 5.4 5.7 7.6
## 7 7 4.6 6.3 4.9
## 8 8 5 4.9 7.3
## 9 9 4.4 6.6 6.7
## 10 10 4.9 5.2 7.2
## # ℹ 40 more rows
pivot_longer()
: reshape a data frame to a long format
iris_l <- iris_w %>%
pivot_longer(cols = c("setosa",
"versicolor",
"virginica"), # columns with values to be reshaped
names_to = "Species", # column IDs move to "Species"
values_to = "Sepal.Length") # column values move to "Sepal.Length"
print(iris_l)
## # A tibble: 150 × 3
## id Species Sepal.Length
## <int> <chr> <dbl>
## 1 1 setosa 5.1
## 2 1 versicolor 7
## 3 1 virginica 6.3
## 4 2 setosa 4.9
## 5 2 versicolor 6.4
## 6 2 virginica 5.8
## 7 3 setosa 4.7
## 8 3 versicolor 6.9
## 9 3 virginica 7.1
## 10 4 setosa 4.6
## # ℹ 140 more rows
5.6 Join
left_join()
: merge data frames based on column(s)
# matching by a single column
## left join by "Species": one to one
df1 <- tibble(Species = c("A", "B", "C"),
x = c(1, 2, 3))
df2 <- tibble(Species = c("A", "B", "C"),
y = c(4, 5, 6))
left_join(x = df1,
y = df2,
by = "Species")
## # A tibble: 3 × 3
## Species x y
## <chr> <dbl> <dbl>
## 1 A 1 4
## 2 B 2 5
## 3 C 3 6
# matching by a single column
## left join by "Species": one to many
df3 <- tibble(Species = c("A", "A", "B", "C"),
y = c(4, 5, 6, 7))
left_join(x = df1,
y = df3,
by = "Species")
## # A tibble: 4 × 3
## Species x y
## <chr> <dbl> <dbl>
## 1 A 1 4
## 2 A 1 5
## 3 B 2 6
## 4 C 3 7
# matching by a single column
## left join by "Species": one to missing
df4 <- tibble(Species = c("A", "A", "C"),
y = c(4, 5, 7))
left_join(x = df1,
y = df4,
by = "Species")
## # A tibble: 4 × 3
## Species x y
## <chr> <dbl> <dbl>
## 1 A 1 4
## 2 A 1 5
## 3 B 2 NA
## 4 C 3 7
# matching by multiple columns
## one to one
df5 <- tibble(Species = c("A", "B", "C"),
x = c(1, 2, 3),
z = c("cool", "awesome", "magical"))
left_join(x = df1,
y = df5,
by = c("Species", "x"))
## # A tibble: 3 × 3
## Species x z
## <chr> <dbl> <chr>
## 1 A 1 cool
## 2 B 2 awesome
## 3 C 3 magical
# matching by multiple columns
## one to many
df6 <- tibble(Species = c("A", "A", "B", "C"),
x = c(1, 1, 2, 3),
z = c("cool", "cool", "awesome", "magical"))
left_join(x = df1,
y = df6,
by = c("Species", "x"))
## # A tibble: 4 × 3
## Species x z
## <chr> <dbl> <chr>
## 1 A 1 cool
## 2 A 1 cool
## 3 B 2 awesome
## 4 C 3 magical
# matching by multiple columns
## one to missing
df6 <- tibble(Species = c("A", "B", "C"),
x = c(1, 2, 4),
z = c("cool", "awesome", "magical"))
left_join(x = df1,
y = df6,
by = c("Species", "x"))
## # A tibble: 3 × 3
## Species x z
## <chr> <dbl> <chr>
## 1 A 1 cool
## 2 B 2 awesome
## 3 C 3 <NA>