4 Data Structure
4.1 Overview
R has six basic data types:
-
Character: e.g.,
"aquatic"
,"ecology"
(no inherent order).
-
Factor: Similar to character but includes levels, which are ordered alphabetically by default.
-
Numeric: e.g.,
20.0
,15.5
.
-
Integer: e.g.,
3
,7
.
-
Logical: e.g.,
TRUE
,FALSE
.
-
Complex: e.g.,
1 + 2i
(numbers with real and imaginary parts).
These data types form the basis of R’s data structures, which include:
-
Vector: A series of elements, all of the same data type.
-
Matrix: Elements arranged into rows and columns, all of the same data type.
- Dataframe: Similar to a matrix but allows different data types in different columns.
In the following examples, I will use (x <- something)
to demonstrate code, as the outer parentheses ()
automatically print the contents of the object x
. Without the parentheses, the assignedvalues in x
are not displayed. For example, the following code assigns a value to x
but does not print it:
x <- 1
To see the value assigned to x
, you would need to type:
x
## [1] 1
Using parentheses around the assignment performs both steps —- assigning the value and printing it – simultaneously:
(x <- 1)
## [1] 1
4.2 Vector
4.2.1 Create Vector
Below are examples of atomic character vectors, numeric vectors, integer vectors, etc. There are many ways to create vector data. The following examples use c()
, :
, seq()
, rep()
:
Combine function c()
combines multiple elements to create a single vector.
# ex.1a manually create a vector using c()
(x <- c(1, 3, 4, 8))
## [1] 1 3 4 8
# ex.1b character
(x <- c("a", "b", "c"))
## [1] "a" "b" "c"
# ex.1c logical
(x <- c(TRUE, FALSE, FALSE))
## [1] TRUE FALSE FALSE
Replicate function rep()
replicates the same element(s) multiple times.
# ex.3a replicate same numbers or characters
(x <- rep(2, times = 5)) # replicate 2 five times
## [1] 2 2 2 2 2
# ex.3b replicate same numbers or characters
(x <- rep("a", 5)) # replicate "a" five times
## [1] "a" "a" "a" "a" "a"
## [1] 1 1 2 2
Sequence function seq()
creates a vector based on starting and ending values with a specified interval or a vector length.
# ex.4a use seq() function
# create a vector from 1 to 5 with interval 1
(x <- seq(1, 5, by = 1))
## [1] 1 2 3 4 5
# ex.4b use seq() function
# create a vector from 1 to 5 with 20 elements
(x <- seq(1, 5, length = 20))
## [1] 1.000000 1.210526 1.421053 1.631579 1.842105 2.052632 2.263158 2.473684
## [9] 2.684211 2.894737 3.105263 3.315789 3.526316 3.736842 3.947368 4.157895
## [17] 4.368421 4.578947 4.789474 5.000000
Colon :
creates a vector based on starting and ending integer values with an interval one. It works with non-integer values, but the behavior is odd (NOT recommended; use other functions for non-integer vectors).
# ex.5a use colon `:` - interval is fixed at one
(x <- 1:5)
## [1] 1 2 3 4 5
# this works, but odd
# NOT recommended
1.5:4.7
## [1] 1.5 2.5 3.5 4.5
4.2.2 Extract Summary
Numerical or character information can be summarized using functions. Let’s try basic functions to see how it works:
Mean function mean()
calculates the arithmetic mean.
## [1] 15
Sum function sum()
calculates the summation.
## [1] 45
Length function length()
returns the number of elements in a vector.
## [1] 3
4.2.3 Access
Element ID Use brackets []
when accessing specific elements in an object. For example, if you want to access element #2 in the vector x
, you may specify as x[2]
:
x <- c(2, 2, 3, 2, 5)
x[2] # access element #2
## [1] 2
x[c(2, 4)] # access elements #2 and 4
## [1] 2 2
x[2:4] # access elements #2-4
## [1] 2 3 2
Equation R provides many ways to access elements that meet specific conditions. You can use mathematical symbols to specify what you need, for example:
-
==
equal -
>
larger than -
>=
equal & larger than -
<
smaller than -
<=
equal & smaller than -
which()
a function that returns element # that suffices the specified condition
The following examples return a logical vector indicating whether each element in x
suffices the specified condition:
# creating a vector
x <- c(2, 2, 3, 2, 5)
# ex.1a equal
x == 2
## [1] TRUE TRUE FALSE TRUE FALSE
# ex.1b larger than
x > 2
## [1] FALSE FALSE TRUE FALSE TRUE
You can access elements that meet the specified condition using brackets, for example:
# ex.2a equal
x[x == 2]
## [1] 2 2 2
# ex.2b larger than
x[x > 2]
## [1] 3 5
Using which()
, you can see which elements (i.e., #) matches what you need:
# ex.3a equal
which(x == 2) # returns which elements are equal to 2
## [1] 1 2 4
# ex.3b larger than
which(x > 2)
## [1] 3 5
4.2.4 Caveat
A single vector CANNOT contain multiple types of data. For example, try the following code. R coerced x
into a character vector, not numeric.
(x <- c(1, 2, "a"))
## [1] "1" "2" "a"
## class() returns a data type of an object
class(x)
## [1] "character"
Also, the character data type is not limited to letters; numbers can be character, when double-quoted. For example, you cannot calculate the mean of the following vector because the numbers are recognized as character in R.
## enter numbers as character "1" "2"
(x <- c("1", "2"))
## [1] "1" "2"
## class() returns a data type of an object
class(x)
## [1] "character"
## cannot calculate the mean
mean(x)
## [1] NA
4.2.5 Exercise
- Create three numeric vectors with length 3, 6 and 20, respectively. The three vectors must be created using different functions (
c()
,rep()
,seq()
). - Create two character vectors with length 3 and 20, respectively. The two vectors must be created using two different functions (
c()
,rep()
). - Copy the following script to your R script and perform the following analysis:
- Identify element IDs of
y
that are greater than2.0
- Identify element values of
y
that are greater than2.0
- Calculate the arithmetic mean of
y
4.3 Matrix
4.3.1 Create Matrix
Matrix is a set of elements (single data type) that are organized into rows and columns:
Column bind function cbind()
combines two vectors by column.
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] "a" "d"
## [2,] "b" "e"
## [3,] "c" "f"
Row bind function rbind()
combines two vectors by row.
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [,1] [,2] [,3]
## [1,] "a" "b" "c"
## [2,] "d" "e" "f"
Matrix function matrix()
organize a vector into a matrix by specifying the numbers of rows and columns
## numeric: vector 1:9 is organized into 3 x 2 matrix
(m_x <- matrix(1:9, nrow = 3, ncol = 2))
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] "a" "d"
## [2,] "b" "e"
## [3,] "c" "f"
4.3.2 Access
Element ID When accessing matrix elements, you need to pick row(s) and/or column(s). In brackets, specify row ID(s) before comma, column ID(s) after comma ([row ID, column ID]
). For example:
(m_x <- matrix(1:9, nrow = 3, ncol = 3))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
m_x[2, 3] # access an element in row #2 and column #3
## [1] 8
m_x[2,] # access elements in row #2
## [1] 2 5 8
m_x[c(2, 3),] # access elements in rows #2 and #3
## [,1] [,2] [,3]
## [1,] 2 5 8
## [2,] 3 6 9
m_x[,c(2, 3)] # access elements in columns #2 and #3
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
Equation You can assess each element with mathematical expressions just like vectors:
m_x == 2 # equal
## [,1] [,2] [,3]
## [1,] FALSE FALSE FALSE
## [2,] TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE
m_x > 2 # larger than
## [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE
## [3,] TRUE TRUE TRUE
However, care must be taken when accessing elements, as it will be converted to vector data:
m_x[m_x == 2] # equal
## [1] 2
m_x[m_x > 2] # larger than
## [1] 3 4 5 6 7 8 9
which()
needs an additional argument to return both row and column #:
which(m_x == 2, arr.ind = TRUE)
## row col
## [1,] 2 1
which(m_x > 2, arr.ind = TRUE)
## row col
## [1,] 3 1
## [2,] 1 2
## [3,] 2 2
## [4,] 3 2
## [5,] 1 3
## [6,] 2 3
## [7,] 3 3
4.3.4 Matrix
- Create a numeric matrix with 4 rows and 4 columns using either
rbind()
orcbind()
function. - Create a character matrix with 4 rows and 4 columns using
matrix()
function. - Copy the following script to your R script and perform the following analysis:
- Identify element IDs of
x
that are greater than2.0
(specify row and column IDs) - Identify element values of
x
that are greater than2.0
and calculate the mean of the selected elements.
4.4 Data Frame
A data frame is a collection of elements organized into rows and columns, but it differs from a matrix in several ways.
- It allows for the inclusion of multiple data types in different columns.
- Each column in a data frame has a name associated with it.
- You can access columns in a data frame by their respective names using the
$
operator.
The data frame is the most commonly used data structure when manipulating ecological data. When loading a dataset from a spreadsheet (which we will discuss later), it is automatically recognized as a data frame. Let’s consider an example:
4.4.1 Creating a data frame
In the following example, the variables x
and y
are organized into a single data frame named df0
. The variables are renamed as part of the process of creating the data frame.
# Create data frame
# Lake type
x <- c("Pristine", "Pristine", "Disturbed", "Disturbed", "Pristine")
# TSS: total suspended solids (mg/L)
y <- c(1.2, 2.2, 10.9, 50.0, 3.0)
# x is named as "LakeType" while y is named as "TSS"
(df0 <- data.frame(LakeType = x, TSS = y))
## LakeType TSS
## 1 Pristine 1.2
## 2 Pristine 2.2
## 3 Disturbed 10.9
## 4 Disturbed 50.0
## 5 Pristine 3.0
4.4.2 Access by columns
To access elements by column names, use $
after the dataframe, then column name.
df0$LakeType # access LakeType
## [1] "Pristine" "Pristine" "Disturbed" "Disturbed" "Pristine"
df0$TSS # access TSS
## [1] 1.2 2.2 10.9 50.0 3.0
You can access elements like a matrix as well:
df0[, 1] # access column #1
## [1] "Pristine" "Pristine" "Disturbed" "Disturbed" "Pristine"
df0[1, ] # access row #1
## LakeType TSS
## 1 Pristine 1.2
df0[c(2, 4),] # access row #2 and 4
## LakeType TSS
## 2 Pristine 2.2
## 4 Disturbed 50.0
4.4.3 Exercise
Copy the following script to your R script and perform the following analysis:
set.seed(1)
x <- rnorm(100, mean = 10, sd = 3)
y <- rpois(100, lambda = 10)
z <- rep(c("VA", "NC"), 50)
df0 <- data.frame(temperature = x, abundance = y, state = z)
- Access
temperature
column. - Calculate the means of
temperature
andabundance
for statesVA
andNC
separately.