4 Data Structure

4.1 Overview

R has six basic data types:

  • Character: e.g., "aquatic", "ecology" (no inherent order).
  • Factor: Similar to character but includes levels, which are ordered alphabetically by default.
  • Numeric: e.g., 20.0, 15.5.
  • Integer: e.g., 3, 7.
  • Logical: e.g., TRUE, FALSE.
  • Complex: e.g., 1 + 2i (numbers with real and imaginary parts).

These data types form the basis of R’s data structures, which include:

  • Vector: A series of elements, all of the same data type.
  • Matrix: Elements arranged into rows and columns, all of the same data type.
  • Dataframe: Similar to a matrix but allows different data types in different columns.

In the following examples, I will use (x <- something) to demonstrate code, as the outer parentheses () automatically print the contents of the object x. Without the parentheses, the assignedvalues in x are not displayed. For example, the following code assigns a value to x but does not print it:

x <- 1

To see the value assigned to x, you would need to type:

x
## [1] 1

Using parentheses around the assignment performs both steps —- assigning the value and printing it – simultaneously:

(x <- 1)
## [1] 1

4.2 Vector

4.2.1 Create Vector

Below are examples of atomic character vectors, numeric vectors, integer vectors, etc. There are many ways to create vector data. The following examples use c(), :, seq(), rep():

Combine function c() combines multiple elements to create a single vector.

# ex.1a manually create a vector using c()
(x <- c(1, 3, 4, 8))
## [1] 1 3 4 8
# ex.1b character
(x <- c("a", "b", "c"))
## [1] "a" "b" "c"
# ex.1c logical
(x <- c(TRUE, FALSE, FALSE))
## [1]  TRUE FALSE FALSE

Replicate function rep() replicates the same element(s) multiple times.

# ex.3a replicate same numbers or characters
(x <- rep(2, times = 5)) # replicate 2 five times
## [1] 2 2 2 2 2
# ex.3b replicate same numbers or characters
(x <- rep("a", 5)) # replicate "a" five times
## [1] "a" "a" "a" "a" "a"
# ex.3c replicate each element multiple times
(x <- rep(c(1, 2), each = 2))
## [1] 1 1 2 2

Sequence function seq() creates a vector based on starting and ending values with a specified interval or a vector length.

# ex.4a use seq() function
# create a vector from 1 to 5 with interval 1
(x <- seq(1, 5, by = 1))
## [1] 1 2 3 4 5
# ex.4b use seq() function
# create a vector from 1 to 5 with 20 elements
(x <- seq(1, 5, length = 20))
##  [1] 1.000000 1.210526 1.421053 1.631579 1.842105 2.052632 2.263158 2.473684
##  [9] 2.684211 2.894737 3.105263 3.315789 3.526316 3.736842 3.947368 4.157895
## [17] 4.368421 4.578947 4.789474 5.000000

Colon : creates a vector based on starting and ending integer values with an interval one. It works with non-integer values, but the behavior is odd (NOT recommended; use other functions for non-integer vectors).

# ex.5a use colon `:` - interval is fixed at one
(x <- 1:5)
## [1] 1 2 3 4 5
# this works, but odd
# NOT recommended
1.5:4.7
## [1] 1.5 2.5 3.5 4.5

4.2.2 Extract Summary

Numerical or character information can be summarized using functions. Let’s try basic functions to see how it works:

Mean function mean() calculates the arithmetic mean.

x <- c(10, 15, 20)
mean(x)
## [1] 15

Sum function sum() calculates the summation.

x <- c(10, 15, 20)
sum(x)
## [1] 45

Length function length() returns the number of elements in a vector.

x <- c(10, 15, 20)
length(x)
## [1] 3

4.2.3 Access

Element ID Use brackets [] when accessing specific elements in an object. For example, if you want to access element #2 in the vector x, you may specify as x[2]:

x <- c(2, 2, 3, 2, 5)
x[2] # access element #2
## [1] 2
x[c(2, 4)] # access elements #2 and 4
## [1] 2 2
x[2:4] # access elements #2-4
## [1] 2 3 2

Equation R provides many ways to access elements that meet specific conditions. You can use mathematical symbols to specify what you need, for example:

  • == equal
  • > larger than
  • >= equal & larger than
  • < smaller than
  • <= equal & smaller than
  • which() a function that returns element # that suffices the specified condition

The following examples return a logical vector indicating whether each element in x suffices the specified condition:

# creating a vector
x <- c(2, 2, 3, 2, 5)

# ex.1a equal
x == 2
## [1]  TRUE  TRUE FALSE  TRUE FALSE
# ex.1b larger than
x > 2 
## [1] FALSE FALSE  TRUE FALSE  TRUE

You can access elements that meet the specified condition using brackets, for example:

# ex.2a equal
x[x == 2]
## [1] 2 2 2
# ex.2b larger than
x[x > 2]
## [1] 3 5

Using which(), you can see which elements (i.e., #) matches what you need:

# ex.3a equal
which(x == 2) # returns which elements are equal to 2
## [1] 1 2 4
# ex.3b larger than
which(x > 2)
## [1] 3 5

4.2.4 Caveat

A single vector CANNOT contain multiple types of data. For example, try the following code. R coerced x into a character vector, not numeric.

(x <- c(1, 2, "a"))
## [1] "1" "2" "a"
## class() returns a data type of an object
class(x)
## [1] "character"

Also, the character data type is not limited to letters; numbers can be character, when double-quoted. For example, you cannot calculate the mean of the following vector because the numbers are recognized as character in R.

## enter numbers as character "1" "2"
(x <- c("1", "2"))
## [1] "1" "2"
## class() returns a data type of an object
class(x)
## [1] "character"
## cannot calculate the mean
mean(x)
## [1] NA

4.2.5 Exercise

  1. Create three numeric vectors with length 3, 6 and 20, respectively. The three vectors must be created using different functions (c(), rep(), seq()).
  2. Create two character vectors with length 3 and 20, respectively. The two vectors must be created using two different functions (c(), rep()).
  3. Copy the following script to your R script and perform the following analysis:
set.seed(1)
y <- rnorm(100)
  • Identify element IDs of y that are greater than 2.0
  • Identify element values of y that are greater than 2.0
  • Calculate the arithmetic mean of y

4.3 Matrix

4.3.1 Create Matrix

Matrix is a set of elements (single data type) that are organized into rows and columns:

Column bind function cbind() combines two vectors by column.

## numeric
(m_x <- cbind(c(1,2,3), c(4,5,6)))
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## character
(m_x <- cbind(c("a", "b", "c"), c("d", "e", "f")))
##      [,1] [,2]
## [1,] "a"  "d" 
## [2,] "b"  "e" 
## [3,] "c"  "f"

Row bind function rbind() combines two vectors by row.

## numeric
(m_x <- rbind(c(1, 2, 3), c(4, 5, 6)))
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## character
(m_x <- rbind(c("a", "b", "c"), c("d", "e", "f")))
##      [,1] [,2] [,3]
## [1,] "a"  "b"  "c" 
## [2,] "d"  "e"  "f"

Matrix function matrix() organize a vector into a matrix by specifying the numbers of rows and columns

## numeric: vector 1:9 is organized into 3 x 2 matrix
(m_x <- matrix(1:9, nrow = 3, ncol = 2))
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
(m_x <- matrix(c("a", "b", "c",
                 "d", "e", "f"),
               nrow = 3,
               ncol = 2))
##      [,1] [,2]
## [1,] "a"  "d" 
## [2,] "b"  "e" 
## [3,] "c"  "f"

4.3.2 Access

Element ID When accessing matrix elements, you need to pick row(s) and/or column(s). In brackets, specify row ID(s) before comma, column ID(s) after comma ([row ID, column ID]). For example:

(m_x <- matrix(1:9, nrow = 3, ncol = 3))
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m_x[2, 3] # access an element in row #2 and column #3
## [1] 8
m_x[2,] # access elements in row #2
## [1] 2 5 8
m_x[c(2, 3),] # access elements in rows #2 and #3
##      [,1] [,2] [,3]
## [1,]    2    5    8
## [2,]    3    6    9
m_x[,c(2, 3)] # access elements in columns #2 and #3
##      [,1] [,2]
## [1,]    4    7
## [2,]    5    8
## [3,]    6    9

Equation You can assess each element with mathematical expressions just like vectors:

m_x == 2 # equal
##       [,1]  [,2]  [,3]
## [1,] FALSE FALSE FALSE
## [2,]  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE
m_x > 2 # larger than
##       [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE
## [3,]  TRUE TRUE TRUE

However, care must be taken when accessing elements, as it will be converted to vector data:

m_x[m_x == 2] # equal
## [1] 2
m_x[m_x > 2] # larger than
## [1] 3 4 5 6 7 8 9

which() needs an additional argument to return both row and column #:

which(m_x == 2, arr.ind = TRUE)
##      row col
## [1,]   2   1
which(m_x > 2, arr.ind = TRUE)
##      row col
## [1,]   3   1
## [2,]   1   2
## [3,]   2   2
## [4,]   3   2
## [5,]   1   3
## [6,]   2   3
## [7,]   3   3

4.3.3 Exercise

4.3.4 Matrix

  1. Create a numeric matrix with 4 rows and 4 columns using either rbind() or cbind() function.
  2. Create a character matrix with 4 rows and 4 columns using matrix() function.
  3. Copy the following script to your R script and perform the following analysis:
set.seed(1)
x <- matrix(rnorm(100), nrow = 10, ncol = 10)
  • Identify element IDs of x that are greater than 2.0 (specify row and column IDs)
  • Identify element values of x that are greater than 2.0 and calculate the mean of the selected elements.

4.4 Data Frame

A data frame is a collection of elements organized into rows and columns, but it differs from a matrix in several ways.

  • It allows for the inclusion of multiple data types in different columns.
  • Each column in a data frame has a name associated with it.
  • You can access columns in a data frame by their respective names using the $ operator.

The data frame is the most commonly used data structure when manipulating ecological data. When loading a dataset from a spreadsheet (which we will discuss later), it is automatically recognized as a data frame. Let’s consider an example:

4.4.1 Creating a data frame

In the following example, the variables x and y are organized into a single data frame named df0. The variables are renamed as part of the process of creating the data frame.

# Create data frame

# Lake type
x <- c("Pristine", "Pristine", "Disturbed", "Disturbed", "Pristine")

# TSS: total suspended solids (mg/L)
y <- c(1.2, 2.2, 10.9, 50.0, 3.0)

# x is named as "LakeType" while y is named as "TSS"
(df0 <- data.frame(LakeType = x, TSS = y))
##    LakeType  TSS
## 1  Pristine  1.2
## 2  Pristine  2.2
## 3 Disturbed 10.9
## 4 Disturbed 50.0
## 5  Pristine  3.0

4.4.2 Access by columns

To access elements by column names, use $ after the dataframe, then column name.

df0$LakeType # access LakeType
## [1] "Pristine"  "Pristine"  "Disturbed" "Disturbed" "Pristine"
df0$TSS # access TSS
## [1]  1.2  2.2 10.9 50.0  3.0

You can access elements like a matrix as well:

df0[, 1] # access column #1
## [1] "Pristine"  "Pristine"  "Disturbed" "Disturbed" "Pristine"
df0[1, ] # access row #1
##   LakeType TSS
## 1 Pristine 1.2
df0[c(2, 4),] # access row #2 and 4
##    LakeType  TSS
## 2  Pristine  2.2
## 4 Disturbed 50.0

4.4.3 Exercise

Copy the following script to your R script and perform the following analysis:

set.seed(1)
x <- rnorm(100, mean = 10, sd = 3)
y <- rpois(100, lambda = 10)
z <- rep(c("VA", "NC"), 50)
df0 <- data.frame(temperature = x, abundance = y, state = z)
  • Access temperature column.
  • Calculate the means of temperature and abundance for states VA and NC separately.