1 Descriptive Statistics

Descriptive statistics are a set of summary measures that provide a concise overview of a dataset. They help us understand the characteristics and properties of the data without delving into complex statistical analyses. Some commonly used descriptive statistics include the mean, standard deviation, and median.

To illustrate the concept, let’s consider fish length measurement \(x\). Each fish is identified by a subscript, and their lengths are denoted as follows:

\[ x_1 = 15.9\\ x_2 = 15.1\\ x_3 = 21.9\\ x_4 = 13.3\\ x_5 = 24.4\\ \] Often times, we use subscript \(i\) (or any character you like) instead of actual number to indicate a given data point. For example, we write fish length \(x_i\) for individual \(i\). Alternatively, instead of writing each individual data point, we can represent them as a vector using boldface, denoted as \(\pmb{x}\). In this case, the vector \(\pmb{x}\) can be expressed as:

\[ \pmb{x} = \{15.9, 15.1, 21.9, 13.3, 24.4\} \]

Now let’s see how we represent summary statistics of vector \(\pmb{x}\).

1.1 Central Tendency

1.1.1 Central Tendency Measures

Central tendency measures (Table 1.1) provide insights into the typical or central value of a dataset. There are three commonly used measures:

  • Arithmetic Mean. This is the most commonly used measure of central tendency. It represents the additive average of the data. To calculate the arithmetic mean, you sum up all the values and divide the total by the number of data points. It can be heavily influenced by extreme values, known as outliers.

  • Geometric Mean. The geometric mean is a multiplicative average. It is always smaller than the arithmetic mean and less sensitive to unusually large values. However, it is not applicable when the data contain negative values.

  • Median. The median is the value that separates the higher half from the lower half of the dataset. It represents the 50th percentile data point. The median is less affected by outliers compared to the arithmetic mean. To calculate the median, you arrange the data in ascending order and select the middle value if the dataset has an odd number of values. If the dataset has an even number of values, you take the average of the two middle values.

Table 1.1: Common measures of central tendency. \(N\) refers to the number of data points.
Measure Formula
Arithmetic mean \(\mu\) \(\frac{\sum_i^N x_i}{N}\)
Geometric mean \(\mu_{ge}\) \((\prod_i^N x_i)^{\frac{1}{N}}\)
Median \(\mu_{med}\)

\(x_{(\frac{N + 1}{2}^\text{th})}~\text{if N is odd}\)

or

\(\frac{1}{2}[x_{(\frac{N}{2}^\text{th})} + x_{(\frac{N}{2} + 1^\text{th})}]~\text{if N is even}\)

1.1.2 R Exercise

To learn more about these measures, let’s create vectors \(\pmb{x} = \{15.9, 15.1, 21.9, 13.3, 24.4\}\) and \(\pmb{y} = \{15.9, 15.1, 21.9, 53.3, 24.4\}\)\(\pmb{y}\) is identical to \(\pmb{x}\) but contains one outlier value. How does this make difference? To construct vectors in R, we use c(), a function that stands for “construct.” Below is the script:

# construct vectors x and y
x <- c(15.9, 15.1, 21.9, 13.3, 24.4)
y <- c(15.9, 15.1, 21.9, 53.3, 24.4)

Confirm you constructed them correctly:

x
## [1] 15.9 15.1 21.9 13.3 24.4
y
## [1] 15.9 15.1 21.9 53.3 24.4

Cool! Now we can calculate summary statistics of x and y.

Arithmetic Mean

While R has a function for arithmetic mean(), let’s calculate the value from scratch:

# for vector x
n_x <- length(x) # the number of elements in x = the number of data points
sum_x <- sum(x) # summation for x
mu_x <- sum_x / n_x # arithmetic mean
print(mu_x) # print calculated value
## [1] 18.12
# for vector y; we can calculate directly too
mu_y <- sum(y) / length(y)
print(mu_y) # print calculated value
## [1] 26.12

Compare with outputs from mean() :

## [1] 18.12
## [1] 26.12

Geometric Mean

Unfortunately, there is no build-in function for geometric mean \(\mu_{ge}\) in R (as far as I know; there are packages though). But, we can calculate the value from scratch again:

# for vector x
prod_x <- prod(x) # product of vector x; x1 * x2 * x3...
n_x <- length(x)
mug_x <- prod_x^(1 / n_x) # ^ means power
print(mug_x)
## [1] 17.63648
# for vector y
mug_y <- prod(y)^(1 / length(y))
print(mug_y)
## [1] 23.28022

Median

Lastly, let’s do the same for median:

# for vector x
x <- sort(x) # sort x from small to large
index <- (length(x) + 1) / 2 # (N + 1)/2 th index as length(x) is an odd number
med_x <- x[index]
print(med_x)
## [1] 15.9
# for vector y
y <- sort(y) # sort y from small to large
med_y <- y[(length(y) + 1) / 2]
print(med_y)
## [1] 21.9

Compare with outputs from median()

## [1] 15.9
## [1] 21.9

1.2 Variation

1.2.1 Variation Measures

Variation measures (Table 1.2) provide information about the spread of data points.

  • Variance. Variance is a statistical measure that quantifies the spread or dispersion of a dataset. It provides a numerical value that indicates how far individual data points in a dataset deviate from the mean or average value. In other words, variance measures the average squared difference between each data point and the mean. Standard deviation (SD) is the square root of variance.

  • Inter-Quantile Range. The interquartile range (IQR) is a statistical measure that provides information about the spread or dispersion of a dataset, specifically focusing on the middle 50% of the data. It is a robust measure of variability that is less affected by outliers compared to variance.

  • Median Absolute Deviation. Median Absolute Deviation (MAD) is similar to variance, but provides a robust estimation of variability that is less affected by outliers compared to variance. MAD is defined as the median of the absolute deviations from the data’s median.

  • Coefficient of Variation. The coefficient of variation (CV) is a statistical measure that expresses the relative variability of a dataset in relation to its mean. It is particularly useful when comparing the variability of different datasets that have different scales or units of measurement.

  • MAD/Median. MAD/Median is a statistical measure used to assess the relative variability of a dataset without assuming any specific distribution or parametric model. CV is sensitive to outliers because of its reliance on the arithmetic mean. However, MAD/Median is robust to this issue.

Table 1.2: Common measures of variation. \(N\) refers to the number of data points. \(x_l\) and \(x_h\) are \(l^{th}\) and \(h^{th}\) percentiles.
Measure Formula
Variance \(\sigma^2\) \(\frac{\sum_i^N (x_i - \mu)^2}{N}\)
Inter-Quantile Range IQR \(|x_l - x_h|\)
Median Absolute Deviation (MAD) \(\text{Median}(|x_i - \mu_{med}|)\)
Coefficient of Variation (CV) \(\frac{\sigma}{\mu}\)
MAD/Median \(\frac{\text{MAD}}{\mu_{med}}\)

1.2.2 R Exercise

Variance

# for x
sqd_x <- (x - mean(x))^2 # sqared deviance
sum_sqd_x <- sum(sqd_x)
var_x <- sum_sqd_x / length(x)
print(var_x)
## [1] 18.2016
# for y
var_y <- sum((y - mean(y))^2) / length(y)
print(var_y)
## [1] 197.0816

Standard Deviation (SD)

# for x
sd_x <- sqrt(var_x) # sqrt(): square root
print(sd_x)
## [1] 4.266333
# for y
sd_y <- sqrt(var_y)
print(sd_y)
## [1] 14.03858

Coefficient of Variation (CV)

# for x
cv_x <- sd_x / mean(x)
print(cv_x)
## [1] 0.2354489
# for y
cv_y <- sd_y / mean(y)
print(cv_y)
## [1] 0.5374646

IQR; Here, let me use 25 and 75 percentiles as \(x_l\) and \(x_h\).

# for x
x_l <- quantile(x, 0.25) # quantile(): return quantile values, 25 percentile
x_h <- quantile(x, 0.75) # quantile(): return quantile values, 75 percentile
iqr_x <- abs(x_l - x_h) # abs(): absolute value
print(iqr_x)
## 25% 
## 6.8
# for y
y_q <- quantile(y, c(0.25, 0.75)) # return as a vector
iqr_y <- abs(y_q[1] - y_q[2]) # y_q[1] = 25 percentile; y_q[2] = 75 percentile
print(iqr_y)
## 25% 
## 8.5

MAD

# for x
ad_x <- abs(x - median(x))
mad_x <- median(ad_x)
print(mad_x)
## [1] 2.6
# for y
mad_y <- median(abs(y - median(y)))
print(mad_y)
## [1] 6

MAD / Median

# for x
mm_x <- mad_x / median(x)
print(mm_x)
## [1] 0.163522
# for y
mm_y <- mad_y / median(y)
print(mm_y)
## [1] 0.2739726

1.3 Laboratory

1.3.1 Comparing Central Tendency Measures

What are the differences of the three measures of central tendency? To investigate this further, let’s perform the following exercise.

  1. Create a new vector z with length \(1000\) as exp(rnorm(n = 1000, mean = 0, sd = 0.1)), and calculate the arithmetic mean, geometric mean, and median.

  2. Draw a histogram of z using functions tibble(), ggplot(), and geom_histogram().

  3. Draw vertical lines of arithmetic mean, geometric mean, and median on the histogram with different colors using a function geom_vline() .

  4. Compare the values of the central tendency measures.

  5. Create a new vector z_rev as -z + max(z) + 0.1, and repeat step 1 – 4.

1.3.2 Comparing Variation Measures

Why do we have absolute (variance, SD, MAD, IQR) and relative measures (CV, MAD/Median) of variation? To understand this, suppose we have 100 measurements of fish weight in unit “gram.” (w in the following script)

w <- rnorm(100, mean = 10, sd = 1)
head(w) # show first 10 elements in w
## [1]  9.237455  9.171337 10.834474  9.032348  9.971185 10.232525

Using this data, perform the following exercise:

  1. Convert the unit of w to “milligram” and create a new vector m.

  2. Calculate SD and MAD for w and m.

  3. Calculate CV and MAD/Median for w and m.