::p_load(lterdatasampler)
pacman
# load the data
data("and_vertebrates", "ntl_icecover")
Introduction to Statistical Programming
This section will not teach you the basics of R and RStudio. If you need that, please refer to A brief R guide for surviving ENVX1002. You will need to be logged into Canvas to access this link.
Data Introduction
While we are all not ecologists, we will be working with two datasets from the Long Term Ecological Research (LTER) Network:
- Salamander Counts (
and_vertebrates
): Data from the Andrews Forest LTER site tracking salamander populations across different streams. - Lake Ice Duration (
ntl_icecover
): Records from the North Temperate Lakes LTER site measuring annual ice cover duration.
Load the data to get started:
These are very large datasets – but don’t be intimidated! The point is to show you that statistical programming can handle big data, and it will all make sense as we go along.
Metadata
Statistical analysis starts by exploring the data’s structure. Sometimes, datasets include metadata—context that helps explain the data.
For the data that we are using today, metadata is available in the documentation. Use ?and_vertebrates
and ?ntl_icecover
to access it.
?and_vertebrates ?ntl_icecover
Understanding Data Structure
The str()
function shows us the types of variables and how the data is “arranged”:
str(and_vertebrates)
tibble [32,209 × 16] (S3: tbl_df/tbl/data.frame)
$ year : num [1:32209] 1987 1987 1987 1987 1987 ...
$ sitecode : chr [1:32209] "MACKCC-L" "MACKCC-L" "MACKCC-L" "MACKCC-L" ...
$ section : chr [1:32209] "CC" "CC" "CC" "CC" ...
$ reach : chr [1:32209] "L" "L" "L" "L" ...
$ pass : num [1:32209] 1 1 1 1 1 1 1 1 1 1 ...
$ unitnum : num [1:32209] 1 1 1 1 1 1 1 1 1 1 ...
$ unittype : chr [1:32209] "R" "R" "R" "R" ...
$ vert_index : num [1:32209] 1 2 3 4 5 6 7 8 9 10 ...
$ pitnumber : num [1:32209] NA NA NA NA NA NA NA NA NA NA ...
$ species : chr [1:32209] "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" ...
$ length_1_mm: num [1:32209] 58 61 89 58 93 86 107 131 103 117 ...
$ length_2_mm: num [1:32209] NA NA NA NA NA NA NA NA NA NA ...
$ weight_g : num [1:32209] 1.75 1.95 5.6 2.15 6.9 5.9 10.5 20.6 9.55 13 ...
$ clip : chr [1:32209] "NONE" "NONE" "NONE" "NONE" ...
$ sampledate : Date[1:32209], format: "1987-10-07" "1987-10-07" ...
$ notes : chr [1:32209] NA NA NA NA ...
There is also glimpse()
, which requires you to load the tidyverse
package:
::p_load(tidyverse)
pacmanglimpse(and_vertebrates)
Rows: 32,209
Columns: 16
$ year <dbl> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987…
$ sitecode <chr> "MACKCC-L", "MACKCC-L", "MACKCC-L", "MACKCC-L", "MACKCC-L"…
$ section <chr> "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC"…
$ reach <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L"…
$ pass <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ unitnum <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
$ unittype <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R"…
$ vert_index <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, …
$ pitnumber <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ species <chr> "Cutthroat trout", "Cutthroat trout", "Cutthroat trout", "…
$ length_1_mm <dbl> 58, 61, 89, 58, 93, 86, 107, 131, 103, 117, 100, 127, 99, …
$ length_2_mm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ weight_g <dbl> 1.75, 1.95, 5.60, 2.15, 6.90, 5.90, 10.50, 20.60, 9.55, 13…
$ clip <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "N…
$ sampledate <date> 1987-10-07, 1987-10-07, 1987-10-07, 1987-10-07, 1987-10-0…
$ notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
You will soon realise that there are many ways to “skin a cat” in R – that is, there are multiple ways to achieve the same outcome. You can pick any method as long as you are comfortable with it.
In general, the outputs from str()
and glimpse()
can seem overwhelming at first. This is completely normal, especially when you’re new to data exploration. Take your time, and don’t worry if it doesn’t all make sense immediately – it will come with practice.
Previewing the Data
The head()
function shows us the first few rows of a dataset:
head(and_vertebrates)
# A tibble: 6 × 16
year sitecode section reach pass unitnum unittype vert_index pitnumber
<dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1987 MACKCC-L CC L 1 1 R 1 NA
2 1987 MACKCC-L CC L 1 1 R 2 NA
3 1987 MACKCC-L CC L 1 1 R 3 NA
4 1987 MACKCC-L CC L 1 1 R 4 NA
5 1987 MACKCC-L CC L 1 1 R 5 NA
6 1987 MACKCC-L CC L 1 1 R 6 NA
# ℹ 7 more variables: species <chr>, length_1_mm <dbl>, length_2_mm <dbl>,
# weight_g <dbl>, clip <chr>, sampledate <date>, notes <chr>
While tail()
shows the last few rows:
tail(ntl_icecover)
# A tibble: 6 × 5
lakeid ice_on ice_off ice_duration year
<fct> <date> <date> <dbl> <dbl>
1 Lake Monona 2014-12-02 2015-04-02 105 2014
2 Lake Monona 2016-01-11 2016-03-13 62 2015
3 Lake Monona 2016-12-16 2017-03-07 81 2016
4 Lake Monona 2017-12-26 2018-03-29 93 2017
5 Lake Monona 2018-12-11 2019-03-31 97 2018
6 Lake Monona 2019-12-16 2020-03-20 80 2019
Quick Statistical Overview
The summary()
function provides a statistical overview of your data, adapting its output based on variable types (numerical, categorical, etc.). This means that it can be useful for some datasets but not for others.
summary(and_vertebrates)
year sitecode section reach
Min. :1987 Length:32209 Length:32209 Length:32209
1st Qu.:1998 Class :character Class :character Class :character
Median :2006 Mode :character Mode :character Mode :character
Mean :2005
3rd Qu.:2012
Max. :2019
pass unitnum unittype vert_index
Min. :1.000 Min. : 1.000 Length:32209 Min. : 1.00
1st Qu.:1.000 1st Qu.: 3.000 Class :character 1st Qu.: 5.00
Median :1.000 Median : 7.000 Mode :character Median : 13.00
Mean :1.224 Mean : 7.696 Mean : 20.17
3rd Qu.:1.000 3rd Qu.:11.000 3rd Qu.: 27.00
Max. :2.000 Max. :20.000 Max. :147.00
pitnumber species length_1_mm length_2_mm
Min. : 62048 Length:32209 Min. : 19.00 Min. : 28.0
1st Qu.:13713632 Class :character 1st Qu.: 47.00 1st Qu.: 77.0
Median :18570447 Mode :character Median : 63.00 Median : 98.0
Mean :16286432 Mean : 73.83 Mean :100.5
3rd Qu.:19132429 3rd Qu.: 97.00 3rd Qu.:119.0
Max. :28180046 Max. :253.00 Max. :284.0
NA's :26574 NA's :17 NA's :19649
weight_g clip sampledate notes
Min. : 0.090 Length:32209 Min. :1987-10-06 Length:32209
1st Qu.: 1.510 Class :character 1st Qu.:1998-09-04 Class :character
Median : 6.050 Mode :character Median :2006-09-06 Mode :character
Mean : 8.903 Mean :2005-08-05
3rd Qu.: 11.660 3rd Qu.:2012-09-05
Max. :134.590 Max. :2019-09-05
NA's :13268
summary(ntl_icecover)
lakeid ice_on ice_off ice_duration
Lake Mendota:167 Min. :1851-12-13 Min. :1852-03-25 Min. : 21.0
Lake Monona :167 1st Qu.:1895-12-04 1st Qu.:1896-04-04 1st Qu.: 92.0
Lake Wingra : 0 Median :1936-12-07 Median :1937-10-01 Median :103.0
Mean :1937-03-15 Mean :1937-09-25 Mean :102.8
3rd Qu.:1978-12-09 3rd Qu.:1979-04-08 3rd Qu.:115.0
Max. :2020-01-12 Max. :2020-03-22 Max. :161.0
NA's :1 NA's :2 NA's :3
year
Min. :1851
1st Qu.:1894
Median :1936
Mean :1936
3rd Qu.:1978
Max. :2019
Summary Statistics
Mean: central tendency
The arithmetic mean is often our first step in understanding typical values in data.
Mathematically, the mean (\(\bar{x}\)) is calculated as the sum of all values (\(x_i\)) divided by the number of values (\(n\)):
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
For example, consider the following set of salamander weights (in grams): 2, 3, 5, 7, and 8. The mean weight is:
\[ \bar{x} = \frac{2 + 3 + 5 + 7 + 8}{5} = \frac{25}{5} = 5 \]
In simpler terms, the mean is what you get if you add up all the numbers and then divide by how many numbers you added. It’s a way to find the “average” value.
In R, we can calculate the mean using the mean()
function. Note that we used the $
operator to access the specific variables in the dataset. It will make sense once you start typing the code as RStudio will provide you with suggestions.
# Calculate mean salamander weight
mean(and_vertebrates$weight_g, na.rm = TRUE)
[1] 8.902859
# Calculate mean ice duration
mean(ntl_icecover$ice_duration, na.rm = TRUE)
[1] 102.8187
Missing values are called NA
in R. The na.rm = TRUE
argument tells R to ignore these missing values when calculating the mean. If you don’t include this argument, R will return NA
as the mean as mathematically, you can’t do calculations with unknown values.
# This will return NA
mean(and_vertebrates$weight_g)
[1] NA
Median: a “robust” central measure
The median is the middle value when all numbers are arranged in order, like finding the middle person if everyone lined up by height. This makes it especially good at representing “typical” values when there are some extreme measurements in your data.
For example, let’s take these salamander weights (in grams): 2, 3, 5, 7, 20 To find the median:
- Arrange in order: 2, 3, 5, 7, 20
- Find the middle number: 5
Even though there’s one unusually large salamander (20g), the median (5g) still represents a typical salamander weight. Compare this to the mean:
(2 + 3 + 5 + 7 + 20) ÷ 5 = 7.4g
The mean is pulled higher by the unusually large salamander, making it less representative of a typical salamander in this case. This is why we say the median is “robust” - it’s not easily influenced by extreme values or outliers.
For datasets with an even number of values, we take the average of the two middle numbers. For example, with weights 2, 3, 5, 7:
- Arrange in order: 2, 3, 5, 7
- Find the middle pair: 3 and 5
- Calculate their average: (3 + 5) ÷ 2 = 4
This robustness proves particularly valuable in data where extreme values are common:
- House prices in a neighborhood (a few mansion prices won’t skew the “typical” house value)
- Animal body sizes (a few giants won’t affect the “typical” size)
- Stream depths (a few very deep pools won’t change the “typical” depth)
Let’s look at our full salamander data:
# Median ice_duration
median(ntl_icecover$ice_duration, na.rm = TRUE)
[1] 103
# Compare with mean
mean(ntl_icecover$ice_duration, na.rm = TRUE)
[1] 102.8187
From the above you can see that both values are close indicating that the data is not skewed by extreme values. We will learn more about how this relates to statistics and data distributions in the next few chapters.
Mode: Most frequent value
The mode represents the most frequently occurring value within a dataset, making it especially valuable for understanding the most common category or value in a dataset, whereas the mean and median are more suitable for continuous numerical data.
For example, consider these salamander counts in different sections of a stream: 2, 3, 2, 4, 2, 1, 3
To find the mode: 1. Count how often each value appears: - 1 appears once - 2 appears three times - 3 appears twice - 4 appears once 2. The mode is 2 because it appears most frequently (three times)
Some datasets can have multiple modes:
- If we had: 2, 3, 2, 4, 3, 1, 3
- Both 2 and 3 appear three times
- This dataset would be “bimodal” (having two modes)
While the mode is useful for discrete data like species counts, R lacks a built-in function for it. We could write one, but we’ll use existing packages instead. It is good practice to find out if a package exists before writing your own function since you are (probably) not in this unit to become a software developer.
A Google search should lead you to multiple packages that can calculate the mode, we’ll just use the modeest
package for now. Note that the function used is called mlv()
(Maximum Likelihood Value):
::p_load(modeest)
pacman<- mlv(and_vertebrates$weight_g, method = "shorth", na.rm = TRUE) ml
Now that we know that the modal weight in grams is 2, we can use this information to understand the distribution of salamander weights in the dataset.
Measures of Spread
Range
The range helps us understand the full scope of variation in their measurements. For salamanders, this might reveal the difference between the smallest juvenile and largest adult:
# Range of salamander weights
range(and_vertebrates$weight_g, na.rm = TRUE)
[1] 0.09 134.59
# Alternative using min and max
min(and_vertebrates$weight_g, na.rm = TRUE)
[1] 0.09
max(and_vertebrates$weight_g, na.rm = TRUE)
[1] 134.59
The range provides a basic understanding of data spread, showing the total span of values. It’s useful for a quick look at the extent of variation in data. However, the range doesn’t tell us how the values are distributed within that span. To understand this, we need more advanced measures like variance and standard deviation.
Variance and standard deviation
Variance and standard deviation help us understand how spread out our data is from the mean. Think of them as measuring how much values typically “deviate” or differ from the average.
Variance
Variance measures the average squared distance of each value from the mean. Let’s break this down with a simple example using salamander weights (in grams): 2, 3, 5, 7, 8
- Calculate the mean: (2 + 3 + 5 + 7 + 8) ÷ 5 = 5
- Find how far each value is from the mean:
- 2 is 3 below the mean: (2 - 5) = -3
- 3 is 2 below the mean: (3 - 5) = -2
- 5 equals the mean: (5 - 5) = 0
- 7 is 2 above the mean: (7 - 5) = 2
- 8 is 3 above the mean: (8 - 5) = 3
- Square these differences (this makes all values positive):
- (-3)² = 9
- (-2)² = 4
- 0² = 0
- 2² = 4
- 3² = 9
- Calculate the average of these squared differences: (9 + 4 + 0 + 4 + 9) ÷ 5 = 5.2
The variance in this case is 5.2 (square grams). This squared unit is a bit awkward - imagine trying to explain that salamanders vary by “5.2 square grams” from the mean! This is one of the main reasons we use standard deviation instead of variance.
Standard Deviation
The standard deviation is simply the square root of the variance. We often prefer it because: - It’s in the same units as our original data (grams instead of square grams) - It’s easier to interpret in relation to the mean
For our example: - Variance = 5.2 - Standard deviation = √5.2 ≈ 2.28 grams
This means that salamander weights typically vary by about 2.28 grams above or below the mean.
Let’s calculate these for our full salamander dataset:
# Variance in salamander weights
var(and_vertebrates$weight_g, na.rm = TRUE)
[1] 113.9829
# Standard deviation (square root of variance)
sd(and_vertebrates$weight_g, na.rm = TRUE)
[1] 10.67628
The mean, standard deviation, and variance are key for understanding data variability and underpin many statistical tests/models. Expect to see them often in research.
Interquartile range (IQR)
The Interquartile Range (IQR) is a robust measure of spread that tells us the range where the middle 50% of our data falls. It’s calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).
To understand IQR, let’s first understand quartiles using a simple example with salamander weights (in grams): 2, 2, 3, 4, 5, 5, 6, 8, 10
- First, arrange the data in order (already done above)
- Find the median (Q2) - it’s 5
- Split the data into two halves:
- Lower half: 2, 2, 3, 4
- Upper half: 5, 6, 8, 10
- Find Q1 (median of lower half) = 2.5
- Find Q3 (median of upper half) = 7
- Calculate IQR = Q3 - Q1 = 7 - 2.5 = 4.5
The IQR tells us that the middle 50% of salamander weights vary by 4.5 grams. This is particularly useful because:
- It’s not affected by extreme values (very large or small salamanders)
- It gives us a sense of “typical” variation in the population
- It helps identify outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are often considered outliers
The IQR is particularly useful in studies where we want to focus on typical variations while excluding extreme values. For lake ice duration, this helps us understand normal annual patterns whilst excluding unusually warm or cold years:
# IQR for ice duration
IQR(ntl_icecover$ice_duration, na.rm = TRUE)
[1] 23
# Get quartiles (handling missing values)
quantile(ntl_icecover$ice_duration, na.rm = TRUE)
0% 25% 50% 75% 100%
21 92 103 115 161
Looking at the quartiles:
- Q1 (25th percentile): 25% of ice durations are below this value
- Q2 (50th percentile/median): The middle ice duration
- Q3 (75th percentile): 75% of ice durations are below this value
- IQR = Q3 - Q1: The range where the middle 50% of ice durations fall
This information helps us understand the typical variation in ice duration while being less sensitive to extreme weather events that might produce unusually long or short ice cover periods.