Introduction to Statistical Programming

Note

This section will not teach you the basics of R and RStudio. If you need that, please refer to A brief R guide for surviving ENVX1002. You will need to be logged into Canvas to access this link.

Data Introduction

While we are all not ecologists, we will be working with two datasets from the Long Term Ecological Research (LTER) Network:

Salamander Counts (and_vertebrates): Data from the Andrews Forest LTER site tracking salamander populations across different streams.
Lake Ice Duration (ntl_icecover): Records from the North Temperate Lakes LTER site measuring annual ice cover duration.

Load the data to get started:

pacman::p_load(lterdatasampler)

# load the data
data("and_vertebrates", "ntl_icecover")

These are very large datasets – but don’t be intimidated! The point is to show you that statistical programming can handle big data, and it will all make sense as we go along.

Metadata

Statistical analysis starts by exploring the data’s structure. Sometimes, datasets include metadata—context that helps explain the data.

For the data that we are using today, metadata is available in the documentation. Use ?and_vertebrates and ?ntl_icecover to access it.

?and_vertebrates
?ntl_icecover

Understanding Data Structure

The str() function shows us the types of variables and how the data is “arranged”:

str(and_vertebrates)

tibble [32,209 × 16] (S3: tbl_df/tbl/data.frame)
 $ year       : num [1:32209] 1987 1987 1987 1987 1987 ...
 $ sitecode   : chr [1:32209] "MACKCC-L" "MACKCC-L" "MACKCC-L" "MACKCC-L" ...
 $ section    : chr [1:32209] "CC" "CC" "CC" "CC" ...
 $ reach      : chr [1:32209] "L" "L" "L" "L" ...
 $ pass       : num [1:32209] 1 1 1 1 1 1 1 1 1 1 ...
 $ unitnum    : num [1:32209] 1 1 1 1 1 1 1 1 1 1 ...
 $ unittype   : chr [1:32209] "R" "R" "R" "R" ...
 $ vert_index : num [1:32209] 1 2 3 4 5 6 7 8 9 10 ...
 $ pitnumber  : num [1:32209] NA NA NA NA NA NA NA NA NA NA ...
 $ species    : chr [1:32209] "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" "Cutthroat trout" ...
 $ length_1_mm: num [1:32209] 58 61 89 58 93 86 107 131 103 117 ...
 $ length_2_mm: num [1:32209] NA NA NA NA NA NA NA NA NA NA ...
 $ weight_g   : num [1:32209] 1.75 1.95 5.6 2.15 6.9 5.9 10.5 20.6 9.55 13 ...
 $ clip       : chr [1:32209] "NONE" "NONE" "NONE" "NONE" ...
 $ sampledate : Date[1:32209], format: "1987-10-07" "1987-10-07" ...
 $ notes      : chr [1:32209] NA NA NA NA ...

There is also glimpse(), which requires you to load the tidyverse package:

pacman::p_load(tidyverse)
glimpse(and_vertebrates)

Rows: 32,209
Columns: 16
$ year        <dbl> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987…
$ sitecode    <chr> "MACKCC-L", "MACKCC-L", "MACKCC-L", "MACKCC-L", "MACKCC-L"…
$ section     <chr> "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC"…
$ reach       <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L"…
$ pass        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ unitnum     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
$ unittype    <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R", "R"…
$ vert_index  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, …
$ pitnumber   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ species     <chr> "Cutthroat trout", "Cutthroat trout", "Cutthroat trout", "…
$ length_1_mm <dbl> 58, 61, 89, 58, 93, 86, 107, 131, 103, 117, 100, 127, 99, …
$ length_2_mm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ weight_g    <dbl> 1.75, 1.95, 5.60, 2.15, 6.90, 5.90, 10.50, 20.60, 9.55, 13…
$ clip        <chr> "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "NONE", "N…
$ sampledate  <date> 1987-10-07, 1987-10-07, 1987-10-07, 1987-10-07, 1987-10-0…
$ notes       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

You will soon realise that there are many ways to “skin a cat” in R – that is, there are multiple ways to achieve the same outcome. You can pick any method as long as you are comfortable with it.

In general, the outputs from str() and glimpse() can seem overwhelming at first. This is completely normal, especially when you’re new to data exploration. Take your time, and don’t worry if it doesn’t all make sense immediately – it will come with practice.

Previewing the Data

The head() function shows us the first few rows of a dataset:

head(and_vertebrates)

# A tibble: 6 × 16
   year sitecode section reach  pass unitnum unittype vert_index pitnumber
  <dbl> <chr>    <chr>   <chr> <dbl>   <dbl> <chr>         <dbl>     <dbl>
1  1987 MACKCC-L CC      L         1       1 R                 1        NA
2  1987 MACKCC-L CC      L         1       1 R                 2        NA
3  1987 MACKCC-L CC      L         1       1 R                 3        NA
4  1987 MACKCC-L CC      L         1       1 R                 4        NA
5  1987 MACKCC-L CC      L         1       1 R                 5        NA
6  1987 MACKCC-L CC      L         1       1 R                 6        NA
# ℹ 7 more variables: species <chr>, length_1_mm <dbl>, length_2_mm <dbl>,
#   weight_g <dbl>, clip <chr>, sampledate <date>, notes <chr>

While tail() shows the last few rows:

tail(ntl_icecover)

# A tibble: 6 × 5
  lakeid      ice_on     ice_off    ice_duration  year
  <fct>       <date>     <date>            <dbl> <dbl>
1 Lake Monona 2014-12-02 2015-04-02          105  2014
2 Lake Monona 2016-01-11 2016-03-13           62  2015
3 Lake Monona 2016-12-16 2017-03-07           81  2016
4 Lake Monona 2017-12-26 2018-03-29           93  2017
5 Lake Monona 2018-12-11 2019-03-31           97  2018
6 Lake Monona 2019-12-16 2020-03-20           80  2019

Quick Statistical Overview

The summary() function provides a statistical overview of your data, adapting its output based on variable types (numerical, categorical, etc.). This means that it can be useful for some datasets but not for others.

summary(and_vertebrates)

      year        sitecode           section             reach          
 Min.   :1987   Length:32209       Length:32209       Length:32209      
 1st Qu.:1998   Class :character   Class :character   Class :character  
 Median :2006   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2005                                                           
 3rd Qu.:2012                                                           
 Max.   :2019                                                           
                                                                        
      pass          unitnum         unittype           vert_index    
 Min.   :1.000   Min.   : 1.000   Length:32209       Min.   :  1.00  
 1st Qu.:1.000   1st Qu.: 3.000   Class :character   1st Qu.:  5.00  
 Median :1.000   Median : 7.000   Mode  :character   Median : 13.00  
 Mean   :1.224   Mean   : 7.696                      Mean   : 20.17  
 3rd Qu.:1.000   3rd Qu.:11.000                      3rd Qu.: 27.00  
 Max.   :2.000   Max.   :20.000                      Max.   :147.00  
                                                                     
   pitnumber          species           length_1_mm      length_2_mm   
 Min.   :   62048   Length:32209       Min.   : 19.00   Min.   : 28.0  
 1st Qu.:13713632   Class :character   1st Qu.: 47.00   1st Qu.: 77.0  
 Median :18570447   Mode  :character   Median : 63.00   Median : 98.0  
 Mean   :16286432                      Mean   : 73.83   Mean   :100.5  
 3rd Qu.:19132429                      3rd Qu.: 97.00   3rd Qu.:119.0  
 Max.   :28180046                      Max.   :253.00   Max.   :284.0  
 NA's   :26574                         NA's   :17       NA's   :19649  
    weight_g           clip             sampledate            notes          
 Min.   :  0.090   Length:32209       Min.   :1987-10-06   Length:32209      
 1st Qu.:  1.510   Class :character   1st Qu.:1998-09-04   Class :character  
 Median :  6.050   Mode  :character   Median :2006-09-06   Mode  :character  
 Mean   :  8.903                      Mean   :2005-08-05                     
 3rd Qu.: 11.660                      3rd Qu.:2012-09-05                     
 Max.   :134.590                      Max.   :2019-09-05                     
 NA's   :13268

summary(ntl_icecover)

          lakeid        ice_on              ice_off            ice_duration  
 Lake Mendota:167   Min.   :1851-12-13   Min.   :1852-03-25   Min.   : 21.0  
 Lake Monona :167   1st Qu.:1895-12-04   1st Qu.:1896-04-04   1st Qu.: 92.0  
 Lake Wingra :  0   Median :1936-12-07   Median :1937-10-01   Median :103.0  
                    Mean   :1937-03-15   Mean   :1937-09-25   Mean   :102.8  
                    3rd Qu.:1978-12-09   3rd Qu.:1979-04-08   3rd Qu.:115.0  
                    Max.   :2020-01-12   Max.   :2020-03-22   Max.   :161.0  
                    NA's   :1            NA's   :2            NA's   :3      
      year     
 Min.   :1851  
 1st Qu.:1894  
 Median :1936  
 Mean   :1936  
 3rd Qu.:1978  
 Max.   :2019

Summary Statistics

Mean: central tendency

The arithmetic mean is often our first step in understanding typical values in data.

Mathematically, the mean ($\bar{x}$) is calculated as the sum of all values ($x_i$) divided by the number of values ($n$):

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

For example, consider the following set of salamander weights (in grams): 2, 3, 5, 7, and 8. The mean weight is:

\[ \bar{x} = \frac{2 + 3 + 5 + 7 + 8}{5} = \frac{25}{5} = 5 \]

In simpler terms, the mean is what you get if you add up all the numbers and then divide by how many numbers you added. It’s a way to find the “average” value.

In R, we can calculate the mean using the mean() function. Note that we used the $ operator to access the specific variables in the dataset. It will make sense once you start typing the code as RStudio will provide you with suggestions.

# Calculate mean salamander weight
mean(and_vertebrates$weight_g, na.rm = TRUE)

[1] 8.902859

# Calculate mean ice duration
mean(ntl_icecover$ice_duration, na.rm = TRUE)

[1] 102.8187

Missing values are called NA in R. The na.rm = TRUE argument tells R to ignore these missing values when calculating the mean. If you don’t include this argument, R will return NA as the mean as mathematically, you can’t do calculations with unknown values.

# This will return NA
mean(and_vertebrates$weight_g)

[1] NA

Median: a “robust” central measure

The median is the middle value when all numbers are arranged in order, like finding the middle person if everyone lined up by height. This makes it especially good at representing “typical” values when there are some extreme measurements in your data.

For example, let’s take these salamander weights (in grams): 2, 3, 5, 7, 20 To find the median:

Arrange in order: 2, 3, 5, 7, 20
Find the middle number: 5

Even though there’s one unusually large salamander (20g), the median (5g) still represents a typical salamander weight. Compare this to the mean:

(2 + 3 + 5 + 7 + 20) ÷ 5 = 7.4g

The mean is pulled higher by the unusually large salamander, making it less representative of a typical salamander in this case. This is why we say the median is “robust” - it’s not easily influenced by extreme values or outliers.

For datasets with an even number of values, we take the average of the two middle numbers. For example, with weights 2, 3, 5, 7:

Arrange in order: 2, 3, 5, 7
Find the middle pair: 3 and 5
Calculate their average: (3 + 5) ÷ 2 = 4

This robustness proves particularly valuable in data where extreme values are common:

House prices in a neighborhood (a few mansion prices won’t skew the “typical” house value)
Animal body sizes (a few giants won’t affect the “typical” size)
Stream depths (a few very deep pools won’t change the “typical” depth)

Let’s look at our full salamander data:

# Median ice_duration
median(ntl_icecover$ice_duration, na.rm = TRUE)

[1] 103

# Compare with mean
mean(ntl_icecover$ice_duration, na.rm = TRUE)

[1] 102.8187

From the above you can see that both values are close indicating that the data is not skewed by extreme values. We will learn more about how this relates to statistics and data distributions in the next few chapters.

Mode: Most frequent value

The mode represents the most frequently occurring value within a dataset, making it especially valuable for understanding the most common category or value in a dataset, whereas the mean and median are more suitable for continuous numerical data.

For example, consider these salamander counts in different sections of a stream: 2, 3, 2, 4, 2, 1, 3

To find the mode: 1. Count how often each value appears: - 1 appears once - 2 appears three times - 3 appears twice - 4 appears once 2. The mode is 2 because it appears most frequently (three times)

Some datasets can have multiple modes:

If we had: 2, 3, 2, 4, 3, 1, 3
Both 2 and 3 appear three times
This dataset would be “bimodal” (having two modes)

While the mode is useful for discrete data like species counts, R lacks a built-in function for it. We could write one, but we’ll use existing packages instead. It is good practice to find out if a package exists before writing your own function since you are (probably) not in this unit to become a software developer.

A Google search should lead you to multiple packages that can calculate the mode, we’ll just use the modeest package for now. Note that the function used is called mlv() (Maximum Likelihood Value):

pacman::p_load(modeest)
ml <- mlv(and_vertebrates$weight_g, method = "shorth", na.rm = TRUE)

Now that we know that the modal weight in grams is 2, we can use this information to understand the distribution of salamander weights in the dataset.

Measures of Spread

Range

The range helps us understand the full scope of variation in their measurements. For salamanders, this might reveal the difference between the smallest juvenile and largest adult:

# Range of salamander weights
range(and_vertebrates$weight_g, na.rm = TRUE)

[1]   0.09 134.59

# Alternative using min and max
min(and_vertebrates$weight_g, na.rm = TRUE)

[1] 0.09

max(and_vertebrates$weight_g, na.rm = TRUE)

[1] 134.59

The range provides a basic understanding of data spread, showing the total span of values. It’s useful for a quick look at the extent of variation in data. However, the range doesn’t tell us how the values are distributed within that span. To understand this, we need more advanced measures like variance and standard deviation.

Variance and standard deviation

Variance and standard deviation help us understand how spread out our data is from the mean. Think of them as measuring how much values typically “deviate” or differ from the average.

Variance

Variance measures the average squared distance of each value from the mean. Let’s break this down with a simple example using salamander weights (in grams): 2, 3, 5, 7, 8

Calculate the mean: (2 + 3 + 5 + 7 + 8) ÷ 5 = 5
Find how far each value is from the mean:
- 2 is 3 below the mean: (2 - 5) = -3
- 3 is 2 below the mean: (3 - 5) = -2
- 5 equals the mean: (5 - 5) = 0
- 7 is 2 above the mean: (7 - 5) = 2
- 8 is 3 above the mean: (8 - 5) = 3
Square these differences (this makes all values positive):
- (-3)² = 9
- (-2)² = 4
- 0² = 0
- 2² = 4
- 3² = 9
Calculate the average of these squared differences: (9 + 4 + 0 + 4 + 9) ÷ 5 = 5.2

The variance in this case is 5.2 (square grams). This squared unit is a bit awkward - imagine trying to explain that salamanders vary by “5.2 square grams” from the mean! This is one of the main reasons we use standard deviation instead of variance.

Standard Deviation

The standard deviation is simply the square root of the variance. We often prefer it because: - It’s in the same units as our original data (grams instead of square grams) - It’s easier to interpret in relation to the mean

For our example: - Variance = 5.2 - Standard deviation = √5.2 ≈ 2.28 grams

This means that salamander weights typically vary by about 2.28 grams above or below the mean.

Let’s calculate these for our full salamander dataset:

# Variance in salamander weights
var(and_vertebrates$weight_g, na.rm = TRUE)

[1] 113.9829

# Standard deviation (square root of variance)
sd(and_vertebrates$weight_g, na.rm = TRUE)

[1] 10.67628

The mean, standard deviation, and variance are key for understanding data variability and underpin many statistical tests/models. Expect to see them often in research.

Interquartile range (IQR)

The Interquartile Range (IQR) is a robust measure of spread that tells us the range where the middle 50% of our data falls. It’s calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

To understand IQR, let’s first understand quartiles using a simple example with salamander weights (in grams): 2, 2, 3, 4, 5, 5, 6, 8, 10

First, arrange the data in order (already done above)
Find the median (Q2) - it’s 5
Split the data into two halves:
- Lower half: 2, 2, 3, 4
- Upper half: 5, 6, 8, 10
Find Q1 (median of lower half) = 2.5
Find Q3 (median of upper half) = 7
Calculate IQR = Q3 - Q1 = 7 - 2.5 = 4.5

The IQR tells us that the middle 50% of salamander weights vary by 4.5 grams. This is particularly useful because:

It’s not affected by extreme values (very large or small salamanders)
It gives us a sense of “typical” variation in the population
It helps identify outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are often considered outliers

The IQR is particularly useful in studies where we want to focus on typical variations while excluding extreme values. For lake ice duration, this helps us understand normal annual patterns whilst excluding unusually warm or cold years:

# IQR for ice duration
IQR(ntl_icecover$ice_duration, na.rm = TRUE)

[1] 23

# Get quartiles (handling missing values)
quantile(ntl_icecover$ice_duration, na.rm = TRUE)

  0%  25%  50%  75% 100% 
  21   92  103  115  161

Looking at the quartiles:

Q1 (25th percentile): 25% of ice durations are below this value
Q2 (50th percentile/median): The middle ice duration
Q3 (75th percentile): 75% of ice durations are below this value
IQR = Q3 - Q1: The range where the middle 50% of ice durations fall

This information helps us understand the typical variation in ice duration while being less sensitive to extreme weather events that might produce unusually long or short ice cover periods.