Lecture 02: Introduction to statistical programming

ENVX1002 Statistics in Life and Environmental Sciences

Januar Harianto

The University of Sydney

Apr 2026

Learning outcomes

After this week, you will be able to:

Navigate and use the RStudio interface effectively
Execute basic R functions and understand their syntax
Describe measures of central tendency (mean, median, mode) clearly and without mathematical jargon
Describe measures of spread (range, IQR, variance, standard deviation) through practical examples
Calculate and compare statistical measures using R and Excel for biological data

Quick checklist

By now you should have…

Installed R
Installed RStudio
Created one (or two) documents in Markdown using Quarto
Read the ENVX1002 R guide (Canvas > Tool Kit)

History of statistical programming

From calculators to computers

1960s: Statistical software BMDP and SPSS (not in image). Source

1970s: SAS (Statistical Analysis System) Source

1976: Birth of S at Bell Labs. S-PLUS debuts in 1988. Source

The R story

Created at University of Auckland, New Zealand in 1993
Named after creators (Ross & Robert) – and inspired by S programming language
Designed specifically for statistical computing and graphics, now used in many fields
Over 22,000 packages on CRAN – extensive statistical capabilities
Supports reproducible research – a key theme in modern science (remember Lecture 01?)

The R story

Getting Started with R

Your RStudio workspace (5 min)

Left: all input, Right: all output

Working in Quarto

A Markdown-based authoring tool for reproducible documents
Write code and text together, output to HTML, PDF, Word and more
Quarto gallery
Quick demo (5 min)

Where to find help

Read “A brief R guide for surviving ENVX1002” by Dr. Geoffrey Mazue
- Available in the Tool Kit section on Canvas
- Contains essential R programming tips and best practices for this unit
Use the Help panel in RStudio (bottom-right) – type ?function_name to look up any function

Key statistical concepts

Population vs Sample

Population

All possible observations
Usually too large to measure
Example: All trees in a forest

Sample

Subset of the population
What we actually measure
Example: 100 trees measured in a forest

Most (if not all) statistical analyses are based on samples, not populations. This is why we need descriptive statistics – to summarise what our sample tells us.

Populations and their samples

Population	Sample
All koalas in Australia	150 koalas studied in NSW
Every fish in Sydney Harbour	300 fish caught in specific locations
All soil bacteria in a forest	Bacteria from 50 soil cores
All cookies in a bakery	Tasting 3 cookies to judge quality
All students at the university	The 200 students in this course

Each sample is a small piece of a much larger picture.

How well does a sample represent the population?

It depends:

Sample size: Larger samples are more likely to represent the population
Sampling method: Random samples are more likely to be representative
Population variability: More variability means larger samples are needed

We often balance these factors due to time, cost, and practical constraints. We will learn tools for this later – confidence intervals and hypothesis testing.

Samples vary

Different samples give different results – suppose we have a population of 1000 trees and we randomly sample 6 tree heights. If this is done 3 times, it is likely that the samples will be different.

The code below demonstrates this – focus on the results for now:

Code

set.seed(258) 
population <- rnorm(1000, mean = 12, sd = 5)

# create samples
sample1 <- round(sample(population, size = 6), 1)
sample2 <- round(sample(population, size = 6), 1)
sample3 <- round(sample(population, size = 6), 1)
# show samples
for (i in 1:3) {
   cat(sprintf("Sample %d: ", i), get(paste0("sample", i)), "\n")
}

Sample 1:  13.7 14.6 14.8 9.6 6.5 10 
Sample 2:  7.6 6.1 9.9 10.1 12.5 14.9 
Sample 3:  9.8 7.9 18.4 19.1 7 26.1

Are the samples different? How different are they? This is where descriptive statistics come in.

Descriptive statistics

We use descriptive statistics to summarise and describe data, helping us compare and contrast.

Measures of central tendency – describe the “typical” value in a sample
- mean, median, mode

Measures of spread – describe how much the data varies
- standard deviation, variance (commonly used)
- range, quartiles, IQR (for unique cases)

Data types

Before we calculate any statistics, it helps to know what kind of data we are working with.

Categorical (qualitative)

Data that falls into groups or categories
Examples: blood type (A, B, AB, O), eye colour, species name
Summarised by counts or proportions
The mode is the main measure of central tendency

Continuous (quantitative)

Data measured on a numeric scale
Examples: height (cm), temperature (°C), seagrass length (cm)
Summarised by mean, median, standard deviation, etc.
Can take any value within a range

Knowing your data type helps you choose the right summary statistic. We will revisit this in more detail in Lecture 03.

That is all for today!

We will continue with measures of central tendency and measures of spread in tomorrow’s session.

Measures of central tendency

Mean – also known as the average

Add up all your numbers
Divide by how many numbers you have

Mathematical notation

Population mean: $\mu = \frac{\sum_{i=1}^{N} x_i}{N}$
Sample mean: $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$

Where $x_i$ is each individual value, $N$ is population size, and $n$ is sample size. Notice how $\mu$ (population) and $\bar{x}$ (sample) mirror the population vs sample distinction from earlier.

Do not worry about memorising the formulas – R handles the calculation for us. The important thing is understanding what the mean represents.

Mean in Excel

Excel offers several ways to calculate the mean:

Using AVERAGE function
```
=AVERAGE(A1:A4)
```
- Type =AVERAGE(
- Select cells with your data
- Press Enter
Using AutoCalculate
- Select your data cells
- Look at bottom right
- Average shown automatically

Mean in R

I’ll show you how to do this in RStudio.

Step 1: Create a vector of values

In R, we store data in vectors (lists of values):

# Create a vector of test scores
scores <- c(80, 85, 90, 95)

# Display the scores
scores

[1] 80 85 90 95

Step 2: Calculate

Either calculate the mean manually:

# Sum divided by count
sum(scores) / length(scores)

[1] 87.5

Or use the mean() function:

# The mean() function does the work for us
mean(scores)

[1] 87.5

Median – the middle value

The median is the middle number when your data is in order:

First, put your numbers in order
Find the middle value
If you have an even number of values, take the average of the two middle numbers

Example: House prices ($’000s): 450, 1100, 480, 460, 470, 420, 1400, 450, 470

Order: 420, 450, 450, 460, 470, 470, 480, 1100, 1400

The median is not pulled by extreme values (like the two expensive houses at $1,100k and $1,400k), making it a better summary for skewed data.

Median in Excel

Excel provides the MEDIAN function:

=MEDIAN(A1:A9)

Type =MEDIAN(
Select your data range
Press Enter

Median in R

R does all the ordering and finding the middle for us:

# House prices
prices <- c(450, 1100, 480, 460, 470, 420, 1400, 450, 470)

# Find median
median(prices)

[1] 470

Comparing the mean and median:

# Compare with mean
mean(prices)

[1] 633.3333

Which is a better measure for house prices?

Mode – most frequent value

The mode is the value that appears most frequently in your data. It is most useful for:

Categorical data (like blood types, species names, eye colours)
Finding the most common item in a group

For numerical data, the mean or median is usually preferred. But the mode answers questions that other measures cannot:

What is the most common blood type in a population?
Which species is most frequently observed at a field site?

Mode in Excel

Excel provides several methods to find the mode but the simplest is to use the MODE function:

=MODE(A1:A10)

Type =MODE(
Select your data range
Press Enter

Mode in R

R does not have a built-in mode function, but there are straightforward ways to find it.

Do it yourself with table():

# Species observed at a field site
species <- c("banksia", "eucalyptus", "banksia", "acacia",
             "eucalyptus", "banksia", "grevillea", "banksia")

# Count how often each species appears
table(species)

species
    acacia    banksia eucalyptus  grevillea 
         1          4          2          1

# Find the most common species
freq_table <- table(species)
names(freq_table[freq_table == max(freq_table)])

[1] "banksia"

Or use a package:

if(!require("modeest")) install.packages("modeest")
library(modeest)
mlv(species, method = "mfv")  # most frequent value

[1] "banksia"

The point is that when R does not have a built-in function, you can write your own solution or find a package that does the job.

Measures of spread

A biological example

Imagine sampling seagrass blade lengths from two different sites in a marine ecosystem, and they have the same mean length of 15.2 cm. Are both sites the same?

Site A (Protected Bay): 15.2, 15.0, 15.3, 15.1, 15.2 centimetres
Site B (Wave-exposed Coast): 12.0, 18.0, 14.5, 16.5, 15.0 centimetres

Comparing Different Measures

Code

# Plot seagrass lengths
library(ggplot2)
library(patchwork)

seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2)
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0)

# Create plots for both sites
p1 <- ggplot() +
   geom_point(aes(x = 1:5, y = seagrass_protected), size = 3) +
   geom_hline(yintercept = mean(seagrass_protected), linetype = "dashed", color = "red") +
   labs(title = "Site A: Protected Bay", x = "Measurement", y = "Length (cm)") +
   ylim(10, 20)

p2 <- ggplot() +
   geom_point(aes(x = 1:5, y = seagrass_exposed), size = 3) +
   geom_hline(yintercept = mean(seagrass_exposed), linetype = "dashed", color = "red") +
   labs(title = "Site B: Wave-exposed Coast", x = "Measurement", y = "Length (cm)") +
   ylim(10, 20)

# Combine plots side by side
p1 + p2

Why do we need measures of spread?

Central tendency (mean, median, mode) only tells part of the story
Spread tells us how much variation exists in our data

Range: Overall spread of data

IQR: Spread of middle 50% of data

Variance: Average squared deviation from mean

Standard deviation: Average deviation in original units

Range – The simplest measure of spread

# Create our seagrass data
seagrass_protected <- c(15.2, 15.0, 15.3, 15.1, 15.2)  # Protected bay
seagrass_exposed <- c(12.0, 18.0, 14.5, 16.5, 15.0)    # Wave-exposed coast

# Calculate ranges
cat("Protected bay range:", diff(range(seagrass_protected)), "cm\n")

Protected bay range: 0.3 cm

cat("Wave-exposed range:", diff(range(seagrass_exposed)), "cm\n")

Wave-exposed range: 6 cm

The range shows us that seagrass lengths are much more variable in the wave-exposed site!

Range in Excel

Use MAX() and MIN() to calculate the range:

=MAX(A1:A10) - MIN(A1:A10)

Interquartile range (IQR): The middle 50%

The IQR tells us how spread out the middle 50% of our data is:

# Get quartiles for protected bay
quantile(seagrass_protected)

  0%  25%  50%  75% 100% 
15.0 15.1 15.2 15.2 15.3

25% of data below Q1 (1st quartile)
75% of data below Q3 (3rd quartile)
IQR = Q3 - Q1

Why use IQR?

Ignores extreme values
Works with skewed data
Less affected by extreme values than the range

Comparing Sites Using IQR

# Compare IQRs
pbay <- IQR(seagrass_protected)
pbay

[1] 0.1

exbay <- IQR(seagrass_exposed)
exbay

[1] 2

Protected bay IQR: 0.1 cm
Wave-exposed IQR: 2 cm

The larger IQR in the wave-exposed site shows more spread in the typical seagrass lengths.

IQR in Excel

Use QUARTILE.INC() to find the IQR:

For Q1: =QUARTILE.INC(A1:A10, 1)
For Q3: =QUARTILE.INC(A1:A10, 3)
For IQR: =QUARTILE.INC(A1:A10, 3) - QUARTILE.INC(A1:A10, 1)

Variance: a detailed measure of spread

Variance measures how far data points are spread from their mean by:

Finding how far each point is from the mean
Squaring these distances (to handle negative values)
Taking the average of these squared distances

Why use variance?

Uses all data points (unlike IQR)
Squaring amplifies the effect of extreme values, making outliers stand out
Shows total spread in both directions

Key points

Measured in squared units (cm²)
Larger variance = more spread

Calculating Variance in R

Code

# Calculate variance for both sites
cat("Protected bay variance:", var(seagrass_protected), "cm²\n")

Protected bay variance: 0.013 cm²

Code

cat("Wave-exposed variance:", var(seagrass_exposed), "cm²\n")

Wave-exposed variance: 5.075 cm²

The larger variance in the wave-exposed site shows more spread from the mean!

Variance in Excel

Use VAR.S() to calculate the sample variance:

=VAR.S(A1:A10)

For now, always use .S (sample) in this course.

Standard deviation: a more interpretable measure

Standard deviation (SD, or $\sigma$ for population, $s$ for sample) is the square root of variance:

Tells us the “typical distance” from the mean
Easy to understand – similar to saying “± value” after a mean
Small SD means values cluster closely around mean
Larger SD values mean values are more spread out

When and why to use it

Values are in the same units as your data (unlike variance)
Perfect for describing natural variation (height, weight, temperature)
Used in many statistical tests
Great for comparing different groups or datasets

Interpreting standard deviation (with R)

We can describe our seagrass lengths using mean ± standard deviation:

# Protected bay
mean_p <- mean(seagrass_protected)
sd_p <- sd(seagrass_protected)
cat("Protected bay:", round(mean_p, 1), "±", round(sd_p, 2), "cm\n")

Protected bay: 15.2 ± 0.11 cm

# Wave-exposed
mean_e <- mean(seagrass_exposed)
sd_e <- sd(seagrass_exposed)
cat("Wave-exposed:", round(mean_e, 1), "±", round(sd_e, 2), "cm\n")

Wave-exposed: 15.2 ± 2.25 cm

The ± tells us about the typical variation around the mean. Larger SD values indicate more spread!

SD in Excel

Use STDEV.S() to calculate the sample standard deviation:

=STDEV.S(A1:A10)

For now, always use .S (sample) in this course.

Comparing spread measures

Measure	Protected Bay	Wave-exposed Coast	What it Tells Us
Range	0.3 cm	6 cm	Overall spread (sensitive to outliers)
IQR	0.1 cm	2 cm	Middle 50% spread (ignores extremes)
Variance	0.01 cm²	5.08 cm²	Average squared distance from mean
SD	0.11 cm	2.25 cm	Average distance from mean (in original units)

Key Observations

Wave-exposed site shows consistently more variation
Each measure gives a different perspective
Choose based on your data and goals
Standard deviation is most commonly used in research papers

References and Resources

Core Reading

Read after the lecture:

Quinn & Keough (2024). Experimental Design and Data Analysis for Biologists. Cambridge University Press. Chapter 2: Things to know before proceeding.
Canvas site for lecture notes and additional resources

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.