[1] 48 56 90 78 86 71 42
ENVX2001 Applied Statistical Methods
Apr 2026
Sampling designs… or why confidence intervals are important in stats
We will:
A farmer wants to estimate the average soil carbon on their property to apply for a carbon credit scheme. Their budget allows for 7 measurement sites.
Where should they sample, and how confident can they be in the result?
The farmer cannot change their land, so they measure it as it is. This makes their study an observational study.
The farmer needs a plan for choosing their 7 sites. The simplest approach: simple random sampling.
The farmer might be tempted to just walk to convenient spots, but that would bias the sample. Convenient locations are often atypical: near paths, buildings, or water sources.
Within a population, all units must have a probability greater than zero of being selected. Nothing can be excluded by design.
Random sampling requires a formal procedure, not just picking samples that “feel” representative.
sample() functionUsing R’s sample() function, the farmer randomly selects 7 grid cells from a map of their property.
The farmer uses simple random sampling to select 7 grid cells, then measures soil carbon at each site.
The measurements: 48, 56, 90, 78, 86, 71, 42 t/ha.
The farmer calculates the mean:
The result is 67.3 t/ha. But is this number the true average for the whole property?
Two studies both report a mean soil carbon of 67 t/ha. One measured 5 sites, the other 500.
Should we trust them equally?
A single number on its own does not tell us anything about uncertainty. We need a way to combine our estimate with how precise it is. That is what confidence intervals do.
Garfield by Jim Davis. © PAWS Inc.
A confidence interval (CI) gives us a range of plausible values for the population parameter, rather than a single best guess.
It combines three things:
Sample mean (\(\bar{x}\)): our estimate of the population parameter
Standard error of the mean (\(SE_{\bar{x}} = s / \sqrt{n}\)): how precise that estimate is
Critical value (\(t_{n-1}\)): from the t-distribution at our chosen confidence level (e.g. 95%). We use t rather than the normal distribution because we estimate \(\sigma\) from the sample.
These three ingredients combine into:
\[ \bar{x} \pm \left(t_{n-1} \times \frac{s}{\sqrt{n}}\right) \]
The \(t_{n-1} \times SE_{\bar{x}}\) part is the margin of error, the “plus or minus” value you often see in survey results (±3%).
To use this formula, we need to understand two things: degrees of freedom and the t-distribution.
The farmer estimated one thing from their sample: the mean. That uses up one degree of freedom:
\[ \text{df} = n - 1 \]
For the farmer: \(df = 7 - 1 = 6\). This number determines which t-distribution we use, and therefore how wide the CI is.
Imagine you have 4 numbers with a mean of 5:
The farmer only has 7 samples, so they do not know the true population standard deviation. They have to estimate it from the sample, and that adds uncertainty.
The t-distribution accounts for this extra uncertainty. It has heavier tails than the normal, meaning more probability in the extremes.
The solid curve is the t-distribution; the dashed curve is the normal. At \(df = 6\) (the farmer’s case), the tails are noticeably heavier. As \(df\) increases, the two curves converge.
anim_speed <- 1
x <- seq(-4, 4, length.out = 400)
dfs <- c(1:5, seq(6, 30, by = 2))
t_curves <- do.call(rbind, lapply(dfs, function(df) {
data.frame(x = x, density = dt(x, df), df = df)
}))
normal_curve <- data.frame(x = x, density = dnorm(x))
p <- ggplot() +
geom_line(
data = normal_curve, aes(x = x, y = density),
color = "black", linetype = "dashed", linewidth = 1
) +
geom_line(
data = t_curves, aes(x = x, y = density, color = factor(df)),
linewidth = 1
) +
labs(
title = "Degrees of Freedom: {closest_state}",
x = "x", y = "Density",
subtitle = "Dashed = Normal distribution; Solid = t-distribution"
) +
theme(legend.position = "none") +
transition_states(states = df, transition_length = anim_speed, state_length = anim_speed)
anim_save("images/t-distribution.gif", p, width = 800, height = 400, res = 150)\[ \bar{x} \pm \left(t_{n-1} \times \frac{s}{\sqrt{n}}\right) \]
You need to be able to calculate this by hand or with a calculator.
A confidence interval is like a fishing net:
Of course, the true value does not move. It is our interval that shifts from sample to sample.
Common misunderstanding: A 95% CI does NOT mean there is a 95% chance the true value is inside the interval. It means that if you took 100 different samples and built 100 different CIs, about 95 of them would contain the true value.
Each click draws a new random sample from the population, computes a 95% CI, and checks whether it captures the true mean.
Applying the CI formula to the farmer’s data:
We report: 67.3 t/ha with a 95% CI of (49.87, 84.73).
The interval is wide (about ±26% of the mean), so there is quite a bit of uncertainty in our estimate.
[1] 49.84627 84.72516
Most statistical functions in R already compute confidence intervals. Understanding the manual calculation is still important because it is what these functions do behind the scenes.
The CI is too wide for the farmer to qualify for carbon credits (perhaps). What drives CI width: sample size, variability, or both?
Can we do better with a different sampling design?
Increasing sample size (if possible) would be the most straightforward way to reduce the CI width. However, the farmer only has budget for 7 sites. Can we reduce the variability instead?
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License