Lab 04 – Central limit theorem

ENVX1002 Handbook

The University of Sydney
Published

Semester 1, 2026

Facing challenges

Project 1 is due in week 5 and for many of you this may be your first university assignment; some may be nervous, while others may be more relaxed. Your demonstrators will use the start of this practical to discuss how you may be feeling for this first assessment and share ways you might want to approach and prepare for the assessment.

This is the last reflective activity for now, thank you all for contributing so far and we hope you have found some benefit in the activities. Now for some probability!

TipLearning outcomes

At the end of this computer practical, students should be able to:

  • calculate tail, interval and inverse probabilities associated with the Normal distribution
  • calculate probabilities associated sampling distribution of the sample mean by using simulation in R and using R commands.

Link to data is below:

R commands for normal distribution

  • Probability density function: \(f(x)\) = \(dnorm(x,\mu,\sigma)\)

Here are the main R functions we’ll use to work with normal distributions:

  • Probability density function (height of the curve at x): \(f(x) =\)dnorm(x, mean, sd)

  • Cumulative probability (area under curve up to x): \(F(x) = P(X \leq x) =\)pnorm(x, mean, sd)

  • Interval probability (area between two points): \(P(a \leq X \leq b) =\)pnorm(upper, mean, sd) - pnorm(lower, mean, sd)

  • Finding values for given probabilities: \(P(X \le x) =\)qnorm(probability, mean, sd)

  • Generate random values from a normal distribution: rnorm(n, mean, sd)

Normal distribution

Exercise 1 - How tall is ENVX1002?

  1. Make sure you have an existing or new project for Lab 4 and create a quarto document called Lab_4.qmd (suggestion). Save it in your project directory.

  2. During the lecture, we collected height data from the class. The data is available as a CSV file. Import it into R:

heights <- read.csv("data/heights.csv")
  1. Separate the data by sex and calculate the mean and standard deviation for each group. Graph the distribution for both genders, for example:
male <- heights |> filter(sex == "M")
female <- heights |> filter(sex == "F")

# calculate the mean and standard deviation of males
m_mean <- mean(male$height)
m_sd <- sd(male$height)
m_mean
[1] 177.3
m_sd
[1] 8.124722
# create a histogram of the data
ggplot(male, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  labs(x = "Height (cm)", y = "Frequency", title = "Male heights")

# now do the same for females
  1. Discuss with your neighbour and class why the normal distribution is a good model for height data. Is it a good model for all data?
  1. How does your class compare to the Australian statistics? According to the Australian Bureau of Statistics (1995), measured heights for Australians aged 18-24 were:
Mean (cm) SD (cm)
Males 178.4 6.6
Females 163.9 6.6

Note that both reported and measured heights were collected. Reported heights were consistently higher than measured heights – males overestimated by ~1.6 cm and females by ~1.0 cm on average. What do you think the reason for this is?

Seeing the Central Limit Theorem in action

The Central Limit Theorem (CLT) tells us that when we take many samples from any population and calculate their means, those means will be approximately normally distributed – even if the original population is not normal.

  1. Run the following code to see this in action. We simulate 1000 sample means (each from a sample of size 30) from two very different populations: a normal distribution (like our height data) and a skewed distribution (like rainfall data).
set.seed(123)
# Simulate sampling distributions for two populations
# Population 1: Normal (like our height data)
# Population 2: Skewed (exponential, like rainfall)
normal_means <- replicate(1000, mean(rnorm(30, mean = 170, sd = 10)))
skewed_means <- replicate(1000, mean(rexp(30, rate = 0.1)))

# Combine for plotting
sim_data <- data.frame(
  sample_mean = c(normal_means, skewed_means),
  population = rep(c("Normal population", "Skewed population"), each = 1000)
)

ggplot(sim_data, aes(x = sample_mean)) +
  geom_histogram(bins = 30, fill = "steelblue", colour = "white") +
  facet_wrap(~population, scales = "free") +
  labs(x = "Sample Mean", y = "Frequency",
       title = "Sampling distributions of the mean (n = 30)")

Discuss with your neighbour: What do you notice about the shape of both sampling distributions? What does this tell us about the Central Limit Theorem?

  1. Now let’s confirm this with our own height data. First, generate a single sample of 30 heights and calculate its mean:
# Step 1: Generate ONE sample of 30 heights and calculate the mean
one_sample <- rnorm(30, mean = m_mean, sd = m_sd)
mean(one_sample)
[1] 176.0564

Now use replicate() to do this 1000 times and plot the result:

# Step 2: Now do it 1000 times using replicate()
sample_means <- replicate(1000, mean(rnorm(30, mean = m_mean, sd = m_sd)))

# Plot the distribution of sample means
ggplot(data.frame(sample_mean = sample_means), aes(x = sample_mean)) +
  geom_histogram(bins = 30, fill = "lightblue", colour = "white") +
  labs(x = "Sample Mean Height (cm)", y = "Frequency",
       title = "Distribution of 1000 sample means (n = 30)")

The sampling distribution of the mean has its own standard deviation, called the standard error (SE). It tells us how much the sample mean varies from sample to sample:

\[SE = \frac{\sigma}{\sqrt{n}}\]

Notice that as the sample size \(n\) increases, the SE decreases – larger samples give more precise estimates of the population mean. We use the SE in place of the standard deviation when calculating probabilities for the sample mean.

Finally, compare the simulated probability to the exact probability. What proportion of sample means exceed 180 cm?

# Compare: what proportion of sample means exceed 180?
sum(sample_means > 180) / 1000  # simulated
[1] 0.034
pnorm(180, mean = m_mean, sd = m_sd / sqrt(30), lower.tail = FALSE)  # exact
[1] 0.03436531

The simulated and exact probabilities should be close – this is the CLT at work!

Sampling distributions

Exercise 2 - Milkfat example

The milkfat content in milk (in %) for 120 cows are presented in the worksheet called ENVX1002_Data4.xlsx. Copy the file into your project directory and:

  1. Import the data into R.
library(readxl)
milkfat <- read_excel("data/ENVX1002_Data4.xlsx", sheet = "Milkfat")
  1. Calculate the summary statistics of Milkfat (mean, median and sd)

Note that we use $ColumnName to select a column from the data

mean(milkfat$Milkfat)
[1] 4.166083

Based on the mean milkfat of ~4.2%, these are likely Jersey cows – a breed known for high milkfat content compared to the more common Holstein-Friesian (~3.5%).

  • Could the data be normally distributed?
  1. Create a histogram and boxplot of the milk fat data. Is the data “normally distributed”?
require(ggplot2)
ggplot(milkfat, aes(x = Milkfat)) +
  geom_histogram(binwidth = 0.1, fill = "lightblue", color = "black") +
  xlab("Milkfat (%)")

  1. In the UK, breakfast milk' (orChannel Island milk’) has 5.5% fat content. What percentage of the cows in this data set is yielding breakfast milk with \(\ge\) 5.5%?
s <- sort(milkfat$Milkfat) # Sorts the data
s # Look at the sorted data
  [1] 3.47 3.56 3.58 3.66 3.66 3.70 3.70 3.72 3.74 3.74 3.77 3.81 3.82 3.83 3.86
 [16] 3.86 3.87 3.88 3.89 3.89 3.89 3.89 3.91 3.91 3.92 3.93 3.94 3.95 3.96 3.96
 [31] 3.97 3.97 3.97 3.97 3.97 3.98 3.99 3.99 4.00 4.00 4.00 4.02 4.03 4.05 4.05
 [46] 4.05 4.06 4.06 4.07 4.08 4.09 4.09 4.09 4.09 4.10 4.10 4.10 4.11 4.12 4.14
 [61] 4.15 4.16 4.16 4.17 4.17 4.18 4.20 4.20 4.20 4.20 4.22 4.23 4.24 4.24 4.24
 [76] 4.24 4.25 4.27 4.28 4.28 4.29 4.29 4.30 4.31 4.32 4.32 4.33 4.33 4.33 4.34
 [91] 4.35 4.36 4.38 4.38 4.38 4.38 4.38 4.40 4.41 4.42 4.42 4.46 4.48 4.49 4.51
[106] 4.52 4.58 4.58 4.60 4.60 4.66 4.67 4.67 4.70 4.71 4.81 4.82 4.88 4.91 5.00
length(s[s >= 5.5]) # Counts how many are >= 5.5
[1] 0
  1. In Australia, full cream milk has greater than 3.2% milk fat content. What percentage of these cows is yielding full cream milk?
## Your turn

Let \(X\) represent the milk fat content for the population of this breed of cows. Assuming the population is normal, use the sample mean and standard deviation from the previous question as estimates of the population parameters. So \(X\sim (\mu =..., \sigma^2 = ...)\).

  1. Draw a picture of the curve representing \(X\). The below example uses ggplot2 to draw the curve for \(N(4.16,0.30^2)\).
library(ggplot2)
ggplot(data.frame(x = c(4.16 - 4 * 0.3, 4.16 + 4 * 0.3)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = 4.16, sd = 0.30)) +
  xlab("x") +
  ylab(expression(N(4.16, 0.30^2) ~ pdf))

  1. What is the probability that 1 cow has a fat content less than 4%? We will adapt the ggplot command above a picture of this probability and then use R to find the probability.

Hint: You may need to use the stat_function command to draw the curve and then use the pnorm command to find the probability.

ggplot(data.frame(x = c(4.16 - 4 * 0.3, 4.16 + 4 * 0.3)), aes(x = x)) +
  stat_function(
    fun = dnorm, args = list(mean = 4.16, sd = 0.30),
    geom = "area", fill = "white"
  ) +
  stat_function(
    fun = dnorm, args = list(mean = 4.16, sd = 0.30),
    xlim = c(4.16 - 4 * 0.3, 4), geom = "area", fill = "red"
  ) +
  xlab("x") +
  ylab(expression(N(4.16, 0.30^2) ~ pdf))

pnorm(4, 4.16, 0.30)
[1] 0.2969014
  1. What is the probability that 1 cow (randomly sampled) has a fat content greater than 4.5%? Try and adapt the ggplots above to draw a picture of this probability and then use R to find the probability.
  1. What milkfat percentage is exceeded by 95% of cows?

Hint: If 95% of cows exceed a value, then 5% are below it – we’re looking for the 5th percentile.

  1. For a sample of 10 cows (randomly sampled), what is the probability that the sample mean milk fat content is greater than 4.2%?

Hint: First find the distribution of the sample mean \(\overline{X}\). You will need the standard error (SE) – see Seeing the Central Limit Theorem in action if you need a refresher. Then find \(P(\overline{X}>4.2)\)

# For a sample of 10 cows, we need to find P(X_bar > 4.2)
# Mean of X_bar = population mean = 4.16
# Standard error of X_bar = population sd / sqrt(n) = 0.30 / sqrt(10)

# Calculate the standard error
se <- 0.30 / sqrt(10)

# Calculate the probability
pnorm(4.2, 4.16, se, lower.tail = FALSE)
[1] 0.336645
# Visualization
ggplot(data.frame(x = c(4.16 - 4 * se, 4.16 + 4 * se)), aes(x = x)) +
  stat_function(
    fun = dnorm, args = list(mean = 4.16, sd = se),
    geom = "area", fill = "white"
  ) +
  stat_function(
    fun = dnorm, args = list(mean = 4.16, sd = se),
    xlim = c(4.2, 4.16 + 4 * se), geom = "area", fill = "blue"
  ) +
  xlab("Sample mean milk fat content") +
  ylab("Probability density") +
  ggtitle("Distribution of sample mean (n=10)")

Standard Error of the mean

The following exercises give you practice calculating and interpreting the standard error. Recall from Seeing the Central Limit Theorem in action that \(SE = \sigma / \sqrt{n}\).

Exercise 3 - Skin cancer

A dermatologist investigating a certain type of skin cancer induced the cancer in nine rats and then treated them with a new experimental drug. For each rat she recorded the number of hours until remission of the cancer. The rats had a mean remission time of 400 hours and a standard deviation of 30 hours. From this data, calculate the standard error of the mean.

Exercise 4 - Soil carbon

An initial soil carbon survey of a farm based on 12 observations found that the sample mean \(\overline{X}\) was 1.2% and the standard deviation s was 0.4%. How many observations would be needed to estimate the mean carbon value with a standard error of 0.1%?

Take-home exercises

The following exercises are for you to complete in your own time. They provide additional practice with the concepts covered in this lab.

Exercise 5 - What’s in the media - looming state election

An article was published in the Sydney Morning Herald on Saturday 20.3.2010 about statistics related to opinion polls. Read it and find the sentences related to (i) populations versus samples (ii) standard error formula (iii) the effect of sample size on standard errors.

http://www.smh.com.au/national/demystifying-the-dark-art-of-polling-20100319-qmai.html

Exercise 6 - Extra practice

The average Australian woman has height (in cms) of 161.8 with a standard deviation of 6.

  1. The Australian Institute of Sport ran a netball training camp for the best Australian young players. How tall were the goal position players? http://www.abc.net.au/news/2015-06-14/tall-athletes-get-support-at-ais-to-stand-as-proud-netballers/6544642

  2. What is the probability of finding an Australian woman of this height or taller?

Hints:

Step 1: Using ggplot, draw a sketch of the Normal curve with the probability identified. You may need to draw a section of the right tail as the probability is small! We have provided the solution for the plotting to assist you.

Step 2: Calculate the probability in R.

1 - pnorm(189, 161.8, 6)
[1] 2.903004e-06
ggplot() +
  stat_function(
    fun = dnorm, args = list(mean = 161.8, sd = 6),
    geom = "area", fill = "white", xlim = c(180, 161.8 + 4 * 6)
  ) +
  stat_function(
    fun = dnorm, args = list(mean = 161.8, sd = 6),
    geom = "area", fill = "red", xlim = c(161.8 + 4 * 6, 189)
  ) +
  xlab("x") +
  ylab(expression(N(161.8, 6^2) ~ pdf)) +
  scale_x_continuous(breaks = 189)

  1. Dharshani Sivalingam is the tallest netball player in the world. How tall is Dharshani? https://en.wikipedia.org/wiki/Tharjini_Sivalingam What is the probability of finding an Australian woman of Dharshani’s height?
  1. Madison Brown is one of the the shortest Australian International players. How tall is Madision? https://en.wikipedia.org/wiki/Madison_Browne What percentage of Australian women are between Madison and Dharshani’s heights?
  1. If 80% of Australian women are above a certain height, what is that height?