# A tibble: 928 × 2
parent child
<dbl> <dbl>
1 70.5 61.7
2 68.5 61.7
3 65.5 61.7
4 64.5 61.7
5 64 61.7
6 67.5 62.2
7 67.5 62.2
8 67.5 62.2
9 66.5 62.2
10 66.5 62.2
# ℹ 918 more rows
ENVX1002 Introduction to Statistical Methods
The University of Sydney
Apr 2024

LO1. Demonstrate proficiency in utilizing R and Excel to effectively explore and describe data sets in the life sciences.
LO2. Evaluate and interpret different types of data in the natural sciences by visualising probability distributions and calculating probabilities using RStudio and Excel.
LO3. Apply parametric and non-parametric statistical inference methods to experimental data using RStudio and effectively interpret and communicate the results in the context of the data.
LO4. Apply both linear and non-linear models to describe relationships between variables using RStudio and Excel, demonstrating creativity in developing models that effectively represent complex data patterns.
LO5. Articulate statistical and modelling results clearly and convincingly in both written reports and oral presentations, working effectively as an individual and collaboratively in a team, showcasing the ability to convey complex information to varied audiences.
LO1. Demonstrate proficiency in utilizing R and Excel to effectively explore and describe data sets in the life sciences.
LO2. Evaluate and interpret different types of data in the natural sciences by visualising probability distributions and calculating probabilities using RStudio and Excel.
LO3. Apply parametric and non-parametric statistical inference methods to experimental data using RStudio and effectively interpret and communicate the results in the context of the data.
LO4. Apply both linear and non-linear models to describe relationships between variables using RStudio and Excel, demonstrating creativity in developing models that effectively represent complex data patterns.
LO5. Articulate statistical and modelling results clearly and convincingly in both written reports and oral presentations, working effectively as an individual and collaboratively in a team, showcasing the ability to convey complex information to varied audiences.

Adrien-Marie Legendre, Carl Friedrich Gauss, Francis Galton, & Karl Pearson
Note
Many other people contributed to the development of regression analysis, but these three are the “most” well-known.
# A tibble: 928 × 2
parent child
<dbl> <dbl>
1 70.5 61.7
2 68.5 61.7
3 65.5 61.7
4 64.5 61.7
5 64 61.7
6 67.5 62.2
7 67.5 62.2
8 67.5 62.2
9 66.5 62.2
10 66.5 62.2
# ℹ 918 more rows
# Generate synthetic data
set.seed(123)
rainfall <- rnorm(100, mean = 50, sd = 10)
plant_growth <- rainfall + rnorm(100, mean = 0, sd = 5)
# Calculate correlation coefficient
correlation_coef <- cor(rainfall, plant_growth)
# Plot data using ggplot2
library(ggplot2)
p1 <- ggplot(data = data.frame(rainfall, plant_growth), aes(x = rainfall, y = plant_growth)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Positive correlation",
x = "Rainfall", y = "Plant growth",
subtitle = paste("Correlation coefficient: ", round(correlation_coef, 2))
)
# Generate synthetic data
set.seed(123)
phosphorous <- rnorm(100, mean = 50, sd = 10)
chlorophyll_a <- 100 - phosphorous + rnorm(100, mean = 0, sd = 5)
# Calculate correlation coefficient
correlation_coef <- cor(phosphorous, chlorophyll_a)
# Plot data using ggplot2
p2 <- ggplot(data = data.frame(phosphorous, chlorophyll_a), aes(x = phosphorous, y = chlorophyll_a)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Negative correlation",
x = "Phosphorous", y = "Chlorophyll-a",
subtitle = paste("Correlation coefficient: ", round(correlation_coef, 2))
)
library(patchwork)
p1 + p2iris dataset?Formula:
r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}

covariance divided by the product of the standard deviations
=CORREL() formula, or use the analysis toolpakcor() functionWe can also visually inspect the relationship between the two variables using a scatterplot:
… but this is not a very good way to assess the strength of the relationship between the two variables.
A strong relationship is one where the correlation coefficient is close to 1 or -1.
A weak or nonexistent relationship is one where the correlation coefficient is close to 0.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
All of these data have a correlation coefficient of about 0.8.
All of these data have a correlation coefficient close to zero!
Spurious correlations: a relationship between two variables does not imply that one causes the other.
If we have a hypothesis about the relationship between two variables, we can use regression analysis to test it.
What is the relationship between a response variable Y and a predictor variable x?
How much variation in Y can be explained by a relationship with x?
What is the value of Y for a given value of x?
Many types of regression models exist:
Asking ChatGPT for help with the next slide:
Using R code, can you generate some data that is useful to demonstrate simple linear regression, multiple linear regression, polynomial, exponential and logarithmic regressions in ggplot2?
Sure! Here’s an example code that generates a sample dataset and visualizes it using ggplot2 library in R.
library(ggplot2)
# Generate sample data
set.seed(136)
x <- 1:100
y <- x^2 + rnorm(100, sd = 100)
# Define regression functions
slr <- function(x, y) {
mod <- lm(y ~ x)
return(list(
data.frame(x = x, y = predict(mod), model_type = "Simple Linear Regression"),
paste("y =", round(coefficients(mod)[[2]], 2), "x +", round(coefficients(mod)[[1]], 2))
))
}
mlr <- function(x, y, z) {
mod <- lm(y ~ x + z)
return(list(
data.frame(x = x, z = z, y = predict(mod), model_type = "Multiple Linear Regression"),
paste("y =", round(coefficients(mod)[[3]], 2), "x +", round(coefficients(mod)[[2]], 2), "z +", round(coefficients(mod)[[1]], 2))
))
}
poly_reg <- function(x, y, degree) {
mod <- lm(y ~ poly(x, degree, raw = TRUE))
x_new <- seq(min(x), max(x), length.out = 100)
y_new <- predict(mod, newdata = data.frame(x = x_new))
return(list(
data.frame(x = x_new, y = y_new, model_type = paste("Polynomial Regression (", degree, ")", sep = "")),
paste(paste("x^", degree, sep = ""), ":", paste(round(coefficients(mod), 2), collapse = " + "))
))
}
exp_reg <- function(x, y) {
mod <- lm(log(y) ~ x)
x_new <- seq(min(x), max(x), length.out = 100)
y_new <- exp(predict(mod, newdata = data.frame(x = x_new)))
return(list(
data.frame(x = x_new, y = y_new, model_type = "Exponential Regression"),
paste("y =", round(exp(coefficients(mod)[[2]]), 2), "* e^(", round(coefficients(mod)[[1]], 2), "x", ")")
))
}
log_reg <- function(x, y) {
mod <- lm(y ~ log(x))
x_new <- seq(min(x), max(x), length.out = 100)
y_new <- predict(mod, newdata = data.frame(x = x_new))
return(list(
data.frame(x = x_new, y = y_new, model_type = "Logarithmic Regression"),
paste("y =", round(coefficients(mod)[[2]], 2), "* log(x) +", round(coefficients(mod)[[1]], 2))
))
}
# Create regression line dataframes and equations
reg_data <- list(slr(x, y), mlr(x, y, rnorm(100, sd = 10)), poly_reg(x, y, 3), exp_reg(x, y), log_reg(x, y))
reg_eqs <- sapply(reg_data, function(x) x[[2]])
# Plot the data and regression lines
ggplot(data.frame(x = x, y = y), aes(x, y)) +
lapply(seq_along(reg_data), function(i) geom_line(data = reg_data[[i]][[1]], aes(x, y, color = reg_data[[i]][[1]]$model_type), linewidth = 1.4)) +
ggtitle("Regression Models") +
geom_point(alpha = .2) +
xlab("X") +
ylab("Y") +
scale_color_discrete(name = "Model Type") +
ylim(-6000, 15000) +
theme(legend.position = "none")Source: https://www.nature.com/articles/s41598-021-84893-4/figures/1

We want to predict an outcome Y based on a predictor x for i number of observations:
Y_i = \color{royalblue}{\beta_0 + \beta_1 x_i} +\color{red}{\epsilon_i}
where
\epsilon_i \sim N(0, \sigma^2)
Note
We tend to associate \epsilon_i with the residual, which is a positive or negative difference from the “predicted” response, rather than error itself which is a difference from the true response
Y_i = \color{royalblue}{\beta_0 + \beta_1 x_i} +\color{red}{\epsilon_i}
Basically, a deterministic straight line equation y = c + mx, with added random variation that is normally distributed
Two approaches; analytical and numerical:
Analytical: equation(s) used directly to find solution, e.g. estimate parameters that minimise residual sum of squares
Numerical: computer uses “random guesses” to find set of parameters to that minimises objective function, in this case residual sum of squares
In this week’s practical we will use Solver in Excel to numerically fit linear models.
What can we understand about the relationship between child and parent?
null_model <- Galton %>%
lm(child ~ 1, data = .) %>%
broom::augment(Galton)
lin_model <- Galton %>%
lm(child ~ parent, data = .) %>%
broom::augment(Galton)
models <- bind_rows(null_model, lin_model) %>%
mutate(model = rep(c("Null model", "SLR model"), each = nrow(Galton)))
ggplot(data = models, aes(x = parent, y = child)) +
geom_smooth(
data = filter(models, model == "Null model"),
method = "lm", se = FALSE, formula = y ~ 1, size = 0.5
) +
geom_smooth(
data = filter(models, model == "SLR model"),
method = "lm", se = FALSE, formula = y ~ x, size = 0.5
) +
geom_segment(
aes(xend = parent, yend = .fitted),
arrow = arrow(length = unit(0.1, "cm")),
size = 0.3, color = "darkgray"
) +
geom_point(alpha = .2) +
facet_wrap(~model) +
xlab("Parent height (in)") +
ylab("Child height (in)")Done!
And then we can use summary() to get a summary of the model:
Call:
lm(formula = child ~ parent, data = Galton)
Residuals:
Min 1Q Median 3Q Max
-7.8050 -1.3661 0.0487 1.6339 5.9264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.94153 2.81088 8.517 <2e-16 ***
parent 0.64629 0.04114 15.711 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.239 on 926 degrees of freedom
Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.