Code
# simulate example data
set.seed(340)
<- runif(10, 0, 30)
x <- 5 * x + rnorm(10, 0, 40)
y <- data.frame(x, y)
df
# make and save the plot
ggplot(df, aes(x, y)) +
geom_point(size=2) +
labs(x = "x", y = "y") +
theme_classic()
Often in a scientific experiment, we collect data for a response variable (\(y\)) and one or more predictor variables (\(x\)).
Some interchangeable terms:
There are several reasons why we would โmodelโ or โcreate a modelโ for the data we have collected:
The simplest form of a model is a linear model.
We start with a hypothetical dataset, with \(x\) and \(y\). A linear model is essentially the line of best fit. How do we determine what the line of best fit is?
Realistically, if we fit a โline of best fitโ it will not pass through all the points. There will be some error โ we call these errors the residuals. The residuals (\(\epsilon_{i}\)) are the difference between the observed value of \(y\) (\(y_{i}\)) and the value predicted (\(\hat{y}_{i}\)) by the model.
Regression is a statistical method to fit a model by minimising the residuals or error. In essence, the model that best minimises the error has the best fit. The most commonly used error term is the sum of squares. By squaring the residual, it wonโt matter if error is positive or negative, and larger errors have a larger penalty. By summing all squared residuals together, we get a single number that represents the total error of the model.
\[ \text{sum of squared residuals} = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \] Regression is the foundation of most numerical statistical models, from simple linear regression to deep learning neural networks.
The equation of a simple linear regression model is:
\[y = \beta_0 + \beta_1 x\]
where \(\beta_0\) is the y-intercept or constant, and \(\beta_1\) is the slope.
The goal of regression is to fit a straight line that will have the lowest sum of squares. This line can be described with the parameters \(\beta_0\) and \(\beta_1\).
There are two methods that we can use - the analytical method (with equations) and the numerical method (trial and error).
We will showcase these methods with an example with the famous iris
dataset. The dataset contains 150 observations of three species of iris flowers, measuring four features (sepal length
, sepal width
, petal length
, petal width
). For our example, we will model the relationship between petal length
and petal width
.
For simple linear regression, we can calculate the values of \(\beta_0\) and \(\beta_1\) with equations
First we calculate the slope \(\beta_1\):
\[ \beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{Cov(x,y)}{Var(x)} = \frac{SS_{xy}}{SS_{xx}} \] Which we then substitute into the equation below to get the intercept \(\beta_0\):
\[ \beta_0 = \bar{y} - \beta_1 \bar{x} \]
Using R to do our calculations:
So the analytical method determines that our linear model is \(y = -0.363076 + 0.415755 \cdot x\).
The numerical method is a trial and error method. We start with some initial values of \(\beta_0\) and \(\beta_1\), and then update the values to minimize the sum of squares. This is the most common method that computer programs (e.g. Excel, R, Python, etc.) will use as it is more computationally efficient with very large dataset (e.g. millions of rows).
Fitting the model in R is very simple:
The estimate of our parameters (or coefficients
) are in the Estimate
column. The Intercept
Estimate
is our y-intercept \(\beta_0\) and the x
Estimate
is our slope \(\beta_1\).
So the numerical method run by R determines that our linear model is \(y = -0.363076 + 0.415755 \cdot x\). This is exactly the same result as the analytical method (at least to 6 decimal places).
Fitting a simple linear regression can be done in two lines of code in R. But there is a series of steps that we need to go through to ensure our model is robust and we can trust the results.
This is a basic step in any data analysis. We need to understand the data we are working with. As with previous modules, we can look at summary statistics, distributions, and visualise the data.
For linear regression, we also need to look at the relationship between the predictor and response. We first look at a scatter plot to determine if we have linear data, and then we choose a suitable correlation coefficient.
There appears to be a strong, positive linear relationship between petal length
and petal width
.
The correlation coefficient is 0.96, which is very close to 1. This is almost a perfect positive linear relationship!
To use linear regression, there are several assumptions (LINE) that need to be met:
If assumptions are met, then we can be confident that the model is a good fit for the data. If they are not met, then the hypothesis test results are unreliable, the standard error estimate is unreliable, and the estimates of the coefficients will not fit the model well.
If the assumptions are not met, then we either need to transform our data (e.g. \(x\), \(y\), or both) with a function (e.g. square root, natural log etc.) or use a non-linear model.
An important point to remember is that the assumptions are about the residuals, not the data itself. The equation for a linear model is:
\[y = \beta_0 + \beta_1 x + \epsilon\] where \(\epsilon\) is the error term, or residual. This is the only term which adds variation to an otherwise straight line, so this is what we need to check our assumptions with.
Linear regression assumes that there is a linear relationship between \(x\) and \(y\). It does not make logical or statistical sense to fit a linear model to data that does not have a linear relationship. With non-linear data, the other assumptions will not be met either. The easiest method to check for linearity with a scatter plot.
The independence of errors is the assumption that the residuals for one observation are independent of another observation. This is a difficult assumption to check, but it is often assumed that the data is collected in a way that the residuals are independent.
For example, if we are measuring the height of children in a class, the height of one child should not affect the height of another child. However, if all the children were siblings or identical octuplets, then the residuals would not be independent. Another case where this assumption could be broken is in time series data โ the height of a child this year is not independent from the height last year.
A linear model assumes the residuals are normally distributed around the line of best fit. This is important for hypothesis testing and confidence intervals. We can check this assumption with a Q-Q plot or a histogram of the residuals.
This is also known as the assumption of constant variance or homoscedasticity. It assumes that the variance of the residuals is constant across all levels of the predictor variable (i.e. no fanning). Again, this is important for hypothesis testing and confidence intervals. We can check this assumption with a scatter plot of Residuals vs Fitted values and the Scale-Location plot.
In R, plots of the residuals can be made with the plot()
function and the model object (in this case mod
) as the input. The four plots produced are 1) Residuals vs Fitted (linearity, equal variances), 2) the Normal Q-Q plot (normality), 3) a Scale-Location plot (equal variances), and 4) the Residuals vs Leverage plot (extreme outliers).
These same plots can also be made with autoplot()
function (from the ggfortify
package). It does not require the par()
function to set up the grid and is more aesthetically customisable.
The Residuals vs Fitted plots the residuals against the โpredictedโ values (i.e. points on the line). If met, the points will be randomly scattered around the horizontal 0 line. If there is a pattern (e.g. a curve, a quadratic), then the assumption of linearity is not met. If there is a fan shape, then the assumption of equal variance is not met.
The Normal Q-Q plot compares the residuals to a normal distribution. If the residuals are normally distributed, the points will fall on the straight line. If the points deviate from the line, then the residuals are not normally distributed.
The Scale-Location plot is used to check the assumption of equal variance. If the line is essentially horizontal and points are randomly scattered, then the assumption is met.
The Residuals vs Leverage plot is used to identify extreme outliers. Although not an official assumption, a very extreme outlier has the potential to skew the line of best fit. These influential points should be kept track of or removed from the dataset.
The examples above are for the iris
dataset with petal length
and petal width
. The assumptions are met.
Below is an example of plots from data that has an exponential relationship with unequal variances (it fans in) with a couple of extreme outliers (in red).
We can see that:
So now that we have confirmed the assumptions are met, the next step is to determine how good the model actually is. If the model was not a good fit to the data, then we wouldnโt want to interpret it โ weโd try and improve it first.
Letโs break down the summary(mod)
output. The first few lines are the model formula lm(formula = y ~ x)
, so this is our linear model. The Residuals
section gives some summary statistics of the residuals.
The null hypothesis for a simple linear regression is that the slope is zero (\(\beta_1 = 0\)), and the model does not perform better than just using the mean. There is thus no relationship between \(x\) and \(y\). The alternative hypothesis is that the slope is not zero (\(\beta_1 \neq 0\)), which means there is a relationship between \(x\) and \(y\), and the model is better than using the mean.
The Coefficients
section gives the estimates of our y-intercept and slope, as well as the standard error, the t-value, and the p-value when determining the estimate. The p-value is the most easily interpreted โ we care the p-value of the (\(x\)). If the p-value is less than 0.05, then we can reject the null hypothesis that the slope \(\beta_1\) is zero.
The Signif. codes
section gives a visual representation of the p-value. The more stars, the smaller the p-value. The ***
means the p-value is less than 0.001, **
means less than 0.01, and *
means less than 0.05. This is useful when we have many predictors.
The Residual standard error
is the standard deviation of the residuals. The smaller the value, the better the model fits the data. It is in the same units as the response variable. Considering the \(y\) ranges between 0.1 and 2.5, a standard error of 0.2065 is quite small.
The Multiple R-squared
is the proportion of the variance in the response variable that is explained by the model. The more variation explained by the model, the better the fit. A value of โ1โ indicates a perfect fit, and a value of โ0โ suggests the model explained no variation at all. A value of 92.71% is very good. In simple linear regression, the R2 is actually equivalent to the correlation coefficient squared - which is also where the term comes from.
The Adjusted R-squared
is the same as the Multiple R-squared
, but adjusted for the number of predictors in the model. We use the Adjusted R-squared
in multiple linear regression, and the Multiple R-squared
in simple linear regression.
\[ R^2 = \frac{SS_{reg}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}} \]
\[ R^2_{adj} = 1 - \frac{SS_{res}}{SS_{tot}} \frac{n-1}{n-p-1} \] Lastly, the F-statistic
is a statistical test of the overall significance of the model. In simple linear regression (1 predictor, 1 response) it is the same as the p-value in the Coefficients
table. The F-statistic value is 1882 in the example, and the degrees of freedom (1, 148) are the number of predictors (1) and the number of observations minus two (150-2=148) because there are two parameters. This is covered in more detail in ENVX2001.
The assumptions for simple linear regression were met. The model explains 92.71% of the variation in the petal width
with petal length
with a residual standard error of 0.2065, and petal length
is a significant predictor (p < 2e-16).
The equation of the model is \(y = -0.36 + 0.42 \cdot x\). This means that for every unit increase in petal length
, petal width
increases by 0.42 cm.
The y-intercept is -0.36, which is the value of petal width
when petal length
is zero. This is not a meaningful value in this context, as petal length
cannot be zero. This is often the case with the y-intercept.
Last but not least, the result can be plotted on a scatterplot with the line of best fit.
### Base R
plot(x, y, pch = 19,
xlab = "Petal Length", ylab = "Petal Width", main = "Base R"
)
abline(mod, col = "red") # using our model `mod` object
# abline(a = -0.363076, b = 0.415755, col = "red") # manually inputting coefficients
### ggplot
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") + # ggplot will fit the model for us
# geom_abline(intercept = -0.363076, slope = 0.415755, color = "red") + # manually inputting coefficients
labs(x = "Petal Length", y = "Petal Width", title = "ggplot2")
In the case assumptions are not met, we can try transforming. Some common options are the square root or natural log. In the case there are negative values, we can also add a constant to ensure all values are greater than 0.
To demonstrate this, we use another well known dataset - air quality in New York (USA). The dataset contains 153 observations of Ozone
(ppb), as well as Solar.R
(solar radiation, Langleys), Wind
(mph), Temp
(\(\deg\)F), Day
(1-31), and Month
(1-12).
The correlation matrix shows that Temp
has the highest correlation with Ozone
(0.7), so we will model the relationship between Ozone
and Temp
. Given Ozone
is a pollutant that is of interest, it is the response variable and Temp
is the predictor.
The relationship does not appear linear. We can fit the model and the other assumptions as well to be sure.
The Residuals vs Fitted plot is curved, which also signifies the relationship is not linear. The spread of points for lower fitted values is tighter than the larger values - i.e. fanning, unequal variance. The Normal Q-Q plot shows the points are not on the line, especially at the upper tails, indicating the residuals are not normally distributed. Hence assumptions are indeed not met.
We can transform 1) just the response variable, 2) just the predictor variable, or 3) both. The easiest option transform the response variable (especially in multiple linear regression), and natural log transformations are more easily backtransformed for meaningful interpretation.
data$Ozone_sqrt <- sqrt(data$Ozone) # square root
data$Ozone_log <- log(data$Ozone) # natural log
data$Ozone_log10 <- log(data$Ozone, base = 10) # base 10 log
# Original
correlation_coef <- cor(data$Temp, data$Ozone) |> round(3)
p1 <- ggplot(data, aes(x = Temp, y = Ozone)) +
geom_point() +
labs(x = "Temp", y = "Ozone", title = "Original",
subtitle = paste("Pearson's r: ", correlation_coef))
# Square root
correlation_coef <- cor(data$Temp, data$Ozone_sqrt) |> round(3)
p2 <- ggplot(data, aes(x = Temp, y = Ozone_sqrt)) + # square root
geom_point() +
labs(x = "Temp", y = "Ozone", title = "Square root",
subtitle = paste("Pearson's r: ", correlation_coef))
# Natural log
correlation_coef <- cor(data$Temp, data$Ozone_log) |> round(3)
p3 <- ggplot(data, aes(x = Temp, y = Ozone_log)) +
geom_point() +
labs(x = "Temp", y = "Ozone", title = "Natural log",,
subtitle = paste("Pearson's r: ", correlation_coef))
# Base 10 log
correlation_coef <- cor(data$Temp, data$Ozone_log10) |> round(3)
p4 <- ggplot(data, aes(x = Temp, y = Ozone_log10)) + # base 10 log
geom_point() +
labs(x = "Temp", y = "Ozone", title = "Base 10 log",
subtitle = paste("Pearson's r: ", correlation_coef))
p1 + p2 + p3 + p4
All the transformations improve the linearity of the relationship by about the same amount (based on \(r\)). We choose the natural log transformation (because it is easiest to backtranform), re-fit the model and check the assumptions again.
Not quite textbook perfect, but definitely an improvement. The residuals are more evenly spread, and the Q-Q plot is closer to the line. The model is now ready for evaluation and interpretation.
The linear model with Temp
explains 55.5% of the variation in Ozone
with a residual standard error of 0.580 log(ppm). Even if this value were back-transformed, it would not have much meaning - which is a limitation of transformations.
The natural log transformation of Temp
is a significant predictor of Ozone
(p < 2e-16). The equation of the model is \(log(\text{Ozone}) = -1.849 + 0.068 \cdot \text{Temp}\). This means that for every unit (farenheit) increase in Temp
, log(Ozone)
increases by 0.068 log(ppm). To backtransform, we can use the exp()
function. So for every unit increase in Temp
, Ozone
increases by \(e^{0.068} = 1.070\) times. We can also say that for every unit increase in Temp
, Ozone
increases by 6.8%.
The backtransformation of the coefficients is not as straightforward as the interpretation of the coefficients. For example, if we had used a square root transformation sqrt(Ozone)
, then if we wanted to get the effect of increasing Temp
by one unit:
\[ \sqrt{Ozone} = \beta_0 + \beta_1 \cdot Temp \] \[ Ozone = (\beta_0 + \beta_1 \cdot Temp)^2 \] \[ Ozone = \beta_0^2 + 2 \beta_0 \beta_1 Temp + \beta_1^2 Temp^2 \] So if Temp
increases by one unit, the increase in Ozone
is not just \(\beta_1^2\)! We end up with \(2\beta_1(\beta_0+\beta_1 \cdot Temp) + \beta_1^2\).
For natural logs, however, the backtransformation is much simpler because of log laws.
\[ log(Ozone) = \beta_0 + \beta_1 \cdot Temp \] \[ Ozone = e^{\beta_0 + \beta_1 \cdot Temp} = e^{\beta_0} \cdot e^{\beta_1 \cdot Temp} \] If Temp
increases by one unit, then the increase in Ozone is:
\[ Ozone = e^{\beta_0} \cdot e^{\beta_1 \cdot (Temp+1)} = e^{\beta_0} \cdot e^{\beta_1 \cdot Temp} \cdot e^{\beta_1} \] The ratio between the two is:
\[ \frac{e^{\beta_0} \cdot e^{\beta_1 \cdot Temp} \cdot e^{\beta_1}}{e^{\beta_0} \cdot e^{\beta_1 \cdot Temp}} = e^{\beta_1} \] So for a one unit increase in Temp
, Ozone
increases by \(e^{\beta_1}\) times.
Interpreting as a percent change can be more meaningful - it can be done with any log transformation (substitute \(e\) below for 10 or any other base), but the quick approximation only works with natural log transformations.
If \(y\) has been transformed with a natural log (log(y)
), for a one-unit increase in \(x\) the percent change in \(y\) (not log(y)
) is calculated with:
\[\Delta y \% = 100 \cdot (e^{\beta_1}-1)\] If \(\beta_1\) is small (i.e. \(-0.25 < \beta_1 < 0.25\)), then: \(e^{\beta_1} \approx 1 + \beta_1\). So \(\Delta y \% \approx 100 \cdot \beta_1\). Below are some examples โ when \(\beta_1\) is 2, the โquick estimateโ is off by 439%!
ฮฒ | Exact \((e^{\beta} - 1)\)% | Approximate \(100 \cdot \beta\) |
---|---|---|
-0.25 | -22.13 | -25 |
-0.1 | -9.52 | -10 |
0.01 | 1.01 | 1 |
0.1 | 10.52 | 10 |
0.25 | 28.41 | 25 |
0.5 | 64.87 | 50 |
2 | 638.91 | 200 |
So for our linear model \(log(\text{Ozone}) = -1.849 + 0.068 \cdot \text{Temp}\), a one unit increase in Temp
results in approximately a 6.8% increase in Ozone
.