Multiple Linear Regression

The formula or function for multiple linear regression is:

\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \epsilon_i \]

Therefore, estimating the model involves estimating the values of \(\beta_0\), \(\beta_1\), \(\beta_2\), …, \(\beta_k\).

  • \(\beta_0\) is the intercept
  • \(\beta_1\) to \(\beta_k\) are the partial regression coefficients
  • \(\epsilon\) is the error term (aka residuals)

For simple linear regression (response and one predictor), we can visualise this with a scatterplot (response on the y-axis, predictor on the x-axis) and a line of best fit. Two variables, two dimensions, a 2D plot. When we have more than one predictor - this becomes more difficult to visualise.

Likewise, interpretation becomes more complex. If \(x_1\) increases by one unit, \(y_i\) increases by \(\beta_1\) units, if all other variables are held constant. This is why the coefficients in multiple linear regression are referred to as partial regression coefficients.

Principle of Parsimony

The principle of parsimony is the idea that the simplest explanation is the best explanation. In the context of multiple linear regression, this means we need to consider whether adding more variables is actually useful. This is because the more variables we include, the more complex the model becomes, and the more likely it is to overfit the data. If we have too few variables, we could underfit the data (low % variance explained).

If a model with 3 variables has the same predictive power as a model with 10 variables, it is better to use the model with 3 variables. A model with 3 variables is easier to fit and interpret, and it means the other 7 variables were not adding much to the model and could be culled.

The process of selecting the correct variables is called variable selection. There are methods to do this (e.g. stepwise regression with step()) but they are beyond the scope of this unit. A simple method is to remove predictors that do not appear significant in the model (ideally one by one) until all remaining variables are significant. This should be done in conjunction with checking the adjust R2 to ensure the model does not decline drastically in predictive capability.

Example

Multiple linear regression is very similar in many aspects to simple linear regression - just more variables. We will go through the airquality dataset in R as an example. This dataset contains daily air quality measurements in New York in 1973. The research question posed by the dataset is - can we find a relationship between Ozone concentration (ppm) and meteorological data or the time of the year? And if so, how well can we predict Ozone from those variables?

Exploratory data analysis

First we will need to remove the missing values in the dataset.

summary(airquality)
data <- na.omit(airquality) # remove NA and rename as new object (to not change original dataset)

Next we will look at the relationships between Ozone and the other variables in the dataset. Below are some potential functions that can visualise the data and correlations – choose the one that works best visually for your data.

# ### Base R
# pairs(data)
# cor(data) |> round(2)
# 
# ### corrplot
# library(corrplot)
# corrplot::corrplot(cor(data), method = "circle")

### psych
psych::pairs.panels(data)

The key thing to check is the plots and correlations between Ozone and the predictors (first column, first row). The best correlated variables are Temp (\(r = 0.7\)), Wind (\(r = -0.61\)), and Solar.R (\(r = 0.35\)). Month is negligibly correlated (\(r = 0.14\)) and Day has no relationship (\(r = -0.01\)).

We can see that Ozone does not have a linear relationship with Temp, Solar.R or Wind, so we do a pre-emptive natural log transformation.

data <- data %>%
  mutate(Ozone_log = log(Ozone)) %>%  # create the log-transformed column
  select(-Ozone) %>%                  # remove the original Ozone column
  select(Ozone_log, everything())     # reorder so that Ozone_log is first

psych::pairs.panels(data)

The linearity between Ozone_log and both Temp and Solar.R are higher, and for Wind it is slightly lower. The scatterplots however are looking more linear.

Assumptions

The assumptions for multiple linear regression are the same as simple linear regression - except one additional condition, called collinearity. This is when two or more predictors are highly correlated with each other. For the assumption to be broken, it requires a perfect relationship (\(r\) = -1 or 1), but strong and very correlations will still have an effect. This can cause problems with the model, as the model cannot distinguish between the two predictors and the resulting relationships are unreliable. Judging from the correlations above, this is not an issue for this dataset.

We fit the model with all variables and create our assumption plots.

mod <- lm(Ozone_log ~ Temp + Wind + Solar.R + Month + Day, data = data)
par(mfrow = c(2,2))
plot(mod)

The residuals vs fitted plot is fairly evenly distributed around the line, and the red line is mostly flat. The normal QQ sticks to the line well except one outlier - we cross-check with the residuals vs leverage plot, but the point does not exceed the dotted Cook’s distance line so it is not extreme enough to need action. Scale-location is slightly tilted but the line is fairly straight, and the points are evenly distributed so variances are equal.

All assumptions are met.

Model evaluation

Like simple linear regression, there is a hypothesis for multiple linear regression. The null hypothesis is that all partial regression coefficients are equal to zero, and the mean is a better predictor or line of best fit. The alternate hypothesis is that at least one of the partial regression coefficients is not equal to zero, and there is merit in the multiple linear regression model.

\[ H_0: \beta_1 = \beta_2 = ... = \beta_k = 0 \] \[ H_1: \beta_1 = \beta_2 = ... = \beta_k \neq 0 \]

summary(mod)

We can see that the p-value for the F-statistic is very low, so we fail to reject the alternate hypothesis, and the model is better than using the mean of the data. The Adjusted R-squared is 65.36%, which means that 65.36% of the variance in Ozone_log can be explained by the predictors. The Residual standard error is 0.5096.

Temp, Wind and Solar.R are all significant predictors of Ozone_log, but Month and Day are not. So we try removing them from the model.

mod <- lm(Ozone_log ~ Temp + Wind + Solar.R, data = data)
summary(mod)

Temp, Wind and Solar.R are all still significant predictors of Ozone_log. The Adjusted R-squared is 65.5% (which is higher) and the Residual standard error is 0.5086 (which is lower). This means that the model is slightly better without Month and Day.

Following the principle of parsimony, the model with just Temp, Wind and Solar.R is the more parsimonious model, and the model we will use to predict Ozone_log.

Interpretation

The equation for our model is \(log(\text{Ozone}) = -0.026 + 0.049 \cdot \text{Temp} - 0.062 \cdot \text{Wind} + 0.003 \cdot \text{Solar.R}\). For a unit increase in Temp, log(Ozone) increases by 0.049 units, and Ozone increases by \(e^0.049 = 1.05\) times or approximately 4.9%, given all other predictors are held constant. The most useful interpretation here is the percentage increase. Given all partial regression coefficient are \(< |0.25|\), we can use a quick approximation (\(\times 100 \%\)).

  • For every one unit increase in Temp, Ozone increases by 4.9%, given all other predictors are held constant
  • For every one unit increase in Wind, Ozone decreases by 6.2%, given all other predictors are held constant
  • For every one unit increase in Solar.R, Ozone increases by 0.3%, given all other predictors are held constant

The variation in Ozone concentration in 1973 New York could be mostly (adjusted R2 = 65.5%) explained by the weather, i.e. temperature, wind speed, and solar radiation.

Simple vs multiple linear regression

The key differences between simple (SLR) and multiple linear regression (MLR) are:

  • Number of predictors: SLR has one predictor, MLR has more than one.
  • Assumptions: MLR has an additional assumption of no collinearity between predictors (CLINE).
  • Hypothesis: The null hypothesis for SLR is that \(\beta_1 = 0\) whereas for MLR \(H_0: \beta_1 = \beta_2 = ... = \beta_k = 0\).
  • Variable selection: MLR has multiple variables, not all of which will be useful.
  • Model evaluation: SLR uses Multiple R-squared whereas MLR uses the Adjusted R-squared which has a penalty for the extra predictors.
  • Interpretation: In SLR, the coefficient is the slope of the line. In MLR, there are multiple dimensions and the coefficient is the partial regression coefficient (don’t forget: given all other predictors are held constant).