Learning Outcomes
In this lab, you will work towards achieving learning outcomes
Lab Objectives
In this lab, we will:
- []
- []
Please work on this exercise by creating your own R Markdown file.
Exercise 1: Model quality - multiple vs adjusted r2
Data: California_streamflow spreadsheet
Import the “California_streamflow” sheet into R.
in this exercise we will use the same data as last week. To jog your memory, the dataset contains 43 years of annual precipitation measurements (in mm) taken at (originally) 6 sites in the Owens Valley in California. Through model selection via partial F-test we have the final model as below:
We will now add a totally useless variable to the dataset. This variable is a random number generated from a normal distribution with mean 3 and standard deviation 2. We use the set.seed() function to make sure that everybody gets the same random values.
We will see the impact of including a totally useless variable, such as this random variable, has on measures of model quality, r2 and adjusted r2 values.
Task: create two regression models:
runoff_volume ~ rock_creek + pine_creekrunoff_volume ~ rock_creek + pine_creek + random_no
Question 2
Compare each in terms of their multiple r2 and adjusted r2 values. Which performance measure (multiple r2 or adj r2) would you use to identify which predictors to use in your model?
Exercise 2: Fish productivity
Fish communities were surveyed in lakes across a eutrophication gradient to investigate the relatioship between productivity and fish diversity.
The datasheet has thefollowing variables:
lake_idUnique identifier for each lakechlaLog-transformed Chlorophyll arichnessLog-transformed species richnessevennessLog-transformed Pielou’s evenness
abundanceLog-transformed abundance per unit effort (NPUE)
biomassLog-transformed biomass per unit effort (BPUE)productivityLog-transformed productivity proxy
Rows: 39 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): lake_id
dbl (6): chla, richness, evenness, abundance, biomass, productivity
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spc_tbl_ [39 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ lake_id : chr [1:39] "FTH" "XLH" "HGH" "LH" ...
$ chla : num [1:39] 4.61 4.53 3.84 3.71 2.43 ...
$ richness : num [1:39] 2.64 2.89 1.99 2.94 2.37 ...
$ evenness : num [1:39] -0.356 -0.67 -0.423 -0.405 -0.589 ...
$ abundance : num [1:39] 2.24 3.34 1.24 2.67 1.12 ...
$ biomass : num [1:39] 4.96 6.03 4.67 6.52 5.32 ...
$ productivity: num [1:39] 3.47 4.47 2.71 4.47 3.24 ...
- attr(*, "spec")=
.. cols(
.. lake_id = col_character(),
.. chla = col_double(),
.. richness = col_double(),
.. evenness = col_double(),
.. abundance = col_double(),
.. biomass = col_double(),
.. productivity = col_double()
.. )
- attr(*, "problems")=<externalptr>
Explore the data on your own before making any models. (Hint: remember that we are only interested in the numeric variables for our linear models)
Question 1
Are there any obvious relationships between the response variable (productivity) and the other variables?
Question 2
Are there any other patterns or potential issues you can see from the exploratory plots?