Lab 02 – Statistical programming

Learning Outcomes

At the end of this practical students should be able to:

Acknowledge the importance of academic integrity
Import, subset and inspect data in R
Calculate simple summary statistics using both R and Excel
Generate dynamic reports in Quarto using inline R code
Understand how to debug R code

Before we begin

Please create a new Quarto document in your project folder to practice R code and complete lab exercises. We will go through the process in Exercise 1.

The following packages are required for this practical:

tidyverse – a collection of R packages designed for data science
readxl – a package for reading Excel files
modeest – a package for estimating the mode of a distribution
lubridate – a package for working with dates and times

If you have not already installed these packages, you can do so by adding the following code into a code chunk in your Quarto document:

library(tidyverse)
library(readxl)
library(modeest)
library(lubridate)

Finally, please download the data file used in this lab here: soil.xlsx.

Academic integrity

This exercise encourages students to discuss academic integrity, and in particular the grey areas often present. Your demonstrator will provide you with a number of scenarios to discuss with each other in smaller groups, and then with the class.

If you are interested in more information on Academic Integrity at the University of Sydney, see the following link: Academic Integrity. Also ensure you have completed the Academic Honesty Education Module (AHEM). This must be complete before your first assessment is due (next week for ENVX1002).

Exercise 1: Setting up (walk-through)

Getting Started - Important!

Before starting each lab session, you will need to create a new Quarto document to work on the exercises. Follow these steps to set up your lab environment:

Open RStudio Project
- Any of the following should work:
  1. In RStudio, click on the project name in the top-right corner and select your project from a recent list. (It might already be open, otherwise it would show “Project: (None)”)
  2. In RStudio, use File > Open Project to navigate to your .Rproj file
  3. Double-click the .Rproj file directly from your file explorer

Your project name should appear in the top-right corner of RStudio.

Create a New Quarto Document
- Several ways to do it, the easiest is by clicking the “New File” button in the toolbar and selecting “Quarto Document…”
- Give your document a meaningful name (e.g., Lab 2 Exercises), leave other options as default and click “Create”
Save and Render Your Document
- You must save your document now. This acts as a simple check to see that you are working in an environment that is not restricted.
- Click File > Save or press Ctrl+S (Windows) / Cmd+S (Mac)
- Name it something meaningful like lab02_practice.qmd
- Click the “Render” button in the editor toolbar to generate the HTML output

Note

If you encounter any issues, please ask your demonstrator for help.

Follow our lead

Important

We’ve created this section to make sure you don’t get lost during the lab. Remember to read “A brief R guide for surviving ENVX1002” available in the Tool Kit section on Canvas for a better understanding of how to use R in this course.

Two golden rules

Text are formatted in Markdown (like this text)
Code must be written inside code chunks

A code chunk is what makes your document “dynamic” as it can execute code and perform all sorts of tasks.

To create a code chunk:

Quick way: Use keyboard shortcuts
- Windows/Linux: Ctrl + Alt + I
- Mac: Cmd + Option + I
Manual way: Type three backticks followed by {r}, then end with three backticks

Here’s what a code chunk looks like:

```{r}
mean(c(1, 2, 3))
```

Note: the “fence” that we use (the three backticks plus the {r}) will only be visible in Source view. Your demonstrator will show you how to switch between Source and Preview views.

Understanding Functions

This is a function:

# Input → Process → Output
mean(c(2, 4, 6)) # Takes numbers, calculates average, returns 4

You can recognise it because it has:

A name (mean)
Parentheses () to hold input data
Input data inside the parentheses (e.g., c(2, 4, 6))

A function will almost always return an output, which you can use in your code. In this case, the output is 4. Functions are the reason why we can do so much with R – they are like actions that someobe else has written for us to use so that we can complete our tasks with minimal programming.

Let’s look at three common functions for central tendency, but at the same time look at what it means to run functions on data we often call “objects”:

mean():

# Create a vector of numbers and store it in an object called "scores"
scores <- c(85, 92, 78, 95, 8)
# Calculate the mean of the scores
mean(scores)

median():

heights <- c(160, 165, 168, 170, 185)
median(heights)

mode():

votes <- c("yes", "no", "yes", "maybe", "yes")
mfv(votes)

Remember: Every function needs:

Parentheses () to work
Input data (inside the parentheses)
Sometimes extra options (like na.rm = TRUE to handle missing values)

Exercise 2: Water quality

Sulphate (SO₄) is a key indicator in water quality monitoring and can be used to assess environmental impacts from industry and agriculture. In this exercise, we will explore a dataset of SO₄ concentrations (mg/kg) in water samples.

The data is stored in a sheet called “SO4” in the MS Excel file, soil.xlsx.

so4 <- read_excel("data/soil.xlsx", sheet = "SO4")

Try writing and running the code chunks in your own Quarto document to see their outputs. Results will appear below each chunk. See Lab 01 for more information on inserting and running code chunks in Quarto.

When we load data into R for the first time, it is important to check what it looks like (and whether it loaded correctly). The str() function is a good way to quickly inspect the data:

str(so4)

Question 1

What does the output of str() tell us about the data? You may want to look at the documentation ?str or search online for more information, but ask your demonstrator if you’re still unsure about why we use this function. Relate this to the data you are working with.

Since the data is a data frame object, we have a good idea of what functions we can use to explore it. Let’s examine the first few rows of our data:

head(so4)

Let’s calculate some basic descriptive statistics. Read the code and try to understand every part of it, including the text in #.

# Calculate mean, median, and mode
mean_so4 <- mean(so4$SO4)
median_so4 <- median(so4$SO4)
mode_so4 <- mfv(so4$SO4)[1] # Most frequent value using modeest package

# Calculate measures of spread
range_so4 <- range(so4$SO4)
iqr_so4 <- IQR(so4$SO4)
var_so4 <- var(so4$SO4)
sd_so4 <- sd(so4$SO4)

Reporting your data

When reporting statistics in a Quarto document, there are two approaches we could take:

1. Basic R output:

mean_so4

[1] 61.92308

sd_so4

[1] 5.241558

median_so4

[1] 62.1

2. Inline reporting (recommended):

Consider the following reporting:

The mean SO₄ concentration in our samples is 61.92 mg/kg, with a standard deviation of 5.24 mg/kg. The median value is 62.1 mg/kg.

Using inline R code (approach 2) has several advantages:

Statistics are seamlessly integrated into your text
Numbers are automatically updated if your data changes
Results are presented in a reader-friendly format

To create inline R code, use backticks with r, like this:

The mean SO~4~ concentration in our samples is `{r} round(mean_so4, 2)` mg/kg...

Try to recreate the report above in your Quarto document, or use other objects like mode_so4 and var_so4 in your report.

Exercise 3: Using MS Excel

Why use Excel when you have R? Well, Excel is an ubiquitous tool in many industries and can be a useful complement to R for quick data exploration and analysis. Sometimes it may even be easier to use Excel for simple tasks.

Let’s calculate the same statistics using Excel to compare approaches. First:

Open the soil.xlsx file in Excel
Navigate to the “SO4_excel” sheet (we’ll use this sheet to avoid modifying the original data)
Ensure the data is properly displayed in columns

MS Excel remains a popular tool for quick and dirty data analysis and can be a valuable resource in just about any field.

Imported data in Excel. Make sure to use the “SO4_excel” sheet.

For any calculation in Excel:

Select an empty cell
Type “=” followed by the function name
Select your data range
Press Enter

For example, to calculate the mean:

Click an empty cell
Type =AVERAGE(
Select all SO4 values
Type ) and press Enter

Using the `=AVERAGE()` formula in Excel. Note that this screenshot demonstrates the selection of a specific range of data.

Question 2

Calculate these statistics for the SO₄ data in Excel:

For central tendency:
- Mean: Use =AVERAGE()
- Median: Use =MEDIAN()
- Mode: Use =MODE()
For spread:
- Range: Use =MAX() and =MIN()
- IQR: Use =QUARTILE.INC() for Q3 and Q1
- Variance: Use =VAR.S()
- Standard Deviation: Use =STDEV.S()

Write notes on how you used these formulas if necessary. You may also want to compare the results with those obtained in R and write down your thoughts on the efficiency and ease of use for each method.

Exercise 4: Soil data

In agricultural science, soil characteristics are essential for understanding soil health and fertility. The soil sheet in the soil.xlsx file contains data on soil properties at different depths, as well as lithology and land use information.

In this exercise, we’ll explore different ways to subset data in R using the soil characteristics data. First, let’s load the data:

soil <- read_excel("data/soil.xlsx", sheet = "soil")

Understanding data structure

Let’s examine our data structure:

str(soil)
head(soil)

Question 3

What do you notice about the data structure that is different from the SO₄ data? How does this affect the way we subset the data?

The soil data frame contains the following columns:

clay0: Clay content at 0 cm depth
clay60: Clay content at 60 cm depth
ec0: Electrical conductivity at 0 cm depth
ec60: Electrical conductivity at 60 cm depth
lithology: Type of soil composition
land_use: Land usage type

Basic subsetting in R

There are two main ways to subset data in R:

Using the $ operator to select columns
Using square brackets [] to select rows and columns

Using the $ operator

The $ operator allows us to select a specific column from our data frame. For example:

# Get the land use column
soil$land_use

Combined with functions, the $ operator can be used to calculate statistics on specific columns.

# Calculate the mean clay content at 0 cm depth
mean_clay0 <- mean(soil$clay0)
mean_clay0

Notice how we saved the result of the mean() function in a new object mean_clay0. This is useful for storing results and using them later in your code (e.g. for inline reporting). However, it means that the result will not be displayed in the console unless you explicitly print it by typing mean_clay0 again in a new line.

Note

The $ operator is particularly useful when you want to:

Access a single column quickly
Use column values in calculations
Create plots with specific variables

Using square brackets []

Square brackets allow more flexible subsetting using the format: dataframe[rows, columns]. It also works with vectors, lists, and matrices. Try the following examples:

soil[1:5, ] # First 5 rows, all columns
soil[, c("clay0", "clay60")] # Clay columns
soil[1:3, c("ec0", "ec60")] # First 3 rows, electrical conductivity columns

When using [], leaving the row or column section empty (with just a comma) means “select all”

Basic subsetting: Use square brackets to extract soil samples at 0cm depth that have clay content greater than 40%.

soil[soil$clay0 > 40, ]

Multiple conditions: Find samples where clay content at 60cm is greater than 30% AND electrical conductivity at 0cm is less than 0.5.

soil[soil$clay60 > 30 & soil$ec0 < 0.5, ]

Question 4

Practice: Subset the data to find samples where:

clay content at 0cm depth is less than 50%
electrical conductivity at 60cm depth is between 25 and 30, inclusive
clay content at 0cm is less than 10% OR electrical conductivity at 60cm is greater than 25

Exercise 5: Soil statistics

We will continue to work on the soil dataset and practice calculating some basic summary statistics.

Question 5

Calculate the mean, median, and mode for clay content at 0cm depth
Calculate the range, IQR, variance, and standard deviation for electrical conductivity at 60cm depth
Report these statistics in your Quarto document using inline R code

Done!

This is the end of Lab 02. Remember to save your Quarto document ~~and submit it to Canvas~~ when you’re done. If you have any questions, feel free to ask your demonstrator for help.

Bonus: Take-home exercises

Take-home exercises are optional but recommended for further practice. You can complete these exercises in your own time.

Exercise 5: Mario Kart statistics

We’re not going to go too much about the game – but here is a dataset of character attributes from the popular Mario Kart.

Download it here: mario_kart.csv

The data is stored in a CSV file, which MS Excel can open directly. To read it into R, use the read_csv() function from the readr package.

The dataset contains the following variables:

character: Character name
weight: Character’s weight class (1-5 scale)
speed: Maximum speed rating (1-5 scale)
acceleration: How quickly they reach top speed (1-5 scale)
handling: Steering control rating (1-5 scale)
traction: Grip on different surfaces (1-5 scale)
drift_rating: Skill at power-sliding around corners (1-5 scale)

Complete the following exercises using both R and Excel:

Data Import and Inspection
- Load the data into R using read_csv()
- Examine the structure using str()
- View the first few rows using head()
- What do you notice about the data?
Central Tendency Calculate for the speed attribute:
- Mean
- Median
- Mode
Which measure best represents the “typical” speed rating? Why?
Spread Analysis For the weight attribute, calculate:
- Range
- IQR
- Variance
- Standard deviation
What do these tell you about the variation in character weights?
Character Comparison
- Which characters have the highest and lowest acceleration?
- Find all characters with above-average handling
- Identify characters with both speed and weight above 4.0
Advanced Challenge
- Calculate the mean of all attributes for each character
- Who is the most “well-rounded” character (closest to average in all stats)?
- Create a report comparing Mario and Luigi’s stats using inline R code

Attribution

This lab document is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.