Lecture 03: Exploring and visualising data

ENVX1002 Statistics in Life and Environmental Sciences

Januar Harianto

The University of Sydney

Apr 2026

Learning outcomes

After this week, you will be able to:

Explore datasets using summary functions and identify data quality issues
Choose appropriate plot types for different kinds of data
Build layered visualisations in R using ggplot2
Interpret distribution shapes, including skewness and kurtosis

Quick checklist

By now you should have…

Installed R and RStudio
Completed Lecture 2 content and read the ENVX1002 R guide
Reviewed measures of central tendency and spread
Practised using R functions with the function(argument = value) syntax
Rendered a few Quarto documents in RStudio

Why explore data first?

Looking before you leap

Jumping straight into analysis without exploring your data is like navigating without a map. You might arrive somewhere, but probably not where you intended.

Data exploration helps you:

Spot problems early (missing values, outliers, errors)
Understand what shape your data takes
Choose the right analysis method
Avoid drawing conclusions from flawed data

The data exploration workflow

A practical sequence for getting to know a new dataset:

Check the structure and data types
Summarise with descriptive statistics
Visualise distributions and relationships
Flag anything unexpected

Quick recap: types of data

R needs to know what kind of data it is working with. Here is the breakdown:

	Type	Order?	Example
Categorical	Nominal	No	Species, blood type
	Ordinal	Yes	Pain scale (mild → severe)
Continuous	Interval	—	Temperature in °C
	Ratio	—	Height, weight, age

Categorical data groups observations into labels. Continuous data measures quantities on a scale. The distinction matters because each type calls for different summaries and different plots.

How is your data shaped?

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey, Exploratory Data Analysis (1977)

The shape of your data influences everything: which summary statistics make sense, which plots to use, and which analyses are valid.

The normal distribution

The most common distribution you will encounter is the normal (or “bell curve”) distribution. It is symmetric, with most values clustered near the centre and fewer in the tails.

Defined by two parameters: the mean (μ) and standard deviation (σ)
Appears frequently in nature: heights, measurement errors, physiological traits

\[X \sim N(\mu, \sigma^2)\]

This notation means: the variable \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\).

What does a normal distribution look like?

Code

# Generate normal distribution data
set.seed(123)
normal_data <- rnorm(1000, mean = 0, sd = 1)

# Plot normal distribution
ggplot(data.frame(x = normal_data), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30,
                 fill = "skyblue",
                 colour = "black") +
  geom_density(colour = "red") +
  labs(title = "Standard Normal Distribution (μ = 0, σ = 1)",
       x = "Value",
       y = "Density")

The empirical rule (68-95-99.7)

For a normal distribution, the mean, median, and mode are all the same value. The data clusters predictably around that centre:

About 68% of values fall within 1 standard deviation of the mean
About 95% within 2 standard deviations
About 99.7% within 3 standard deviations

The empirical rule, visualised

This pattern makes the normal distribution especially useful: if you know the mean and standard deviation, you already know where most of the data sits.

Code

# Create a standard normal distribution
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x)
df <- data.frame(x = x, y = y)

# Plot with empirical rule highlighted
ggplot(df, aes(x = x, y = y)) +
  geom_line() +
  # Add vertical reference lines at standard deviations
  geom_vline(xintercept = c(-3, -2, -1, 0, 1, 2, 3), linetype = "dashed", colour = "gray50", alpha = 0.7) +
  geom_area(data = subset(df, x >= -1 & x <= 1), fill = "darkblue", alpha = 0.3) +
  geom_area(data = subset(df, (x >= -2 & x < -1) | (x > 1 & x <= 2)), fill = "darkgreen", alpha = 0.3) +
  geom_area(data = subset(df, (x >= -3 & x < -2) | (x > 2 & x <= 3)), fill = "darkred", alpha = 0.3) +
  annotate("text", x = 0, y = 0.2, label = "68%", colour = "darkblue") +
  annotate("text", x = 1.5, y = 0.1, label = "95%", colour = "darkgreen") +
  annotate("text", x = 2.5, y = 0.05, label = "99.7%", colour = "darkred") +
  labs(title = "Normal Distribution: Empirical Rule",
       x = "Standard Deviations from Mean",
       y = "Density")

Why does this matter?

Many biological measurements follow a normal distribution. If you know your data is roughly normal, you can predict where most values will land and spot anything unusual.

Code

# Simulate plant height data
set.seed(456)
plant_heights <- rnorm(200, mean = 25, sd = 3)  # Heights in cm

# Plot the distribution
ggplot(data.frame(height = plant_heights), aes(x = height)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 20,
                 fill = "#66c2a5",
                 colour = "black") +
  geom_density(colour = "#1f78b4", linewidth = 1) +
  geom_vline(xintercept = 25, linetype = "dashed", colour = "red") +
  annotate("text", x = 25.5, y = 0.05, label = "μ = 25 cm", colour = "red") +
  labs(title = "Distribution of Plant Heights in a Population",
       x = "Height (cm)",
       y = "Density")

Here, plant heights cluster around a mean of 25 cm. A plant measuring 35 cm would be more than 3 standard deviations from the mean, making it a clear outlier worth investigating.

But what if the data is not symmetric?

Not all data follows a neat bell curve. When a distribution leans to one side, we call it skewed. The direction of the longer tail tells you the type of skew, and it shifts the mean away from the median.

Positive skew (right-skewed)

The tail stretches to the right, pulling the mean above the median. Common in income data, reaction times, and species abundance counts.

Code

# Generate positive skewed distribution
set.seed(123)
right_skewed <- exp(rnorm(1000, 0, 0.5))

# Calculate statistics
mean_val <- mean(right_skewed)
median_val <- median(right_skewed)

# Plot positive skewed distribution
ggplot(data.frame(x = right_skewed), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30,
                 fill = "#66c2a5", # colourblind-friendly green
                 colour = "black") +
  geom_density(colour = "#1f78b4", linewidth = 1) +
  # Add vertical lines for mean and median
  geom_vline(xintercept = mean_val, colour = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = median_val, colour = "blue", linetype = "dashed", linewidth = 1) +
  # Add annotations
  annotate("text", x = mean_val + 0.2, y = 0.5, label = paste("Mean =", round(mean_val, 2)), colour = "red") +
  annotate("text", x = median_val - 0.2, y = 0.4, label = paste("Median =", round(median_val, 2)), colour = "blue") +
  labs(title = "Positive skew (right-skewed)",
       subtitle = "Note that Mean > Median",
       x = "Value",
       y = "Density")

Negative skew (left-skewed)

The tail stretches to the left, pulling the mean below the median. Less common, but seen in exam scores near a ceiling or age at retirement.

Code

# Generate skewed distributions
set.seed(123)
left_skewed <- max(right_skewed) - right_skewed + min(right_skewed)

# Calculate statistics
mean_val <- mean(left_skewed)
median_val <- median(left_skewed)

# Plot negative skewed distribution
ggplot(data.frame(x = left_skewed), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30,
                 fill = "#8da0cb", # colourblind-friendly blue
                 colour = "black") +
  geom_density(colour = "#1f78b4", linewidth = 1) +
  # Add vertical lines for mean and median
  geom_vline(xintercept = mean_val, colour = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = median_val, colour = "blue", linetype = "dashed", linewidth = 1) +
  # Add annotations
  annotate("text", x = mean_val - 0.2, y = 0.5, label = paste("Mean =", round(mean_val, 2)), colour = "red") +
  annotate("text", x = median_val + 0.2, y = 0.4, label = paste("Median =", round(median_val, 2)), colour = "blue") +
  labs(title = "Negative skew (left-skewed)",
       subtitle = "Note that Mean < Median",
       x = "Value",
       y = "Density")

Kurtosis: how heavy are the tails?

Kurtosis describes whether a distribution has heavy tails (more extreme values than expected) or light tails (fewer extremes). It is not about how “peaked” the distribution looks, despite what some textbooks say.

Mesokurtic: a normal distribution (the baseline)
Leptokurtic: heavier tails, more outliers
Platykurtic: lighter tails, fewer outliers

Kurtosis, visualised

Compare these three distributions. They all have the same mean and similar spread, but their tails behave very differently.

Code

# Generate distributions with different kurtosis
set.seed(123)
normal <- rnorm(1000, 0, 1)  # Mesokurtic
leptokurtic <- rt(1000, df = 5)  # t-distribution with 5 df is leptokurtic
platykurtic <- runif(1000, -3, 3)  # Uniform distribution is platykurtic

# Calculate kurtosis values (using e1071 package)
library(e1071)
k_normal <- kurtosis(normal)
k_lepto <- kurtosis(leptokurtic)
k_platy <- kurtosis(platykurtic)

# Plot distributions
p1 <- ggplot(data.frame(x = normal), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#a6cee3", # colourblind-friendly blue
                 colour = "black") +
  geom_density(colour = "#1f78b4", linewidth = 1) + # Darker blue
  annotate("text", x = 2, y = 0.3,
           label = paste("Kurtosis =", round(k_normal, 2)),
           colour = "#1f78b4") +
  labs(title = "Mesokurtic (normal)",
       subtitle = "Normal distribution with balanced tails",
       x = "Value",
       y = "Density")

p2 <- ggplot(data.frame(x = leptokurtic), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#fb9a99", # colourblind-friendly pink
                 colour = "black") +
  geom_density(colour = "#e31a1c", linewidth = 1) + # Darker red
  annotate("text", x = 2, y = 0.3,
           label = paste("Kurtosis =", round(k_lepto, 2)),
           colour = "#e31a1c") +
  labs(title = "Leptokurtic (heavy-tailed)",
       subtitle = "Sharper peak, heavier tails",
       x = "Value",
       y = "Density")

p3 <- ggplot(data.frame(x = platykurtic), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#b2df8a", # colourblind-friendly green
                 colour = "black") +
  geom_density(colour = "#33a02c", linewidth = 1) + # Darker green
  annotate("text", x = 0, y = 0.15,
           label = paste("Kurtosis =", round(k_platy, 2)),
           colour = "#33a02c") +
  labs(title = "Platykurtic (light-tailed)",
       subtitle = "Flatter peak, thinner tails",
       x = "Value",
       y = "Density")

# Display plots vertically
library(patchwork)
p1 + p2 + p3

Exploring data in R

Data structures you will use

The two structures you will work with most in this course:

Vectors: a single row of values

heights <- c(1.65, 1.70, 1.75, 1.80, 1.85)
heights

[1] 1.65 1.70 1.75 1.80 1.85

Data frames: a table with rows and columns (like a spreadsheet)

df <- data.frame(
  species = c("A", "B", "C", "A", "B"),
  height = c(1.65, 1.70, 1.75, 1.80, 1.85),
  weight = c(60, 65, 70, 75, 80)
)
df

  species height weight
1       A   1.65     60
2       B   1.70     65
3       C   1.75     70
4       A   1.80     75
5       B   1.85     80

Getting to know your data

A few functions that give you an immediate overview:

str(df)

'data.frame':   5 obs. of  3 variables:
 $ species: chr  "A" "B" "C" "A" ...
 $ height : num  1.65 1.7 1.75 1.8 1.85
 $ weight : num  60 65 70 75 80

summary(df)

   species              height         weight  
 Length:5           Min.   :1.65   Min.   :60  
 Class :character   1st Qu.:1.70   1st Qu.:65  
 Mode  :character   Median :1.75   Median :70  
                    Mean   :1.75   Mean   :70  
                    3rd Qu.:1.80   3rd Qu.:75  
                    Max.   :1.85   Max.   :80

unique(df$species)

[1] "A" "B" "C"

summary() is useful, but it does not always reveal the full picture.

Visualising missing data

The naniar package can show you exactly where values are missing, which is much easier to interpret than scanning raw output.

library(naniar)
vis_miss(airquality)

Seeing vs reading

Compare the vis_miss() output to the raw data below. Which one makes the missing values easier to spot?

airquality$Ozone

  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

summary(airquality$Ozone)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00   18.00   31.50   42.13   63.25  168.00      37

Histograms: the shape of continuous data

Use histograms when you want to see how a continuous variable is distributed. They reveal the centre, spread, and any skewness or outliers at a glance.

Code

hist(penguins$body_mass_g,
     main = "Distribution of Penguin Body Mass",
     xlab = "Body Mass (g)",
     col = "skyblue",
     border = "white")

Bar plots: comparing categories

Bar plots show counts or proportions across groups. Use them for categorical data like species, treatments, or survey responses.

Code

species_counts <- table(penguins$species)
barplot(species_counts,
        main = "Count of Penguins by Species",
        xlab = "Species", ylab = "Count",
        col = c("darkorange", "purple", "cyan4"),
        border = "white")

Scatterplots: relationships between variables

Scatterplots show how two continuous variables relate to each other. They are the go-to plot for spotting correlations, clusters, and outliers.

Code

penguins_clean <- na.omit(penguins[, c("flipper_length_mm", "body_mass_g")])
plot(penguins_clean$flipper_length_mm, penguins_clean$body_mass_g,
     main = "Relationship Between Flipper Length and Body Mass",
     xlab = "Flipper Length (mm)", ylab = "Body Mass (g)",
     pch = 19, col = "darkblue")

Boxplots: comparing distributions across groups

Boxplots summarise the median, quartiles, and outliers for each group. They are ideal for comparing a continuous variable across categories.

Code

boxplot(body_mass_g ~ species, data = penguins,
        main = "Body Mass by Penguin Species",
        xlab = "Species", ylab = "Body Mass (g)",
        col = c("darkorange", "purple", "cyan4"),
        border = "black")

Introduction to ggplot2

A grammar for graphics

ggplot2 builds plots from three core ingredients:

Data: the dataset
Aesthetics (aes): which variables map to x, y, colour, size, etc.
Geometries (geom_*): how the data is drawn (points, bars, lines, etc.)

You combine these with + to build a plot layer by layer. As you get more comfortable, you can add scales, facets, and themes to refine the result.

Building a plot: start with data

We will build a scatterplot of penguin flipper length vs body mass, step by step.

library(ggplot2)
penguins_clean <- na.omit(penguins)

# Start with data — this creates an empty canvas
ggplot(penguins_clean)

Nothing to see yet. We need to tell ggplot2 what to plot and how to draw it.

Add aesthetics and a geometry

Map variables to axes with aes(), then choose a geometry to draw them:

ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

Two lines of code, and we already have a working scatterplot. The + operator is how you add layers in ggplot2.

Colour by group and add labels

Map a categorical variable to colour inside aes() to split by group. Use labs() to add titles and clean up axis names.

ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g, colour = species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "Relationship Between Flipper Length and Body Mass",
    subtitle = "Palmer Penguins Dataset",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    colour = "Penguin Species"
  )

Customise colours and themes

scale_colour_manual() lets you pick your own palette. Themes control the overall look of the plot.

p <- ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g, colour = species)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_colour_manual(values = c("darkorange", "purple", "cyan4")) +
  labs(
    title = "Relationship Between Flipper Length and Body Mass",
    subtitle = "Palmer Penguins Dataset",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    colour = "Penguin Species"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
p

Adding more layers

Once you have a base plot saved as p, you can keep adding layers. Trend lines and facets are two of the most useful:

# Add a linear trend line per species
p + geom_smooth(method = "lm", se = FALSE)

# Split into separate panels
p + facet_wrap(~ species)

References and resources

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.

Lecture 03: Exploring and visualising data

Learning outcomes

After this week, you will be able to:

Quick checklist

Why explore data first?

Looking before you leap

The data exploration workflow

Quick recap: types of data

How is your data shaped?

The normal distribution

What does a normal distribution look like?

The empirical rule (68-95-99.7)

The empirical rule, visualised

Why does this matter?

But what if the data is not symmetric?

Positive skew (right-skewed)

Negative skew (left-skewed)

Kurtosis: how heavy are the tails?

Kurtosis, visualised

Exploring data in R

Data structures you will use

Getting to know your data

Visualising missing data

Seeing vs reading

Histograms: the shape of continuous data

Bar plots: comparing categories

Scatterplots: relationships between variables

Boxplots: comparing distributions across groups

Introduction to ggplot2

A grammar for graphics

Building a plot: start with data

Add aesthetics and a geometry

Colour by group and add labels

Customise colours and themes

Adding more layers

References and resources

Further reading

Thanks!