Topic 2 – Exploratory Data Analysis (EDA)

ENVX1002 Introduction to Statistical Methods

Dr. Floris van Ogtrop

The University of Sydney

Feb 2024

Topic 2 - Exploratory Data Analysis

Summary statistics:
- measures of centre;
- measures of spread (dispersion).
Graphical summaries:
- bar chart;
- histogram;
- boxplot.

Learning Outcomes

At the end of this topic students should able to:
- Calculate “by hand” summary statistics for simple datasets;
- Manually draw graphical summaries (boxplots and histograms) for simple datasets;
- Demonstrate proficiency in the use of R and Excel for calculating summary statistics and generating graphical summaries;
- Describe key features of their data using summary statistics and graphical summaries.

Types of data

Numerical:
- Continuous: yield, weight
- Discrete: weeds per m^2
Categorical:
- Binary: 2 mutually exclusive categories
- Ordinal: categories ranked in order
- Nominal: qualitative data

Presentation of data

Tables: Experimental data

Presentation of data

Tables: Observational data

Population versus Sample

Before we go calculate averages, we need to think about the difference between population and sample

We take a sample from a larger population
What information does the sample give about the population and how reliable is that information?

bakhtiarzein - https://stock.adobe.com/

Descriptive statistics

Measures of central tendency
- Mean
- Median
- Mode
Measures of spread or dispersion
- Range
- Interquartile range
- Standard deviation / Variance

library(ggplot2)
library(tidyverse)

# Generate a normal distribution
normal_dist <- tibble(x = seq(-4, 4, by = 0.01)) %>%
  mutate(y = dnorm(x))

# Plot the distribution
ggplot(normal_dist, aes(x = x, y = y)) +
  geom_line() +
  ggtitle("Standard Normal Distribution") +
  xlab("Value") +
  ylab("Density")

Motivating example

Sequestered soil carbon is worth $35/tonne if measured (1 Tonne of Carbon = 1 Australian Carbon Credit Unit = $AU35 See Clean Energy Regulator)
It costs $100 to collect and analyse one soil sample for soil carbon
The farmer needs an estimate of carbon stored on the property.
How many samples are needed to give a good estimate of carbon on the property? Is it worth measuring soil carbon for a land holder?

Source: https://www.energy.nsw.gov.au/business-and-industry/programs-grants-and-schemes/primary-industries-carbon-farming

Motivating example

Google earth image with farm and soil-landscape boundaries

Motivating example

Soil carbon content was measured at 6 points across a farm
- The amount at each location was 48, 56, 90, 78, 86, 271 (t/ha)
We will now get into some formulas and calculations ;o)

Sigma notation

\Sigma, is the greek capital letter called sigma, refers to the sum
It is a convenient way to represent long sums

\Sigma_{i=1}^n=x_1+x_2+x_3+...+x_n

- sum(c(48, 56, 78, 86, 90, 271))

total_c <- sum(c(48, 56, 78, 86, 90, 271))
print(total_c)

[1] 629

- =SUM(A1:A6)

Centre: Arithmetic mean

Population mean (\mu): sum of all values of a variable divided by the number of objects in the population;

\mu = \frac{\sum_{i=1}^{N} y_i}{N}

Sample mean (\overline{y}) is based on a subset of n objects from a population of size N

\overline{y} = \frac{\sum_{i=1}^{n} y_i}{n}

- mean(c(48, 56, 78, 86, 90, 271))

mean_c <- mean(c(48, 56, 78, 86, 90, 271))
print(mean_c)

[1] 104.8333

- =AVERAGE(A1:A6)

Centre: Median

Median is the middle number of a set of ordered observations

Population median: M=\left(\frac{N+1}{2}\right)th sorted value

Sample median: \tilde{y}=\left(\frac{n+1}{2}\right)th sorted value

- median(c(48, 56, 78, 86, 90, 271))

median_c <- median(c(48, 56, 78, 86, 90, 271))
print(median_c)

[1] 82

- =MEDIAN(A1:A6)

Centre: Mode

Mode is the most commonly occurring number in a set of observations

mode_function <- function(x) {
  uniq_x <- unique(x)
  uniq_x[which.max(tabulate(match(x, uniq_x)))]
}
data_vector <- c(1, 2, 4, 4, 3, 5, 4)
mode_value <- mode_function(data_vector)
print(mode_value)

[1] 4

- =MODE.SNGL(A1:A7)

Spread: Range

Difference between largest and smallest observations in a group of data
Note that we also refer to spread as measures of dispersion

- max(c(48, 56, 78, 86, 90, 271)) - min(c(48, 56, 78, 86, 90, 271))

range_c <- max(c(48, 56, 78, 86, 90, 271)) - min(c(48, 56, 78, 86, 90, 271))
print(range_c)

[1] 223

- =MAX(A1:A6) - MAX(A1:A6)

Spread: Inter-quartile range (IQR)

Median divides dataset into 2, quartile divides it into 4:
- 25% observations ≤ 1st quartile (Q1)
- 50% observations ≤ Median (Q2)
- 75% observations ≤ 3rd quartile (Q3)

Let’s take an easy example

1 2 3 4 5 6 7 8 9

What is Q1, Median, Q3?

quantile(c(1,2,3,4,5,6,7,8,9))

  0%  25%  50%  75% 100% 
   1    3    5    7    9

Spread: Inter-quartile range (IQR)

IQR = Q_3 - Q_1

Source: Nicholas (1999)

Spread: Inter-quartile range (IQR)

Quartiles

- quantile(c(48, 56, 78, 86, 90, 271))

quant_c <- quantile(c(48, 56, 78, 86, 90, 271))
print(quant_c)

   0%   25%   50%   75%  100% 
 48.0  61.5  82.0  89.0 271.0

- =QUARTILE.INC(A1:A6, 1) - first quartile

Spread: Inter-quartile range (IQR)

IQR

- IQR(c(48, 56, 78, 86, 90, 271))

iqr_c <- IQR(c(48, 56, 78, 86, 90, 271))
print(iqr_c)

[1] 27.5

- =QUARTILE.INC(A1:A6, 3)-QUARTILE.INC(A1:A6, 1) - third quartile - first quartile

Spread: Variance

Describes variability around the arithmetic mean

Population variance: \sigma^2 = \frac{\sum_{i=1}^{N}(y_i - \mu)^2}{N}

Sample variance: s^2 = \frac{\sum_{i=1}^{n}(y_i - \overline{y})^2}{n-1}

- var(c(48, 56, 78, 86, 90, 271))

var_c <- var(c(48, 56, 78, 86, 90, 271))
print(var_c)

[1] 6904.167

- =VAR.S(A1:A6)

Spread: Standard deviation

Describes variability around the arithmetic mean
- Variance is in units^2 as it is based on squared deviations from the mean
- Standard deviation describes variability around the mean in original units
- Standard deviation \sqrt() of the variance

Population standard deviation: \sigma = \sqrt{\frac{\sum_{i=1}^{N}(y_i - \mu)^2}{N}}

Sample standard deviation: s = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \overline{y})^2}{n-1}}

Spread: Standard deviation

R’s sd function always calculates the sample standard deviation
The denominator of sample standard is n−1 (Bessel’s correction) this is an important concept in statistics. A key property is that it gives a more accurate estimate of the population variance and standard deviation when working with a sample.

- sd(c(48, 56, 78, 86, 90, 271))

sd_c <- sd(c(48, 56, 78, 86, 90, 271))
print(sd_c)

[1] 83.09132

- =STDEV.S(A1:A6)

Spread: Coefficient of variation

Let’s take an example where we measured both nitrogen and carbon in our soil such that:

Soil nitrogen (%): 2 16 22 45 65 93

How could we find out which measurements have a greater spread given they have very different units (% versus t/ha)?
It turns out we can use the CV

CV=\left(\frac{s}{\overline{y}}\right)\times{100}

Spread: Coefficient of variation

Looking at the calculations below, which is more variable, Carbon or Nitrogen?

- sd(c(48, 56, 78, 86, 90, 271))

cv_c <- sd(c(48, 56, 78, 86, 90, 271))/mean(c(48, 56, 78, 86, 90, 271))*100
print(cv_c)

[1] 79.2604

cv_n <- sd(c(2, 16, 22, 45, 65, 93))/mean(c(2, 16, 22, 45, 65, 93))*100
print(cv_n)

[1] 84.10661

- =(STDEV.S(A1:A6)/AVERAGE(A1:A6))*100

Robustness (to outliers)

Which summary statistics should I use to describe centre?
- Example: 48, 56, 8, 86, 90, 27
- Example: 48, 56, 8, 86, 90, 271

mean - \overline{y}:

mean(c(48, 56, 8, 86, 90, 27))

[1] 52.5

mean(c(48, 56, 8, 86, 90, 271))

[1] 93.16667

median - \tilde{y}:

median(c(48, 56, 8, 86, 90, 27))

[1] 52

median(c(48, 56, 8, 86, 90, 271))

[1] 71

Robustness (to outliers)

Which summary statistics should I use to describe spread?
- Example: 48, 56, 8, 86, 90, 27
- Example: 48, 56, 8, 86, 90, 271

variance - s:

var(c(48, 56, 8, 86, 90, 27))

[1] 1038.3

var(c(48, 56, 8, 86, 90, 271))

[1] 8472.167

Inter quartile range - IQR:

IQR(c(48, 56, 8, 86, 90, 27))

[1] 46.25

IQR(c(48, 56, 8, 86, 90, 271))

[1] 39

Graphical and tabular summaries

Visualisation of data is useful for identifying
- outliers
- shape and distribution
- communicating results
- suggest modelling strategies
Bar chart
Strip chart
Boxplot
Histogram

Categorical data - table

Different types with examples:
- Binary: We spray insects and see how many die
- Nominal: We count how animals, and their species, are in a forest
- Ordinal: Different disease levels for a plant, no disease, moderate, severe
We can count the number observations belonging to each class, called frequency, f.
- Can present as a frequency table

Categorical data - Bar chart

We first tabulate the data

# Load necessary library
library(ggplot2)

# Your disease data
disease <- c("None", "Moderate", "None", "Severe", "Moderate", "Moderate", "Severe", "Moderate", "None", "Moderate")

# Order factors from no disease to severe disease 
disease = factor(disease, levels = c("None", "Moderate", "Severe"))

# Create a frequency table
disease_tbl <- table(disease)
print(disease_tbl)

disease
    None Moderate   Severe 
       3        5        2

Categorical data - Bar chart

We then plot the table in ggplot

# Convert the table to a data frame for ggplot2
disease_df <- as.data.frame(disease_tbl)

# Rename the columns appropriately
names(disease_df) <- c("Disease", "Frequency")

# Create the bar plot
p <- ggplot(disease_df, aes(x = Disease, y = Frequency)) +
  geom_bar(stat = "identity") +
  ggtitle("Frequency of Disease Categories") +
  xlab("Disease Category") +
  ylab("Frequency")

print(p)

Categorical data - Bar chart

Using tidyverse

# Load necessary libraries
library(tidyverse)

# Your disease data
disease <- c("None", "Moderate", "None", "Severe", "Moderate", "Moderate", "Severe", "Moderate", "None", "Moderate")

# Convert to tibble and count occurrences
disease_data <- tibble(disease) %>%
  mutate(disease = factor(disease, levels = c("None", "Moderate", "Severe"))) %>%
  count(disease, name = "Frequency")

# Create the bar plot
ggplot(disease_data, aes(x = disease, y = Frequency)) +
  geom_bar(stat = "identity") +
  ggtitle("Frequency of Disease Categories") +
  xlab("Disease Category") +
  ylab("Frequency")

Categorical data - Bar chart

NOTE: Bar charts should generally not used for continuous numerical data

Source: Weissgerber at al. (2015)

Numerical Data - Strip chart

Often if we have a small data set (1-5 data points), we can use a stripchart to visualise our data. We will demonstrate using our soil carbon data set.
What do we notice from the plot?

# Load necessary library
library(ggplot2)

# Your data
soil_c <- c(48, 56, 8, 86, 90, 271)

# Convert to a data frame
soil_c_df <- data.frame(Value = soil_c)

# Create the strip chart
p <- ggplot(soil_c_df, aes(x = "", y = Value)) +
  geom_jitter(width = 0) +
  ggtitle("Strip chart of soil carbon") +
  xlab("") +
  ylab("Soil carbon (t/ha)")

print(p)

Numerical Data - Boxplot

We can overlay our strip chart with a boxplot. This shows use the min/max, quartiles and median and can also show outliers.
We generally use boxplots when we have more than 5 data points.
See lecture notes for creating boxplot by hand.

Source: Nicholas (1999)

Numerical Data - Boxplot

Here is a slightly larger data set from a trial where creeping bentgrass turf was laid in an experiment to assess root growth. Eighty (80) “plugs” were randomly sampled 4 weeks after laying. Root growth was measured by averaging the length (mm) of the ten longest roots in each plug.

root_length <- c(108, 102, 100, 135, 113, 109, 92, 97, 73, 65,
68, 74, 93, 97, 118, 121, 103, 99, 90, 90,
99, 102, 106, 90, 92, 97, 100, 92, 80, 99,
103, 103, 115, 85, 96, 86, 85, 86, 91, 90,
94, 93, 93, 99, 109, 115, 110, 94, 107, 88,
101, 89, 117, 91, 112, 101, 91, 81, 80, 67,
69, 80, 86, 81, 65, 90, 99, 93, 90, 102,
72, 70, 90, 90, 87, 89, 90, 96, 108, 86)

Numerical Data - Boxplot

The follow produces a boxplot and also includes the jittered data points (red coloured)
Note that there is on outlier which is the black data pointat the top of the plot

# Load necessary library
library(ggplot2)

# Convert to a data frame
root_length_df <- data.frame(Value = root_length)

# Create the strip chart
ggplot(root_length_df, aes(x = "", y = Value)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, col = "red") +
  ggtitle("Boxplot of root length in bentgrass") +
  xlab("") +
  ylab("Root length (mm)")

Numerical Data - Histogram

Based on frequency table
- Height of each bar proportional to frequency - need to group data
- We can use histograms to describe the shape of distributions for continuous data sets that are larger than 20 data points.

# Load necessary library
library(ggplot2)

# Convert to a data frame
root_length_df <- data.frame(Value = root_length)

# Create the strip chart
ggplot(root_length_df, aes(Value)) +
  geom_histogram() +
  ggtitle("Boxplot of root length in bentgrass") +
  xlab("Root length (mm)")

Summary

The following is a rough guide for plotting continuous data

Remember for categorical data we use Tables and Bar Charts

Numerical Data - Symmetry

Throughout this unit you will be assessing the shape of distributions, in particular you will be looking at whether the distribution (histogram) of the data is symmetrical in shape;
For small data sets, you will generally compare the mean and median;
- if the mean and the median are similar it indicates that the data is symmetrical.
- if the mean and the median are not similar it indicates that the data is skewed.

mean(soil_c)

[1] 93.16667

median(soil_c)

[1] 71

What can we conclude from the mean and median of our soil carbon data?

Numerical Data - Symmetry

For larger data sets, we can look at the mean, median and the histogram to determine if it is symmetrical.

mean(root_length)

[1] 93.8625

median(root_length)

[1] 93

# Create the strip chart
ggplot(root_length_df, aes(Value)) +
  geom_histogram() +
  ggtitle("Boxplot of root length in bentgrass") +
  xlab("Root length (mm)")

Numerical Data - Symmetry

We can also calculate skewness using the following equation

g_1 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left( \frac{y_i - \bar{y}}{s} \right)^3

in RStudio, we can use the skewness function found in the e1071 package

library(e1071)
skewness(root_length)

[1] 0.08316635

if |{g_1}|<1.0 then the dataset is approximately symmetrical. If g_1>1.0 then the data is positively skewed and if g_1<-1.0 then the data is negatively skewed

Reading

Canvas site
Notes
Quinn & Keough (2002)
- Chapter 2. Sections 2.1-2.2, p. 14-17.
- Chapter 4. Sections 4.1, p. 58-61 (stop at scatterplot)
Mead et al. (2002).
- Chapter 1.
- Chapter 2. Sections 2.1-2.3, p. 9-19

References

J. Nicholas (1999). Introduction to descriptive statistics. Mathematics Learning Centre, University of Sydney.
T. L. Weissgerber, N. M. Milic, S. J. Winham and V. D. Garovic (2015). Beyond bar and line graphs: time for a new data presentation paradigm. PLOS Biology. 13. e1002128.

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.