Lecture 01b – Reproducible science

ENVX1002 Statistics in Life and Environmental Sciences

Januar Harianto

The University of Sydney

Mar 2026

Importance of statistics

Never leave a number all by itself. Never believe that one number on its own can be meaningful. If you are offered one number, always ask for at least one more. Something to compare it with.

– Hans Rosling (1948-2017)

About me

  • Lecturer in Biostatistics/Data Science, also a research software engineer
  • Marine ecophysiologist by training, studied echinoderms (sea stars, sea urchins, etc.) and coral reefs

Why learn statistics?

All of science (and industry) are increasingly data-driven and computational

I don’t think it is relevant to my degree or future career

Does it matter? Statistics is everywhere and can be applied in almost any field – a multi-disciplinary problem-solving tool.

Benefits

Even if you don’t become a data scientist, statistics will help you to:

  1. Evaluate claims critically - isn’t that important for… everyone?
  2. Communicate effectively - statistics is NOT an exclusive language
  3. Solve real-world problems - wouldn’t it be great to understand the statistics behind the large decisions that affect our lives (e.g. budgeting, mortgage rates, etc.)?

The joy of stats

200 countries, 200 years, 4 minutes

In your own time: The best stats you’ve ever seen

Lionel Messi is impossible

It’s not possible to shoot more efficiently from outside the penalty area than many players shoot inside it. It’s not possible to lead the world in weak-kick goals and long-range goals. It’s not possible to score on unassisted plays as well as the best players in the world score on assisted ones. It’s not possible to lead the world’s forwards both in taking on defenders and in dishing the ball to others. And it’s certainly not possible to do most of these things by insanely wide margins.

But Messi does all of this and more.

Lionel Messi standing with the Argentina national football team on the pitch

Messi playing for Argentina. Image credit: Кирилл Венедиктов, CC BY-SA 3.0 GFDL, via Wikimedia Commons

Lionel Messi is impossible

Chart comparing Messi's scoring production to other top football players, showing Messi as a statistical outlier

Source: fivethirtyeight

Lionel Messi is impossible

Chart comparing Messi's shooting ratio to other top football players, highlighting his exceptional efficiency

Source: fivethirtyeight

Serious stats

Which sepal length is longer?

Photographs of three iris species -- setosa, versicolor, and virginica -- showing differences in sepal and petal shape

Source: Embedded Robotics

Note: common dataset used in statistics and machine learning.

Serious stats

Visualise

Code
# load libraries
pacman::p_load(ggplot2, rstatix, gt)

# create boxplot
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  theme_classic() +
  labs(y = "Sepal Length (cm)", title = "Sepal Length by Species")

Serious stats

Infer - the scientific method

We use formal statistical tests to determine if differences are statistically significant so that we can make inferences about the population based on the sample data.

Code
# run ANOVA
model <- aov(Sepal.Length ~ Species, data = iris)
f_stat <- summary(model)[[1]]$`F value`[1]
p_val <- summary(model)[[1]]$`Pr(>F)`[1]
rstatix::anova_summary(model) |>
  gt() |>
  opt_table_font(font = "Crimson Pro", add = FALSE) |>
  tab_options(table.font.size = px(28)) |>
  tab_caption(
    caption = "Table 1: One-way ANOVA results comparing sepal length between iris species"
  )
Table 1: One-way ANOVA results comparing sepal length between iris species
Effect DFn DFd F p p<.05 ges
Species 2 147 119.265 1.67e-31 * 0.619

A one-way ANOVA revealed significant differences in sepal length between species (ANOVA, F(2, 147) = 119.26, p < .001).

Not always formal

The scientific method

The man of science has learned to believe in justification, not by faith, but by verification.

Thomas Huxley (1825-1895)

Science as an enterprise

  • The scientific method – fundamental to centuries of scientific progress
  • If you discover something (or not), it should be possible for others to verify your findings independently
  • Your findings should be reproducible and replicable

No single method

Diagram of the logical framework for scientific inquiry as described by Underwood (1997)

The logical framework by Underwood (1997)

No single method

Flowchart showing variations of the scientific method as a cyclical workflow

No single method

HATPC

  • Hypothesis
  • Assumptions
  • Test statistic
  • P-value
  • Conclusion

Diagram of the HATPC framework: Hypothesis, Assumptions, Test statistic, P-value, Conclusion

Key principles

  1. Ask a question — What do you want to know?
  2. Make a prediction — What do you think will happen?
  3. Do an experiment — Collect data to test your prediction
  4. Analyse the data — Look for patterns or differences
  5. Draw a conclusion — What did you learn? Was your prediction right?

Why R? Why statistical programming?

Reproducibility crisis

Despite the scientific method, many scientific findings are often based on research that cannot be repeated by others due to various issues.

From Nature (including image sources):

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.

Reproducibility crisis

Bar chart from Nature survey showing the percentage of researchers who failed to reproduce experiments

Bar chart from Nature survey showing factors that contribute to irreproducible research, with statistical analysis and experimental design among the top causes

Reproducibility and replicability

Key definitions

  • Reproducibility: the ability to re-run an analysis and obtain the same results
  • Replicability: the ability to obtain the same conclusions using a different dataset or study population

Scientific findings should be both reproducible and replicable – the tools that we use should facilitate this in the most efficient way possible.

Reproducibility

How would you explain to someone how to reproduce this plot…

in Excel? Check the guide

In SPSS? Check the guide

In R

library(palmerpenguins)
boxplot(bill_length_mm ~ species, data = penguins)

An over-simplification

Those without programming knowledge will still struggle to understand and use the two lines of R code shown above.

Learn as MUCH as you can NOW

  • AI tools interface with code – you’re setting yourself up for success
  • Reproducible reporting is fundamental to ALL professions (think weekly reports, presentations, etc.)

Demonstration

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.

References

  • Quinn & Keough (2002). Sections 1.1-1.2, pages 1-7.
  • Underwood AJ (1997) Experiments in Ecology: Their Logical Design and Interpretation using Analysis of Variance. Cambridge University Press, Cambridge.