cor(x, y)
Describing relationships
In statistics, a relationship between two or more numerical variables means when one variable changes, the other variable/s will also change. We can describe these relationships in terms of direction, strength, and form.
To determine the relationship between two variables qualitatively with by visualising our data with scatter plots.
Scatter Plots
Direction
A relationship can be positive, negative, or non-existent.
- Positive relationship: when one variable increases, the other variable also increases, e.g. the amount of rainfall received and the resulting crop yield.
- Negative relationship: when one variable increases, the other variable decreases, e.g. the levels of insulin and glucose in blood.
- No relationship: when one variable changes, the other variable does not change, e.g. shoe size and IQ.
Strength
The strength of a relationship refers to how closely the two variables are related - weak, moderate, strong, very strong and perfect. A weak relationship will have more scatter than a strong relationship.
Form
The form of a relationship refers to the shape of the relationship. The simplest form is a straight line or linear relationship. Some examples of non-linear forms include polynomials (e.g. quadratic, cubic), exponential, and logistic.
Correlation
Linear relationships
To measure the relationship quantitatively, we use correlation coefficients. These are numbers between -1 and 1, where -1 indicates a perfect negative relationship, 0 indicates no relationship, and 1 indicates a perfect positive relationship.
The most common correlation coefficient is the Pearson correlation coefficient (\(r\)). It measures linear relationships between two numerical variables.
Pearson’s Correlation (r) Formula:
\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \]
In essence, it is the covariance divided by the product of the standard deviations.
In R, we use the function cor()
to calculate the correlation coefficient. By default, it calculates the Pearson correlation coefficient.
Describing the correlation coefficient in terms of strength can be a little subjective. Below are some approximate ranges for the common terms.
- |0 - 0.1| : no relationship
- |0.1 - 0.3| : weak
- |0.4 - 0.6| : moderate
- |0.7 - 1| : strong
- |0.9 - 1| : very strong
Non-linear relationships
In the case of non-linear relationships, it is best not to use the Pearson’s correlation coefficient. Instead, we can use the Spearman’s rank correlation coefficient (\(r_{s}\)) or Kendall’s tau (\(\tau\)). Note, these relationships must still be monotonic.
- Monotonic: a relationship that is consistently increasing or decreasing
- Linear: a relationship that is increasing or decreasing at a constant rate
To use either Spearman’s rank correlation coefficient or Kendall’s tau, we can use the cor()
function with the method
argument set to either "spearman"
or "kendall"
.
cor(x, y, method = "spearman")
cor(x, y, method = "kendall")
Spearman’s rank correlation coefficient essentially ranks the data (e.g. the smallest value will be ranked 1, the second smallest 2, etc.) for both the x and y axis, and then calculates the Pearson correlation coefficient for the ranks. Thus it works for any monotonic relationship.
Kendall’s tau looks at all possible x and y pairs, and determines if they are concordant. e.g. one pair of points \((x_{i}, y_{i})\) is concordant with another pair \((x_{j}, y_{j})\) if \(x_{i} > x_{j}\) and \(y_{i} > y_{j}\) or \(x_{i} < x_{j}\) and \(y_{i} < y_{j}\). Kendall’s tau is then calculated with:
\[ \tau = \frac{\text{(number of concordant pairs)} - \text{(number of discordant pairs)}}{\text{(total number of pairs)}}\]
Causation
Correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other to change. Some example of spurious (i.e. ridiculous) correlations can be found at Spurious Correlations.
Before conducting an experiment to collect data (or analysing existing data), it is important to have a hypothesis about the relationship between the variables. Is there reason to believe the two variables should have a relationship? If not, any ‘relationship’ found via scatter plots or correlations are unlikely meaningful.