An accessible walkthrough of fundamental properties of this popular, yet often misunderstood metric from a predictive modeling perspective
15 min read
15 hours ago
Photo by Josh Rakower on Unsplash
R² (R-squared), also known as the coefficient of determination, is widely used as a metric to evaluate the performance of regression models. It is commonly used to quantify goodness of fit in statistical modeling, and it is a default scoring metric for regression models both in popular statistical modeling and machine learning frameworks, from statsmodels to scikit-learn.
Despite its omnipresence, there is a surprising amount of confusion on what R² truly means, and it is not uncommon to encounter conflicting information (for example, concerning the upper or lower bounds of this metric, and its interpretation). At the root of this confusion is a “culture clash” between the explanatory and predictive modeling tradition. In fact, in predictive modeling — where evaluation is conducted out-of-sample and any modeling approach that increases performance is desirable — many properties of R² that do apply in the narrow context of explanation-oriented linear modeling no longer hold.
To help navigate this confusing landscape, this post provides an accessible narrative primer to some basic properties of R² from a predictive modeling perspective, highlighting and dispelling common confusions and misconceptions about this metric. With this, I hope to help the reader to converge on a unified intuition of what R² truly captures as a measure of fit in predictive modeling and machine learning, and to highlight some of this metric’s strengths and limitations. Aiming for a broad audience which includes Stats 101 students and predictive modellers alike, I will keep the language simple and ground my arguments into concrete visualizations.
Ready? Let’s get started!
What is R²?
Let’s start from a working verbal definition of R². To keep things simple, let’s take the first high-level definition given by Wikipedia, which is a good reflection of definitions found in many pedagogical resources on statistics, including authoritative textbooks: the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
Anecdotally, this is also what the vast majority of students trained in using statistics for inferential purposes would probably say, if you asked them to define R². But, as we will see in a moment, this common way of defining R² is the source of many of the misconceptions and confusions related to R². Let’s dive deeper into it.
Calling R² a proportion implies that R² will be a number between 0 and 1, where 1 corresponds to a model that explains all the variation in the outcome variable, and 0 corresponds to a model that explains no variation in the outcome variable. Note: your model might also include no predictors (e.g., an intercept-only model is still a model), that’s why I am focusing on variation predicted by a model rather than by independent variables.
Let’s verify if this intuition on the range of possible values is correct. To do so, let’s recall the mathematical definition of R²:
R² = 1 – (RSS / TSS)
Here, RSS is the residual sum of squares, which is defined as:
RSS = Σ(y – ŷ)²
This is simply the sum of squared errors of the model, that is the sum of squared differences between true values y and corresponding model predictions ŷ.
On the other hand, TSS, the total sum of squares, is defined as follows:
TSS = Σ(y – ȳ)²
As you might notice, this term has a similar “form” than the residual sum of squares, but this time, we are looking at the squared differences between the true values of the outcome variables y and the mean of the outcome variable ȳ. This is technically the variance of the outcome variable. But a more intuitive way to look at this in a predictive modeling context is the following: this term is the residual sum of squares of a model that always predicts the mean of the outcome variable. Hence, the ratio of RSS and TSS is a ratio between the sum of squared errors of your model, and the sum of squared errors of a “reference” model predicting the mean of the outcome variable.
With this in mind, let’s go on to analyse what the range of possible values for this metric is, and to verify our intuition that these should, indeed, range between 0 and 1.
What is the best possible R²?
As we have seen so far, R² is computed by subtracting the ratio of RSS and TSS from 1. Can this ever be higher than 1? Or, in other words, is it true that 1 is the largest possible value of R²? Let’s think this through by looking back at the formula.
The only scenario in which 1 minus something can be higher than 1 is if that something is a negative number. But here, RSS and TSS are both sums of squared values, that is, sums of positive values. The ratio of RSS and TSS will thus always be positive. The largest possible R² must therefore be 1.
Now that we have established that R² cannot be higher than 1, let’s try to visualize what needs to happen for our model to have the maximum possible R². For R² to be 1, RSS / TSS must be zero. This can happen if RSS = 0, that is, if the model predicts all data points perfectly.
Examples illustrating hypothetical models with R² ≈ 1 using simulated data. In all cases, the true underlying model is y = 2x + 3. The first two models fit the data perfectly, in the first case because the data has no noise and a linear model can retrieve perfectly the relation between x and y (left) and in the second because the model is very flexible and overfits the data (center). These are extreme cases which are hardly found in reality. In fact, the largest possible R² will often be defined by the amount of noise if the data. This is illustrated by the third plot, where due to the presence of random noise, even the true model can only achieve R² = 0.458.
In practice, this will never happen, unless you are wildly overfitting your data with an overly complex model, or you are computing R² on a ridiculously low number of data points that your model can fit perfectly. All datasets will have some amount of noise that cannot be accounted for by the data. In practice, the largest possible R² will be defined by the amount of unexplainable noise in your outcome variable.
What is the worst possible R²?
So far so good. If the largest possible value of R² is 1, we can still think of R² as the proportion of variation in the outcome variable explained by the model. But let’s now move on to looking at the lowest possible value. If we buy into the definition of R² we presented above, then we must assume that the lowest possible R² is 0.
When is R² = 0? For R² to be null, RSS/TSS must be equal to 1. This is the case if RSS = TSS, that is, if the sum of squared errors of our model is equal to the sum of squared errors of a model predicting the mean. If you are better off just predicting the mean, then your model is really not doing a terribly good job. There are infinitely many reasons why this can happen, one of these being an issue with your choice of model — if, for example, if you are trying to model really non-linear data with a linear model. Or it can be a consequence of your data. If your outcome variable is very noisy, then a model predicting the mean might be the best you can do.
Two cases where the mean model might be the best possible (linear) models because: a) data is pure Gaussian noise (left); b) the data is highly non-linear, as it is generated using a periodic function (right).
But is R² = 0 truly the lowest possible R²? Or, in other words, can R² ever be negative? Let’s look back at the formula. R² < 0 is only possible if RSS/TSS > 1, that is, if RSS > TSS. Can this ever be the case?
This is where things start getting interesting, as the answer to this question depends very much on contextual information that we have not yet specified, namely which type of models we are considering, and which data we are computing R² on. As we will see, whether our interpretation of R² as the…