Do not over-think about âoutliersâ, use a student-t distribution instead | by Daniel Manrique-Castano

A Student’s t-distribution is essentially a Gaussian distribution with heavier tails. Essentially, the Gaussian distribution can be seen as a special case of the Student’s t-distribution. The Gaussian distribution is characterized by the mean (μ) and the standard deviation (σ). On the other hand, the Student t distribution introduces an additional parameter, the degrees of freedom (df), which determines the “thickness” of the distribution. This parameter assigns higher probability to events further away from the mean, making it particularly useful for small sample sizes, such as in biomedicine, where normality assumptions may be questionable. It’s important to note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. This can be visualized using density plots:

# Load necessary libraries
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Define the distributions
x <- seq(-4, 4, length.out = 200) y_gaussian <- dnorm(x) y_t3 <- dt(x, df = 3) y_t10 <- dt(x, df = 10) y_t30 <- dt(x, df = 30) # Create a data frame for plotting df <- data.frame(x, y_gaussian, y_t3, y_t10, y_t30) # Plot the distributions ggplot(df, aes(x)) +geom_line(aes(y = y_gaussian, color = "Gaussian")) +geom_line(aes(y = y_t3, color = "t, df=3")) +geom_line(aes(y = y_t10, color = "t, df=10")) +geom_line(aes(y = y_t30, color = "t, df=30")) +labs(title = "Comparison of Gaussian and Student t-Distributions",x = "Value",y = "Density") +scale_color_manual(values = c("Gaussian" = "blue", "t, df=3" = "red", "t, df=10" = "green", "t, df=30" = "purple")) +theme_classic()

Figure 1: Comparison of Gaussian and Student t-Distributions with different degrees of freedom.

In Figure 1, you can observe that as the degrees of freedom decrease, the peak around the mean diminishes due to the probability mass shifting towards the thicker tails. This property of the Student’s t-distribution reduces its sensitivity to outliers. For more detailed information on this topic, refer to this blog.

We start by loading the necessary libraries:

library(ggplot2)library(brms)library(ggdist)library(easystats)library(dplyr)library(tibble)library(ghibli)

Now, let’s move on from data simulations to real data analysis. We will be working with actual data obtained from mice undergoing the rotarod test.

First, we load the dataset into our environment and configure the corresponding factor levels. The dataset includes IDs for the animals, a grouping variable (Genotype), an indicator for the different days of the test (day), and various trials for each day. For this article, we focus on modeling only one of the trials (Trial3), leaving the other trials for a future analysis on variation modeling.

As we handle the data, our modeling approach will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In the field of biomedical science, categorical predictors or grouping factors are more prevalent than continuous predictors. Researchers in this domain tend to categorize their samples into groups or conditions and apply diverse treatments.

data <- read.csv("Data/Rotarod.csv")data$Day <- factor(data$Day, levels = c("1", "2"))data$Genotype <- factor(data$Genotype, levels = c("WT", "KO"))head(data)

Let’s get an initial overview of the data using Raincloud plots, as demonstrated by Guilherme A. Franchi, PhD in this informative blog post.

edv <- ggplot(data, aes(x = Day, y = Trial3, fill=Genotype)) +scale_fill_ghibli_d("SpiritedMedium", direction = -1) +geom_boxplot(width = 0.1,outlier.color = "red") +xlab('Day') +ylab('Time (s)') +ggtitle("Rorarod performance") +theme_classic(base_size=18, base_family="serif")+theme(text = element_text(size=18),axis.text.x = element_text(angle=0, hjust=.1, vjust = 0.5, color = "black"),axis.text.y = element_text(color = "black"),plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5),legend.position="bottom")+scale_y_continuous(breaks = seq(0, 100, by=20), limits=c(0,100)) +# Line below adds dot plots from {ggdist} package stat_dots(side = "left", justification = 1.12,binwidth = 1.9) +# Line below adds half-violin from {ggdist} packagestat_halfeye(adjust = .5, width = .6, justification = -.2, .width = 0, point_colour = NA)edv

Figure 2: Exploratory data visualization.

Figure 2 presents a different view compared to the original by Guilherme A. Franchi, PhD as we are plotting two factors instead of one. However, the essence of the plot remains the same. Pay attention to the red dots, which represent extreme observations that can skew the measures of central tendency (especially the mean) towards one direction. Additionally, we notice variations in variances, indicating that modeling sigma can enhance estimation accuracy. Our next step is to model the output using the brms package.

Source link