A Primer on Statistical Inference

The law of large numbers and sound statistical reasoning are the foundation for effective statistical inference in data science. The following text draws significantly from my book, “Data Science — An Introduction to Statistics and Machine Learning” [Plaue 2023], recently published by Springer Nature.

Through our everyday experience, we have an intuitive understanding of what a typical body height is for people in the population. In much of the world, adult humans are typically between 1.60 m and 1.80 m tall, while people taller than two meters are rare to meet. By providing a frequency distribution of body height, this intuited fact can be backed up with numerical evidence:

These figures are based on a dataset collected by the U.S. Centers for Disease Control and Prevention (CDC) that lists, among other attributes, the height of more than 340,000 individuals [CDC 2018]. An inspection of this frequency table shows that, in fact, more than half of the people interviewed for the survey reported their height to be between1.60 m and 1.80 m.

Even though the sample is of limited size, we are confident that our investigations allow us to draw conclusions about the population as a whole. For example, based on data alone, we can conclude with some confidence that a human being cannot grow to a height of three meters.

One important goal of stochastics is to justify such conclusions rigorously, mathematically. The field can be divided into two subfields:

– Probability theory deals with the mathematical definition and investigation of the concept of probability. A central object of such an investigation are random variables: variables the values of which are not specified or known precisely but are subject to uncertainty. In other words, a probability can only be given that a random variable takes values within a certain range.

– Inferential statistics is based on the assumption that statistical observations and measures, such as frequencies, means, etc., are values or realizations of random variables. Conversely, the field investigates the extent to which characteristics of random variables can be estimated from sampled data. In particular, under certain simplifying assumptions, it is possible to quantify the accuracy or error of such an estimate.

Let us examine a straightforward example of statistical inference: determining whether a coin is fair or biased by observing a sequence of coin tosses. We can assume that the outcome of tossing the coin is determined by a discrete random variable X_1 that takes on the values of zero (representing tails) or one (representing heads). If we were to flip the same coin again, we can assume that the outcome can be described by a second random variable X_2, which is independent of the first but follows the same distribution.

If we lack any evidence to support the hypothesis that the coin is biased, we may assume that the coin is fair. In other words, we expect that heads will appear with the same probability as tails. Under this assumption, known as the null hypothesis, if we were to repeat the experiment multiple times, we would expect heads to turn up about as often as tails.

Conversely, the data allow us to draw conclusions about the underlying true distribution. For example, if we were to observe very different frequencies for heads and tails, such as a 70% frequency for heads compared to 30% for tails, then — if the sample size is sufficiently large — we would be convinced that we need to correct our original assumption of equal probability. In other words, we may need to abandon our assumption that the coin is fair.

In the example above, the frequency of heads appearing in the data acts as an estimator of the probability of the random event “the coin shows heads.” Common sense suggests that our confidence in such estimates increases with the size of the sample. For instance, if the imbalance described earlier were found in only ten coin tosses (seven heads and three tails), we might not yet be convinced that we have a biased coin. It is still possible that the null hypothesis of a fair coin holds true. In everyday terms, the outcome of the experiment could also be attributed to “pure chance.” However, if we observed seventy instances of heads out of one hundred coin tosses, it would be much stronger evidence in favor of the alternative hypothesis that the coin is biased!

Point estimates are among the most fundamental tools in the toolkit of statisticians and data scientists. For instance, the arithmetic mean, derived from a sizable sample of a population, provides an insight into the typical value that a given variable might assume. In machine learning, we estimate model parameters from training data, which should cover an adequate number of labeled examples.

Through experience and intuition, we have come to believe that larger samples and larger amounts of training data allow for more accurate statistical procedures and better predictive models. Inferential statistics offer a more robust foundation for supporting this intuition, often referred to as the law of large numbers. Furthermore, we gain a deeper understanding of what constitutes a “sufficiently large sample” by calculating confidence intervals, as opposed to relying solely on point estimates. Confidence intervals provide us with ranges of values within which we can reasonably assert that the true parameter we seek to estimate resides.

In the following sections, we will present the mathematical framework for computing confidence intervals in a self-contained manner, at the core of which lies the central limit theorem.

Chebyshev’s law of large numbers

Just as we expect the relative frequency to be a good estimator for the probability of an event or outcome of a binary variable, we expect the arithmetic mean to be a good estimator for the expected value of the random variable that produces the numeric data we observe.

It is important to note that this estimate itself is again a random variable. If we roll a die 50 times and record the average number, and then we repeat the experiment, we will likely get slightly different values. If we repeat the experiment many times, the arithmetic means we recorded will follow some distribution. For large samples, however, we expect them to show only a small dispersion and to be centered around the true expected value. This is the key message of Chebyshev’s law of large numbers, which we will detail below.

Before doing so, we introduce an important tool in probability theory— Chebyshev’s inequality. Suppose that we are given some random variable X with finite mean μ and variance σ². Then, for any ε > 0, the following holds, where Pr( · ) means “probability of”:

This result aligns with our intuitive understanding of a measure of dispersion: the smaller the variance, the more likely it is that the random variable will take on values that are close to the mean.

For example, the probability of finding an observed value of the random variable within six standard deviations of its expected value is very high, at least 97%. In other words, the probability that a random variable takes on a value that deviates from the mean by more than six standard deviations is very low, less than 3%. This result holds for distributions of any shape as long as the expected value and variance are finite values.

Now suppose that we observe numeric values in a sample that are the realizations of random variables X_1, …, X_N. We assume that these random variables are mutually independent and follow the same distribution, a property commonly known as independent and identically distributed, or i.i.d. for short. This assumption is reasonable when the observations are the result of independently set up and identically prepared trials or when they represent a random selection from a population. However, it is important to note that this assumption may not always be justified.

In addition, we assume that the expected value μ and variance σ² of every random variable exists and is finite. Since the variables follow the same distribution, these values are the same for each of the variables. Next, we consider the following random variable that produces the arithmetic mean:

First, we show that the arithmetic mean estimator x̄ is an unbiased estimator: its values distribute around the true mean μ. This is a result that follows directly from…

Source link