Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

A Primer on Statistical Inference

October 31, 2023
in Data Science & ML
Reading Time: 5 mins read
0 0
A A
0
Share on FacebookShare on Twitter



The law of large numbers and sound statistical reasoning are the foundation for effective statistical inference in data science. The following text draws significantly from my book, “Data Science — An Introduction to Statistics and Machine Learning” [Plaue 2023], recently published by Springer Nature.

Through our everyday experience, we have an intuitive understanding of what a typical body height is for people in the population. In much of the world, adult humans are typically between 1.60 m and 1.80 m tall, while people taller than two meters are rare to meet. By providing a frequency distribution of body height, this intuited fact can be backed up with numerical evidence:

These figures are based on a dataset collected by the U.S. Centers for Disease Control and Prevention (CDC) that lists, among other attributes, the height of more than 340,000 individuals [CDC 2018]. An inspection of this frequency table shows that, in fact, more than half of the people interviewed for the survey reported their height to be between1.60 m and 1.80 m.

Even though the sample is of limited size, we are confident that our investigations allow us to draw conclusions about the population as a whole. For example, based on data alone, we can conclude with some confidence that a human being cannot grow to a height of three meters.

One important goal of stochastics is to justify such conclusions rigorously, mathematically. The field can be divided into two subfields:

– Probability theory deals with the mathematical definition and investigation of the concept of probability. A central object of such an investigation are random variables: variables the values of which are not specified or known precisely but are subject to uncertainty. In other words, a probability can only be given that a random variable takes values within a certain range.

– Inferential statistics is based on the assumption that statistical observations and measures, such as frequencies, means, etc., are values or realizations of random variables. Conversely, the field investigates the extent to which characteristics of random variables can be estimated from sampled data. In particular, under certain simplifying assumptions, it is possible to quantify the accuracy or error of such an estimate.

Let us examine a straightforward example of statistical inference: determining whether a coin is fair or biased by observing a sequence of coin tosses. We can assume that the outcome of tossing the coin is determined by a discrete random variable X_1 that takes on the values of zero (representing tails) or one (representing heads). If we were to flip the same coin again, we can assume that the outcome can be described by a second random variable X_2, which is independent of the first but follows the same distribution.

If we lack any evidence to support the hypothesis that the coin is biased, we may assume that the coin is fair. In other words, we expect that heads will appear with the same probability as tails. Under this assumption, known as the null hypothesis, if we were to repeat the experiment multiple times, we would expect heads to turn up about as often as tails.

Conversely, the data allow us to draw conclusions about the underlying true distribution. For example, if we were to observe very different frequencies for heads and tails, such as a 70% frequency for heads compared to 30% for tails, then — if the sample size is sufficiently large — we would be convinced that we need to correct our original assumption of equal probability. In other words, we may need to abandon our assumption that the coin is fair.

In the example above, the frequency of heads appearing in the data acts as an estimator of the probability of the random event “the coin shows heads.” Common sense suggests that our confidence in such estimates increases with the size of the sample. For instance, if the imbalance described earlier were found in only ten coin tosses (seven heads and three tails), we might not yet be convinced that we have a biased coin. It is still possible that the null hypothesis of a fair coin holds true. In everyday terms, the outcome of the experiment could also be attributed to “pure chance.” However, if we observed seventy instances of heads out of one hundred coin tosses, it would be much stronger evidence in favor of the alternative hypothesis that the coin is biased!

Point estimates are among the most fundamental tools in the toolkit of statisticians and data scientists. For instance, the arithmetic mean, derived from a sizable sample of a population, provides an insight into the typical value that a given variable might assume. In machine learning, we estimate model parameters from training data, which should cover an adequate number of labeled examples.

Through experience and intuition, we have come to believe that larger samples and larger amounts of training data allow for more accurate statistical procedures and better predictive models. Inferential statistics offer a more robust foundation for supporting this intuition, often referred to as the law of large numbers. Furthermore, we gain a deeper understanding of what constitutes a “sufficiently large sample” by calculating confidence intervals, as opposed to relying solely on point estimates. Confidence intervals provide us with ranges of values within which we can reasonably assert that the true parameter we seek to estimate resides.

In the following sections, we will present the mathematical framework for computing confidence intervals in a self-contained manner, at the core of which lies the central limit theorem.

Chebyshev’s law of large numbers

Just as we expect the relative frequency to be a good estimator for the probability of an event or outcome of a binary variable, we expect the arithmetic mean to be a good estimator for the expected value of the random variable that produces the numeric data we observe.

It is important to note that this estimate itself is again a random variable. If we roll a die 50 times and record the average number, and then we repeat the experiment, we will likely get slightly different values. If we repeat the experiment many times, the arithmetic means we recorded will follow some distribution. For large samples, however, we expect them to show only a small dispersion and to be centered around the true expected value. This is the key message of Chebyshev’s law of large numbers, which we will detail below.

Before doing so, we introduce an important tool in probability theory— Chebyshev’s inequality. Suppose that we are given some random variable X with finite mean μ and variance σ². Then, for any ε > 0, the following holds, where Pr( · ) means “probability of”:

This result aligns with our intuitive understanding of a measure of dispersion: the smaller the variance, the more likely it is that the random variable will take on values that are close to the mean.

For example, the probability of finding an observed value of the random variable within six standard deviations of its expected value is very high, at least 97%. In other words, the probability that a random variable takes on a value that deviates from the mean by more than six standard deviations is very low, less than 3%. This result holds for distributions of any shape as long as the expected value and variance are finite values.

Now suppose that we observe numeric values in a sample that are the realizations of random variables X_1, …, X_N. We assume that these random variables are mutually independent and follow the same distribution, a property commonly known as independent and identically distributed, or i.i.d. for short. This assumption is reasonable when the observations are the result of independently set up and identically prepared trials or when they represent a random selection from a population. However, it is important to note that this assumption may not always be justified.

In addition, we assume that the expected value μ and variance σ² of every random variable exists and is finite. Since the variables follow the same distribution, these values are the same for each of the variables. Next, we consider the following random variable that produces the arithmetic mean:

First, we show that the arithmetic mean estimator x̄ is an unbiased estimator: its values distribute around the true mean μ. This is a result that follows directly from…



Source link

Tags: inferencePrimerStatistical
Previous Post

Preparing For AI’s Impact

Next Post

Automation technology to boost Japan’s logistics industry

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Automation technology to boost Japan’s logistics industry

Automation technology to boost Japan’s logistics industry

5 Skills To Become A Cloud Engineer | Cloud Computing Engineer

5 Skills To Become A Cloud Engineer | Cloud Computing Engineer

1 Min Business News (Daily) | A.I wrecking U.S stock market? #breakingnews #business #news #shorts

1 Min Business News (Daily) | A.I wrecking U.S stock market? #breakingnews #business #news #shorts

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In