Friday, May 16, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

The Death of the Static AI Benchmark | by Sandi Besen | Mar, 2024

March 22, 2024
in AI Technology
Reading Time: 5 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Benchmarking as a Measure of Success

Towards Data Science

Benchmarks are often hailed as a hallmark of success. They are a celebrated way of measuring progress — whether it’s achieving the sub 4-minute mile or the ability to excel on standardized exams. In the context of Artificial Intelligence (AI) benchmarks are the most common method of evaluating a model’s capability. Industry leaders such as OpenAI, Anthropic, Meta, Google, etc. compete in a race to one-up each other with superior benchmark scores. However, recent research studies and industry grumblings are casting doubt about whether common benchmarks truly capture the essence of a models ability.

Source: Dalle 3

Emerging research points to the probability that training sets of some models have been contaminated with the very data that they are being assessed on — raising doubts on the the authenticity of their benchmark scores reflecting true understanding. Just like in films where actors can portray Doctors or Scientists, they deliver the lines without truly grasping the underlying concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, he likely did not understand the complex physics theories he spoke of. Although benchmarks are meant to evaluate a models capabilities, are they truly doing so if like an actor the model has memorized them?

Recent findings from the University of Arizona have discovered that GPT-4 is contaminated with AG News, WNLI, and XSum datasets discrediting their associated benchmarks[1]. Further, researchers from the University of Science and Technology of China found that when they deployed their “probing” technique on the popular MMLU Benchmark [2], results decreased dramatically.

Their probing techniques included a series of methods meant to challenge the models understanding of the question when posed different ways with different answer options, but the same correct answer. Examples of the probing techniques consisted of: paraphrasing questions, paraphrasing choices, permuting choices, adding extra context into questions, and adding a new choice to the benchmark questions.

From the graph below, one can gather that although each tested model performed well on the unaltered “vanilla” MMLU benchmark, when probing techniques were added to different sections of the benchmark (LU, PS, DK, All) they did not perform as strongly.

“Vanilla” represents performance on the unaltered MMLU Benchmark.The other keys represent the performance on the altered sections of the MMLU Benchmark:Language Understanding (LU),Problem Solving (PS),Domain Knowledge (DK), All

This evolving situation prompts a re-evaluation of how AI models are assessed. The need for benchmarks that both reliably demonstrate capabilities and anticipate the issues of data contamination and memorization is becoming apparent.

As models continue to evolve and are updated to potentially include benchmark data in their training sets, benchmarks will have an inherently short lifespan. Additionally, model context windows are increasing rapidly, allowing a larger amount of context to be included in the models response. The larger the context window the more potential impact of contaminated data indirectly skewing the model’s learning process, making it biased towards the seen test examples .

To address these challenges, innovative approaches such as dynamic benchmarks are emerging, employing tactics like: altering questions, complicating questions, introduce noise into the question, paraphrasing the question, reversing the polarity of the question, and more [3].

The example below provides an example on several methods to alter benchmark questions (either manually or language model generated).

Source: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

As we move forward, the imperative to align evaluation methods more closely with real-world applications becomes clear. Establishing benchmarks that accurately reflect practical tasks and challenges will not only provide a truer measure of AI capabilities but also guide the development of Small Language Models (SLMs) and AI Agents. These specialized models and agents require benchmarks that genuinely capture their potential to perform practical and helpful tasks.



Source link

Tags: benchmarkBesendeathMarSandiStatic
Previous Post

Cryptos and Stocks Close the Week in Red, Analysts Eye Post-Halving Bitcoin Rally

Next Post

Southwest Gas infrastructure services unit Centuri files for IPO (NYSE:SWX)

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Southwest Gas infrastructure services unit Centuri files for IPO (NYSE:SWX)

Southwest Gas infrastructure services unit Centuri files for IPO (NYSE:SWX)

Amazon AI Introduces DataLore: A Machine Learning Framework that Explains Data Changes between an Initial Dataset and Its Augmented Version to Improve Traceability

Amazon AI Introduces DataLore: A Machine Learning Framework that Explains Data Changes between an Initial Dataset and Its Augmented Version to Improve Traceability

10 Vital Python Concepts for Data Science

10 Vital Python Concepts for Data Science

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In