Friday, May 9, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Large Language Model (LLM) Training Data Is Running Out. How Close Are We To The Limit?

May 14, 2024
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


In the rapidly evolving fields of Artificial Intelligence and Data Science, the amount and availability of training data play a crucial role in determining the capabilities and potential of Large Language Models (LLMs). These models rely on large volumes of textual data to enhance their language understanding skills.

A recent tweet by Mark Cummins raises concerns about the possibility of depleting the global reservoir of text data needed to train these models, given the exponential growth in data consumption and the demanding requirements of next-generation LLMs. To address this issue, we examine various textual sources currently accessible across different media and compare them to the increasing demands of advanced AI models.

Web Data: The English text segment of the FineWeb dataset, a subset of the Common Crawl web data, contains a remarkable 15 trillion tokens. When high-quality non-English web content is included, this corpus can double in size.

Code Repositories: Publicly available code repositories, like those compiled in the Stack v2 dataset, contribute approximately 0.78 trillion tokens. While this may seem small compared to other sources, the global volume of code is expected to be significant, totaling tens of trillions of tokens.

Academic Publications and Patents: Academic publications and patents account for around 1 trillion tokens, representing a substantial but distinct subset of textual data.

Books: Digital book collections from platforms such as Google Books and Anna’s Archive contain over 21 trillion tokens of textual content. Considering all unique books worldwide, the total token count rises to 400 trillion tokens.

Social Media Archives: Platforms like Weibo, Twitter, and Facebook host user-generated content, contributing approximately 49 trillion tokens collectively. Facebook stands out with 140 trillion tokens, although much of this data is inaccessible due to privacy and ethical concerns.

Transcribing Audio: Publicly available audio sources like YouTube and TikTok add around 12 trillion tokens to the training corpus.

Private Communications: Emails and stored instant conversations contain a massive amount of text data, totaling about 1,800 trillion tokens. Access to this data is restricted due to privacy and ethical considerations.

As the current LLM training datasets near the 15 trillion token mark, which represents the available high-quality English text, ethical and logistical challenges emerge. Exploring additional resources like books, audio transcriptions, and diverse language corpora may offer slight improvements, potentially expanding the maximum readable, high-quality text to 60 trillion tokens.

However, private data repositories managed by tech giants like Google and Facebook contain token counts in the quadrillions, beyond ethical boundaries. With limited access to morally acceptable text sources, the future of LLM development hinges on synthetic data creation. As access to private data reservoirs is restricted, data synthesis emerges as a crucial direction for future AI research.

In summary, innovative approaches to LLM training are essential given the growing data requirements and limited text resources. Synthetic data becomes increasingly crucial as existing datasets approach saturation levels. This shift underscores the evolving landscape of AI research, emphasizing the importance of synthetic data synthesis for continued progress and ethical compliance.

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning. She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

[Recommended Read] Rightsify’s GCX: Your Go-To Source for High-Quality, Ethically Sourced, Copyright-Cleared AI Music Training Datasets with Rich Metadata



Source link

Tags: closedatalanguageLargeLimitLLMmodelRunningtraining
Previous Post

Smart Moves to Reduce Your Home Maintenance Costs

Next Post

Using ideas from game theory to improve the reliability of language models | MIT News

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
Using ideas from game theory to improve the reliability of language models | MIT News

Using ideas from game theory to improve the reliability of language models | MIT News

AI Helps Businesses Save Money with Better Financial Management

AI Helps Businesses Save Money with Better Financial Management

New compute-optimized (C7i-flex) Amazon EC2 Flex instances

New compute-optimized (C7i-flex) Amazon EC2 Flex instances

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

A faster, better way to prevent an AI chatbot from giving toxic responses | MIT News

April 10, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In