In the rapidly evolving fields of Artificial Intelligence and Data Science, the amount and availability of training data play a crucial role in determining the capabilities and potential of Large Language Models (LLMs). These models rely on large volumes of textual data to enhance their language understanding skills.
A recent tweet by Mark Cummins raises concerns about the possibility of depleting the global reservoir of text data needed to train these models, given the exponential growth in data consumption and the demanding requirements of next-generation LLMs. To address this issue, we examine various textual sources currently accessible across different media and compare them to the increasing demands of advanced AI models.
Web Data: The English text segment of the FineWeb dataset, a subset of the Common Crawl web data, contains a remarkable 15 trillion tokens. When high-quality non-English web content is included, this corpus can double in size.
Code Repositories: Publicly available code repositories, like those compiled in the Stack v2 dataset, contribute approximately 0.78 trillion tokens. While this may seem small compared to other sources, the global volume of code is expected to be significant, totaling tens of trillions of tokens.
Academic Publications and Patents: Academic publications and patents account for around 1 trillion tokens, representing a substantial but distinct subset of textual data.
Books: Digital book collections from platforms such as Google Books and Anna’s Archive contain over 21 trillion tokens of textual content. Considering all unique books worldwide, the total token count rises to 400 trillion tokens.
Social Media Archives: Platforms like Weibo, Twitter, and Facebook host user-generated content, contributing approximately 49 trillion tokens collectively. Facebook stands out with 140 trillion tokens, although much of this data is inaccessible due to privacy and ethical concerns.
Transcribing Audio: Publicly available audio sources like YouTube and TikTok add around 12 trillion tokens to the training corpus.
Private Communications: Emails and stored instant conversations contain a massive amount of text data, totaling about 1,800 trillion tokens. Access to this data is restricted due to privacy and ethical considerations.
As the current LLM training datasets near the 15 trillion token mark, which represents the available high-quality English text, ethical and logistical challenges emerge. Exploring additional resources like books, audio transcriptions, and diverse language corpora may offer slight improvements, potentially expanding the maximum readable, high-quality text to 60 trillion tokens.
However, private data repositories managed by tech giants like Google and Facebook contain token counts in the quadrillions, beyond ethical boundaries. With limited access to morally acceptable text sources, the future of LLM development hinges on synthetic data creation. As access to private data reservoirs is restricted, data synthesis emerges as a crucial direction for future AI research.
In summary, innovative approaches to LLM training are essential given the growing data requirements and limited text resources. Synthetic data becomes increasingly crucial as existing datasets approach saturation levels. This shift underscores the evolving landscape of AI research, emphasizing the importance of synthetic data synthesis for continued progress and ethical compliance.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning. She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.