In the realm of artificial intelligence (AI) and large language models (LLMs), the key requirement for developing generative solutions is finding suitable training data. With the advancement of Generative AI models such as Chat GPT and DALL-E, there is a growing temptation to utilize AI-generated outputs as training data for new AI systems. However, recent research has highlighted the dangerous consequences of this practice, leading to a phenomenon known as “model collapse.” A study published in July 2023 by scientists at Rice and Stanford University concluded that exclusively training AI models on generative AI outputs is not advisable. Their report was titled “Self-consuming generative models go MAD.”
When training an AI model on data generated by other AI models, it ends up learning from a distorted reflection of itself. Similar to the game of “telephone,” each iteration of AI-generated data becomes more corrupted and detached from reality. Researchers have discovered that even a small amount of AI-generated content in the training data can be detrimental to the model, causing its outputs to degrade into nonsensical gibberish quickly. This is because the errors and biases present in the synthetic data get magnified as the model learns from its own generated outputs.
The issue of model collapse is evident across various types of AI models, from language models to image generators. While larger, more powerful models may show some resistance, there is little evidence to suggest they are immune to this problem. As AI-generated content becomes more widespread, future AI models are likely to be trained on a combination of real and synthetic data. This creates an “autophagous” loop where the model’s outputs deteriorate in quality and diversity over successive generations.
Researchers at Rice University and Stanford University conducted a detailed analysis of self-consuming generative image models trained on their own synthetic outputs. They identified three main types of self-consuming loops:
Fully Synthetic Loops: In these loops, models are exclusively trained on synthetic data generated by previous models. It was found that these loops inevitably lead to Model Autophagy Disorder (MAD), with the quality or diversity of generated images progressively decreasing over generations.
Synthetic Augmentation Loops: These loops incorporate a fixed set of real training data along with synthetic data, delaying but not preventing MAD.
Fresh Data Loops: In these loops, each generation of the model has access to new, previously unseen real training data, preventing MAD and maintaining the quality and diversity of generated images over generations.
Prominent figures in the AI industry recently made commitments at the White House to introduce strategies like watermarking to distinguish synthetic data from authentic data. This approach aims to help users identify artificially generated content and address the negative impacts of synthetic data on the internet. Watermarking could serve as a preventive measure against training generative models on AI-generated data, although its effectiveness in tackling MADness requires further investigation.
It is crucial to maintain a balance of real and synthetic content in training data, with proper representation of minority groups. Companies must curate datasets carefully and monitor for signs of degradation to prevent AI systems from becoming biased and unreliable. Responsible data curation and monitoring can guide the development of AI in a grounded direction that serves diverse community needs.
About the Author
Ranjeeta Bhattacharya is a senior data scientist at BNY Mellon, with over 15 years of experience in Data Science and Technology consulting roles. She holds degrees in Computer Science, Data Science, and various certifications in these fields, demonstrating a commitment to continuous learning and knowledge sharing.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW