TL;DR LLMs and other Generative AI models can reproduce significant chunks of training data. Specific prompts seem to “unlock” training data. There are current and future copyright challenges to consider, as training data may not infringe copyright but legal doesn’t mean legitimate. This raises concerns similar to MegaFace, where surveillance models were trained on photos of minors without informed consent. Copyright was originally meant to incentivize cultural production, but in the age of generative AI, copyright may not be enough.
In Borges’ fable Pierre Menard, Author of The Quixote, the character plans to rewrite Don Quixote word for word, not as a copy but as a new creation. This analogy can be seen in how Generative AI models reproduce training data without storing it explicitly, creating outputs that are verbally identical to the original but lacking the human experience that goes into cultural production.
Generative AI models like ChatGPT can produce chunks of training data through next-word prediction, leading to plagiarism issues like the lawsuit against OpenAI by The New York Times. Researchers are finding ways to extract training data from these models, such as generating images from movies using text-to-image models like Midjourney. The rapid advancement in models like SORA, OpenAI’s text-to-video model, further complicates copyright and legitimacy concerns.
Training data isn’t stored in the model itself, but can be reconstructed with the right prompts. LLMs compress data while also having generative capabilities, leading to discussions about their nature as compression tools. The concept of “hallucination” in LLMs reflects their dream-like nature guided by prompts, which can reproduce training data but also create new content.
Copyright may not be the best framework for considering the legality of training and deploying AI models. It’s important to think about the societal implications of incentivizing cultural production and the impact of AI on creativity and ownership. The evolving landscape of AI raises questions about legality, legitimacy, and ethical considerations in the generation of content.