Friday, May 16, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

The OpenAI Endgame – O’Reilly

February 13, 2024
in AI Technology
Reading Time: 5 mins read
0 0
A A
0
Share on FacebookShare on Twitter



Since the New York Times sued OpenAI for infringing its copyrights by using Times content for training, everyone involved with AI has been wondering about the consequences. How will this lawsuit play out? And, more importantly, how will the outcome affect the way we train and use large language models?

There are two components to this suit. First, it was possible to get ChatGPT to reproduce some Times articles very close to verbatim. That’s fairly clearly copyright infringement, though there are still important questions that could influence the outcome of the case. Reproducing the New York Times clearly isn’t the intent of ChatGPT, and OpenAI appears to have modified ChatGPT’s guardrails to make generating infringing content more difficult, though probably not impossible. Is this enough to limit any damages? It’s not clear that anybody has used ChatGPT to avoid paying for a NYT subscription.

Second, the examples in a case like this are always cherry-picked. While the Times can clearly show that OpenAI can reproduce some articles, can it reproduce any article from the Times’ archive? Could I get ChatGPT to produce an article from page 37 of the September 18, 1947 issue? Or, for that matter, an article from the Chicago Tribune or the Boston Globe? Is the entire corpus available (I doubt it), or just certain random articles? I don’t know, and given that OpenAI has modified GPT to reduce the possibility of infringement, it’s almost certainly too late to do that experiment. The courts will have to decide whether inadvertent, inconsequential, or unpredictable reproduction meets the legal definition of copyright infringement.

Learn faster. Dig deeper. See farther.

The more important claim is that training a model on copyrighted content is infringement, whether or not the model is capable of reproducing that training data in its output. An inept and clumsy version of this claim was made by Sarah Silverman and others in a suit that was dismissed. The Authors’ Guild has its own version of this lawsuit, and it is working on a licensing model that would allow its members to opt in to a single licensing agreement. The outcome of this case could have many side-effects, since it essentially would allow publishers to charge not just for the texts they produce, but for how those texts are used.

It is difficult to predict what the outcome will be, though easy enough guess. Here’s mine. OpenAI will settle with the New York Times out of court, and we won’t get a ruling. This settlement will have important consequences: it will set a de-facto price on training data. And that price will no doubt be high. Perhaps not as high as the Times would like (there are rumors that OpenAI has offered something in the range of $1 million to $5 million), but sufficiently high enough to deter OpenAI’s competitors. $1M is not, in and of itself, a terribly high price, and the Times reportedly thinks that it’s way too low; but realize that OpenAI will have to pay a similar amount to almost every major newspaper publisher worldwide in addition to organizations like the Authors Guild, technical journal publishers, magazine publishers, and many other content owners. The total bill is likely to be close to $1 billion, if not more, and as models need to be updated, at least some of it will be a recurring cost.

I suspect that OpenAI would have difficulty going higher, even given Microsoft’s investments—and, whatever else you may think of this strategy—OpenAI has to think about the total cost. I doubt that they are close to profitable; they appear to be running on an Uber-like business plan, in which they spend heavily to buy the market without regard for running a sustainable business. But even with that business model, billion-dollar expenses have to raise the eyebrows of partners like Microsoft.

The Times, on the other hand, appears to be making a common mistake: overvaluing its data. Yes, it has a large archive—but what is the value of old news? Furthermore, in almost any application but especially in AI, the value of data isn’t the data itself; it’s the correlations between different datasets. The Times doesn’t own those correlations any more than I own the correlations between my browsing data and Tim O’Reilly’s. But those correlations are precisely what’s valuable to OpenAI and others building data-driven products.

Having set the price of copyrighted training data to $1B or thereabouts, other model developers will need to pay similar amounts to license their training data: Google, Microsoft (for whatever independently developed models they have), Facebook, Amazon, and Apple. Those companies can afford it. Smaller startups (including companies like Anthropic and Cohere) will be priced out, along with every open source effort. By settling, OpenAI will eliminate much of their competition. And the good news for OpenAI is that even if they don’t settle, they still might lose the case. They’d probably end up paying more, but the effect on their competition would be the same. Not only that, the Times and other publishers would be responsible for enforcing this “agreement.” They’d be responsible for negotiating with other groups that want to use their content and suing those they can’t agree with. OpenAI keeps its hands clean, and its legal budget unspent. They can win by losing—and if so, do they have any real incentive to win?

Unfortunately, OpenAI is right in claiming that a good model can’t be trained without copyrighted data (although Sam Altman, OpenAI’s CEO, has also said the opposite). Yes, we have substantial libraries of public domain literature, plus Wikipedia, plus papers in ArXiv, but if a language model trained on that data would produce text that sounds like a cross between 19th century novels and scientific papers, that’s not a pleasant thought. The problem isn’t just text generation; will a language model whose training data has been limited to copyright-free sources require prompts to be written in an early-20th or 19th century style? Newspapers and other copyrighted material are an excellent source of well-edited grammatically correct modern language. It is unreasonable to believe that a good model for modern languages can be built from sources that have fallen out of copyright.

Requiring model-building organizations to purchase the rights to their training data would inevitably leave generative AI in the hands of a small number of unassailable monopolies. (We won’t address what can or can’t be done with copyrighted material, but we will say that copyright law says nothing at all about the source of the material: you can buy it legally, borrow it from a friend, steal it, find it in the trash—none of this has any bearing on copyright infringement.)

One of the participants at the WEF roundtable The Expanding Universe of Generative Models reported that Altman has said that he doesn’t see the need for more than one foundation model. That’s not unexpected, given my guess that his strategy is built around minimizing competition. But this is chilling: if all AI applications go through one of a small group of monopolists, can we trust those monopolists to deal honestly with issues of bias? AI developers have said a lot about “alignment,” but discussions of alignment always seem to sidestep more immediate issues like race and gender-based bias.

Will it be possible to develop specialized applications (for example, O’Reilly Answers) that require training on a specific dataset? I’m sure the monopolists would say “of course, those can be built by fine tuning our foundation models”; but do we know whether that’s the best way to build those applications? Or whether smaller companies will be able to afford to build those applications, once the monopolists have succeeded in buying the market? Remember: Uber was once inexpensive.

If model development is limited to a few wealthy companies, its future will be bleak. The outcome of copyright lawsuits won’t just apply to the current generation of Transformer-based models; they will apply to any model that needs training data. Limiting model building to a small number of companies will eliminate most academic research. It would certainly be possible for most research universities to build a training corpus on content they acquired legitimately. Any good library will have the Times and other newspapers on microfilm, which can be converted to text with OCR. But if the law specifies how copyrighted material can…



Source link

Tags: EndgameopenaiOReilly
Previous Post

YouTube Updates: YouTube’s 2024 Vision for AI, YPP, Experience Features, and More

Next Post

3 Roles Marketing Leaders Plan to Recruit in 2024 [New Research + Expert Insights]

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
3 Roles Marketing Leaders Plan to Recruit in 2024 [New Research + Expert Insights]

3 Roles Marketing Leaders Plan to Recruit in 2024 [New Research + Expert Insights]

‘An age of manufactured mistrust’: Here’s what rampant disinformation means for health, climate, and democracy

‘An age of manufactured mistrust’: Here’s what rampant disinformation means for health, climate, and democracy

EarnBet.io Processed $1 Billion In Bets and Distributed Millions in User Rewards and Rakeback – Blockchain News, Opinion, TV and Jobs

EarnBet.io Processed $1 Billion In Bets and Distributed Millions in User Rewards and Rakeback – Blockchain News, Opinion, TV and Jobs

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In