Authors suing OpenAI
Adrien Foucart, PhD in biomedical engineering.
Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.
Adrien Foucart, PhD in biomedical engineering.
Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.
Two authors, Paul Tremblay and Mona Awad, have filed a lawsuit against OpenAI for “direct copyright infringement, vicarious copyright infringement, violations of section 1202(b) of the Digital Millenium Copyright Act, unjust enrichment, violations of the California and common law unfair competition laws, and negligence”1
When I first heard about the complaint, I was a little bit skeptical. The key element of the complaint (from what I saw in the media coverage2) was that ChatGPT was capable of generating very accurate summaries of those works. In the complaint itself, we can read in the Overview section:
Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.
And later on in the Factual Allegations section:
On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.
While I have few doubts that OpenAI is, indeed, using a ton of copyrighted work to train its models, this doesn’t really prove it in my opinion. It seems to imply that the language model is really summarizing works that it has somehow “ingested” in its model. But GPT isn’t really summarizing anything, of course: it’s doing its best impression of a summary of those books, which, as the complaint notes, “get some details wrong.” I find it a lot more likely that, when prompted to summarize those books, GPT is influenced more by online summaries of the books than by the books themselves.
Looking more closely at the full complaint and the exhibits3, however, I think it’s a lot more solid than it initially appears. While Tremblay and Awad use their own books as example, their complaint is much bigger, as its presented as a class action on behalf of “all persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models.” And their case for why OpenAI’s training data includes copyrighted works is largely based off information from OpenAI themselves.
In particular, they note that:
The thing is, there are not really many sources of book data with so many works in it. The suit identifies two likely candidates. The 60k books dataset could correspond to Project Gutenberg. Those books are no longer under copyright, so that’s fine. But the only internet-based books repositories that could include the 300k books from the second dataset would be the LibGen / Z-Lib / Sci-Hub / Bibliotik corpus and/or torrent collections of books, which mostly include copyrighted material.
I think that’s the strongest argument they have here: based off OpenAI’s claim on their own dataset, they almost certainly used copyrighted material to train their model. And if OpenAI wants to demonstrate that it’s not the case, they may have to give a bit more information about their data… So we may finally get some openness from them – under oath.
This is not just about those two authors, however. The lawyers behind this, Joseph Saveri and Matthew Butterick, are clearly determined to make life very difficult for OpenAI and other generative AI startups. Three other authors (Sarah Silverman, Chris Golden and Richard Kadrey) joined an identical complaint against OpenAI4 and against Meta for their own language model, LLaMA5. LLaMa is apparently trained on “the Books3 section of ThePile”6. That particular dataset comes from EleutherAI7, and Books3 is described as coming from Bibliotik, compiled by Shawn Presser and shared on Twitter8. Since this dataset is easy to find and download, this is a lot more straightforward: copyrighted books from the authors filing the complaint are in the Books3 dataset, so – according to Meta’s own papers – they were used in training LLaMA. Saveri and Butterick are also behind the lawsuits against GitHub Copilot9 and Stable Diffusion10.
It will be interesting to see what kind of precedent the US courts decide to set here. The claim that the training sets contain copyrighted material may be strong, but that doesn’t necessarily mean that training a language model with copyrighted material is by itself infrigement. What the judges will decide to do with all these information, I don’t know. I’m sure many lawyers are going to fight about it for many years, though.
Touvron et al., LLaMA: Open and Efficient Foudation Language Models, 2023. ArXiV:2302.13971↩︎
Gao et al., The Pile: An 800Gb Dataset of Diverse Text for Language Modeling, 2020. ArXiV:2101.00027↩︎