Authors suing OpenAI

Two authors, Paul Tremblay and Mona Awad, have filed a lawsuit against OpenAI for “direct copyright infringement, vicarious copyright infringement, violations of section 1202(b) of the Digital Millenium Copyright Act, unjust enrichment, violations of the California and common law unfair competition laws, and negligence”¹

When I first heard about the complaint, I was a little bit skeptical. The key element of the complaint (from what I saw in the media coverage²) was that ChatGPT was capable of generating very accurate summaries of those works. In the complaint itself, we can read in the Overview section:

Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.

And later on in the Factual Allegations section:

On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While I have few doubts that OpenAI is, indeed, using a ton of copyrighted work to train its models, this doesn’t really prove it in my opinion. It seems to imply that the language model is really summarizing works that it has somehow “ingested” in its model. But GPT isn’t really summarizing anything, of course: it’s doing its best impression of a summary of those books, which, as the complaint notes, “get some details wrong.” I find it a lot more likely that, when prompted to summarize those books, GPT is influenced more by online summaries of the books than by the books themselves.

Looking more closely at the full complaint and the exhibits³, however, I think it’s a lot more solid than it initially appears. While Tremblay and Awad use their own books as example, their complaint is much bigger, as its presented as a class action on behalf of “all persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models.” And their case for why OpenAI’s training data includes copyrighted works is largely based off information from OpenAI themselves.

In particular, they note that:

GPT-1 used BookCorpus, which is copied from Smashwords.com and is – apparently – known to contain work under copyright.
GPT-3 adds “two internet-based books corpora,” whose nature is unspecified, but which according to OpenAI’s papers would contain respectively around 60.000 titles and 300.000 titles.

The thing is, there are not really many sources of book data with so many works in it. The suit identifies two likely candidates. The 60k books dataset could correspond to Project Gutenberg. Those books are no longer under copyright, so that’s fine. But the only internet-based books repositories that could include the 300k books from the second dataset would be the LibGen / Z-Lib / Sci-Hub / Bibliotik corpus and/or torrent collections of books, which mostly include copyrighted material.

I think that’s the strongest argument they have here: based off OpenAI’s claim on their own dataset, they almost certainly used copyrighted material to train their model. And if OpenAI wants to demonstrate that it’s not the case, they may have to give a bit more information about their data… So we may finally get some openness from them – under oath.

This is not just about those two authors, however. The lawyers behind this, Joseph Saveri and Matthew Butterick, are clearly determined to make life very difficult for OpenAI and other generative AI startups. Three other authors (Sarah Silverman, Chris Golden and Richard Kadrey) joined an identical complaint against OpenAI⁴ and against Meta for their own language model, LLaMA⁵. LLaMa is apparently trained on “the Books3 section of ThePile”⁶. That particular dataset comes from EleutherAI⁷, and Books3 is described as coming from Bibliotik, compiled by Shawn Presser and shared on Twitter⁸. Since this dataset is easy to find and download, this is a lot more straightforward: copyrighted books from the authors filing the complaint are in the Books3 dataset, so – according to Meta’s own papers – they were used in training LLaMA. Saveri and Butterick are also behind the lawsuits against GitHub Copilot⁹ and Stable Diffusion¹⁰.

It will be interesting to see what kind of precedent the US courts decide to set here. The claim that the training sets contain copyrighted material may be strong, but that doesn’t necessarily mean that training a language model with copyrighted material is by itself infrigement. What the judges will decide to do with all these information, I don’t know. I’m sure many lawyers are going to fight about it for many years, though.

Tremblay v OpenAI complaint ↩︎
The Guardian, July 5th, 2023 ↩︎
Tremblay v OpenAI Exhibits (PDF)↩︎
Silverman v OpenAI complaint ↩︎
Kadrey v Meta complaint ↩︎
Touvron et al., LLaMA: Open and Efficient Foudation Language Models, 2023. ArXiV:2302.13971 ↩︎
Gao et al., The Pile: An 800Gb Dataset of Diverse Text for Language Modeling, 2020. ArXiV:2101.00027 ↩︎
Shawn Presser on Twitter ↩︎
https://githubcopilotlitigation.com/↩︎
https://stablediffusionlitigation.com/↩︎

10 Jul 2023 | Adrien Foucart | adrien@adfoucart.be