[Opinion] The Galactica debacle

So I’ve made clear in the past that I’m not a huge fan of large language models as a way forward for AI. I think we just got a great example of the fundamental problems of this approach with the “Galactica” fiasco. A decent summary of the events by Will Douglas Heaven was posted on MIT Technology Review. I just want to give here my quick summary and personal experience, and what it all means going forward.

What the hell is Galactica, anyway?

On November 15th, Galactica was introduced by “Papers with Code” (a Meta AI project) on their social media accounts (Twitter thread, LinkedIn post) and a now unavailable live demo. In case Twitter no longer exists by the time you read this post, their main claims were:

Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.

They talked a bit about their corpus of data:

We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. Includes scientific text and also scientific modalities such as proteins, compounds and more.

How well did it work?

As you may expect from how quickly it was pulled: not very well. I tried a couple of queries when I saw the post, just to get an idea, and it was not particularly impressive in terms of “summarizing academic literature” or “write scientific code”, the two things that I briefly tried.

I tried to get it to make a review of segmentation methods and a few other basic image analysis concepts, and got badly written, repetitive and completely unsourced (and therefore unusable for any scientific purpose) stuff that was not necessarily “false”, but at best very surface level. Surface level would be fine if it then linked to wherever we could learn more, but in its current state at least it just had no added value over just going to the wikipedia page of whatever concept you want to write about. Or Google.

I also tried to get it to code a Otsu thresholding algorithm, and the code was messy and filled with mistakes (as in: not keeping its variable names straight, importing a bunch of useless things, etc.) It was clearly lifting things from a few different codes from its training set that had Otsu in it with no idea on what it was supposed to be doing.

As many others did, I also tested it on some more “politically charged” stuff and it miserably failed to detect that it was making a spirited defense of the Third Reich that maybe should have raised a red flag somewhere in the system.

It’s just baffling that Meta thought this was ready for a public demo.

So, what happened?

Reading the Arxiv preprint, the most charitable reading of the situation I can give is that there has been some, let’s say, misalignment between the research team that developed and validated the model, and the people that made and advertised the demo website. I don’t know if it’s the same people, but if it’s not, it may explain a few things.

Because the thing is, reading the original paper, it really doesn’t look like it’s supposed to “summarize academic literature” or “generate Wiki articles” or “write scientific code”. Certainly, the claims by Yann LeCun that you can “[t]ype a text and http://galactica.ai will generate a paper with relevant references, formulas, and everything” is just plain wrong.

The benchmark that they used to validate the model are on very specific, specialized tasks, like “prompt with an equation name and generate LaTeX” (if you need that, use WolframAlpha!), describe proteins from an amino-acid sequence, or answer very specific questions (e.g. prompt “Abell 370 is a galaxy cluster located in the constellation of”, answer “Cetus”).

There are a handful of examples in the appendix of larger texts with “Wikipedia” articles and “literature survey”, but they seem very handpicked and there doesn’t appear to have been a systematic evaluation for those.

And as soon as real users got to play with the prompts, the whole thing unraveled.

What can we learn from this?

Large language models, like image generators, are cool. But they are also dumb. Even when they are very large and get very impressive scores and curated benchmarks, they don’t have any real knowledge or understanding. They generate stuff that’s likely to be “acceptable” by a human as “something another human could say”. On very specific prompts, they will be able to get the right answer because the mapping between the tokenized input and the expected output will be very clear (in other words, it’s unlikely that there will be many other words other than “Cetus” with a strong association with the words “Abell 370 is a galaxy cluster located in the constellation of” in the dataset). But anything more generic, more like what the announcement implied it could do… just doesn’t work. Or if it does, it’s very likely to be similar to DALL-E and other image generators: to really be able to use it, you have to learn its language. How to tweak the prompts exactly right so that you activate the right path through the model.

But again: if you can do that, you should be able to learn how to use Wikipedia and Google Scholar, where your risk of plagiarizing someone else’s work by accident will be much reduced…

In summary: beware of claims from large language models that have not been thoroughly tested “in the wild”, because that’s the only place you can actually test them (at least if you claim that they can be useful to the general public).

19 Nov 2022 | Adrien Foucart | adrien@adfoucart.be