Brain metaphors

Adrien Foucart, PhD in biomedical engineering.

This website is guaranteed 100% human-written, ad-free and tracker-free. Follow updates using the RSS Feed or by following me on Mastodon

Over on Aeon, psychologist Robert Epstein wrote a very interesting piece about how we often explain the functioning of the brain as if its a computer, retrieving memories and processing information, despite the lack of scientific grounding for that metaphor. Our brains don’t “recall” information, they “re-live” it. Information is not “stored” in neurons: our brain constantly evolved based on our experiences, and events can trigger our brains into re-activating the areas that were activated during a previous event, thus creating the impression of a “memory” for us.

This, I think, is the flip side of a discussion that I have had very often this past year about whether AI as it exists today can qualify as “intelligence” in any way. Just like “the computer” is not a very good metaphor for the human brain, the human brain is not a good metaphor for how AI algorithms work, yet this metaphor has completely taken over the way we think about AI. In fact, the whole vocabulary around AI is filled with human-intelligence metaphors: neural networks, learning, AI that explain their reasoning… and obviously “intelligence” itself. These metaphors may be useful to get a surface-level intuition about how AI algorithms work, but they are very much untrue and should not be taken as more than that: metaphors.

There’s this idea that, if we can make an artificial neural network that is as complex as the human neural network, then we should be capable of creating an actual intelligence. But that makes two hypothesis which are unsupported by evidence: that artificial neural networks are good models of human neural networks (they aren’t), and that human neural networks are solely responsible for human intelligence. That doesn’t seem to be the case either. Our brains work as part of an intricate, interwoven network of systems which all interact with each others in ways that are very hard to capture. Our neurons without our endocrine system, our immune system, etc., are a very incomplete snapshot of who we are. We can’t – and maybe won’t ever be able to – upload our brains to the cloud.

Artificial Intelligence is not Human Intelligence, and it doesn’t really make any sense to compare them in any way. Any AI algorithm has capabilities and limitations, and we can study those without falling into the trap of anthropomorphism. That’s why I don’t like it when LLMs are compared on benchmarks such as law exams, or common coding exercises. Those tests were designed for humans, to test capabilities that are often difficult to acquire for human intelligence. For an AI, it doesn’t tell us much. Importantly, it doesn’t tell us that this AI is a good lawyer, or a good software developer. Yet these are still the kind of tests used to compare LLMs today. Perhaps the most used test right now is the “Massive Multitask Language Understanding (MMLU)” test. The MMLU is composed “of multiple-choice questions from various branches of knowledge”, which were “manually collected by graduate and undergraduate students from freely available sources online”. But we aren’t using LLMs to solve multiple-choice questions from various exams and tests collected around the web. For any actual task that we have in mind for such a model, we would need to prove that the MMLU is a useful proxy to evaluate the capability of the model to perform the task.

The fact that the MMLU is collected from online sources also makes it extremely difficult to use it as a benchmark for modern LLMs, which are often trained on very opaque datasets, also collected from all around the web. The risks of contamination are huge, and the measures taken by LLM developers to mitigate this issue are often doubtful, if they are even described at all. The Gemini paper, for instance, states that they “search for and remove any evaluation data that may have been in our training corpus before using data for training”, but don’t provide any detail as to how this search was done. If they used the same method as GPT-4 (i.e. looking for exact matches on substrings of the questions), then the risk of contamination is high.

There is often a very large gap between the performances of the LLMs in benchmarks and their performances in real-world applications, and this is probably part of the reason: the benchmarks are not made for the evaluation of a computer program, but of a human brain… and those are way too different for it to work that way.