Research blog
Adrien Foucart, PhD in biomedical engineering.
This website is guaranteed 100% human-written, ad-free and tracker-free. Follow updates using the RSS Feed or by following me on Mastodon
Adrien Foucart, PhD in biomedical engineering.
This website is guaranteed 100% human-written, ad-free and tracker-free. Follow updates using the RSS Feed or by following me on Mastodon
Two posts on Retraction Watch yesterday highlight yet again how the big publishers in the scientific publishing industry are failing at being gatekeepers of high-quality science, and how many scientists don’t seem to grasp that they should be writing their papers themselves.
“As a nonsense phrase of shady provenance makes the rounds, Elsevier defends its use” (Frederik Joelving, Retraction Watch).
In this first one, a “fingerprint” of automatically-generated papers is described. These fingerprints, like the “tortured phrases” identified by Guillaume Cabanac [arXiv], are basically things that no scientist will ever write and that no diligent human reviewer or editor should let through. Cabanac identifies for instance “profound neural organization” used instead of “deep neural network”, a clear sign that some automated method just tried to find synonyms in an effort to mask plagiarism, without understanding the context.
This time, the fingerprint is “vegetative electron microscope” (or microscopy), which… doesn’t really mean anything? The most likely explanation found for its appearance (in more than 20 published papers) is because of an OCR error: a machine tried to convert a two-column 1959 paper to plain text and accidentally merged the two columns, where “vegetative” and “electron microscopy” appeared next to each other.
But the news here isn’t the fingerprint, but rather the baffling reaction of Elsevier when asked about a 2024 paper published in one of their journal [Elsevier]. To be clear, “vegetative electron microscope” is something that no one has ever written and that doesn’t make sense. But a Elsevier spokesman wrote that “the Editor-in-Chief confirmed that ‘vegetative electron microscopy’ is a way of conveying ‘electron microscopy of vegetative structures’ so he was content with the shortened version to be used.”
This is nonsense. First because, again, no-one uses this “shortcut”. Second, because the thing studied at that point of the paper is bacterial cellulose, which is different from plant cellulose and therefore not even vegetal. The presence of such an obvious mark of automated paper writing should be an instant retraction, and launch an investigation into any previous paper published by the authors.
“As Springer Nature journal clears AI papers, one university’s retractions rise drastically” (Avery Orrall, Retraction Watch)
Writing a scientific paper with ChatGPT or other LLMs is a terrible idea. Using AI assistance to write lowers our critical thinking abilities, as even Microsoft realises [Microsoft Research].
Convincing scientists, particularly those under heavy pressure to publish a lot, not to use these tools is a tall order. The bare minimum, which the Springer Nature publisher requires, is to disclose the use of an LLM. More than one hundred papers that failed this bare minimum requirements have been retracted so far by the Neurosurgical Review journal. At least in this case the journal seems to be doing its job, but I find the response from one of the retracted authors [Sivakamavalli Jeyachandran via RetractionWatch], who disagrees with the retraction, really disheartening.
His argument is:
He at some points refers to his undisclosed usage of an LLM as a “minor technical infraction”. I certainly disagree – and am glad that Springer seems to disagree as well.
In competitions or in original research papers which compare the results of some algorithms on a given task, the centerpiece is generally the Big Table of Results. The Big Table of Results is where you put a list of algorithms on one axis, a list of metrics on the other axis, and you put in bold the algorithm that performs best. In original research papers, it’s where you justify that your method is better than the others, with tables such as the one below. See? It’s in bold!
Method | Result |
---|---|
Old Classic Baseline [1] | 0.71 |
State-of-the-art from a few years back [2] | 0.82 |
Previous work from the authors [3] | 0.82 |
This work [4] | 0.84 |
In a competitions, this gives us the leaderboard, which will look something like this:
Rank | Team | Result |
---|---|---|
1 | Big AI Research group | 0.91 |
2 | Big AI Company | 0.90 |
3 | Someone with 2 GPUs | 0.87 |
4 | Someone with 1 GPU | 0.84 |
… | … | … |
157 | I don’t know what I’m doing | 0.42 |
Needless to say, things are a bit more complicated than that. In our new preprint, accepted at the ESANN 2025 conference and available online (PDF), we argue for a more nuanced approach to ranking where, instead of saying “this is the best method”, we compute confidence intervals on the rankings based on the assumption that the test set is a random sample from the larger population of “all cases where we may want to apply our algorithm”. We take the general procedure proposed by S. Holm in 20131, which uses the result of pairwise statistics tests to infer the confidence interval.
Several options for these statistical tests are evaluated using a Monte Carlo simulation on synthetic data. The procedure that appears to be the most robust based on our experiments is:
We release with this paper the cirank
Python library, which you can use to compute the confidence intervals
with:
from cirank import ci_ranking
import numpy # for the example
# example results:
= [np.random.random((10,)) for _ in range(5)]
results
# default method
= ci_ranking(results)
rankings print(rankings)
This paper (and the library) have large limitations in scope, and are likely to be expanded in the future – as discussed in the paper.
Reference:
A. Foucart, A. Elskens, C. Decaestecker
Ranking the scores of algorithms with confidence
ESANN 2025 (Accepted).
S.Holm. Confidence intervals for ranks. https://www.diva-portal.org/smash/get/diva2:634016/fulltext01.pdf, 2013.↩︎
Advent of Code is a yearly event, running since 2015. It is made by Eric Wastl, and it is “an Advent calendar of small programming puzzles”.
I had never heard of it until this year, when I started following the blog of Juha-Matti Santala, and saw him document his progress. I thought it would be fun – and I was right. I also thought that the practice of writing up how I came to the solutions was nice, and decided to do it as well. All of my notes on the puzzles can be found here: https://notes.adfoucart.be/aocode24. And all of my code, the good and the bad, is on my Gitlab.
One thing I really like about it is that there are basically no rules except the ones we bring with us. Eric posts a two-part puzzles every day, where you have to solve the first part in order to get the second part. You solve the puzzle by writing (or rather copy-pasting) a number into a box, and the only check that is made is if you have the right number. It doesn’t matter which language you used, or if your solution is clever, or efficient, or even would generalize to any other inputs than those he gave.
There is a leaderboard, but it is obviously just there for people who really want a leaderboard. The only metric used for it is how fast you solved the puzzle. I didn’t notice it was there until several days into the challenge.
So why do I do it? I decided to use it to get better at using the Python standard library efficiently, and to practice finding a good way to frame the problem so that the solution is generic and modular enough that, when part 2 is revealed, I can easily add to the existing code without modifying too many functions (ideally, none).
Sometimes it works, sometimes not. We’re halfway through, and I’m reasonably satisfied with most of my solutions so far. If you’re into that kind of things, I highly encourage doing it. All of the puzzles from the previous years are available as well.
Over on Aeon, psychologist Robert Epstein wrote a very interesting piece about how we often explain the functioning of the brain as if its a computer, retrieving memories and processing information, despite the lack of scientific grounding for that metaphor. Our brains don’t “recall” information, they “re-live” it. Information is not “stored” in neurons: our brain constantly evolved based on our experiences, and events can trigger our brains into re-activating the areas that were activated during a previous event, thus creating the impression of a “memory” for us.
This, I think, is the flip side of a discussion that I have had very often this past year about whether AI as it exists today can qualify as “intelligence” in any way. Just like “the computer” is not a very good metaphor for the human brain, the human brain is not a good metaphor for how AI algorithms work, yet this metaphor has completely taken over the way we think about AI. In fact, the whole vocabulary around AI is filled with human-intelligence metaphors: neural networks, learning, AI that explain their reasoning… and obviously “intelligence” itself. These metaphors may be useful to get a surface-level intuition about how AI algorithms work, but they are very much untrue and should not be taken as more than that: metaphors.
There’s this idea that, if we can make an artificial neural network that is as complex as the human neural network, then we should be capable of creating an actual intelligence. But that makes two hypothesis which are unsupported by evidence: that artificial neural networks are good models of human neural networks (they aren’t), and that human neural networks are solely responsible for human intelligence. That doesn’t seem to be the case either. Our brains work as part of an intricate, interwoven network of systems which all interact with each others in ways that are very hard to capture. Our neurons without our endocrine system, our immune system, etc., are a very incomplete snapshot of who we are. We can’t – and maybe won’t ever be able to – upload our brains to the cloud.
Artificial Intelligence is not Human Intelligence, and it doesn’t really make any sense to compare them in any way. Any AI algorithm has capabilities and limitations, and we can study those without falling into the trap of anthropomorphism. That’s why I don’t like it when LLMs are compared on benchmarks such as law exams, or common coding exercises. Those tests were designed for humans, to test capabilities that are often difficult to acquire for human intelligence. For an AI, it doesn’t tell us much. Importantly, it doesn’t tell us that this AI is a good lawyer, or a good software developer. Yet these are still the kind of tests used to compare LLMs today. Perhaps the most used test right now is the “Massive Multitask Language Understanding (MMLU)” test. The MMLU is composed “of multiple-choice questions from various branches of knowledge”, which were “manually collected by graduate and undergraduate students from freely available sources online”. But we aren’t using LLMs to solve multiple-choice questions from various exams and tests collected around the web. For any actual task that we have in mind for such a model, we would need to prove that the MMLU is a useful proxy to evaluate the capability of the model to perform the task.
The fact that the MMLU is collected from online sources also makes it extremely difficult to use it as a benchmark for modern LLMs, which are often trained on very opaque datasets, also collected from all around the web. The risks of contamination are huge, and the measures taken by LLM developers to mitigate this issue are often doubtful, if they are even described at all. The Gemini paper, for instance, states that they “search for and remove any evaluation data that may have been in our training corpus before using data for training”, but don’t provide any detail as to how this search was done. If they used the same method as GPT-4 (i.e. looking for exact matches on substrings of the questions), then the risk of contamination is high.
There is often a very large gap between the performances of the LLMs in benchmarks and their performances in real-world applications, and this is probably part of the reason: the benchmarks are not made for the evaluation of a computer program, but of a human brain… and those are way too different for it to work that way.