Research blog

Adrien Foucart, PhD in biomedical engineering.

Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.

Over on Aeon, psychologist Robert Epstein wrote a very interesting piece about how we often explain the functioning of the brain as if its a computer, retrieving memories and processing information, despite the lack of scientific grounding for that metaphor. Our brains don’t “recall” information, they “re-live” it. Information is not “stored” in neurons: our brain constantly evolved based on our experiences, and events can trigger our brains into re-activating the areas that were activated during a previous event, thus creating the impression of a “memory” for us.

This, I think, is the flip side of a discussion that I have had very often this past year about whether AI as it exists today can qualify as “intelligence” in any way. Just like “the computer” is not a very good metaphor for the human brain, the human brain is not a good metaphor for how AI algorithms work, yet this metaphor has completely taken over the way we think about AI. In fact, the whole vocabulary around AI is filled with human-intelligence metaphors: neural networks, learning, AI that explain their reasoning… and obviously “intelligence” itself. These metaphors may be useful to get a surface-level intuition about how AI algorithms work, but they are very much untrue and should not be taken as more than that: metaphors.

There’s this idea that, if we can make an artificial neural network that is as complex as the human neural network, then we should be capable of creating an actual intelligence. But that makes two hypothesis which are unsupported by evidence: that artificial neural networks are good models of human neural networks (they aren’t), and that human neural networks are solely responsible for human intelligence. That doesn’t seem to be the case either. Our brains work as part of an intricate, interwoven network of systems which all interact with each others in ways that are very hard to capture. Our neurons without our endocrine system, our immune system, etc., are a very incomplete snapshot of who we are. We can’t – and maybe won’t ever be able to – upload our brains to the cloud.

Artificial Intelligence is not Human Intelligence, and it doesn’t really make any sense to compare them in any way. Any AI algorithm has capabilities and limitations, and we can study those without falling into the trap of anthropomorphism. That’s why I don’t like it when LLMs are compared on benchmarks such as law exams, or common coding exercises. Those tests were designed for humans, to test capabilities that are often difficult to acquire for human intelligence. For an AI, it doesn’t tell us much. Importantly, it doesn’t tell us that this AI is a good lawyer, or a good software developer. Yet these are still the kind of tests used to compare LLMs today. Perhaps the most used test right now is the “Massive Multitask Language Understanding (MMLU)” test. The MMLU is composed “of multiple-choice questions from various branches of knowledge”, which were “manually collected by graduate and undergraduate students from freely available sources online”. But we aren’t using LLMs to solve multiple-choice questions from various exams and tests collected around the web. For any actual task that we have in mind for such a model, we would need to prove that the MMLU is a useful proxy to evaluate the capability of the model to perform the task.

The fact that the MMLU is collected from online sources also makes it extremely difficult to use it as a benchmark for modern LLMs, which are often trained on very opaque datasets, also collected from all around the web. The risks of contamination are huge, and the measures taken by LLM developers to mitigate this issue are often doubtful, if they are even described at all. The Gemini paper, for instance, states that they “search for and remove any evaluation data that may have been in our training corpus before using data for training”, but don’t provide any detail as to how this search was done. If they used the same method as GPT-4 (i.e. looking for exact matches on substrings of the questions), then the risk of contamination is high.

There is often a very large gap between the performances of the LLMs in benchmarks and their performances in real-world applications, and this is probably part of the reason: the benchmarks are not made for the evaluation of a computer program, but of a human brain… and those are way too different for it to work that way.

I must confess that, sometimes, I Google myself. I typically know what I’ll find: I have enough of an online presence that most of the results are related to me (and one other Adrien Foucart who competed in Judo fifteen years ago), and it’s typically a mix of my blogs, scientific papers, social media, and from time to time a post about things I’ve written or done. Finding those is the main reason that I do this exercise every once in a while. Today, however, I got a surprising result, from a website that I had never heard about: SciSpace.

SciSpace is yet another GPT-powered chatbot, aimed at scientists who want to outsource their research to a machine. You ask a question, it answers with a summary built from scientific papers, with citations to those papers so that you can read them if you want to do some work somewhere in the process. I don’t think it’s a good idea: doing those kind of summaries is how you actually gain the understanding of your field, and you’ll necessarily miss a lot of the nuance of what’s happening in the field if you just get the AI-generated “summary”. So even if it worked perfectly as advertised, I wouldn’t recommend using it. But the reason I’m writing this is that it fails pretty spectacularly at its job.

It seems that SciSpace allows you to browse questions, presumably asked by other users. Google indexed a question where, surprisingly, I appeared in the answers. I say surprisingly because the question is not quite in my field: “What are the specific cultural criticisms associated with the implementation of Panopticon in various societies?”

The beginning of the answer seems to be on-topic, although since it’s a topic I know nothing about, it could all be bullshit for all I know. But it’s around the end that I suddenly appear, with this tangent:

Lastly, Adrien Foucart and colleagues critique the Panoptic Quality metric in digital pathology, illustrating the challenges of applying panoptic principles to complex, nuanced fields [10].

This is a reference to my Scientific Reports paper “Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology”, which is probably the paper I like the most out of my publications, but which has absolutely nothing to do with the Panopticon, outside of the fact that there is “Panoptic” in the title.

And this, I think, is where GPT went wrong. Because GPT predicts tokens, which encode parts of words, “Panoptic” and “Panopticon” likely share one or several tokens in common. This alone shouldn’t be enough to trip the model, but it also happens that one of the names most associated with discussion of the Panopticon is Michel Foucault, and “Foucault” also shares at least one token with “Foucart”. This taken together probably sent SciSpace’s GPT into the wrong direction. Because, as we should all know by now, GPT has no fucking idea what it’s talking about. It’s all just a statistically likely string of tokens, with no understanding whatsoever of what’s going on.

So Foucault and the Panopticon end up mixed with Foucart and Panoptic Quality, even though these come from completely different domains. I should also note that for some reason my paper is mistakenly cited as coming from “Dental science reports” instead of “Scientific Reports”. No idea what happened there, but another clear reason that this tool is absolutely useless.

From the CEO’s LinkedIn profile, the platform is “used by more than a million researchers worldwide, including Nobel Laureates”. I highly doubt, however, that Nobel prize worthy science will be done with it anytime soon.

I recently reviewed a paper which I strongly suspect, without certainty, had a little help from ChatGPT for its research. One thing that tipped me off was that it was bit confused in the evaluation metrics: it reported, for a segmentation task, both the “F1” score and the “Dice” score, which it defined as:

$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$

$$Dice = \frac{|P \cap G|}{|P| + |G|}$$

Where P & G refer to the predicted and ground truth regions.

Both of these definitions are fine, but… well, before I explain the problem, let’s see what ChatGPT (version 3.5) has to say about it.

Asking the question

User:  
What is the difference between the F1 score and the Dice Similarity Coefficient ?

ChatGPT’s answer:

The F1 score and the Dice Similarity Coefficient (also known as Dice coefficient or Dice index) are both metrics commonly used to evaluate the performance of binary classification tasks, particularly in the context of image segmentation or medical imaging. While they serve similar purposes, they have some differences in how they are calculated and what aspects of the classification performance they emphasize.

  1. F1 Score:
  • The F1 score is the harmonic mean of precision and recall.
  • Precision measures the ratio of true positive predictions to the total number of positive predictions made by the classifier.
  • Recall measures the ratio of true positive predictions to the total number of actual positive instances in the dataset.
  • The F1 score balances both precision and recall, giving equal weight to both. It is calculated as: F1 = 2 * (precision * recall) / (precision + recall).
  • F1 score ranges from 0 to 1, where a higher value indicates better performance.
  1. Dice Similarity Coefficient:
  • The Dice coefficient measures the overlap between two samples.
  • It is calculated as twice the intersection of the two samples divided by the sum of the sizes of the two samples.
  • In the context of binary classification tasks, one sample typically represents the predicted segmentation mask, while the other represents the ground truth mask.
  • Dice coefficient ranges from 0 to 1, where 0 indicates no overlap between the predicted and ground truth masks, and 1 indicates perfect overlap.

Differences

  • While both metrics are used to evaluate binary classification tasks, they emphasize different aspects of classification performance.
  • F1 score emphasizes the balance between precision and recall, which is particularly useful when there is an imbalance between the classes in the dataset.
  • Dice coefficient emphasizes the overlap between the predicted and ground truth masks, which is especially relevant in tasks such as image segmentation, where the spatial agreement between the predicted and true regions is crucial.
  • The F1 score can penalize false positives and false negatives equally, while the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant.

In summary, both metrics are valuable tools for evaluating classification performance, but the choice between them depends on the specific requirements and characteristics of the task at hand.

Dissecting the answer

ChatGPT is verbose as always, but the two definitions are mostly correct. Nothing much to say about the F1-Score, and for Dice the only real problem I have is with “measures the overlap,” as I’d rather say “is one of the measures of overlap.” The other main one being the Jaccard Index, also known as “Intersection over Union.”

The “differences” section is where I have way more problems. F1 score is balanced in terms of precision and recall, but it is not at all appropriate when there is a class imbalance (see Chapter 4 of my thesis for more!). And “the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant” is nonsense.

But that’s still not the main issue. Let’s see if I can get ChatGPT to notice it.

One step further

User:  
Express both the F1 score and the Dice Similarity Coefficient in terms of True Positives, False Positive and False Positives.

ChatGPT’s answer

Sure, here are the formulas for both the F1 score and the Dice Similarity Coefficient expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

  • F1 Score:
    • Precision (P) = TP / (TP + FP)
    • Recall (R) = TP / (TP + FN)
    • F1 Score = 2 * (P * R) / (P + R)
  • Dice Similarity Coefficient:
    • Dice = (2 * TP) / (2 * TP + FP + FN)

In both cases, TP represents the number of true positive predictions, FP represents the number of false positive predictions, and FN represents the number of false negative predictions. These formulas allow us to calculate both metrics based on the counts of these classification outcomes.

Almost there

The formulas are still correct ! But weirdly, ChatGPT didn’t exactly do as I asked. It still puts the F1 Score in terms of Precision and Recall, not in terms of TP, FP and FN. Let’s prompt it a bit further.

User:  
You put the F1 Score in terms of P and R, give the complete formula for the F1 Score based on TP, FP and FN.

ChatGPT’s answer

Apologies for the oversight. Here’s the complete formula for the F1 score expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score = (2 * TP) / (2 * TP + FP + FN)

This formula directly uses the counts of true positives, false positives, and false negatives to compute the F1 score, without needing to calculate precision and recall separately.

There we go !

So let’s put the two formulas next to each other:

$$Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$$

$$F1 Score = \frac{2 \times TP}{2 \times TP + FP + FN}$$

They are the same thing ! Dice and F1 Score are two different names for the same thing. The only real difference is in when they are often used. F1 is a more common terminology in classification and detection problems, whereas Dice is more often used for segmentation problems. But they are the same.

All the talk about their differences was complete bullshit. But it would have been relatively difficult to spot without already knowing the information. Which is always the problem with trying to use ChatGPT as a source of knowledge.

Conclusions

It was not a very good paper, and whether the authors misunderstood evaluation metrics because of ChatGPT or all on their own, I don’t think their study will be published, at least in the journal they were aiming for.

But after more than a year of people trying to convince me that ChatGPT really is usually right, and useful as a source of information… I remain firmly convinced that it’s only “right” if you don’t know the topic and can’t fact-check the answer fully. In this case, the information is easily found on Google (even on Bing !). On Google, the first line I see with the same original prompt is: “Dice coefficient = F1 score,” from a Kaggle notebook.

Sure, “if I upgrade to the paid version” it may be better. The only thing that really makes ChatGPT-Plus better is the capacity to search the web. I’d rather do that directly on my own – and see the information in its context, so that I can critically assess it.

Note: the code for all of this is available on Gitlab here: https://gitlab.ulb.be/lisa/tcgasampler. It may even perhaps work.

The goal

I would like to have a relatively big stock of TCGA whole-slide images, from various tissue locations and cancer types, and at a relatively low resolution, so that I can quickly test algorithms such as tissue segmentation or artifact detection that generally don’t require super-high magnification.

The National Cancer Institute provides an API to access their datasets programatically, so this is what I’d like to do:

  • Get the list of all the diagnostic tissue slides.
  • Randomly select N slides.
  • Download and save a copy of those slides at at chosen resolution.

Accessing image data

There are basically two ways to access image data for a given slide:

  • Downloading the entire whole-slide image using the “data endpoint” of the API: https://api.gdc.cancer.gov/data/{file_id}
  • Directly loading the tiles from their tile server, used by their online slide viewer: https://portal.gdc.cancer.gov/auth/api/v0/tile/{file_id}?level={level}&x={x}&y={y}.

Both options come with their own set of problems.

In the first case, we have to download a full-resolution slide, which is often 1 GB or more, just to extract the low-resolution image from it (and then delete the full WSI to avoid filling up local storage too quickly).

In the second case… we don’t have access to any of the metadata of the slide, including, crucially, the resolution and magnification level. This means that we have to kinda guess at which “level” – as understood by the slide server – we have to work. Of course, the “levels” of the slide server are not the same as the “levels” of the image pyramid in the slide file. But at least we only download the data that we want, which is a lot faster.

So I’ve tried both: with the first option, I can have exactly the resolution I want for all images, but it takes a ton of time; in the second option, it’s faster but I will have variations in the resolution.

The overall pipeline

We start by setting a random seed for repeatability, and then we load and shuffle the file ids from the manifest:

random.seed(RANDOM_SEED)
with open(PATH_TO_MANIFEST, 'r') as fp:
    all_tissue_slides = [line.strip().split('\t') for line in fp.readlines()]

rows = all_tissue_slides[1:]
row_ids = list(range(len(rows)))
random.shuffle(row_ids)

We can then iterate through the row_ids to get the file_id and filename for each slide. This will randomly sample through the full TCGA dataset.

Option A: downloading full slides

In the first option, we use the data endpoint to load the whole-slide file and save it to the disk:

response = requests.get(data_endpt, headers={"Content-Type": "application/json"})
with open(os.path.join(f"{GDC_DATA_ENDPT}/{file_id}", f"tmp/{filename}"), "wb") as fp:
    fp.write(response.content)

Then we can use our WholeSlide class (available in our openwholeslide package) to extract the image at the target resolution:

wsi = WholeSlide(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))
region = wsi.read_full(resolution=TARGET_RESOLUTION)
imsave(os.path.join(TCGA_LOCAL_PATH, f"images/{filename}.png"), region.as_ndarray)
wsi.slide.close()

And finally we remove the downloaded file, so that the disk doesn’t fill up:

os.remove(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))

Option B: downloading tiles

Getting a tile from the GDC API requires to provide the file id – which we have from the manifest – alongside the level, and an “x-y” position. The “x-y” position is a tile index, so that the first tile is (x=0, y=0), then the one to the right is (x=1, y=0), etc. The level is related to the size of the image. It’s not really documented, but from playing a bit with the API it’s clear that the system is basically just starting from a 1x1px at level 0, then increasing by powers of 2. At some point it diverges from the expected 2x2, 4x4, 8x8, etc., so that the real aspect ratio is preserved. So for instance for the image with id 455d9335-c6f3-4966-8b3c-1291e2d31093, we have:

  • level 0: 1x1
  • level 1: 2x2
  • level 2: 4x3
  • level 3: 7x6
  • level 4: 14x11 …
  • level 9: 428x351

We can get some metadata from the API using: https://portal.gdc.cancer.gov/auth/api/v0/tile/metadata/455d9335-c6f3-4966-8b3c-1291e2d31093. It tells us the full size of the image at maximum resolution (here 109559x89665), and the tile size (here 512). So we know that above level 9, we will start to have multiple tiles.

Let’s try to grab the images at a resolution such that the image is, at most, 2048px in its largest dimension. For that, we need to use level 11. But first, we need to grab the metadata to get the aspect ratio and have an idea of the total size of the resulting image.

meta_url = f"{GDC_META_ENDPT}/{file_id}"

response_meta = requests.get(meta_url)
metadata = json.loads(response_meta.content)
full_width = int(metadata['Width'])
full_height = int(metadata['Height'])
overlap = int(metadata['Overlap'])
tile_size = int(metadata['TileSize'])

if full_width > full_height:
    expected_width = 2048
    expected_height = math.ceil(2048 * full_height/full_width)
else:
    expected_width = math.ceil(2048 * full_width / full_height)
    expected_height = 2048

downloaded_image = np.zeros((expected_height, expected_width, 3), dtype=np.uint8)

We will then use the tile_size to compute the maximum number of tiles per dimension, and slowly fill the image from the tiles. We also keep track in max_x and max_y of the real size of the resulting image so that we can crop it from the temporary black image at the end.

tile_url = lambda x, y: f"{GDC_TILE_ENDPT}/{file_id}?level=11&x={x}&y={y}"

max_tiles = (2048 // (tile_size-overlap)) + 1
max_x = 0
max_y = 0
for y in range(max_tiles):
    for x in range(max_tiles):
        response_img = requests.get(tile_url(x, y))
        if response_img.ok:
            image = imread(BytesIO(response_img.content))
            startx = x*(tile_size-overlap)
            starty = y*(tile_size-overlap)
            downloaded_image[starty:starty+image.shape[0], startx:startx+image.shape[1]] = image[..., :3]
            max_x = max(max_x, startx+image.shape[1])
            max_y = max(max_y, starty+image.shape[0])

imsave(down_path, downloaded_image[:max_y, :max_x])

Conclusions

With Option A, it took me about 11 hours to get 100 random images at a resolution of 15µm/px – but at least I know their resolution. It took only 2 hours to get the same amount of images with option B… but with varying levels of resolution.

In the end, I think both options can be useful ways of randomly sampling TCGA data without storing the full WSIs locally. Which, given their 12.95TB size, I’d rather not do.