Research blog

Adrien Foucart, PhD in biomedical engineering.

Get sparse and irregular email updates by subscribing to https://adfoucart.substack.com. This website is guaranteed 100% human-written, ad-free and tracker-free.

I recently reviewed a paper which I strongly suspect, without certainty, had a little help from ChatGPT for its research. One thing that tipped me off was that it was bit confused in the evaluation metrics: it reported, for a segmentation task, both the “F1” score and the “Dice” score, which it defined as:

$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$

$$Dice = \frac{|P \cap G|}{|P| + |G|}$$

Where P & G refer to the predicted and ground truth regions.

Both of these definitions are fine, but… well, before I explain the problem, let’s see what ChatGPT (version 3.5) has to say about it.

Asking the question

User:  
What is the difference between the F1 score and the Dice Similarity Coefficient ?

ChatGPT’s answer:

The F1 score and the Dice Similarity Coefficient (also known as Dice coefficient or Dice index) are both metrics commonly used to evaluate the performance of binary classification tasks, particularly in the context of image segmentation or medical imaging. While they serve similar purposes, they have some differences in how they are calculated and what aspects of the classification performance they emphasize.

  1. F1 Score:
  • The F1 score is the harmonic mean of precision and recall.
  • Precision measures the ratio of true positive predictions to the total number of positive predictions made by the classifier.
  • Recall measures the ratio of true positive predictions to the total number of actual positive instances in the dataset.
  • The F1 score balances both precision and recall, giving equal weight to both. It is calculated as: F1 = 2 * (precision * recall) / (precision + recall).
  • F1 score ranges from 0 to 1, where a higher value indicates better performance.
  1. Dice Similarity Coefficient:
  • The Dice coefficient measures the overlap between two samples.
  • It is calculated as twice the intersection of the two samples divided by the sum of the sizes of the two samples.
  • In the context of binary classification tasks, one sample typically represents the predicted segmentation mask, while the other represents the ground truth mask.
  • Dice coefficient ranges from 0 to 1, where 0 indicates no overlap between the predicted and ground truth masks, and 1 indicates perfect overlap.

Differences

  • While both metrics are used to evaluate binary classification tasks, they emphasize different aspects of classification performance.
  • F1 score emphasizes the balance between precision and recall, which is particularly useful when there is an imbalance between the classes in the dataset.
  • Dice coefficient emphasizes the overlap between the predicted and ground truth masks, which is especially relevant in tasks such as image segmentation, where the spatial agreement between the predicted and true regions is crucial.
  • The F1 score can penalize false positives and false negatives equally, while the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant.

In summary, both metrics are valuable tools for evaluating classification performance, but the choice between them depends on the specific requirements and characteristics of the task at hand.

Dissecting the answer

ChatGPT is verbose as always, but the two definitions are mostly correct. Nothing much to say about the F1-Score, and for Dice the only real problem I have is with “measures the overlap,” as I’d rather say “is one of the measures of overlap.” The other main one being the Jaccard Index, also known as “Intersection over Union.”

The “differences” section is where I have way more problems. F1 score is balanced in terms of precision and recall, but it is not at all appropriate when there is a class imbalance (see Chapter 4 of my thesis for more!). And “the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant” is nonsense.

But that’s still not the main issue. Let’s see if I can get ChatGPT to notice it.

One step further

User:  
Express both the F1 score and the Dice Similarity Coefficient in terms of True Positives, False Positive and False Positives.

ChatGPT’s answer

Sure, here are the formulas for both the F1 score and the Dice Similarity Coefficient expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

  • F1 Score:
    • Precision (P) = TP / (TP + FP)
    • Recall (R) = TP / (TP + FN)
    • F1 Score = 2 * (P * R) / (P + R)
  • Dice Similarity Coefficient:
    • Dice = (2 * TP) / (2 * TP + FP + FN)

In both cases, TP represents the number of true positive predictions, FP represents the number of false positive predictions, and FN represents the number of false negative predictions. These formulas allow us to calculate both metrics based on the counts of these classification outcomes.

Almost there

The formulas are still correct ! But weirdly, ChatGPT didn’t exactly do as I asked. It still puts the F1 Score in terms of Precision and Recall, not in terms of TP, FP and FN. Let’s prompt it a bit further.

User:  
You put the F1 Score in terms of P and R, give the complete formula for the F1 Score based on TP, FP and FN.

ChatGPT’s answer

Apologies for the oversight. Here’s the complete formula for the F1 score expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score = (2 * TP) / (2 * TP + FP + FN)

This formula directly uses the counts of true positives, false positives, and false negatives to compute the F1 score, without needing to calculate precision and recall separately.

There we go !

So let’s put the two formulas next to each other:

$$Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$$

$$F1 Score = \frac{2 \times TP}{2 \times TP + FP + FN}$$

They are the same thing ! Dice and F1 Score are two different names for the same thing. The only real difference is in when they are often used. F1 is a more common terminology in classification and detection problems, whereas Dice is more often used for segmentation problems. But they are the same.

All the talk about their differences was complete bullshit. But it would have been relatively difficult to spot without already knowing the information. Which is always the problem with trying to use ChatGPT as a source of knowledge.

Conclusions

It was not a very good paper, and whether the authors misunderstood evaluation metrics because of ChatGPT or all on their own, I don’t think their study will be published, at least in the journal they were aiming for.

But after more than a year of people trying to convince me that ChatGPT really is usually right, and useful as a source of information… I remain firmly convinced that it’s only “right” if you don’t know the topic and can’t fact-check the answer fully. In this case, the information is easily found on Google (even on Bing !). On Google, the first line I see with the same original prompt is: “Dice coefficient = F1 score,” from a Kaggle notebook.

Sure, “if I upgrade to the paid version” it may be better. The only thing that really makes ChatGPT-Plus better is the capacity to search the web. I’d rather do that directly on my own – and see the information in its context, so that I can critically assess it.

Note: the code for all of this is available on Gitlab here: https://gitlab.ulb.be/lisa/tcgasampler. It may even perhaps work.

The goal

I would like to have a relatively big stock of TCGA whole-slide images, from various tissue locations and cancer types, and at a relatively low resolution, so that I can quickly test algorithms such as tissue segmentation or artifact detection that generally don’t require super-high magnification.

The National Cancer Institute provides an API to access their datasets programatically, so this is what I’d like to do:

  • Get the list of all the diagnostic tissue slides.
  • Randomly select N slides.
  • Download and save a copy of those slides at at chosen resolution.

Accessing image data

There are basically two ways to access image data for a given slide:

  • Downloading the entire whole-slide image using the “data endpoint” of the API: https://api.gdc.cancer.gov/data/{file_id}
  • Directly loading the tiles from their tile server, used by their online slide viewer: https://portal.gdc.cancer.gov/auth/api/v0/tile/{file_id}?level={level}&x={x}&y={y}.

Both options come with their own set of problems.

In the first case, we have to download a full-resolution slide, which is often 1 GB or more, just to extract the low-resolution image from it (and then delete the full WSI to avoid filling up local storage too quickly).

In the second case… we don’t have access to any of the metadata of the slide, including, crucially, the resolution and magnification level. This means that we have to kinda guess at which “level” – as understood by the slide server – we have to work. Of course, the “levels” of the slide server are not the same as the “levels” of the image pyramid in the slide file. But at least we only download the data that we want, which is a lot faster.

So I’ve tried both: with the first option, I can have exactly the resolution I want for all images, but it takes a ton of time; in the second option, it’s faster but I will have variations in the resolution.

The overall pipeline

We start by setting a random seed for repeatability, and then we load and shuffle the file ids from the manifest:

random.seed(RANDOM_SEED)
with open(PATH_TO_MANIFEST, 'r') as fp:
    all_tissue_slides = [line.strip().split('\t') for line in fp.readlines()]

rows = all_tissue_slides[1:]
row_ids = list(range(len(rows)))
random.shuffle(row_ids)

We can then iterate through the row_ids to get the file_id and filename for each slide. This will randomly sample through the full TCGA dataset.

Option A: downloading full slides

In the first option, we use the data endpoint to load the whole-slide file and save it to the disk:

response = requests.get(data_endpt, headers={"Content-Type": "application/json"})
with open(os.path.join(f"{GDC_DATA_ENDPT}/{file_id}", f"tmp/{filename}"), "wb") as fp:
    fp.write(response.content)

Then we can use our WholeSlide class (available in our openwholeslide package) to extract the image at the target resolution:

wsi = WholeSlide(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))
region = wsi.read_full(resolution=TARGET_RESOLUTION)
imsave(os.path.join(TCGA_LOCAL_PATH, f"images/{filename}.png"), region.as_ndarray)
wsi.slide.close()

And finally we remove the downloaded file, so that the disk doesn’t fill up:

os.remove(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))

Option B: downloading tiles

Getting a tile from the GDC API requires to provide the file id – which we have from the manifest – alongside the level, and an “x-y” position. The “x-y” position is a tile index, so that the first tile is (x=0, y=0), then the one to the right is (x=1, y=0), etc. The level is related to the size of the image. It’s not really documented, but from playing a bit with the API it’s clear that the system is basically just starting from a 1x1px at level 0, then increasing by powers of 2. At some point it diverges from the expected 2x2, 4x4, 8x8, etc., so that the real aspect ratio is preserved. So for instance for the image with id 455d9335-c6f3-4966-8b3c-1291e2d31093, we have:

  • level 0: 1x1
  • level 1: 2x2
  • level 2: 4x3
  • level 3: 7x6
  • level 4: 14x11 …
  • level 9: 428x351

We can get some metadata from the API using: https://portal.gdc.cancer.gov/auth/api/v0/tile/metadata/455d9335-c6f3-4966-8b3c-1291e2d31093. It tells us the full size of the image at maximum resolution (here 109559x89665), and the tile size (here 512). So we know that above level 9, we will start to have multiple tiles.

Let’s try to grab the images at a resolution such that the image is, at most, 2048px in its largest dimension. For that, we need to use level 11. But first, we need to grab the metadata to get the aspect ratio and have an idea of the total size of the resulting image.

meta_url = f"{GDC_META_ENDPT}/{file_id}"

response_meta = requests.get(meta_url)
metadata = json.loads(response_meta.content)
full_width = int(metadata['Width'])
full_height = int(metadata['Height'])
overlap = int(metadata['Overlap'])
tile_size = int(metadata['TileSize'])

if full_width > full_height:
    expected_width = 2048
    expected_height = math.ceil(2048 * full_height/full_width)
else:
    expected_width = math.ceil(2048 * full_width / full_height)
    expected_height = 2048

downloaded_image = np.zeros((expected_height, expected_width, 3), dtype=np.uint8)

We will then use the tile_size to compute the maximum number of tiles per dimension, and slowly fill the image from the tiles. We also keep track in max_x and max_y of the real size of the resulting image so that we can crop it from the temporary black image at the end.

tile_url = lambda x, y: f"{GDC_TILE_ENDPT}/{file_id}?level=11&x={x}&y={y}"

max_tiles = (2048 // (tile_size-overlap)) + 1
max_x = 0
max_y = 0
for y in range(max_tiles):
    for x in range(max_tiles):
        response_img = requests.get(tile_url(x, y))
        if response_img.ok:
            image = imread(BytesIO(response_img.content))
            startx = x*(tile_size-overlap)
            starty = y*(tile_size-overlap)
            downloaded_image[starty:starty+image.shape[0], startx:startx+image.shape[1]] = image[..., :3]
            max_x = max(max_x, startx+image.shape[1])
            max_y = max(max_y, starty+image.shape[0])

imsave(down_path, downloaded_image[:max_y, :max_x])

Conclusions

With Option A, it took me about 11 hours to get 100 random images at a resolution of 15µm/px – but at least I know their resolution. It took only 2 hours to get the same amount of images with option B… but with varying levels of resolution.

In the end, I think both options can be useful ways of randomly sampling TCGA data without storing the full WSIs locally. Which, given their 12.95TB size, I’d rather not do.

Over on Instagram, I saw a post making fun of Javascript’s weird and unpredictable (unless you really know Javascript) typecasting behaviour. There is a very long tradition of mocking Javascript, with perhaps the most famous instance being the “Wat” video from Gary Bernhardt, complete with very dated early 2010s meme culture (I feel old).

Neither really take the time to explain what’s going on, so I thought it would be interesting to quickly go through the Instagram one, because I think it’s actually interesting to illustrate the difficulty of creating a programming language that intuitively “makes sense.” Something which Javascript spectacularly fails to do.

So, first one:

console.log(3 + 3);
6

This, obviously, is the expected behaviour. Nothing to see.

console.log("3" + "3");
33

This is logical as well, but it clearly shows the root of the issue with a “dynamic type” language. Because the + addition operator means “arithmetic addition” (for numbers), but also “concatenation” (for strings), the creators of the language have to make some choices as to what happens when you mix both types. Which is where it starts to become weird.

console.log("3" + + 3);
33

In this case, the + + is not actually needed, we could just use a single + and get the same result. It’s just put there to setup the next bit. But the reasoning here is simple: we start from a string, we have a + so we want to concatenate something. What is this thing? + 3, which is a number. But we cannot concatenate a number to a string, so it is first cast into the string 3, hence the final result: the string 33.

console.log(3 + + "3");
6

This is where it starts being a bit confusing, especially since doing 3 + "3" actually outputs 33. So the casting is done towards the string type for "3" + 3 and for 3 + "3", but the second + signs causes everything to fall back to numbers. Why? Because + in Javascript isn’t just for addition or concatenation, it can also be the unary plus operator, which converts whatever follows it to a number. And, because the unary operator has precedence over the addition operator, what Javascript evaluates is:

  1. a = + "3" = 3
  2. 3 + a = 3 + 3 = 6
console.log(3 + 3 - 3);
3

No problem here.

console.log("3" + "3" - "3");
30

What’s happening here ? The two operators, + (addition because there are things on both sides, not unary plus) and - have the same precedence, and arithmetic operators use “left-associativity” (except exponentiation, but let’s not go there), so this gets evaluated as ("3" + "3") - "3".

With "3" + "3", + will mean “concatenation,” so we get the string 33. Then, we subtract the string 3. But while addition applied to strings is concatenation, subtraction applied to string… doesn’t exist in javascript. The subtraction operator always try to “coerce both operands to numeric values” before performing the operation, so "33" - "3" is evaluated as 33 - 3 = 30.

How do other languages handle it ? Typically, most language prefer to throw an exception rather than to have unintuitive outputs. Python for instance, will refuse to do "3" + 3 or 3 + "3", throwing a TypeError in both cases. It does have its own weirdness, however, as "3" * 3 will work, and output the string '333': the multiplication operator, for strings, is interpreted as a “repeat” operator in Python.

So what’s the conclusion of all that ? Except for “don’t use Javascript unless you really, really have to,” it’s: programming languages are often weird, but in the end they are just following exactly their internal rules. I certainly prefer languages where those rules tend to produce the behaviour that I intuitively expect. But languages are designed according to their own needs. Javascript tries to keep going as long as possible without crashing, because the goal is for end users on their web browsers to not have any errors on their ends. “Forgiveness by Default,” because if it doesn’t, users are unhappy.

Also, developpers have been making fun of Javascript for so long that it would now feel weird if it became less weird.

Tissue segmentation in digital pathology is one of these weird little problems where it seems like it should be trivial, where in fact in many situations it is trivial, yet if you need a robust solution with a reasonable amount of precision, it is suprisingly difficult to get it right. This is something that we examine in our latest preprint, accepted into the SIPAIM 2023 conference [read as HTML - read as PDF], written with Arthur Elskens, Olivier Debeir and Christine Decaestecker.

What are we trying to do?

This is a situation that most developpers who had to work with digital pathology images had to deal with at some point or another: you have a very large image with some tissue regions, and a lot of light gray “background,” where the backlight from the microscope shines through the glass slide.

Exemple of a whole-slide image from a breast cancer sample (source: TCGA, case TCGA-AC-A2FB)

The goal is to clearly separate the tissue region from the background region. There are many potential reasons for that. One of the most common is when we are extracting patches for training a deep neural network: we don’t want to extract tiles in the background region, where there is no useful information to be found.

Another case is when we are trying to quantify stuff, such as the percentage of the tissue that is covered by, for instance, tumoral cells. It’s important in that case to have a good estimation of what the total tissue area is. Kleczek et al.1 also uses tissue segmentation to get a more accurate estimation of the stain colors, for a process called stain deconvolution which is also fairly common.

Those different cases have very different requirements: for the patch extraction, for instance, we don’t really care if it’s super precise. For quantification processes, however, we do.

How do we do it?

When we look at the litterature, it’s quickly apparent that most people just kinda wing it (which is what I was doing up to know as well, to be honest). Everyone has a slightly different but relatively similar pipeline, which most of the time can be summed up in a few steps:

  • Channel reduction: convert the 3-channel color image into a single-channel “grayscale” image.
  • Thresholding: determine the best value that separates background from non-background in this single-channel image.
  • Post-procsesing: clean-up the segmentation mask, typically with some morphological operations and by filling any holes within the tissue region.

A few algorithms have also been developped a bit more rigorously, such as FESI2 or EntropyMasker3, but they still follow mostly the same pipeline.

Our study

The steps that seems the most important in this process is the channel reduction step: how do we transform the image so that the value of the pixels in the single-channel image are well-correlated with whether they are within the tissue region or not.

The main characteristic of the background region is that it’s light, so a common choice is to simply use the regular grayscale transform (or “inverted” grayscale so that tissue regions appear brighter than background regions). Another characteristic is that it’s grayish, or “desaturated.” So we can use the saturation in the HSV colorspace. Another approach (used for instance in the VALIS registration software4) is to first estimate the color of the background region (by looking at the brightest pixels), then compute a color distance from each pixel to this estimated background color. Pixels in the tissue should be “further” in colorspace.

We can also use the other main charactersitic of the background region: that it’s homogeneous. This is the approach from EntropyMasker: to use the local entropy – or, in other works, the Laplacian or the output of an edge detection algorithm – which will be higher in “textured” regions (such as the tissue) and lower in the background regions.

The goal of our study was to evaluate whether one or some of those approaches, all commonly found in the litterature, are better than the others.

Our results

We use annotated data from Bándi et al.5 and from the TiGER challenge6. Five slides from Bándi and 62 from TiGER are used as a development / learning set. Five slides from Bándi and 31 from TiGER as a test set. We worked using a low magnification level corresponding to a resolution of around 15µm/px in all experiments.

We first analyzed the separability of the tissue and background value distributions in the different single-channel options on the development set, and find that the local entropy provides the best separability, and the grayscale the worst.

We then look at the results of a basic pipeline of channel reduction, median filter, thresholding and morphological post-processing, and compare the four channels’ results according to the Intersection over Union metric. For the thresholding step, we use either a “learned” threshold (based on the development set), or the Otsu algorithm. FESI and EntropyMasker results are also tested for comparison.

Results on a representative image of the test set. The contours of the masks are overlaid on the RGB image (for the annotations, FESI and EntropyMasker) and on the single-channel representations (for the fixed (FT) and Otsu thresholds).

Our main conclusions are:

Our results demonstrate that transforming the RGB image to grayscale, as is very commonly done for tissue segmentation, is a suboptimal choice that should generally be avoided. Better channel reduction choices are available, focusing either on the colorization of the tissue compared to the background (e.g. with Saturation or ColorDist), or on the texture (e.g. with LocalEntropy). Pipelines with many handcrafted steps such as the one proposed in EntropyMasker may be unecessarily complex, as a very simple pipeline obtain very similar results as long as the right channel reduction step is applied.

What about deep learning ?

Tissue segmentation can certainly also be performed using deep convolutional neural networks. Bándi et al., for instance, showed very good results using U-Net. We decided against including such methods in our analysis, however:

We purposefully chose not to focus on deep learning solutions in this study. Often, one of the objectives of tissue segmentation is to have a very simple and quick method that offer a result that doesn’t need to be pixel-perfect. Making simple, classical pipelines more robust and reliable can allow us to get good results while wasting less computing resources.

I think that, in general, we should try to find the “simplest” solution to a problem. For my future needs in tissue segmentation, I’d much rather have a robust, reliable pipeline based on channel reduction and basic operations (that can easily be translated to any framework and included in any pipeline without much fuss) rather than to have a big trained neural network that will require some heavy maintenance to keep up to date with the latest versions of deep learning libraries. I have some neural networks that were trained two years ago and which are already annoyingly complicated to run now. I’d feel a lot more comfortable with the kind of pipeline that we studied here.

So do we have a robust solution?

First we have to refine the other steps of the pipline, most particularly the post-processing, and to potentially combine different thresholds. The homemade pipeline that I used until now thresholded on a combination of the grayscale and saturation channels. I’ll likely change that to rather use saturation and color distance or saturation and local entropy.

The code for our experiments is available on Gitlab: https://gitlab.com/adfoucart/tissue-segmentation


  1. Kleczek et al., 2020. A novel method for tissue segmentation in high-resolution H&E-stained histopathological whole-slide images. https://doi.org/10.1016/j.compmedimag.2019.101686↩︎

  2. Bug et al., 2015. Foreground Extraction for Histopathological Whole Slide Imaging. https://doi.org/10.1007/978-3-662-46224-9_72↩︎

  3. Song et al., 2023. An automatic entropy method to efficiently mask histology whole-slide images. https://doi.org/10.1038/s41598-023-29638-1↩︎

  4. Gatenbee et al., 2021. VALIS: Virtual Alignment of pathoLogy Image Series. https://doi.org/10.1101/2021.11.09.467917↩︎

  5. Bándi et al., 2017. Comparison of different methods for tissue segmentation in histopathological whole-slide images. https://doi.org/10.1109/ISBI.2017.7950590↩︎

  6. https://tiger.grand-challenge.org/↩︎