Adrien Foucart's research blog

Adding context to publications: a small experiment.

adrien@adfoucart.be — 2025-05-19

I’m making a small experiment with my publications – or at least the version that I host here.

It started like this: I think that PDF kinda suck as a science communication medium, so I wanted to add an HTML version of my papers. But after converting some of my LaTeX sources to HTML (thanks to pandoc), and using a little bit of my limited CSS talent, I got something that was OK, but felt underexploited.

Because now, I have the opportunity to add stuff around the published text. I don’t think modifying the text itself would be right, but there are many cases where, due to the length constraints or lack of knowledge at the time, things that I wrote are not quite as clear as I would like, or miss some useful informations. Or where I would simply do things a bit differently know, or have got more recent results that strengthen or weaken some of the conclusions.

In the current publication model, it’s not very easy to add such context. You can cite the old publication in your newer ones and comment from there, of course, but if someone finds the old article first, there is really nothing that make it clear what may or may not have changed since publication. But here, on my own website, I can do what I want !

The experiment

I found a nice post explaining how to easily create “sidenotes” using only CSS, which is great: I don’t want to start messing with Javascript unless absolutely necessary. I was also inspired by (aka: stole from) the work of Arthur Perret, who has some great posts on publishing to multiple formats from text source, and uses similar sidenotes through his blog and even in the web version of his PhD dissertation (all in French).

Sidenotes seem like a really good way to add contextual information. It’s easy to make it absolutely clear that it isn’t part of the original text itself (for instance, with timestamps on each note) and that it doesn’t represent the opinion of all original authors (by signing each note). And it doesn’t distract too much from reading the text, like footnotes would.

The results

I started from the first article I published during my PhD, on artifact segmentation in digital pathology. I simply re-read it and started annotating. I quickly found things that I thought were interesting to add: links to later publications, clarifications, links to relevant source code, etc.

Moving from one publication to the next, I got more convinced that this was a good idea. It makes it a lot easier, for instance, to follow the links between the first paper, the first SNOW paper from ISBI, and the follow-up report that expands on both.

In my early articles, I also didn’t have the habit yet of creating a snapshot repository with code for replicating the results. So I could also point to later versions of the code in some of my repositories, which in some case may make it easier for some to understand exactly what I was doing.

I’ve now gone through all of the publications where I’m first author, up to my latest ESANN 2025 paper.

Conclusions

I really like this format. Not only because I can correct things that I think weren’t quite right when we published the paper in the first place, but also because it allows me, in a way, to personalize a lot more the way I present that research. My publications are obviously not just mine, even when I’m the first author. But here, I can add things in my own words, without pressure, and I think it brings some life into what is often a fairly dry process.

[Opinion] Science and LLMs don't go together well

adrien@adfoucart.be — 2025-02-11

Two posts on Retraction Watch yesterday highlight yet again how the big publishers in the scientific publishing industry are failing at being gatekeepers of high-quality science, and how many scientists don’t seem to grasp that they should be writing their papers themselves.

“As a nonsense phrase of shady provenance makes the rounds, Elsevier defends its use” (Frederik Joelving, Retraction Watch).

In this first one, a “fingerprint” of automatically-generated papers is described. These fingerprints, like the “tortured phrases” identified by Guillaume Cabanac [arXiv], are basically things that no scientist will ever write and that no diligent human reviewer or editor should let through. Cabanac identifies for instance “profound neural organization” used instead of “deep neural network”, a clear sign that some automated method just tried to find synonyms in an effort to mask plagiarism, without understanding the context.

This time, the fingerprint is “vegetative electron microscope” (or microscopy), which… doesn’t really mean anything? The most likely explanation found for its appearance (in more than 20 published papers) is because of an OCR error: a machine tried to convert a two-column 1959 paper to plain text and accidentally merged the two columns, where “vegetative” and “electron microscopy” appeared next to each other.

But the news here isn’t the fingerprint, but rather the baffling reaction of Elsevier when asked about a 2024 paper published in one of their journal [Elsevier]. To be clear, “vegetative electron microscope” is something that no one has ever written and that doesn’t make sense. But a Elsevier spokesman wrote that “the Editor-in-Chief confirmed that ‘vegetative electron microscopy’ is a way of conveying ‘electron microscopy of vegetative structures’ so he was content with the shortened version to be used.”

This is nonsense. First because, again, no-one uses this “shortcut”. Second, because the thing studied at that point of the paper is bacterial cellulose, which is different from plant cellulose and therefore not even vegetal. The presence of such an obvious mark of automated paper writing should be an instant retraction, and launch an investigation into any previous paper published by the authors.

“As Springer Nature journal clears AI papers, one university’s retractions rise drastically” (Avery Orrall, Retraction Watch)

Writing a scientific paper with ChatGPT or other LLMs is a terrible idea. Using AI assistance to write lowers our critical thinking abilities, as even Microsoft realises [Microsoft Research].

Convincing scientists, particularly those under heavy pressure to publish a lot, not to use these tools is a tall order. The bare minimum, which the Springer Nature publisher requires, is to disclose the use of an LLM. More than one hundred papers that failed this bare minimum requirements have been retracted so far by the Neurosurgical Review journal. At least in this case the journal seems to be doing its job, but I find the response from one of the retracted authors [Sivakamavalli Jeyachandran via RetractionWatch], who disagrees with the retraction, really disheartening.

His argument is:

Lack of standard for detecting LLM-written content makes the retractions arbitrary. That’s true, but “it’s hard to detect fraud” is a poor argument for “not acting when fraud is detected”.
It was a letter or commentary, not a full research paper, so it doesn’t really matter. If it matters enough to be published, it matters enough to be retracted – in my opinion.
Retractions should only be for data fabrication, plagiarism or breaches of journal policy. Well, this was a breach of journal policy, so I’m not sure how that’s an argument against retraction. Also, LLM output is arguably plagiarism by essence, but that’s another debate…

He at some points refers to his undisclosed usage of an LLM as a “minor technical infraction”. I certainly disagree – and am glad that Springer seems to disagree as well.

[Preprint] Ranking the scores of algorithms with confidence (ESANN 2025)

adrien@adfoucart.be — 2025-01-15

In competitions or in original research papers which compare the results of some algorithms on a given task, the centerpiece is generally the Big Table of Results. The Big Table of Results is where you put a list of algorithms on one axis, a list of metrics on the other axis, and you put in bold the algorithm that performs best. In original research papers, it’s where you justify that your method is better than the others, with tables such as the one below. See? It’s in bold!

Method	Result
Old Classic Baseline [1]	0.71
State-of-the-art from a few years back [2]	0.82
Previous work from the authors [3]	0.82
This work [4]	0.84

In a competitions, this gives us the leaderboard, which will look something like this:

Rank	Team	Result
1	Big AI Research group	0.91
2	Big AI Company	0.90
3	Someone with 2 GPUs	0.87
4	Someone with 1 GPU	0.84
…	…	…
157	I don’t know what I’m doing	0.42

Needless to say, things are a bit more complicated than that. In our new preprint, accepted at the ESANN 2025 conference and available online (PDF), we argue for a more nuanced approach to ranking where, instead of saying “this is the best method”, we compute confidence intervals on the rankings based on the assumption that the test set is a random sample from the larger population of “all cases where we may want to apply our algorithm”. We take the general procedure proposed by S. Holm in 20131, which uses the result of pairwise statistics tests to infer the confidence interval.

Several options for these statistical tests are evaluated using a Monte Carlo simulation on synthetic data. The procedure that appears to be the most robust based on our experiments is:

Make a Iman-Davenport test comparing the m algorithms. If the null hypothesis (no significant difference between results) cannot be rejected, we stop here: all algorithms have the same confidence interval [1, m].
Compute pairwise one-sided Wilcoxon signed-rank tests, adjusting the p-values using Holm’s procedure. The ranking for each algorithm is then: [1 + #sba, m − #swa], with sba/swa for the number of significantly better/worse algorithms.

We release with this paper the cirank Python library, which you can use to compute the confidence intervals with:

from cirank import ci_ranking
import numpy # for the example

# example results:
results = [np.random.random((10,)) for _ in range(5)]

# default method
rankings = ci_ranking(results)
print(rankings)

This paper (and the library) have large limitations in scope, and are likely to be expanded in the future – as discussed in the paper.

Reference:

A. Foucart, A. Elskens, C. Decaestecker Ranking the scores of algorithms with confidence ESANN 2025 (Accepted).

S.Holm. Confidence intervals for ranks. https://www.diva-portal.org/smash/get/diva2:634016/fulltext01.pdf, 2013.↩︎

Advent of code 2024

adrien@adfoucart.be — 2024-12-12

Advent of Code is a yearly event, running since 2015. It is made by Eric Wastl, and it is “an Advent calendar of small programming puzzles”.

I had never heard of it until this year, when I started following the blog of Juha-Matti Santala, and saw him document his progress. I thought it would be fun – and I was right. I also thought that the practice of writing up how I came to the solutions was nice, and decided to do it as well. All of my notes on the puzzles can be found here: https://notes.adfoucart.be/aocode24. And all of my code, the good and the bad, is on my Gitlab.

One thing I really like about it is that there are basically no rules except the ones we bring with us. Eric posts a two-part puzzles every day, where you have to solve the first part in order to get the second part. You solve the puzzle by writing (or rather copy-pasting) a number into a box, and the only check that is made is if you have the right number. It doesn’t matter which language you used, or if your solution is clever, or efficient, or even would generalize to any other inputs than those he gave.

There is a leaderboard, but it is obviously just there for people who really want a leaderboard. The only metric used for it is how fast you solved the puzzle. I didn’t notice it was there until several days into the challenge.

So why do I do it? I decided to use it to get better at using the Python standard library efficiently, and to practice finding a good way to frame the problem so that the solution is generic and modular enough that, when part 2 is revealed, I can easily add to the existing code without modifying too many functions (ideally, none).

Sometimes it works, sometimes not. We’re halfway through, and I’m reasonably satisfied with most of my solutions so far. If you’re into that kind of things, I highly encourage doing it. All of the puzzles from the previous years are available as well.

Brain metaphors

adrien@adfoucart.be — 2024-06-17

Over on Aeon, psychologist Robert Epstein wrote a very interesting piece about how we often explain the functioning of the brain as if its a computer, retrieving memories and processing information, despite the lack of scientific grounding for that metaphor. Our brains don’t “recall” information, they “re-live” it. Information is not “stored” in neurons: our brain constantly evolved based on our experiences, and events can trigger our brains into re-activating the areas that were activated during a previous event, thus creating the impression of a “memory” for us.

This, I think, is the flip side of a discussion that I have had very often this past year about whether AI as it exists today can qualify as “intelligence” in any way. Just like “the computer” is not a very good metaphor for the human brain, the human brain is not a good metaphor for how AI algorithms work, yet this metaphor has completely taken over the way we think about AI. In fact, the whole vocabulary around AI is filled with human-intelligence metaphors: neural networks, learning, AI that explain their reasoning… and obviously “intelligence” itself. These metaphors may be useful to get a surface-level intuition about how AI algorithms work, but they are very much untrue and should not be taken as more than that: metaphors.

There’s this idea that, if we can make an artificial neural network that is as complex as the human neural network, then we should be capable of creating an actual intelligence. But that makes two hypothesis which are unsupported by evidence: that artificial neural networks are good models of human neural networks (they aren’t), and that human neural networks are solely responsible for human intelligence. That doesn’t seem to be the case either. Our brains work as part of an intricate, interwoven network of systems which all interact with each others in ways that are very hard to capture. Our neurons without our endocrine system, our immune system, etc., are a very incomplete snapshot of who we are. We can’t – and maybe won’t ever be able to – upload our brains to the cloud.

Artificial Intelligence is not Human Intelligence, and it doesn’t really make any sense to compare them in any way. Any AI algorithm has capabilities and limitations, and we can study those without falling into the trap of anthropomorphism. That’s why I don’t like it when LLMs are compared on benchmarks such as law exams, or common coding exercises. Those tests were designed for humans, to test capabilities that are often difficult to acquire for human intelligence. For an AI, it doesn’t tell us much. Importantly, it doesn’t tell us that this AI is a good lawyer, or a good software developer. Yet these are still the kind of tests used to compare LLMs today. Perhaps the most used test right now is the “Massive Multitask Language Understanding (MMLU)” test. The MMLU is composed “of multiple-choice questions from various branches of knowledge”, which were “manually collected by graduate and undergraduate students from freely available sources online”. But we aren’t using LLMs to solve multiple-choice questions from various exams and tests collected around the web. For any actual task that we have in mind for such a model, we would need to prove that the MMLU is a useful proxy to evaluate the capability of the model to perform the task.

The fact that the MMLU is collected from online sources also makes it extremely difficult to use it as a benchmark for modern LLMs, which are often trained on very opaque datasets, also collected from all around the web. The risks of contamination are huge, and the measures taken by LLM developers to mitigate this issue are often doubtful, if they are even described at all. The Gemini paper, for instance, states that they “search for and remove any evaluation data that may have been in our training corpus before using data for training”, but don’t provide any detail as to how this search was done. If they used the same method as GPT-4 (i.e. looking for exact matches on substrings of the questions), then the risk of contamination is high.

There is often a very large gap between the performances of the LLMs in benchmarks and their performances in real-world applications, and this is probably part of the reason: the benchmarks are not made for the evaluation of a computer program, but of a human brain… and those are way too different for it to work that way.

SciSpace: a quick story

adrien@adfoucart.be — 2024-06-15

I must confess that, sometimes, I Google myself. I typically know what I’ll find: I have enough of an online presence that most of the results are related to me (and one other Adrien Foucart who competed in Judo fifteen years ago), and it’s typically a mix of my blogs, scientific papers, social media, and from time to time a post about things I’ve written or done. Finding those is the main reason that I do this exercise every once in a while. Today, however, I got a surprising result, from a website that I had never heard about: SciSpace.

SciSpace is yet another GPT-powered chatbot, aimed at scientists who want to outsource their research to a machine. You ask a question, it answers with a summary built from scientific papers, with citations to those papers so that you can read them if you want to do some work somewhere in the process. I don’t think it’s a good idea: doing those kind of summaries is how you actually gain the understanding of your field, and you’ll necessarily miss a lot of the nuance of what’s happening in the field if you just get the AI-generated “summary”. So even if it worked perfectly as advertised, I wouldn’t recommend using it. But the reason I’m writing this is that it fails pretty spectacularly at its job.

It seems that SciSpace allows you to browse questions, presumably asked by other users. Google indexed a question where, surprisingly, I appeared in the answers. I say surprisingly because the question is not quite in my field: “What are the specific cultural criticisms associated with the implementation of Panopticon in various societies?”

The beginning of the answer seems to be on-topic, although since it’s a topic I know nothing about, it could all be bullshit for all I know. But it’s around the end that I suddenly appear, with this tangent:

Lastly, Adrien Foucart and colleagues critique the Panoptic Quality metric in digital pathology, illustrating the challenges of applying panoptic principles to complex, nuanced fields [10].

This is a reference to my Scientific Reports paper “Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology”, which is probably the paper I like the most out of my publications, but which has absolutely nothing to do with the Panopticon, outside of the fact that there is “Panoptic” in the title.

And this, I think, is where GPT went wrong. Because GPT predicts tokens, which encode parts of words, “Panoptic” and “Panopticon” likely share one or several tokens in common. This alone shouldn’t be enough to trip the model, but it also happens that one of the names most associated with discussion of the Panopticon is Michel Foucault, and “Foucault” also shares at least one token with “Foucart”. This taken together probably sent SciSpace’s GPT into the wrong direction. Because, as we should all know by now, GPT has no fucking idea what it’s talking about. It’s all just a statistically likely string of tokens, with no understanding whatsoever of what’s going on.

So Foucault and the Panopticon end up mixed with Foucart and Panoptic Quality, even though these come from completely different domains. I should also note that for some reason my paper is mistakenly cited as coming from “Dental science reports” instead of “Scientific Reports”. No idea what happened there, but another clear reason that this tool is absolutely useless.

From the CEO’s LinkedIn profile, the platform is “used by more than a million researchers worldwide, including Nobel Laureates”. I highly doubt, however, that Nobel prize worthy science will be done with it anytime soon.

ChatGPT on evaluation metrics

adrien@adfoucart.be — 2024-02-09

I recently reviewed a paper which I strongly suspect, without certainty, had a little help from ChatGPT for its research. One thing that tipped me off was that it was bit confused in the evaluation metrics: it reported, for a segmentation task, both the “F1” score and the “Dice” score, which it defined as:

$$F1 = \frac{2 \times precision \times recall}{precision + recall}$$

$$Dice = \frac{|P \cap G|}{|P| + |G|}$$

Where P & G refer to the predicted and ground truth regions.

Both of these definitions are fine, but… well, before I explain the problem, let’s see what ChatGPT (version 3.5) has to say about it.

Asking the question

User:  
What is the difference between the F1 score and the Dice Similarity Coefficient ?

ChatGPT’s answer:

The F1 score and the Dice Similarity Coefficient (also known as Dice coefficient or Dice index) are both metrics commonly used to evaluate the performance of binary classification tasks, particularly in the context of image segmentation or medical imaging. While they serve similar purposes, they have some differences in how they are calculated and what aspects of the classification performance they emphasize.

F1 Score:

The F1 score is the harmonic mean of precision and recall.
Precision measures the ratio of true positive predictions to the total number of positive predictions made by the classifier.
Recall measures the ratio of true positive predictions to the total number of actual positive instances in the dataset.
The F1 score balances both precision and recall, giving equal weight to both. It is calculated as: F1 = 2 * (precision * recall) / (precision + recall).
F1 score ranges from 0 to 1, where a higher value indicates better performance.

Dice Similarity Coefficient:

The Dice coefficient measures the overlap between two samples.
It is calculated as twice the intersection of the two samples divided by the sum of the sizes of the two samples.
In the context of binary classification tasks, one sample typically represents the predicted segmentation mask, while the other represents the ground truth mask.
Dice coefficient ranges from 0 to 1, where 0 indicates no overlap between the predicted and ground truth masks, and 1 indicates perfect overlap.

Differences

While both metrics are used to evaluate binary classification tasks, they emphasize different aspects of classification performance.
F1 score emphasizes the balance between precision and recall, which is particularly useful when there is an imbalance between the classes in the dataset.
Dice coefficient emphasizes the overlap between the predicted and ground truth masks, which is especially relevant in tasks such as image segmentation, where the spatial agreement between the predicted and true regions is crucial.
The F1 score can penalize false positives and false negatives equally, while the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant.

In summary, both metrics are valuable tools for evaluating classification performance, but the choice between them depends on the specific requirements and characteristics of the task at hand.

Dissecting the answer

ChatGPT is verbose as always, but the two definitions are mostly correct. Nothing much to say about the F1-Score, and for Dice the only real problem I have is with “measures the overlap,” as I’d rather say “is one of the measures of overlap.” The other main one being the Jaccard Index, also known as “Intersection over Union.”

The “differences” section is where I have way more problems. F1 score is balanced in terms of precision and recall, but it is not at all appropriate when there is a class imbalance (see Chapter 4 of my thesis for more!). And “the Dice coefficient may be less sensitive to false negatives if the overlap between the predicted and true regions is still significant” is nonsense.

But that’s still not the main issue. Let’s see if I can get ChatGPT to notice it.

One step further

User:  
Express both the F1 score and the Dice Similarity Coefficient in terms of True Positives, False Positive and False Positives.

ChatGPT’s answer

Sure, here are the formulas for both the F1 score and the Dice Similarity Coefficient expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score:
- Precision (P) = TP / (TP + FP)
- Recall (R) = TP / (TP + FN)
- F1 Score = 2 * (P * R) / (P + R)
Dice Similarity Coefficient:
- Dice = (2 * TP) / (2 * TP + FP + FN)

In both cases, TP represents the number of true positive predictions, FP represents the number of false positive predictions, and FN represents the number of false negative predictions. These formulas allow us to calculate both metrics based on the counts of these classification outcomes.

Almost there

The formulas are still correct ! But weirdly, ChatGPT didn’t exactly do as I asked. It still puts the F1 Score in terms of Precision and Recall, not in terms of TP, FP and FN. Let’s prompt it a bit further.

User:  
You put the F1 Score in terms of P and R, give the complete formula for the F1 Score based on TP, FP and FN.

ChatGPT’s answer

Apologies for the oversight. Here’s the complete formula for the F1 score expressed in terms of True Positives (TP), False Positives (FP), and False Negatives (FN):

F1 Score = (2 * TP) / (2 * TP + FP + FN)

This formula directly uses the counts of true positives, false positives, and false negatives to compute the F1 score, without needing to calculate precision and recall separately.

There we go !

So let’s put the two formulas next to each other:

$$Dice = \frac{2 \times TP}{2 \times TP + FP + FN}$$

$$F1 Score = \frac{2 \times TP}{2 \times TP + FP + FN}$$

They are the same thing ! Dice and F1 Score are two different names for the same thing. The only real difference is in when they are often used. F1 is a more common terminology in classification and detection problems, whereas Dice is more often used for segmentation problems. But they are the same.

All the talk about their differences was complete bullshit. But it would have been relatively difficult to spot without already knowing the information. Which is always the problem with trying to use ChatGPT as a source of knowledge.

Conclusions

It was not a very good paper, and whether the authors misunderstood evaluation metrics because of ChatGPT or all on their own, I don’t think their study will be published, at least in the journal they were aiming for.

But after more than a year of people trying to convince me that ChatGPT really is usually right, and useful as a source of information… I remain firmly convinced that it’s only “right” if you don’t know the topic and can’t fact-check the answer fully. In this case, the information is easily found on Google (even on Bing !). On Google, the first line I see with the same original prompt is: “Dice coefficient = F1 score,” from a Kaggle notebook.

Sure, “if I upgrade to the paid version” it may be better. The only thing that really makes ChatGPT-Plus better is the capacity to search the web. I’d rather do that directly on my own – and see the information in its context, so that I can critically assess it.

Retrieving TCGA slides with Python and the GDC API

adrien@adfoucart.be — 2023-10-06

Note: the code for all of this is available on Gitlab here: https://gitlab.ulb.be/lisa/tcgasampler. It may even perhaps work.

The goal

I would like to have a relatively big stock of TCGA whole-slide images, from various tissue locations and cancer types, and at a relatively low resolution, so that I can quickly test algorithms such as tissue segmentation or artifact detection that generally don’t require super-high magnification.

The National Cancer Institute provides an API to access their datasets programatically, so this is what I’d like to do:

Get the list of all the diagnostic tissue slides.
Randomly select N slides.
Download and save a copy of those slides at at chosen resolution.

Accessing image data

There are basically two ways to access image data for a given slide:

Downloading the entire whole-slide image using the “data endpoint” of the API: https://api.gdc.cancer.gov/data/{file_id}
Directly loading the tiles from their tile server, used by their online slide viewer: https://portal.gdc.cancer.gov/auth/api/v0/tile/{file_id}?level={level}&x={x}&y={y}.

Both options come with their own set of problems.

In the first case, we have to download a full-resolution slide, which is often 1 GB or more, just to extract the low-resolution image from it (and then delete the full WSI to avoid filling up local storage too quickly).

In the second case… we don’t have access to any of the metadata of the slide, including, crucially, the resolution and magnification level. This means that we have to kinda guess at which “level” – as understood by the slide server – we have to work. Of course, the “levels” of the slide server are not the same as the “levels” of the image pyramid in the slide file. But at least we only download the data that we want, which is a lot faster.

So I’ve tried both: with the first option, I can have exactly the resolution I want for all images, but it takes a ton of time; in the second option, it’s faster but I will have variations in the resolution.

The overall pipeline

We start by setting a random seed for repeatability, and then we load and shuffle the file ids from the manifest:

random.seed(RANDOM_SEED)
with open(PATH_TO_MANIFEST, 'r') as fp:
    all_tissue_slides = [line.strip().split('\t') for line in fp.readlines()]

rows = all_tissue_slides[1:]
row_ids = list(range(len(rows)))
random.shuffle(row_ids)

We can then iterate through the row_ids to get the file_id and filename for each slide. This will randomly sample through the full TCGA dataset.

Option A: downloading full slides

In the first option, we use the data endpoint to load the whole-slide file and save it to the disk:

response = requests.get(data_endpt, headers={"Content-Type": "application/json"})
with open(os.path.join(f"{GDC_DATA_ENDPT}/{file_id}", f"tmp/{filename}"), "wb") as fp:
    fp.write(response.content)

Then we can use our WholeSlide class (available in our openwholeslide package) to extract the image at the target resolution:

wsi = WholeSlide(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))
region = wsi.read_full(resolution=TARGET_RESOLUTION)
imsave(os.path.join(TCGA_LOCAL_PATH, f"images/{filename}.png"), region.as_ndarray)
wsi.slide.close()

And finally we remove the downloaded file, so that the disk doesn’t fill up:

os.remove(os.path.join(TCGA_LOCAL_PATH, f"tmp/{filename}"))

Option B: downloading tiles

Getting a tile from the GDC API requires to provide the file id – which we have from the manifest – alongside the level, and an “x-y” position. The “x-y” position is a tile index, so that the first tile is (x=0, y=0), then the one to the right is (x=1, y=0), etc. The level is related to the size of the image. It’s not really documented, but from playing a bit with the API it’s clear that the system is basically just starting from a 1x1px at level 0, then increasing by powers of 2. At some point it diverges from the expected 2x2, 4x4, 8x8, etc., so that the real aspect ratio is preserved. So for instance for the image with id 455d9335-c6f3-4966-8b3c-1291e2d31093, we have:

level 0: 1x1
level 1: 2x2
level 2: 4x3
level 3: 7x6
level 4: 14x11 …
level 9: 428x351

We can get some metadata from the API using: https://portal.gdc.cancer.gov/auth/api/v0/tile/metadata/455d9335-c6f3-4966-8b3c-1291e2d31093. It tells us the full size of the image at maximum resolution (here 109559x89665), and the tile size (here 512). So we know that above level 9, we will start to have multiple tiles.

Let’s try to grab the images at a resolution such that the image is, at most, 2048px in its largest dimension. For that, we need to use level 11. But first, we need to grab the metadata to get the aspect ratio and have an idea of the total size of the resulting image.

meta_url = f"{GDC_META_ENDPT}/{file_id}"

response_meta = requests.get(meta_url)
metadata = json.loads(response_meta.content)
full_width = int(metadata['Width'])
full_height = int(metadata['Height'])
overlap = int(metadata['Overlap'])
tile_size = int(metadata['TileSize'])

if full_width > full_height:
    expected_width = 2048
    expected_height = math.ceil(2048 * full_height/full_width)
else:
    expected_width = math.ceil(2048 * full_width / full_height)
    expected_height = 2048

downloaded_image = np.zeros((expected_height, expected_width, 3), dtype=np.uint8)

We will then use the tile_size to compute the maximum number of tiles per dimension, and slowly fill the image from the tiles. We also keep track in max_x and max_y of the real size of the resulting image so that we can crop it from the temporary black image at the end.

tile_url = lambda x, y: f"{GDC_TILE_ENDPT}/{file_id}?level=11&x={x}&y={y}"

max_tiles = (2048 // (tile_size-overlap)) + 1
max_x = 0
max_y = 0
for y in range(max_tiles):
    for x in range(max_tiles):
        response_img = requests.get(tile_url(x, y))
        if response_img.ok:
            image = imread(BytesIO(response_img.content))
            startx = x*(tile_size-overlap)
            starty = y*(tile_size-overlap)
            downloaded_image[starty:starty+image.shape[0], startx:startx+image.shape[1]] = image[..., :3]
            max_x = max(max_x, startx+image.shape[1])
            max_y = max(max_y, starty+image.shape[0])

imsave(down_path, downloaded_image[:max_y, :max_x])

Conclusions

With Option A, it took me about 11 hours to get 100 random images at a resolution of 15µm/px – but at least I know their resolution. It took only 2 hours to get the same amount of images with option B… but with varying levels of resolution.

In the end, I think both options can be useful ways of randomly sampling TCGA data without storing the full WSIs locally. Which, given their 12.95TB size, I’d rather not do.

Javascript's weird typecasting rules

adrien@adfoucart.be — 2023-10-03

Over on Instagram, I saw a post making fun of Javascript’s weird and unpredictable (unless you really know Javascript) typecasting behaviour. There is a very long tradition of mocking Javascript, with perhaps the most famous instance being the “Wat” video from Gary Bernhardt, complete with very dated early 2010s meme culture (I feel old).

Neither really take the time to explain what’s going on, so I thought it would be interesting to quickly go through the Instagram one, because I think it’s actually interesting to illustrate the difficulty of creating a programming language that intuitively “makes sense.” Something which Javascript spectacularly fails to do.

So, first one:

console.log(3 + 3);

This, obviously, is the expected behaviour. Nothing to see.

console.log("3" + "3");

This is logical as well, but it clearly shows the root of the issue with a “dynamic type” language. Because the + addition operator means “arithmetic addition” (for numbers), but also “concatenation” (for strings), the creators of the language have to make some choices as to what happens when you mix both types. Which is where it starts to become weird.

console.log("3" + + 3);

In this case, the + + is not actually needed, we could just use a single + and get the same result. It’s just put there to setup the next bit. But the reasoning here is simple: we start from a string, we have a + so we want to concatenate something. What is this thing? + 3, which is a number. But we cannot concatenate a number to a string, so it is first cast into the string 3, hence the final result: the string 33.

console.log(3 + + "3");

This is where it starts being a bit confusing, especially since doing 3 + "3" actually outputs 33. So the casting is done towards the string type for "3" + 3 and for 3 + "3", but the second + signs causes everything to fall back to numbers. Why? Because + in Javascript isn’t just for addition or concatenation, it can also be the unary plus operator, which converts whatever follows it to a number. And, because the unary operator has precedence over the addition operator, what Javascript evaluates is:

a = + "3" = 3
3 + a = 3 + 3 = 6

console.log(3 + 3 - 3);

No problem here.

console.log("3" + "3" - "3");

What’s happening here ? The two operators, + (addition because there are things on both sides, not unary plus) and - have the same precedence, and arithmetic operators use “left-associativity” (except exponentiation, but let’s not go there), so this gets evaluated as ("3" + "3") - "3".

With "3" + "3", + will mean “concatenation,” so we get the string 33. Then, we subtract the string 3. But while addition applied to strings is concatenation, subtraction applied to string… doesn’t exist in javascript. The subtraction operator always try to “coerce both operands to numeric values” before performing the operation, so "33" - "3" is evaluated as 33 - 3 = 30.

How do other languages handle it ? Typically, most language prefer to throw an exception rather than to have unintuitive outputs. Python for instance, will refuse to do "3" + 3 or 3 + "3", throwing a TypeError in both cases. It does have its own weirdness, however, as "3" * 3 will work, and output the string '333': the multiplication operator, for strings, is interpreted as a “repeat” operator in Python.

So what’s the conclusion of all that ? Except for “don’t use Javascript unless you really, really have to,” it’s: programming languages are often weird, but in the end they are just following exactly their internal rules. I certainly prefer languages where those rules tend to produce the behaviour that I intuitively expect. But languages are designed according to their own needs. Javascript tries to keep going as long as possible without crashing, because the goal is for end users on their web browsers to not have any errors on their ends. “Forgiveness by Default,” because if it doesn’t, users are unhappy.

Also, developpers have been making fun of Javascript for so long that it would now feel weird if it became less weird.

[Preprint] Finding the best channel for tissue segmentation in whole-slide images

adrien@adfoucart.be — 2023-09-25

Tissue segmentation in digital pathology is one of these weird little problems where it seems like it should be trivial, where in fact in many situations it is trivial, yet if you need a robust solution with a reasonable amount of precision, it is suprisingly difficult to get it right. This is something that we examine in our latest preprint, accepted into the SIPAIM 2023 conference [read as HTML - read as PDF], written with Arthur Elskens, Olivier Debeir and Christine Decaestecker.

What are we trying to do?

This is a situation that most developpers who had to work with digital pathology images had to deal with at some point or another: you have a very large image with some tissue regions, and a lot of light gray “background,” where the backlight from the microscope shines through the glass slide.

Exemple of a whole-slide image from a breast cancer sample (source: TCGA, case TCGA-AC-A2FB)

The goal is to clearly separate the tissue region from the background region. There are many potential reasons for that. One of the most common is when we are extracting patches for training a deep neural network: we don’t want to extract tiles in the background region, where there is no useful information to be found.

Another case is when we are trying to quantify stuff, such as the percentage of the tissue that is covered by, for instance, tumoral cells. It’s important in that case to have a good estimation of what the total tissue area is. Kleczek et al.1 also uses tissue segmentation to get a more accurate estimation of the stain colors, for a process called stain deconvolution which is also fairly common.

Those different cases have very different requirements: for the patch extraction, for instance, we don’t really care if it’s super precise. For quantification processes, however, we do.

How do we do it?

When we look at the litterature, it’s quickly apparent that most people just kinda wing it (which is what I was doing up to know as well, to be honest). Everyone has a slightly different but relatively similar pipeline, which most of the time can be summed up in a few steps:

Channel reduction: convert the 3-channel color image into a single-channel “grayscale” image.
Thresholding: determine the best value that separates background from non-background in this single-channel image.
Post-procsesing: clean-up the segmentation mask, typically with some morphological operations and by filling any holes within the tissue region.

A few algorithms have also been developped a bit more rigorously, such as FESI2 or EntropyMasker3, but they still follow mostly the same pipeline.

Our study

The steps that seems the most important in this process is the channel reduction step: how do we transform the image so that the value of the pixels in the single-channel image are well-correlated with whether they are within the tissue region or not.

The main characteristic of the background region is that it’s light, so a common choice is to simply use the regular grayscale transform (or “inverted” grayscale so that tissue regions appear brighter than background regions). Another characteristic is that it’s grayish, or “desaturated.” So we can use the saturation in the HSV colorspace. Another approach (used for instance in the VALIS registration software4) is to first estimate the color of the background region (by looking at the brightest pixels), then compute a color distance from each pixel to this estimated background color. Pixels in the tissue should be “further” in colorspace.

We can also use the other main charactersitic of the background region: that it’s homogeneous. This is the approach from EntropyMasker: to use the local entropy – or, in other works, the Laplacian or the output of an edge detection algorithm – which will be higher in “textured” regions (such as the tissue) and lower in the background regions.

The goal of our study was to evaluate whether one or some of those approaches, all commonly found in the litterature, are better than the others.

Our results

We use annotated data from Bándi et al.5 and from the TiGER challenge6. Five slides from Bándi and 62 from TiGER are used as a development / learning set. Five slides from Bándi and 31 from TiGER as a test set. We worked using a low magnification level corresponding to a resolution of around 15µm/px in all experiments.

We first analyzed the separability of the tissue and background value distributions in the different single-channel options on the development set, and find that the local entropy provides the best separability, and the grayscale the worst.

We then look at the results of a basic pipeline of channel reduction, median filter, thresholding and morphological post-processing, and compare the four channels’ results according to the Intersection over Union metric. For the thresholding step, we use either a “learned” threshold (based on the development set), or the Otsu algorithm. FESI and EntropyMasker results are also tested for comparison.

Results on a representative image of the test set. The contours of the masks are overlaid on the RGB image (for the annotations, FESI and EntropyMasker) and on the single-channel representations (for the fixed (FT) and Otsu thresholds).

Our main conclusions are:

Our results demonstrate that transforming the RGB image to grayscale, as is very commonly done for tissue segmentation, is a suboptimal choice that should generally be avoided. Better channel reduction choices are available, focusing either on the colorization of the tissue compared to the background (e.g. with Saturation or ColorDist), or on the texture (e.g. with LocalEntropy). Pipelines with many handcrafted steps such as the one proposed in EntropyMasker may be unecessarily complex, as a very simple pipeline obtain very similar results as long as the right channel reduction step is applied.

What about deep learning ?

Tissue segmentation can certainly also be performed using deep convolutional neural networks. Bándi et al., for instance, showed very good results using U-Net. We decided against including such methods in our analysis, however:

We purposefully chose not to focus on deep learning solutions in this study. Often, one of the objectives of tissue segmentation is to have a very simple and quick method that offer a result that doesn’t need to be pixel-perfect. Making simple, classical pipelines more robust and reliable can allow us to get good results while wasting less computing resources.

I think that, in general, we should try to find the “simplest” solution to a problem. For my future needs in tissue segmentation, I’d much rather have a robust, reliable pipeline based on channel reduction and basic operations (that can easily be translated to any framework and included in any pipeline without much fuss) rather than to have a big trained neural network that will require some heavy maintenance to keep up to date with the latest versions of deep learning libraries. I have some neural networks that were trained two years ago and which are already annoyingly complicated to run now. I’d feel a lot more comfortable with the kind of pipeline that we studied here.

So do we have a robust solution?

First we have to refine the other steps of the pipline, most particularly the post-processing, and to potentially combine different thresholds. The homemade pipeline that I used until now thresholded on a combination of the grayscale and saturation channels. I’ll likely change that to rather use saturation and color distance or saturation and local entropy.

The code for our experiments is available on Gitlab: https://gitlab.com/adfoucart/tissue-segmentation

Kleczek et al., 2020. A novel method for tissue segmentation in high-resolution H&E-stained histopathological whole-slide images. https://doi.org/10.1016/j.compmedimag.2019.101686 ↩︎
Bug et al., 2015. Foreground Extraction for Histopathological Whole Slide Imaging. https://doi.org/10.1007/978-3-662-46224-9_72 ↩︎
Song et al., 2023. An automatic entropy method to efficiently mask histology whole-slide images. https://doi.org/10.1038/s41598-023-29638-1 ↩︎
Gatenbee et al., 2021. VALIS: Virtual Alignment of pathoLogy Image Series. https://doi.org/10.1101/2021.11.09.467917 ↩︎
Bándi et al., 2017. Comparison of different methods for tissue segmentation in histopathological whole-slide images. https://doi.org/10.1109/ISBI.2017.7950590 ↩︎
https://tiger.grand-challenge.org/↩︎

Authors suing OpenAI

adrien@adfoucart.be — 2023-07-10

Two authors, Paul Tremblay and Mona Awad, have filed a lawsuit against OpenAI for “direct copyright infringement, vicarious copyright infringement, violations of section 1202(b) of the Digital Millenium Copyright Act, unjust enrichment, violations of the California and common law unfair competition laws, and negligence”1

When I first heard about the complaint, I was a little bit skeptical. The key element of the complaint (from what I saw in the media coverage2) was that ChatGPT was capable of generating very accurate summaries of those works. In the complaint itself, we can read in the Overview section:

Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.

And later on in the Factual Allegations section:

On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While I have few doubts that OpenAI is, indeed, using a ton of copyrighted work to train its models, this doesn’t really prove it in my opinion. It seems to imply that the language model is really summarizing works that it has somehow “ingested” in its model. But GPT isn’t really summarizing anything, of course: it’s doing its best impression of a summary of those books, which, as the complaint notes, “get some details wrong.” I find it a lot more likely that, when prompted to summarize those books, GPT is influenced more by online summaries of the books than by the books themselves.

Looking more closely at the full complaint and the exhibits3, however, I think it’s a lot more solid than it initially appears. While Tremblay and Awad use their own books as example, their complaint is much bigger, as its presented as a class action on behalf of “all persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models.” And their case for why OpenAI’s training data includes copyrighted works is largely based off information from OpenAI themselves.

In particular, they note that:

GPT-1 used BookCorpus, which is copied from Smashwords.com and is – apparently – known to contain work under copyright.
GPT-3 adds “two internet-based books corpora,” whose nature is unspecified, but which according to OpenAI’s papers would contain respectively around 60.000 titles and 300.000 titles.

The thing is, there are not really many sources of book data with so many works in it. The suit identifies two likely candidates. The 60k books dataset could correspond to Project Gutenberg. Those books are no longer under copyright, so that’s fine. But the only internet-based books repositories that could include the 300k books from the second dataset would be the LibGen / Z-Lib / Sci-Hub / Bibliotik corpus and/or torrent collections of books, which mostly include copyrighted material.

I think that’s the strongest argument they have here: based off OpenAI’s claim on their own dataset, they almost certainly used copyrighted material to train their model. And if OpenAI wants to demonstrate that it’s not the case, they may have to give a bit more information about their data… So we may finally get some openness from them – under oath.

This is not just about those two authors, however. The lawyers behind this, Joseph Saveri and Matthew Butterick, are clearly determined to make life very difficult for OpenAI and other generative AI startups. Three other authors (Sarah Silverman, Chris Golden and Richard Kadrey) joined an identical complaint against OpenAI4 and against Meta for their own language model, LLaMA5. LLaMa is apparently trained on “the Books3 section of ThePile”6. That particular dataset comes from EleutherAI7, and Books3 is described as coming from Bibliotik, compiled by Shawn Presser and shared on Twitter8. Since this dataset is easy to find and download, this is a lot more straightforward: copyrighted books from the authors filing the complaint are in the Books3 dataset, so – according to Meta’s own papers – they were used in training LLaMA. Saveri and Butterick are also behind the lawsuits against GitHub Copilot9 and Stable Diffusion10.

It will be interesting to see what kind of precedent the US courts decide to set here. The claim that the training sets contain copyrighted material may be strong, but that doesn’t necessarily mean that training a language model with copyrighted material is by itself infrigement. What the judges will decide to do with all these information, I don’t know. I’m sure many lawyers are going to fight about it for many years, though.

Tremblay v OpenAI complaint ↩︎
The Guardian, July 5th, 2023 ↩︎
Tremblay v OpenAI Exhibits (PDF)↩︎
Silverman v OpenAI complaint ↩︎
Kadrey v Meta complaint ↩︎
Touvron et al., LLaMA: Open and Efficient Foudation Language Models, 2023. ArXiV:2302.13971 ↩︎
Gao et al., The Pile: An 800Gb Dataset of Diverse Text for Language Modeling, 2020. ArXiV:2101.00027 ↩︎
Shawn Presser on Twitter ↩︎
https://githubcopilotlitigation.com/↩︎
https://stablediffusionlitigation.com/↩︎

[Publication] Evaluating participating methods in image analysis challenges: lessons from MoNuSAC 2020

adrien@adfoucart.be — 2023-04-14

It took a long time -- way longer than necessary, to be honest -- but our paper “Evaluating participating methods in image analysis challenges: lessons from MoNuSAC 2020” has finally been published in Pattern Recognition, and is now available online (doi:10.1016/j.patcog.2023.109600, open access for 50 days using this link).

It uses the published results of the MoNuSAC 2020 challenge, which include the prediction maps from some of the teams, to study how using complex metrics such as Panoptic Quality lead to poorer insights and hard to interpret results, compared to using separate, independent metrics for each sub-task (here: segmentation, classification and detection). As we demonstrate in the paper, while the ranking based on the PQ doesn’t really tell us much about the capabilities of the different algorithms, using separate metrics give us otherwise invisible insights on the results of the challenge.

The preprint version of the paper can be downloaded from this website. Only minor modifications were made for the final paywalled version.

Misusing Hausdorff's Distance

adrien@adfoucart.be — 2023-04-13

Hausdorff’s Distance is useful – but it can be tricky

I like Hausdorff’s Distance as a segmentation metric. It’s a way to measure the distance between the contours of two objects, and provides useful information that isn’t captured by overal metrics such as the Intersection over Union or the Dice Similarity Coefficient. I talk about it way more extensively in my thesis.

What I want to focus on here, however, is how it’s implemented, and particularly how scikit-image’s implementation leads to some mistakes in the way it’s commonly used.

The example from scikit-image’s documentation shows how the metric is computed for two sets of points. This, however, is not how it would generally be used in a segmentation task.

In a segmentation task, we will generally be comparing two binary masks: one with the annotated obect, and one with the prediction, as shown here:

Looking at the hausdorff_distance method in scikit-image, we can see that it expects as arguments two images. The documentation states that it will compute the Hausdorff distance between nonzero elements of given images, which means that if we want to use it as a segmentation metric, we need to provide as input images where the contours of the objects are nonzero elements.

This, however, is a bit confusing for two reasons. First, most segmentation metrics are computed directly on the segmentation masks, so users of the library who don’t read the documentation closely enough could expect that behaviour to be implemented here. Second, scikit-image provides no method for computing a “contours image” from a segmentation mask. The find_contours method returns a list of contour points coordinates, not an image. It also uses some interpolation to compute the coordinates, meaning that we get floating points values which can’t be used to directly recreate a “contours image” from the output of find_contours.

The easiest way to get the intended results of the method is to do something like:

se = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]])
gt_contour = ground_truth ^ erosion(ground_truth, se)
predicted_contour = predicted ^ erosion(predicted, se)
distance = hausdorff_distance(gt_contour, predicted_contour)

Where ground_truth and predicted are binary segmentation masks. The contours here are found by eroding the masks and subtracting the eroded mask from the original (thus only leaving the outer layer of pixels, as shown in the previous figure).

Does it matter?

The confusion about how to properly compute Hausdorff’s distance is not limited to scikit-image. The 2015 Gland Segmentation challenge, for instance, describes in its post-challenge publication (Sirinukunwattana et al., 2017) the metric as “the most extreme value from all distances between the pairs of nearest pixels on the boundaries of S and G,” but in the mathematical definition on the challenge website – and in their MATLAB implementation – actually compute the distances not just on the boundaries, but on all nonzero pixels in the segmentation masks, thus replicating scikit-image’s behaviour.

In code available from recent publications using scikit-image’s implementation such as Bourigault et al., MICCAI 21 (code) or Le Bescond et al., MICCAI 22 (code), we see the same implementation, where all pixels of the segmentation mask are considered instead of just the contours.

There can be a significant difference between the two ways of measuring the distance, as illustrated in the synthetic example below:

Example of synthetic “ground truth” and “predicted” segmentation masks, as well as the overlapping “contours image,” with the pair of points determining Hausdorff’s distance marked in white (using the points in the contours only) and black (using all points in the binary masks).

From a quick search on GitHub, I have found other examples of “misuse” of scikit-image’s confusing implementation.

So what’s next?

I have raised an issue on scikit-image’s repository. To avoid backward compatibility issues, I don’t think it’s a good idea to change the behaviour of the method in scikit-image, but I think using a clearer example of how to use it, being more explicit in the docstring and/or add a function to directly compute the distance from the segmentation masks would be interesting improvements.

I’ll see whether the scikit-image community agrees with me (or cares at all about the issues, to start with), and hopefully we can limit the confusion in the future.

[Opinion] Bing AI: has Microsoft lost its mind?

adrien@adfoucart.be — 2023-03-08

In case you haven’t heard, Microsoft is “Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web”. Surfing on the ChatGPT craze, they also further announced the integration of chatbots into Skype, so that you can just send a quick question to your AI pal in the middle of your conversation with friends and family.

Let’s put aside the fact that it necessarily means giving full access to the content of your conversations to Microsoft (no end-to-end encryption if Bing needs to listen in to offer its advice!).

I just want to quickly review the examples that Microsoft has decided to highlight in its own blog posts to showcase the capabilities of its new superpowered AI. Because… I’m starting to doubt that anyone even reads the chatbot outputs before sharing them and saying how wonderful they are.

Let’s start by the first example, in the Bing AI announcement.

I am planning a trip for our anniversary in September. What are some places we can go that are within a 3 hour flight from London Heatrow?

Relatively simple and straightforward request. We need destinations that are:

Less than 3 hours flight from London Heatrow.
Suitable / popular for a romantic getaway.

If we ask Google for a “romantic trip less than 3 hours flight from London”, we get several articles with a lot of possible choices, from inspiringtravel.co.uk, thegentlemansjournal.com, or travelrepublic.co.uk. Not all of those are suitable for the request, but it’s fairly easy to sift through (almost all include some estimation of the flight time and what to do there).

What’s Bing AI’s take?

Bing AI example from Microsoft’s blog

First one is Malaga. The provided information seems correct, and it fits the request, depending on what you consider suitable for a anniversary trip. So far, so good.

Second one is… Annecy, in France. Certainly a very nice city, but it doesn’t have an airport, and is therefore absolutely not within a 3 hour flight from London Heatrow. Ok. One mistake, let’s move on.

Third one is… Florence, Italy. Which as far as I can tell requires a minimum flight time of four hours.

That’s the first example they chose to highlight: two of the three proposed destinations do not fit the prompt. In the propositions from the Google search, you can find Paris, Amsterdam, and plenty of other destinations which arguably fit the prompt better. Also, you know, instead of the Venice of France, there is actually a 2h20 minutes flight from London to actual Venice…

Let’s see what Bing AI in Skype can do. Here, they provide three conversation examples. In the first one, they ask for some vegetarian recipes, and it delivers. I don’t really see how it’s easier than Google in this case, but fine. Second one is about cleaning up a full mailbox, and here again the results seem fine, but you could put the same thing in Google (or probably even Bing!) and get it in the first result.

The third one is where we finally see a request that requires some level of “intelligence”: “what should we do during a layover in Spain? Food an beaches ideally”. That’s a relatively easy question: Bing AI can choose from all over Spain, and information about beaches and restaurants are typically not difficult to find. Can you guess how well it performs?

Bing AI in Skype example from Microsoft’s blog

The first suggestion is fine… Almost. At first glance. As far as I can tell (because the reference is hidden in the screenshot) La Mallorquina in Barcelona is not a pastry shop, it’s a textile shop. There is a pastry shop called “La Mallorquina Formentor”, but it doesn’t seem particularly famous. There is a famous pastry shop called “La Mallorquina” in Spain… but it’s in Madrid. I’m sure Spanish – and Catalan – users will be delighted to know Bing AI confuses Madrid with Barcelona…

The second suggestion is incredible. “If you are in Port of Spain”. Port of Spain. Which is in Trinidad and Tobago, around 6000km from Spain, is the second option for a layover in Spain from Bing AI. In the example cherry-picked by Microsoft to showcase how good their system is. Seriously.

This is absurd. Microsoft may be getting some people to Bing with the hype, but I really don’t see how they will retain customers if they have such a high miss rate in their answers.

The fact that a comparatively much smaller mistake made by Bard in Google’s demo of their competing system apparently had a huge impact on their stock while Microsoft’s mistakes (and there are others in their other demos) don’t seem to have much impact on the hype either suggests that tech investors are completely irrational, or that everyone is so used to Bing being bad that “bad, but with style” is really seen as an improvement.

Anyway. I can’t wait for this hype cycle to be over so that we can move beyond large generative models, which seem more and more like a scientific dead end.

[Devlog] Getting oriented in anatomical space

adrien@adfoucart.be — 2023-01-24

CT images, 3D views and anatomical planes

During all my PhD, I’ve managed to avoid dealing with the third dimension. I did some work before that on “structure from motion” for detecting trucks from traffic cameras, but that was ten years ago. Now, however, I have to get away from the confort of two-dimensional images, like the digital pathology slides I’m most familiar with, and deal with Computed Tomography, CT in short, images.

The first question when dealing with CT images is… how to “look” at them? You can use some tricks to get a 3D view of the main elements, but if you want to get the “raw” information, the most common way is to use 2D slices cutting in different directions. I have some CT images of mice, which have been injected with a contrasting agent so that the liver is more visible. Let’s take a quick look using the open source 3D Slicer software:

Different views of a CT Scan. Top-left: transverse plane (i.e. axis are left-right and dorsal-ventral); bottom-left: frontal plane (axis are left-right and anterior-posterior); bottom-right: sagittal plane (axis are dorsal-ventral and anterior-posterior); top-right: 3D visualisation showing the bones and most visible organs

Since on of the goals of my research is to register ex-vivo histology images to these in-vivo CT, it would be very useful to determine a reliable frame of reference for “anatomical coordinates.” The reason that it’s useful (rather than just looking at voxel coordinates in the volume) is that with CT images, it’s very easy to reorient and/or crop the image in different ways to make it more practical to process in one way or another. A coordinate system that is independent from these transforms is quite useful.

The characteristics that I want for this system are therefore:

Have an origin point that is reasonably easy to find automatically.
Use axis that are easy to interpret (so, based on the anatomical reference)
Use physical coordinates, i.e. millimeters, rather than pixels.

Point and axis of reference

It seems obvious that the reference point, the origin of our coordinates system, will have to be related to the skeleton. The skeleton is very easy to segment, and the position of the different organs relative to the skeleton can be somewhat predictable. As far as my anatomy knowledge goes, also, it should be present and similar in all subjects…

Looking at the images, I think an interesting candidate is the junction between the most posterior ribs and the vertebral column (so the last floating rib from this diagram). In our example image, we can easily find this spot in the 3D view of the skeleton:

3D view of the skeleton. The black arrow points to a possible candidate for an easy to find anatomical reference point

So how do we find it automatically?

Let’s start by segmenting the skeleton, using a simple thresholding (it’s fairly easy to find a good threshold automatically in this case, but for simplicity’s sake I’ll hardcode it here). We’ll use SimpleITK to load the volume, but most of the processing will be done with numpy and scikit-image. The useful thing about SimpleITK is that the image object includes information about the “spacing” (the physical size of the voxels).

import SimpleITK as sitk
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt

path_to_volume = r".\volume.mha"
volume = sitk.ReadImage(path_to_volume)
volume_np = sitk.GetArrayFromImage(volume)  # Convert to Numpy array for easier low-level processing

skeleton = volume_np > 100

In order to view this skeleton, we can “project” the volume on the three different planes:

plt.figure(figsize=(20,7))
plt.subplot(1, 3, 1)
plt.imshow(skeleton.sum(axis=0))
plt.xlabel('Left <-> Right')
plt.ylabel('Dorsal <-> Ventral')
plt.title('Transverse projection')
plt.subplot(1, 3, 2)
plt.imshow(skeleton.sum(axis=1))
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.subplot(1, 3, 3)
plt.imshow(skeleton.sum(axis=2))
plt.xlabel('Ventral <-> Dorsal')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Sagittal projection')
plt.show()

Projection of the skeleton in the three anatomical planes

Automatically finding the right place

It’s way easier to process 2D images than 3D volumes, so let’s first try to use the coronal projection to find the reference point’s left-right and anterior-posterior coordinates, and we’ll worry about the dorsal-ventral dimension later.

coronal_proj = skeleton.sum(axis=1)

plt.figure()
plt.imshow(coronal_proj)
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.show()

Coronal projection of the skeleton

First, we want to find the path of the vertebral column. Since the body is mostly symmetrical in the left-right dimension, we can try to use the centroid of each row to see if it gives us something interesting:

x = np.arange(coronal_proj.shape[1])
centroids = (coronal_proj * x).sum(axis=1) / coronal_proj.sum(axis=1)
plt.figure()
plt.imshow(coronal_proj)
plt.plot(centroids, np.arange(coronal_proj.shape[0]), 'r-')
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.show()

Coronal projection with “centroid” line

It’s not always well centered because the vertebral column isn’t completely straight, so sometimes just taking the horizontal in this projection is not symmetrical. However, it always remain inside the column, except near the front where the front paws make it more difficult. To adjust for that, we can easily use a “connected components” rule to only keep the largest object in the image:

from skimage.measure import label, regionprops

def select_main_object(binary_array: np.ndarray) -> np.ndarray:
    """Removes any bit that is not connected to the largest object"""
    labels = label(binary_array)
    objs = regionprops(labels)
    max_area, max_area_idx = 0, -1
    for obj in objs:
        if obj.area > max_area:
            max_area = obj.area
            max_area_idx = obj.label
    binary_array[labels != max_area_idx] = 0
    return binary_array

coronal_proj[select_main_object(coronal_proj > 0) == 0] = 0
x = np.arange(coronal_proj.shape[1])
centroids = (coronal_proj * x).sum(axis=1) / coronal_proj.sum(axis=1)

plt.figure()
plt.imshow(coronal_proj)
plt.plot(centroids, np.arange(coronal_proj.shape[0]), 'r-')
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.show()

Cleaned-up coronal projection with “centroid” line

That’s better. Now, we can use a simple heuristic to determine the location on the anterior-posterior axis of the junction that interests us. Looking at the values across the “centroid” line, the local minima correspond to the junctions between the vertebrae. We can then check if there are ribs around that junction, and the first junction with ribs starting from the posterior side corresponds to the target.

from skimage.feature import peak_local_max

centerline_x = np.round(centroids).astype('int')
centerline_value = []
for z, x in enumerate(centerline_x):
    centerline_value.append(coronal_proj[z, x])
centerline_value = np.array(centerline_value)
vertebrae = peak_local_max(centerline_value.max() - centerline_value, min_distance=30)[:, 0]

# To find the first vertebra junction with a rib attached, we look at the variance across the left-right line
# passing through the junction
vertebrae_sorted = np.sort(vertebrae)
xcs = (np.array([np.arange(coronal_proj.shape[1]) for _ in range(len(vertebrae))]).T - centroids[
    vertebrae_sorted]).T
variances = ((xcs ** 2) * coronal_proj[vertebrae_sorted]).sum(axis=1)
z = 0
for idx, variance in enumerate(variances):
    if variance > variances.min()*5:
        z = vertebrae_sorted[idx]
        break

plt.figure()
plt.imshow(coronal_proj)
plt.plot(centroids, np.arange(coronal_proj.shape[0]), 'r-')
plt.plot(centroids[z], z, 'wo')
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.show()

Coronal projection with position of the first vertebra junction with a rib attached.

For the position in the left-right axis, we want to get in the center of the vertebral column, so we have to adjust the value a little bit:

# adjust x position
maxima = peak_local_max(coronal_proj[z], num_peaks=2, min_distance=50)
x = int(np.round(maxima.mean()))

plt.figure()
plt.imshow(coronal_proj)
plt.plot(centroids, np.arange(coronal_proj.shape[0]), 'r-')
plt.plot(x, z, 'wo')
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal projection')
plt.show()

Adjusted position of the junction

We still have to find the dorsal-ventral coordinates. For that, we can go back to the segmented skeleton volume and take the region in the sagittal projection around the anterior-posterior and left-right coordinates that we found:

region_of_interest = skeleton[z - 15:z + 15, :, x]
y = peak_local_max(region_of_interest.sum(axis=0), num_peaks=1)[0, 0]

plt.figure()
plt.imshow(region_of_interest, cmap=plt.cm.gray)
plt.plot(y, 15, 'bo')
plt.xlabel('Ventral <-> Dorsal')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Sagittal plane')
plt.show()

Slice in the sagittal plane around the reference point.

Now we can check where that point is the three planes of our original volumes:

vmin = 0
vmax = 255
plt.figure(figsize=(20, 7))
plt.subplot(1, 3, 1)
plt.imshow(volume_np[z], cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
plt.plot(x, y, 'bo')
plt.xlabel('Left <-> Right')
plt.ylabel('Dorsal <-> Ventral')
plt.title('Transverse plane')
plt.subplot(1, 3, 2)
plt.imshow(volume_np[:, y, :], cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
plt.plot(x, z, 'bo')
plt.xlabel('Left <-> Right')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Coronal plane')
plt.subplot(1, 3, 3)
plt.imshow(volume_np[:, :, x], cmap=plt.cm.gray, vmin=vmin, vmax=vmax)
plt.plot(y, z, 'bo')
plt.xlabel('Dorsal <-> Ventral')
plt.ylabel('Anterior (head) <-> Posterior (tail)')
plt.title('Sagittal plane')
plt.show()

CT volume in the three anatomical planes with the automatically found reference point

That looks decent enough. With that, we can therefore determine the “origin” point of our system of reference. With the spacing of the volume, we can also compute coordinates in physical spaces along the three anatomical axis:

print(volume.GetSpacing())
spacing = np.array(volume.GetSpacing())
origin = np.array((x, y, z))  # left-right, dorsal-ventral, anterior-posterior
coords_px = np.array((400, 300, 150))
coords_phys = (coords_px-origin)*spacing
print(f"Coordinates {coords_px}px correspond to a point that is \n{coords_phys[0]:.2f}mm right,\n{coords_phys[1]:.2f}mm ventral, and\n{coords_phys[2]:.2f}mm anterior\nfrom the origin at {origin}px, or a distance of {np.sqrt((coords_phys**2).sum()):.2f}mm")

(0.040892000000000005, 0.040892000000000005, 0.040892000000000005)
Coordinates [400 300 150]px correspond to a point that is 
4.50mm right,
-7.93mm ventral, and
-4.21mm anterior
from the origin at [290 494 253]px, or a distance of 10.05mm

Consolidating all these methods into proper functions and classes, we can check that the algorithms works on different images:

from registration.ct import find_anatomical_reference
from segmentation.ct import skeleton_from_volume
from imagetools.vectors import zyx2vector
from viz.plot import plot_volume_slices
import SimpleITK as sitk

paths = [r"E:\data\PTL\PTL01\Round C1\PTL01_20221220_in vivo\4h\PTL01_20221220_09_1C1G_4H_Rec\volume.mha",
         r"E:\data\PTL\PTL01\Round C1\PTL01_20221220_in vivo\4h\PTL01_20221220_10_1C1D_4H_Rec\volume.mha",
         r"E:\data\PTL\PTL01\Round C1\PTL01_20221220_in vivo\4h\PTL01_20221220_12_1C1G1D_4H_Rec\volume.mha"]

for p in paths:
    volume = sitk.ReadImage(p)
    volume_np = sitk.GetArrayFromImage(volume)
    skeleton = skeleton_from_volume(volume_np, acc_max=1e-5, area_min=1e3)
    origin = find_anatomical_reference(skeleton, zyx2vector(volume.GetSpacing()))
    plot_volume_slices(volume_np, point=origin, figsize=(20,7))

CT volume in the three anatomical planes with the automatically found reference point (first image) CT volume in the three anatomical planes with the automatically found reference point (second image) CT volume in the three anatomical planes with the automatically found reference point (third image)

Can ChatGPT write an academic paper?

adrien@adfoucart.be — 2023-01-07

Review of “A Day in the Life of ChatGPT”

Mashrin Srivastava tried an interesting ChatGPT experiment: a “paper completely written by ChatGPT” (originally published on LinkedIn, now available as a preprint on ResearchGate).

He appended to the generated paper the whole sequence of prompts and responses that was used to generate it. The original prompt was for an article to submit to a workshop on Sparsity in Neural Networks at ICLR 2023.

Title of the generated article (“A Day in the Life of ChatGPT as a researcher:”) was added by Mashrin, the rest comes from the bot

I don’t think it makes a lot of sense to review the resulting paper like I would do for a regular paper (I can’t pretend that I don’t know it’s a generated text), but I think the exercise is really interesting, so I want to take a deep dive into it.

A few disclaimers before I start:

To be honest about my own biases, I have been quite open about the fact that I don’t think ChatGPT is revolutionary or quite as useful as the hype around it make it look, and that I’m very skeptical of the actual value of Large Language Models as anything else than fun text generators. But I’m sometimes wrong about things (allegedly), so who knows?
I’m not an expert in sparsity. I’m enough of an expert in machine learning to know why it’s an interesting topic in neural networks, and to know the basics of how to obtain it (e.g. regularization techniques), but I can’t fully judge the scientific quality of a novel paper on the topic. So in terms of accuracy I’ll focus on what I can easily check.

Okay, let’s get started !

1. Looking at the prompts sequence

I started from the Appendix, so I could follow along the generation of the paper to better get a sense of where the ideas came from and how much “prompting” it actually needed. It’s really great from Mashrin to include all of it, as it makes the analysis way more complete. Thanks!

Generating an outline

The first prompt asked for a paper idea for the workshop, giving to ChatGPT the workshop description.

Here, ChatGPT mostly takes point by point all the proposed topics from the workshop description, and arranges them into sections. This is fairly typical from what I’ve seen of ChatGPT: it tends to really want to include everything you mention in its answers.

Workshop description (left) in the prompt vs proposed outline (right) by ChatGPT

The result here is that the outline proposes a wide but superficial survey paper, which would not really be fitting for a workshop. The structure reads more like the chapters of a book than the sections of a conference paper.

So: not a great start, but nothing particularly bad right now.

Generating the introduction

Large portions of the introduction are again directly taken from the workshop description. It then rewrites the paper outline, which itself also came from the prompt. Having the outline explained in the intro is normal, but this means that right now we still don’t have any added value from ChatGPT: if we want to know what are interesting topics related to sparsity, we can read the workshop description and we’ll have as much info as with the current intro.

Introduction generated by ChatGPT

Also, at this point we have zero citations, including for assertions of facts such as:

The range of applications of DNN success (medical diagnostics, autonomous driving)… There should at least be some citation to review papers in those domains.
The large carbon footprint and e-waste.
The potiental of a “sparsity” approach.

Generating the “background” section

The first problem here – caught by Mashrin – is that there is still zero citations, which becomes very problematic in what is typically a citation-heavy part of a paper.

The provided definitions are then very surface-level, and arguably wrong: for instance, “weight decay” is not by itself a pruning technique. It’s a regularization technique which can be used in conjunction with a pruning method which removes connections with weights that are close to zero.

In general, the level of the explanations at this point could be acceptable for a student assignment (assuming it’s correct, I don’t see many obvious errors but there could be), not quite for an actual research paper.

Mashrin then asks ChatGPT to provide citations and to add more details “for a research paper.” We’re moving away a bit from the “completely generated by ChatGPT” scenario, but that’s a bit nitpicky so let’s keep going.

The response doesn’t have more details, but did add citations and a reference list, so let’s go through that.

Start of the background section generated by ChatGPT

The first reference is “Sparsity refers to the explicit or implicit reduction of the number of non-zero parameters in a neural network (LeCun et al., 1990).” LeCun et al. is not in the reference list (which I think just got truncated due to limits on the output size). Much later in the process, Mashrin asks ChatGPT to provide a BibTeX citation for it, to which we get a reply mentioning a paper called “Optimal Brain Damage” by LeCun, Denker, Solla, Howard and Jackel. This is not exactly correct, as that paper is from 1989 and with only LeCun, Denker, and Solla. While the paper does talk about pruning and reducing the size of a network, it does not provide such a definition of sparsity.

So let’s check some other references:

Sparsity can be introduced in various ways, including through the use of pruning, quantization, and low-rank decomposition (Han et al., 2015): the reference is correct, but the paper really only deal with pruning (although it does mention quantization and low-rank approximation in its related works).
Pruning involves the removal of individual parameters or connections in a neural network based on their importance or relevance to the model’s performance (Hassibi et al., 1993). Pruning can be performed either manually or through the use of automated algorithms.: the reference provided has the wrong year (it’s 1992), but it exists. The paper talks about pruning, but it doesn’t really state such a definition, so it’s not ideal. Also, “manual” pruning of a deep neural network doesn’t sound like a very realistic prospect…
One common automated pruning method is weight decay, which adds a regularization term to the objective function that encourages the weights to be small (Krogh & Hertz, 1992): The reference got truncated as well, but it’s in the final paper, so let’s use the info from there. The paper does explain weight decay, but does not state that it is a pruning method (it’s explored as a way to combat noise).

Let’s stop here for a moment: the citations are generally “on topic,” but they are clearly not the source of the information.

Credit where it’s due, however: ChatGPT shows itself here to be a decent source of papers to look at to actually get to know the topic… at least from a fairly general perspective (up to now at least this is all very superficial).

Generating the “sparse training algorithms” section

Still looking at the citations, here our first is to Krogh and Hertz, 1992, but in the references the title corresponds to another paper from the same authors (from 1991). Again, the citation is “on topic,” but does not explicitly provide the assertion cited.

Start of the sparse training algorithms section generated by ChatGPT

We also here see repetitions from things already written in the previous chapter. This section barely brings any new details or new pieces of information to the table. The algorithms provided are:

Weight decays methods: this is indeed a standard way of inducing sparsity. The “Lasso / Ridge” nicknames are only used in regression models, and it doesn’t apply to the “regularization term” (see e.g. here for more), so that part is not correct. Last sentence is correct, but the citation is incorrect. Van der maaten and Hinton is from 2008, not 2010, and the title is “Visualizing data using t-SNE,” not “Getting the most out of a neural network by minimizing the amount of parameter” (which doesn’t exist as far as I can tell). Also, it has nothing to do with weight decay.
Gradient-based pruning methods: the explanation is mostly incorrect (the “standard approach” would rather be to sort the gradients and remove the smallest ones rather than using a hard threshold), and neither of the cited papers (Han et al., van der Maaten et al.) talk about it.

Then it does some improv on the limitations of those two “methods” (with again van der Maaten cited for no reason).

Mashrin prompts it again for “more sparsity algorithms.” We get:

Structured pruning, pointing again to Han et al. Han et al. does not mention anything called structured pruning. It does not mention pruning an entire layer either. In fact, the definition of structured pruning here is just incorrect.
Sparse initialization. The provided source is not about sparse initialization, but about ReLU activation functions.
Column sampling. There is no “column sampling” algorithm as far as I can find. The reference does present a method to reduce parameters, but not by the described method.
Binary weights. This method does not reduce the amount of parameters or induce sparsity. Also, it’s badly described: the weights are only restricted during the forward and backward passes of the training, but they are still stored as usual. It would also not necessarily require “specialized hardware.”
Low-precision weights. That’s just quantization again, which was previously mentioned in the paper.

Mashrin asks for more again, and we get “Dynamic sparsity” (the text doesn’t really correspond to what the paper describes); “Structural sparsity,” which cites a paper by Gao et al. that doesn’t exist, and provides an incorrect definition; and “Functional sparsity,” which cites an article with the wrong authors and year (and which doesn’t correspond to the explanation, which is kind of nonsensical as “constraining the activation to follow a specific function” doesn’t mean anything: that’s just what activation functions do in general).

So, to summarize: rehashing things that superficially develop what was in the prompt, with “algorithms” that and are not particularly well explained, are often badly attributed, are badly named so that it’s hard to find more about them, or just don’t exist.

Moving on… to “novel” ideas

I’m not going to be as detailed for the rest, as it would get very repetitive very quickly, but in the next section on Hardware the first citation is already wrongly attributed, and we keep the same pattern: superficial explanations that “sound right” but are often imprecise, incomplete or just wrong, with citations that don’t match (when they exist).

In the “compression” section, we see the same techniques that were already presented before (pruning, quantization), so no new information.

At some point, Mashrin asks for a novel idea for future research in the area of compression for large-scale neural networks. This moves again away from the “all-written-by-ChatGPT” concept, but it’s interesting. Lack of novelty is often seen as one of the main limitations of LLMs. So what can ChatGPT come up with?

A novel idea on compression proposed by ChatGPT

Well, we see a reformulation of ideas that were already presented before (“adaptively adjust the level of sparsity in the network based on the specific characteristics of the input data,” which nearly matches what Dynamic Sparsity was presented as). The proposed method also doesn’t really make sense. “Jointly optimizes the network weights and the sparsity pattern of the network based on the input data?” … yes? The “sparsity pattern” is directly linked to the “network weights” (connections that can be pruned are connections with weights close to zero), and the training is obviously based on the training data, so what does that even mean? The rest of the explanation similarly makes no sense (but it “reads” nicely!)

ChatGPT is prompted for more novel ideas.

We get… really rewording of the same, or of previously explained concepts, or so vague as to be completely unusable.

Anyway, I think we get the point. The conclusion is also mostly empty of actual content. It’s basically all “sparsity would be more efficient, so it’s nice, but there are tradeoffs, and it’s difficult.”

Finally, Mashrin asks for an abstract and bibtex citations of the references. The references, as we’ve seen, are mostly accurate, but sometimes made up, which is exactly what you want from an academic paper!

2. Taking a step back

So all of these results from the prompts were then compiled into a full paper. At least one citation has been changed in the full paper from the prompts results (LeCun, 1990 has become Liu, 2015, I haven’t gone throuhg all), but otherwise it’s just some reformatting.

What can we make of all of this? Here are my main thoughts:

ChatGPT didn’t understand the assignment

That’s really important to note again, I think, because it shows a key problem with LLMs – they don’t have, nor understand, intent. Well, they don’t understand anything, but let’s move on from that.

The original prompt was to “suggest a paper for the below conference workshop on Sparsity in Neural Networks.” This is not the kind of venue where you try to write a global review of the whole domain discussed in the workshop. Such workshop are typically for really novel – often work-in-progress – ideas that can move the field forward in one or a few of the specific topics ([just look at the list of paper from the 2022 edition). Everything that is in this ChatGPT paper would (hopefully) be considered basic, common knowledge for attendees of such a workshop.

I mean, the parts that are correct.

It’s really just a (excellent) word generator

Some people seem to get really upset when they are reminded that LLMs are “stochastic parrots,” but this paper is a great example of how true that is.

It’s really just a word generator.

The structure of the “proposed paper” is taken directly from the prompt. For all the “detailed” section, it’s a lot of superficial, sometimes nonsensical paragraphs that are “on topic” but don’t bring anything new to the table.

It doesn’t even really work as a “quick review,” which would be fine for personal usage if not for publication, because too much of it is just empty of content or wrong. The citations are sometimes relevant, but most of it is way too old to be really good starting points to get the state-of-the-art.

Can ChatGPT produce valuable science?

From what I can see here: no. I can’t say exactly how I would feel seeing this as a reviewer without knowing that it was a ChatGPT-generated text, or without going through the prompt, but it would certainly be a hard reject from me.

Besides the fact that it’s not appropriate for the workshop, there are too many obvious red flags. The superficiality, the repetitions, the citations that don’t match the text… These are all things that I notice when I review papers or grade student work, and there are so many of them here that there is no chance that I would let it pass.

I’m not enough of a specialist in questions of sparsity techniques and efficiency in neural networks, so I can’t really judge how wrong it is… but that’s even more of a red flag: even as a non-specialist, it’s very obviously wrong in many places.

Can ChatGPT be useful as a study pal?

If it’s a topic you don’t actually care about and you just want to have a superficial level of knowledge to get the bare minimum to pass a class, sure.

Otherwise I would avoid.

3. Conclusions

I was skeptical before reading the paper, but I tried to keep an open mind (believe it or not!).

I must say I’m absolutely impressed by the quality of the text. It reads like a scientific paper. I totally understand why many people see this as amazing.

But in the end… Stochastic parrot remains the best description there is. As soon as you go beyond the superficial level of “reading the text,” and you try to parse its meaning and verify what it says, it breaks down into noise.

[Opinion] ChatGPT vs Wolfram Alpha

adrien@adfoucart.be — 2022-12-29

This post was originally published on LinkedIn.

A big misconception I see with ChatGPT relates to the way it “knows” data. I’ve seen many statements like “ChatGPT has a knowledge base that ends in 2021”, or that it “cannot access Google to check its answers”, presented as limitations that may be addressed in future versions. Surely, if it “knows” stuff until 2021, it cannot be that hard to update its knowledge base with more up to date information, or even real-time information with a Google-like web scraper in the background, right?

The problem is that ChatGPT doesn’t “know” data until 2021, not in the way that is implied by the idea of “updating” it. If you ask about the birth date of Napoleon, ChatGPT doesn’t have some biographical data that it can query to retrieve the correct information. What it may have is a strong association between the words “Napoleon”, “birth date” and “August 15, 1769”, if they were often close together in its training data, which is very likely. It will also “know” from its training data that a question about a birth date is very likely to require an answer in a date-like format. Putting all of that together, it will probably give the right answer most of the time.

But this is not knowledge from a database: the information is embedded in the parameters of the network. That means that “updating” the information, or adding new information, requires retraining the model. Which is, a), very expensive to do, and b), may interfere with some other previously held “information”.

So a ChatGPT-like model will never be able to function as an up-to-date source of information, because you can’t just retrain it regularly with new information scraped from news sources, or Wikipedia, or wherever. Likewise, it cannot just be plugged into a search engine to get the information on-the-fly.

A closer thing to this use case, an actual knowledge base that can be queried using natural language, actually already exists (with its own important limitations): it’s Wolfram Alpha.

If you ask Wolfram Alpha “what is the birthdate of Napoleon”, it will parse the query and, crucially, start by giving you information about how it interpreted it, and the assumptions it made. For instance: that you are referring to “Napoleon” the royal person and not the fictional character from Animal Farm. Then it will give you the result, and then dump some additional related biographical information. This is super important, because it’s a lot easier to determine if the information can be trusted or not.

Is Wolfram Alpha a perfect source of information? Certainly not. It’s not as good at parsing natural language, its answers are not formatted in a nice prose, and it doesn’t attempt to do things like write computer code or poetry. But at least, if it gives you an answer… it’s probably correct. Which in my humble opinion is fairly important if you want to use something as a knowledge source…

[Publication] Review of digital pathology segmentation challenges

adrien@adfoucart.be — 2022-12-14

Our paper “Shortcomings and areas for improvement in digital pathology image segmentation challenges” is now available in its final version in the journal Computerized Medical Imaging and Graphics (doi: 10.1016/j.compmedimag.2022.102155).

It’s probably the most important of the articles written during my thesis. We compiled information about 21 digital pathology challenges organized betwen 2010 and 2022 that included a segmentation task, and analyzed different aspects:

How the task definitions have evolved over the year, from simple binary or instance segmentation to semantic segmentation and instance segmentation and clasisification.
How the reference annotations were generated: despite the high level of inter-expert disagreement exhibited by most digital pathology tasks, many challenges only rely on a single expert to generate their annotations.
Their evaluation processes. This is in my opinion the most important part of the study, where we identify a serious lack of transparency in the exact steps of the evaluation process, and show that beside the choice of “metric” (which for segmentation tasks is generally either the IoU, the DSC or Hausdorff’s Distance), there are many choices to be made in the evaluation pipeline (rules for matching instances, aggregation process…) which are often less precisely described, despite the significant impact that they can have on the results. We also note that few challenges make their full evaluation code public.
How top-ranked methods have changed over the years, with increasingly complex deep learning architectures (and often ensemble of networks), embedded in a relatively settled pipeline: some form of stain normalization, patch extraction, data augmentation…

We take in particular a closer look at three selected challenges with some unique or less common characteristics: GlaS 2015 (which included several subtasks ranked separately), Gleason 2019 (which publicly released individual annotations from multiple experts), and MoNuSAC 2020 (which publicly released their full test set annotations, their evaluation code, and several of the participating teams’ prediction maps).

Segmentation challenges are extremely difficult to organise and properly evaluate. In order to make the “competitive” aspects of the challenge work, we often have to make choices which may make the results less reliable in terms of the scientific perspective they offer, such as ignoring inter-expert disagreement and using metrics which are not necessarily well aligned with the clinical or biological task. To ensure that we fully leverage the time and effort taken to organise those challenges, we highlight the need for more transparency in the process:

Restricting access to the datasets (and the test dataset in particular) is obviously necessary while the challenge is underway, but becomes a barrier to subsequent research once the challenge is over. Challenge websites are also often abandoned once the final ranking has been published and/or the related conference event is over. Even though all the reviewed challenges are relatively recent, a lot of the information has been lost, or has become very difficult to find, with some websites no longer available, some links to the datasets no longer working, and contact email addresses not responding as organisers have moved on to other projects.

We emphasize that, in order for challenge results to be fully reproducible (and therefore to be able to compare them to new results), it is necessary to have access “to the evaluation code, the participants’ predictions and the full dataset, including the test set annotations.”

The preprint (which is only very slightly different from the final version) will remain available here on this website.

[Devlog] Supervoxel CT liver segmentation

adrien@adfoucart.be — 2022-12-06

I have some micro-CT images of mice, where I need to segment the liver. I haven’t done a lot of 3D image processing before, and after spending a few years in the “deep learning” side of the force, I thought it would be nice to look at some good classic image processing methods and to see how I could use them. The images are not too hard, because a contrast product was injected in the mice before the acquisition, so that the liver is a bit brighter than other soft tissues in the images.

Three slices from the micro-CT volume, where liver tissue is lighter than the other soft tissues

It’s not quite as easy as doing a binary thresholding, however. If we look at the histogram of the whole volume with values scaled between 0 and 255, we seem to have four main distributions, which mostly correspond to background, soft tissue, contrasted soft tissue, and bones. The distributions, however, are very much overlapping. So if we try to just take “natural” thresholds manually, like here 130 and 200, we end up taking a lot of bones and missing some actual liver tissue. Not ideal.

Histogram of the micro-CT values. Result of 3D segmentation, taking every voxels with a value between 130 and 200. Shown using 3D Slicer.

A method that I’ve always liked in 2D are “superpixels,” where you start by making a regular grid that is deformed so that it sticks to the borders in the image, creating large irregular but mostly homogeneous “pixels.” What’s interesting when you do that is that your “new” image is a lot “smaller” (so easier to process) while keeping the most useful information (where the borders are) intact. In 3D, we don’t have pixels, we have voxels, but the process can still apply. And reducing the size of the image in a clever way like “supervoxels” seems like it could be really useful in 3D images, as the processing time can quickly become very large. This is not a particularly original idea: I quickly checked, and found a paper from 2016 that basically had most of the pipeline that I had in mind (Wu et al. 2016) for the same application (in humans, though). I’m going to simplify things a little bit however, as the presence of the contrast product makes it a little bit easier to find the region of interest.

Results obtained in (Wu et al. 2016) (slice and 3D reconstruction)

The pipeline I tested is as follows:

Compute supervoxels with SLIC
Create graph connecting neighbours, and compute basic stats per supervoxel (mean intensity, volume)
Filter out supervoxels outside of expected range of mean intensity values for the liver
Iteratively merge neighbours of similar intensities.
Select largest merged supervoxel as “liver” segment.

Let’s have a look at what we can get.

SLIC supervoxels

SLIC supervoxel labels for a slice.

SLIC is a commonly used algorithm for finding superpixels (or, in 3D, supervoxels) (Achanta et al. 2012). I use here the implementation from the SimpleITK library, which is relatively straightforward:

slic = sitk.SLIC(volume, 
                 superGridSize=(grid_size, grid_size, grid_size),
                 spatialProximityWeight=proximity_weight)

It has two main parameters to play with: the size of the “super grid,” which will determine how many superpixels there are, and the “spatial proximity weight,” which will determine whether the resulting grid will be “more regular” or if you allow more deformations. Here, I have set the grid size to 20x20x20, and the proximity weight to 20.0. As we can see in the figure, in homogeneous regions (like in the background), the supervoxels tend to be very regular, while in regions with more borders they deform to adapt to the local information.

Computing the graph

After the previous step, we have a 3D volume with, for each voxel, a label so that every voxel that has the same lable belongs to the same supervoxel. We want to have a “graph” of how those supervoxels are connected together, so we need to find their neighbours. The simplest way of doing that is to iterate through the supervoxels, dilate them, and find all the other supervoxels that intersect with the dilated volume:

# note: slic here has been converted to a numpy array
idxs = np.unique(slic)
for idx in idxs:
    mask = volume == idx
    neighbours = list(np.unique(slic[dilation(mask, el)]))
    neighbours.remove(idx) # remove current supervoxel from list of neighbours

And we can at the same time compute the average intensity value of the region in the volume, and save all that meta-information in a dedicated structure. The problem with this simple approach is that it takes a lot of time: for each supervoxel, we need to make a binary comparison and a morphological operation on the entire volume. And in 3D, things scale up very fast. Without the morphological operation, running this code would already take about 5-6h on my computer. To speed things up, we go through the volume by more reasonable chunks of around 20x20x20 voxels, updating the supervoxel information and the neighbours lists as we go along:

for z in range(0, volume.size.z, window_size):
    for y in range(0, volume.size.y, window_size):
        for x in range(0, volume.size.x, window_size):
            chunk_slic = slic[z:z + window_size, y:y + window_size, x:x + window_size]
            chunk_volume = volume_array[z:z + window_size, y:y + window_size, x:x + window_size]
            idxs = np.unique(chunk_slic)
            for idx in idxs:
                if idx not in supervoxels:
                    supervoxels[idx] = Supervoxel(idx=idx, total=0, volume=0)

                mask = chunk_slic == idx
                neighbours = list(np.unique(chunk_slic[dilation(mask, el)]))
                supervoxels[idx].add(total=chunk_volume[mask].sum(),
                                     volume=np.count_nonzero(mask),
                                     neighbours=neighbours)
                supervoxels[idx].add_chunk(IntVector(x=x, y=y, z=z),
                                           IntVector(x=x+window_size, y=y+window_size, z=z+window_size))

This allows us to go from 5 hours to around 30 seconds without the morphological operation, and around 3 minutes with it for a full volume. This could be further accelerated with some multithreading, but I don’t really need the speed at the moment, so that exercise will be left for another time.

Now that we have our graph, we can also filter any supervoxel that has a mean value outside of the expected range. Ideally, this range should be computed automatically, but for the moment I just use the 130-200 range that I previously defined.

Iterative merging

For the last part of the method, I use a much simpler heuristic than the “graph cut” from (Wu et al. 2016). First, I build a list of all connections between neighbours, sorted based on the absolute difference in mean values. The first “connection” will therefore be between the two neighbouring supervoxels that are the most similar in terms of mean intensity. These two supervoxels are then “merged”: the smallest of the two is “consumed” by the largest (and its neighbours list is also merged with the other’s). Then we iterate through the sorted list of connections.

Two additional parameters control this merging process: a MAX_VOLUME, so that we don’t merge two supervoxels if it would result in an object that’s too large, and a MAX_DISS so that, when we reach a connection that has a dissimilarity above a given threshold (I used 10 in my experiments), we stop.

We then consider that the largest remaining supervoxel is our target object.

Largest supervoxel, reconstructed in 3D Slicer.

Conclusion

The result is still a bit noisy, but I’m very happy with it given the overall simplicity of the approach. The main difficulty will probably come from the large… artery, I think? Basically: the contrast agent used to highlight the tissue is also present in the vascular system, and it’s therefore difficult to avoid taking part of that system along. It should however be possible to use some post-processing to detect regions where we are in a “tubular” shape, and filter them out.

For such “specialized” applications, creating the kind of datasets needed for deep learning methods would be extremely challenging, so it’s really interesting I think to keep some “old school” methods alive. There is still a fair amount of work required to make the pipeline more robust - and to validate it - but I think it works well as a proof-of-concept.

References

Achanta, Radhakrishna, Appu Shaji, Kevin Smith, Aurélien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. “SLIC Superpixels Compared to State-of-the-art Superpixel Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11): 2274–82. Wu, Weiwei, Zhuhuang Zhou, Shuicai Wu, and Yanhua Zhang. 2016. “Automatic Liver Segmentation on Volumetric CT Images Using Supervoxel-Based Graph Cuts.” Computational and Mathematical Methods in Medicine 2016: 1–14. https://doi.org/10.1155/2016/9093721.

[Preprint] Panoptic Quality : not always a good metric

adrien@adfoucart.be — 2022-11-28

This is the last preprint to come out of the work done during my thesis: “Why Panoptic Quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology,” co-authored by my PhD advisors Christine Decaestecker and Olivier Debeir. The preprint is available on ResearchSquare, or on this website. The code to replicate the results and figures is available on GitHub.

This completes a set of studies that we’ve done, largely but not only based on the results of the MoNuSAC nuclei instance segmentation and classification challenge. In the first one (Foucart, Debeir, and Decaestecker 2022a) (PDF), we discovered some errors in the challenge’s evaluation code, which I previously talked about in this blog. In the second one (Foucart, Debeir, and Decaestecker 2022b) (PDF), we looked at the problem of using “entangled” metrics (i.e. metrics that combine multiple independent subtasks into a single score) - such as Panoptic Quality - for ranking challenges. In this one, finally, we round up our analysis and demonstrate why this metric is inadequate - and should be abandoned - for the task of nuclei instance segmentation and classification, where it is becoming “standard.”

In addition to the previously mentioned problem of combining multiple metrics (in this case, the segmentation IoU with the detection F1-Score), we show that the problem that the metric is designed to evaluate, “Panoptic Segmentation” (Kirillov et al. 2019), has some fundamental differences with the problem of “instance segmentation and classification” that we are trying to evaluate here. We also demonstrate that the IoU, used for the segmentation part, is inadequate for small objects with fuzzy, uncertain boundaries, such as nuclei.

Those three problems (entanglement, task mismatch and inadequate segmentation score) together should disqualify Panoptic Quality as a metric for ranking algorithms on the task of nuclei instance segmentation and classification.

References

Foucart, Adrien, Olivier Debeir, and Christine Decaestecker. 2022a. “Comments on ‘MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge’.” IEEE Transactions on Medical Imaging 41 (4): 997–99. https://doi.org/10.1109/TMI.2022.3156023. ———. 2022b. “Evaluating Participating Methods in Image Analysis Challenges: Lessons from MoNuSAC 2020.” https://doi.org/10.13140/RG.2.2.11627.00801. Kirillov, Alexander, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollar. 2019. “Panoptic Segmentation.” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019-June: 9396–9405. https://doi.org/10.1109/CVPR.2019.00963.

[Reading] Top ACROBATs - pathology registration

adrien@adfoucart.be — 2022-11-24

In the reviews of medical image registration that I previously summarized, the most common methods were clearly intensity-based. This means that they use the pixel intensity to measure the similarity between the moving image and the target image. The big advantage of such methods is that they are dense: every pixel contributes to the overall similarity and helps find the best transform.

This, however, is not the only available option, and it’s not always the best option. A key issue with intensity-based method is that it will typically not work well if there are real alterations between the two images, alterations meaning here that the images are fundamentally different in some aspect, and not just “warping” of each other. This is typically the case in digital pathology registration, where we may be working with adjacent slices of tissue (so that the individual cells are not the same from one slice to the next, even though the overall structure will be very similar), and/or with different stains (e.g. the “generic” H&E stain for one, and more specific immunohistochemistry stains for the other).

This is probably why, in the recent Acrobat challenge, the top methods were rather feature-based. Let’s take a look at how they work.

The winner (possibly) - Marzhal et al.

The winner of the challenge was Christian Marzahl, from a company called Gestalt Diagnostics. I didn’t find a description of the method his team used in the challenge… but last year he published a “Robust Quad-Tree based Registration on Whole Slide Images” (Marzahl et al. 2021), so I’m going to assume his Acrobat entry was based on the same principles.

So what did they do?

They used a Quad-Tree approach to “recursively divide the WSI into image segments with successively higher resolution levels,” then to do “piece-wise affine approximation of any non-linear deformation,” using matching SIFT keypoints to determine the transformation matrix.

So, in slightly less technical terms:

At low resolution, find keypoints in the full images.
Determine best “affine” transform (which include rotation, translation, scaling, shear…)
Recursively:
1. Cut the image in quadrants.
2. Take quadrants at higher resolution.
3. Find keypoints & best affine transform per-quadrant.
4. Repeat until either maximum “depth” is reached, or some “stopping criterion” is met.

Feature-based methods are therefore sparse: they only use a selection of keypoints to find the transform. The big advantage of using only keypoints is that regions where the tissue is damaged, or otherwise “too” different between the two images, will just not be taken into account in the computation of the transform. It’s much easier to discard outliers in this way. It’s also a lot less dependent on a “pre-registration” step. Even if the two images are badly misaligned or rotated, it’s not too difficult to find the “best transform” between sets of matching points.

The big disadvantage, of course, is that if the matching step is not good, the results are going to be useless.

The code from that publication is also available on GitHub: https://github.com/ChristianMarzahl/WsiRegistration, so it should be relatively easy to test on our images.

The runner-up: VALIS

The runner-up of the challenge was the team behind VALIS (“Virtual Alignment of pathoLogy Image Series”), a digital pathology registration module. So let’s also take a look at their methods (Gatenbee et al. 2021).

While Marzahl et al. were concerned with pairs of slices, VALIS aims at registering a full stack of N adjacent slices. Their pipeline contains three main parts: a pre-processing module, a feature-based rigid registration module, and an intensity-based non-rigid registration refinement module.

Pre-processing

Their pre-processing steps are:

Resize all images to the same largest dimension.
Work on low resolution image
Re-coloring to make the image similar. First, the RGB images are transformed into the CAM16-UCS colorspace, then assigned a constant “hue” and “colorfulness,” then converted back to RGB. The resulting images are then finally converted to greyscale, inverted, and normalized so that “they have similar distributions of pixel values.”

This normalization step is further explained:

The normalization method is inspired by (Khan, Rajpoot, Treanor, & Magee, 2014), where first the 5th percentile, average, and 95th percentile of all pixel values is determined. These target values are then used as knots in cubic interpolation, and then the pixel values of each image are fit to the target values.

Rigid registration

Their rigid registration steps are:

Keypoints detection (with BRISK) & feature extraction (with VGG descriptors).
Compute transform between adjacent slices. For the transform between slice i and slice i-1, only the features from keypoints that are also matched in slice i+1 are used.

Non-rigid transformation

This step is done after the rigid registration. They use existing methods, which they don’t describe further:

VALIS can conduct this non-rigid registration using one of three methods: Deep Flow, SimpleElastix, or Groupwise SimpleElastix.

Their results tend to show that the improvement from the non-rigid transformation is relatively small.

Conclusions?

It’s interesting to see that the two best methods on this particular challenge are relying on relatively “old” techniques with keypoints, descriptors, and “good old” image processing. It seems again that deep learning methods don’t show the same dominance for this task as they usually do in image analysis.

The good performance of feature-based methods is something to keep in mind, particularly when digital pathology is involved.

References

Gatenbee, Chandler D., Ann-Marie Baker, Sandhya Prabhakaran, Robbert J. C. Slebos, Gunjan Mandal, Eoghan Mulholland, Simon Leedham, et al. 2021. “VALIS: Virtual Alignment of pathoLogy Image Series.” Preprint. Cancer Biology. https://doi.org/10.1101/2021.11.09.467917. Marzahl, Christian, Frauke Wilm, Franz Dressler, Lars Tharun, Sven Perner, Christof Bertram, Christine Kröger, et al. 2021. “Robust Quad-Tree Based Registration on Whole Slide Images.” In Proceedings of Machine Learning Research, 156:181–90. https://proceedings.mlr.press/v156/marzahl21a.html.

[Opinion] The Galactica debacle

adrien@adfoucart.be — 2022-11-19

So I’ve made clear in the past that I’m not a huge fan of large language models as a way forward for AI. I think we just got a great example of the fundamental problems of this approach with the “Galactica” fiasco. A decent summary of the events by Will Douglas Heaven was posted on MIT Technology Review. I just want to give here my quick summary and personal experience, and what it all means going forward.

What the hell is Galactica, anyway?

On November 15th, Galactica was introduced by “Papers with Code” (a Meta AI project) on their social media accounts (Twitter thread, LinkedIn post) and a now unavailable live demo. In case Twitter no longer exists by the time you read this post, their main claims were:

Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.

They talked a bit about their corpus of data:

We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. Includes scientific text and also scientific modalities such as proteins, compounds and more.

How well did it work?

As you may expect from how quickly it was pulled: not very well. I tried a couple of queries when I saw the post, just to get an idea, and it was not particularly impressive in terms of “summarizing academic literature” or “write scientific code”, the two things that I briefly tried.

I tried to get it to make a review of segmentation methods and a few other basic image analysis concepts, and got badly written, repetitive and completely unsourced (and therefore unusable for any scientific purpose) stuff that was not necessarily “false”, but at best very surface level. Surface level would be fine if it then linked to wherever we could learn more, but in its current state at least it just had no added value over just going to the wikipedia page of whatever concept you want to write about. Or Google.

I also tried to get it to code a Otsu thresholding algorithm, and the code was messy and filled with mistakes (as in: not keeping its variable names straight, importing a bunch of useless things, etc.) It was clearly lifting things from a few different codes from its training set that had Otsu in it with no idea on what it was supposed to be doing.

As many others did, I also tested it on some more “politically charged” stuff and it miserably failed to detect that it was making a spirited defense of the Third Reich that maybe should have raised a red flag somewhere in the system.

It’s just baffling that Meta thought this was ready for a public demo.

So, what happened?

Reading the Arxiv preprint, the most charitable reading of the situation I can give is that there has been some, let’s say, misalignment between the research team that developed and validated the model, and the people that made and advertised the demo website. I don’t know if it’s the same people, but if it’s not, it may explain a few things.

Because the thing is, reading the original paper, it really doesn’t look like it’s supposed to “summarize academic literature” or “generate Wiki articles” or “write scientific code”. Certainly, the claims by Yann LeCun that you can “[t]ype a text and http://galactica.ai will generate a paper with relevant references, formulas, and everything” is just plain wrong.

The benchmark that they used to validate the model are on very specific, specialized tasks, like “prompt with an equation name and generate LaTeX” (if you need that, use WolframAlpha!), describe proteins from an amino-acid sequence, or answer very specific questions (e.g. prompt “Abell 370 is a galaxy cluster located in the constellation of”, answer “Cetus”).

There are a handful of examples in the appendix of larger texts with “Wikipedia” articles and “literature survey”, but they seem very handpicked and there doesn’t appear to have been a systematic evaluation for those.

And as soon as real users got to play with the prompts, the whole thing unraveled.

What can we learn from this?

Large language models, like image generators, are cool. But they are also dumb. Even when they are very large and get very impressive scores and curated benchmarks, they don’t have any real knowledge or understanding. They generate stuff that’s likely to be “acceptable” by a human as “something another human could say”. On very specific prompts, they will be able to get the right answer because the mapping between the tokenized input and the expected output will be very clear (in other words, it’s unlikely that there will be many other words other than “Cetus” with a strong association with the words “Abell 370 is a galaxy cluster located in the constellation of” in the dataset). But anything more generic, more like what the announcement implied it could do… just doesn’t work. Or if it does, it’s very likely to be similar to DALL-E and other image generators: to really be able to use it, you have to learn its language. How to tweak the prompts exactly right so that you activate the right path through the model.

But again: if you can do that, you should be able to learn how to use Wikipedia and Google Scholar, where your risk of plagiarizing someone else’s work by accident will be much reduced…

In summary: beware of claims from large language models that have not been thoroughly tested “in the wild”, because that’s the only place you can actually test them (at least if you claim that they can be useful to the general public).

[Devlog] Coordinates and scale

adrien@adfoucart.be — 2022-11-16

Part 1 - why it’s complicated

When working with digital pathology images, particularly in a context where we have to use multiple libraries, compare to different modalities, and work at different scale, it’s very easy to get lost in the coordinates systems. It’s (for me at least) an extremely common source of error in my code, where I mix my axes or screw up in some way the indexing of the region I’m trying to adress.

Let’s take a quick example to see what I mean. To open whole-slide images (WSI), I use the Openslide library:

import openslide

slide = openslide.OpenSlide("path/to/slide.ndpi")

These WSI are multi-scale, meaning that they have been scanned at different levels of magnifications in the microscope. From the OpenSlide object, we can get information about the available levels, magnification, and corresponding resolution.

max_mag = float(slide.properties[openslide.PROPERTY_NAME_OBJECTIVE_POWER])
mpp_x = float(slide.properties[openslide.PROPERTY_NAME_MPP_X])
mpp_y = float(slide.properties[openslide.PROPERTY_NAME_MPP_Y])

print(f"Maximum magnification: {max_mag}x")
print(f"Maximum resolution: {mpp_x:.2f}x{mpp_y:.2f}µm/px")
print(f"Number of levels: {slide.level_count}")
for level in range(slide.level_count):
    print(f"[{level}] {slide.level_dimensions[level]}px\t" +
          f"@{max_mag/slide.level_downsamples[level]:.2f}x\t" +
          f"{mpp_x*slide.level_downsamples[level]:.2f}x{mpp_y*slide.level_downsamples[level]:.2f}µm/px (" +
          f"downsample={slide.level_downsamples[level]})")

Maximum magnification: 20.0x
Maximum resolution: 0.46x0.46µm/px
Number of levels: 8
[0] (76800, 38016)px    @20.00x 0.46x0.46µm/px (downsample=1.0)
[1] (38400, 19008)px    @10.00x 0.92x0.92µm/px (downsample=2.0)
[2] (19200, 9504)px @5.00x  1.84x1.84µm/px (downsample=4.0)
[3] (9600, 4752)px  @2.50x  3.67x3.67µm/px (downsample=8.0)
[4] (4800, 2376)px  @1.25x  7.35x7.35µm/px (downsample=16.0)
[5] (2400, 1188)px  @0.62x  14.69x14.69µm/px (downsample=32.0)
[6] (1200, 594)px   @0.31x  29.38x29.38µm/px (downsample=64.0)
[7] (600, 297)px    @0.16x  58.76x58.77µm/px (downsample=128.0)

At this point, we have not yet accessed the image data. If we want to do that, we need to call the read_region method, which takes as input three parameters: the location (of the top-left corner), the level and the dimensions. Here, we get a first common source of error, which is that the location needs to be expressed in “pixels at the highest available resolutions” (i.e. level 0), while the dimension is expressed at the target level. In both cases, the convention used for the axis order is (x, y), with x being the “width” direction and y the “height”.

So, let’s first look at the whole image at the lowest level of magnification:

from matplotlib import pyplot as plt
%matplotlib inline

im_low = slide.read_region((0, 0), 7, (600, 297))

plt.figure()
plt.imshow(im_low)
plt.show()

Full image at lowest available magnification (~59x59µm/px)

Now if we want to look more specifically at the tissue region on the left at a higher magnification (say, 1.25x, corresponding to “level 4” in this image), we have some computations to do. We can easily determine good coordinates and dimensions at the level that we’re currently looking at. Something like (10, 130) for the top-left corner, (120, 150) for the dimensions should work. But we need to convert those values to the right coordinates systems. For the location, it’s the highest available resolution, which is 128x higher than the one we were looking at.

So we need:

location = (10*128, 130*128)

And for the dimensions, we need to look at level 4, which is 8x more magnification than the current view, so we need:

dimensions = (120*8, 150*8)

Let’s check if our math was correct:

tissue_region = slide.read_region(location, 4, dimensions)

plt.figure()
plt.imshow(tissue_region)
plt.show()

Tissue region @1.25x magnification (~7x7µm/px)

Perfect… but as you can see, it requires some thinking to make sure we get the correct region. And it gets worse. Because read_region returns an Image object from the PIL library, but very often when we do some actual processing (or to display it like we just did with matplotlib) we want to have access directly to the pixel data, which we will typically store in a numpy array. The conversion is easily done, but we can see something interesting when we print the shape of this array:

import numpy as np
tissue_region_array = np.array(tissue_region)
print(tissue_region_array.shape)

(1200, 960, 4)

The dimensions are now expressed with a different convention: the first axis is the “rows” (or height), the second the “columns” (width), and then we have 4 channels, as the image has been opened as an RGBA image, with an Alpha channel. So if we want to adress a region in this array, we have to use a (y, x) convention instead of (x, y).

That’s not all. For registration tasks, a very useful library is SimpleITK. Which has, of course, its own image format, which can cause some trouble if we’re not careful:

import SimpleITK as sitk

tissue_region_sitk = sitk.GetImageFromArray(tissue_region_array)
print(tissue_region_sitk.GetSize())
print(tissue_region_sitk.GetWidth(), tissue_region_sitk.GetHeight(), tissue_region_sitk.GetDepth())

(4, 960, 1200)
4 960 1200

As we can see here, the method to read an image from a numpy array interpreted the numpy dimensions as (depth, height, width), which is not ideal if we want to process the image in SimpleITK… If we want SimpleITK to give the expected results, we need to specify that the pixel values are vectors (here, RGBA vectors):

tissue_region_sitk = sitk.GetImageFromArray(tissue_region_array, isVector=True)
print(tissue_region_sitk.GetSize())
print(tissue_region_sitk.GetWidth(), tissue_region_sitk.GetHeight(), tissue_region_sitk.GetDepth())

(960, 1200)
960 1200 0

And we are back to the (x, y) convention, like PIL. So if we want to access the same pixel in the three representations of the tissue region, we have:

px_pil = tissue_region.getpixel((400, 300))
px_np = tissue_region_array[300, 400]
px_sitk = tissue_region_sitk.GetPixel(400, 300)
print(px_pil)
print(px_np)
print(px_sitk)

(181, 84, 151, 255)
[181  84 151 255]
(181, 84, 151, 255)

Another interesting observation is that, in sitk, we can bring back some information about the location and the “real-world” resolution of the image, informations that are completely lost in the PIL and Numpy representations. With the “Origin” and “Spacing”, we can put the image into a world representation. Let’s do it here using µm for our dimensions.

tissue_region_sitk.SetOrigin((10*128*mpp_x, 130*128*mpp_y)) # offset in µm from the top-left corner of the image
tissue_region_sitk.SetSpacing((16*mpp_x, 16*mpp_y)) # resolution in µm at the level 4 (which is 16x downsampled from the maximum)

And we can see that all these informations are contained into the sitk.Image object.

print(tissue_region_sitk)

VectorImage (0000028FFFD0E2E0)
  RTTI typeinfo:   class itk::VectorImage
  Reference Count: 1
  Modified Time: 1676
  Debug: Off
  Object Name: 
  Observers: 
    none
  Source: (none)
  Source output name: (none)
  Release Data: Off
  Data Released: False
  Global Release Data: Off
  PipelineMTime: 0
  UpdateMTime: 0
  RealTimeStamp: 0 seconds 
  LargestPossibleRegion: 
    Dimension: 2
    Index: [0, 0]
    Size: [960, 1200]
  BufferedRegion: 
    Dimension: 2
    Index: [0, 0]
    Size: [960, 1200]
  RequestedRegion: 
    Dimension: 2
    Index: [0, 0]
    Size: [960, 1200]
  Spacing: [7.34518, 7.34585]
  Origin: [587.614, 7639.69]
  Direction: 
1 0
0 1

  IndexToPointMatrix: 
7.34518 0
0 7.34585

  PointToIndexMatrix: 
0.136144 0
0 0.136131

  Inverse Direction: 
1 0
0 1

  VectorLength: 4
  PixelContainer: 
    ImportImageContainer (0000028F825273F0)
      RTTI typeinfo:   class itk::ImportImageContainer
      Reference Count: 1
      Modified Time: 1673
      Debug: Off
      Object Name: 
      Observers: 
        none
      Pointer: 0000028F883B1040
      Container manages memory: true
      Size: 4608000
      Capacity: 4608000

Part 2 - how can we manage it more easily?

Clearly, this needs to be managed in some way. Here’s what I’ve been doing at the moment. First, I want to make sure that I can use the coordinates system more explicitly. So I want to define a “vector” class for coordinates and dimensions representation, using dataclasses (if you want more on dataclasses and why they’re useful, I recommend ArjanCodes’ video on YouTube on the topic):

from dataclasses import dataclass
from typing import Union

@dataclass
class Vector2D:
    x: Union[int, float]
    y: Union[int, float]
    
    @property
    def xy(self):
        return self.x, self.y
    
    @property
    def yx(self):
        return self.y, self.x

coordinates = Vector2D(x=20, y=130)
print(coordinates.xy, coordinates.yx)

(20, 130) (130, 20)

Next, I want to make the “navigation” in the WSI more transparent. For that, I’ll wrap the OpenSlide class into a WholeSlide class. There are several “helper” functions that I want to add there:

Being able to find the “level” for a target magnification.
Being able to extract regions based on relative coordinates (so that I can easily find the same region at different scales).

class WholeSlide:
    def __init__(self, path: str):
        self.path = path
        self.slide = openslide.OpenSlide(path)
        
        if openslide.PROPERTY_NAME_OBJECTIVE_POWER in self.slide.properties:
            self.mag = float(self.slide.properties[openslide.PROPERTY_NAME_OBJECTIVE_POWER])
        else:
            self.mag = 1
        
        if openslide.PROPERTY_NAME_MPP_X in self.slide.properties \
                and openslide.PROPERTY_NAME_MPP_Y in self.slide.properties:
            self.mpp = Vector2D(x=float(self.slide.properties[openslide.PROPERTY_NAME_MPP_X]),
                                y=float(self.slide.properties[openslide.PROPERTY_NAME_MPP_Y]))
        else:
            self.mpp = Vector2D(x=1., y=1.)

        self.dimensions = Vector2D(x=self.slide.level_dimensions[0][0], y=self.slide.level_dimensions[0][1]) 
    
    def get_best_level_for_magnification(self, magnification: float):
        return self.slide.get_best_level_for_downsample(self.mag/magnification)
    
    def get_absolute_position(self, location: Vector2D):
        """Returns location in µm from the top-left corner"""
        return Vector2D(x=location.x*self.dimensions.x*self.mpp.x,
                        y=location.y*self.dimensions.y*self.mpp.y)
    
    def read_region(self, location: Vector2D, magnification: float, dimensions: Vector2D):
        level = self.get_best_level_for_magnification(magnification)
        dimensions_at_level = Vector2D(x=int(dimensions.x*self.slide.level_dimensions[level][0]),
                                       y=int(dimensions.y*self.slide.level_dimensions[level][1]))
        
        location_abs = Vector2D(x=int(location.x*self.dimensions.x), y=int(location.y*self.dimensions.y))
        
        # check if we need to further rescale to get to the target magnification or if we can use the level as is
        if self.slide.level_downsamples[level] == self.mag/magnification:
            return self.slide.read_region(location_abs.xy, level, dimensions_at_level.xy)
        
        print(f"Resizing from {self.mag/self.slide.level_downsamples[level]}x")
        region = self.slide.read_region(location_abs.xy, level, dimensions_at_level.xy)
        factor = self.slide.level_downsamples[level]/(self.mag/magnification)
        target_dimensions = Vector2D(x=int(dimensions_at_level.x*factor), y=int(dimensions_at_level.y*factor))
        return region.resize(target_dimensions.xy)

Let’s check that we can get the full image, and the same region at different scales using our relative coordinates:

wsi = WholeSlide("path/to/slide.ndpi")

Full image:

plt.figure()
plt.imshow(wsi.read_region(location=Vector2D(0, 0), magnification=20/128, dimensions=Vector2D(x=1, y=1)))
plt.show()

Full image at lowest available magnification (~59x59µm/px)

Region @1.25x, 1x, 2.5x

region_location = Vector2D(x=0.015, y=0.43)
region_dimensions = Vector2D(x=0.2, y=0.5)

print(f"Loaction={wsi.get_absolute_position(region_location)}")

plt.figure(figsize=(18,7))
plt.subplot(1, 3, 1)
plt.imshow(wsi.read_region(location=region_location, magnification=1.25, dimensions=region_dimensions))
plt.subplot(1, 3, 2)
plt.imshow(wsi.read_region(location=region_location, magnification=1., dimensions=region_dimensions))
plt.subplot(1, 3, 3)
plt.imshow(wsi.read_region(location=region_location, magnification=2.5, dimensions=region_dimensions))
plt.show()

Loaction=Vector2D(x=528.8527750998485, y=7505.109958220467)
Resizing from 1.25x

Tissue region @1.25x, 1x and 2.5x magnification

That’s already a lot more convenient. Now we still have to deal with the multi-library situation. What I really want is to make sure that, when I extract a region, I can keep all the relevant information (offset, magnification) along the way, and I can get the image in any format I want. So for that I’m going to create a “SlideRegion” class, and move the read_region code here into a _load method. I’ll also add as_pil, as_np and as_sitk methods to get the images in the requested formats.

class SlideRegion:
    def __init__(self, *,
                 wsi: WholeSlide,
                 location: Vector2D,
                 dimensions: Vector2D,
                 magnification: float):
        self.wsi = wsi
        self.location = location
        self.dimensions = dimensions
        self.magnification = magnification
        self.pil = None

    def _load(self):
        level = self.wsi.get_best_level_for_magnification(self.magnification)
        dimensions_at_level = Vector2D(x=int(self.dimensions.x*self.wsi.slide.level_dimensions[level][0]),
                                       y=int(self.dimensions.y*self.wsi.slide.level_dimensions[level][1]))
        
        location_abs = Vector2D(x=int(self.location.x*self.wsi.dimensions.x), 
                                y=int(self.location.y*self.wsi.dimensions.y))
        
        # check if we need to further rescale to get to the target magnification or if we can use the level as is
        if self.wsi.slide.level_downsamples[level] == self.wsi.mag/self.magnification:
            self.pil = self.wsi.slide.read_region(location_abs.xy, level, dimensions_at_level.xy)
            return
        
        print(f"Resizing from {self.wsi.mag/self.wsi.slide.level_downsamples[level]}x")
        region = self.wsi.slide.read_region(location_abs.xy, level, dimensions_at_level.xy)
        factor = self.wsi.slide.level_downsamples[level]/(self.wsi.mag/self.magnification)
        target_dimensions = Vector2D(x=int(dimensions_at_level.x*factor), y=int(dimensions_at_level.y*factor))
        self.pil = region.resize(target_dimensions.xy)

    def as_pil(self):
        if self.pil is None:
            self._load()
        return self.pil

    def as_np(self):
        if self.pil is None:
            self._load()
        return np.array(self.pil)

    def as_sitk(self, floating = False):
        """Floating parameters indicates if the spacing / offset should be set (if floating is False)."""
        if self.pil is None:
            self._load()
        
        image = sitk.GetImageFromArray(np.array(self.pil), isVector=True)
        
        if not floating:
            downsample = self.wsi.mag/self.magnification       
            spacing = Vector2D(x=float(self.wsi.mpp.x) * downsample,
                               y=float(self.wsi.mpp.y) * downsample)
            origin = self.wsi.get_absolute_position(self.location)
            
            image.SetSpacing((spacing.x, spacing.y))
            
            image.SetOrigin((origin.x, origin.y))

        return image
    
    def absolute_position(self):
        return self.wsi.get_absolute_position(self.location)

And we can now modify the WholeSlide class so that read_region returns a SlideRegion:

class WholeSlide:
    ...
    
    def read_region(self, location: Vector2D, magnification: float, dimensions: Vector2D):
        return SlideRegion(wsi=self, location=location, magnification=magnification, dimensions=dimensions)

So now we can get our region in the format that we want, while keeping the relevant information:

wsi = WholeSlide("path/to/slide.ndpi")

region_location = Vector2D(x=0.015, y=0.43)
region_dimensions = Vector2D(x=0.2, y=0.5)

region = wsi.read_region(location=region_location, dimensions=region_dimensions, magnification=1.25)

pos = region.absolute_position()
plt.figure()
plt.imshow(region.as_np())
plt.title(f"Location=(x={pos.x:.2f},y={pos.y:.2f})µm @{region.magnification}x")
plt.show()

Tissue region @1.25x magnification, with location information

We can check that if we get sitk images at different resolutions, we have the correct information about their origin & spacing:

region_sitk = region.as_sitk()
print(region_sitk.GetOrigin(), region_sitk.GetSpacing())
region_lowres = wsi.read_region(location=region_location, dimensions=region_dimensions, magnification=0.05).as_sitk()
print(region_lowres.GetOrigin(), region_lowres.GetSpacing())

(528.8527750998485, 7505.109958220467) (7.34517743194234, 7.345851889261283)
Resizing from 0.15625x
(528.8527750998485, 7505.109958220467) (183.62943579855852, 183.64629723153206)

This should make it a lot less likely for me to screw up in the future !

[Reading] Reviews of medical image registration

adrien@adfoucart.be — 2022-11-10

My new project involves a lot of image registration, which is an image analysis task that I haven’t really worked on much before. The goal, in the end, is to be able to “put together” information coming from multiple modalities and taken at different times: CT scans, MRI, and also histology images. This means putting everything into the same frame of reference – a registration task.

“In-vivo” MR to “ex-vivo” MR to histology images registration, from (Goubran et al. 2015)

So I obviously need to do some reading. As a starting point, we will use three big reviews of medical image registration. The first is a 2014 review (Oliveira and Tavares 2014). I always like to start from reviews from the pre-”deep learning” era when looking at a task that I’m less familiar with, as they generally give a better overview of the general pipeline and the range of possible approaches. To that, I add a chapter from the 2020 “Handbook of Medical Image Computing and Computer Assisted Intervention,” focusing on registration using “machine learning and deep learning” (Cao et al. 2020). The last one, also from 2020, surveys deep learning methods more specifically (Haskins, Kruger, and Yan 2020).

What is image registration

Image registration “can be defined as the process of aligning two or more images” (Oliveira and Tavares 2014). These may come from different modalities (e.g. CT and MRI), different times (e.g. to monitor tumor growth)… They may be 2D images (e.g. successive slices of a tissue block in digital pathology) or 3D images (e.g. MRI volume). (Cao et al. 2020) put it more mathematically as finding the transformation $\phi^*$ such that:

\[ \phi^*={argmin}_\phi {S(I_R, \phi(I_M))} \]

Where $I_R$ is the reference image (often called “fixed”), $I_M$ is the floating image (or “moving”), and $S$ is a “similarity metric” that measure show well the transformed image $\phi(I_M)$ matches the reference.

So the main elements that we have to play with are:

A transformation model
The similarity metric (how do we define what “matching” means)
An optimization strategy: how do we minimize $S$ so that we can find $\phi^*$.

Transformation models

The transformation model determines how we can modify the moving image. Broadly speaking, we can make a distinction between global models — e.g. rigid or affine transforms — which apply a single operation on the whole “image matrix,” and local or deformable models, which can be expressed as a “deformation field” where each voxel is associated to a vector pointing to its “new” position in the transformed image.

Rigid transforms are often used as a pre-registration step, to broadly align the two images before refining with a more complex and/or local model.

Classic similarity metrics

This is where a lot of the complexity lies – how do you define what counts as a “good match?” It’s particularly difficult in multi-modal problems, where the nature of the information in the two images may be very different. (Oliveira and Tavares 2014) and (Cao et al. 2020) both have mostly the same list of “commonly used” intensity-based metrics:

Sum of Squared Differences / Mean Squared Differences, which assumes that “the corresponding structures in both images should have identical intensities” (Oliveira and Tavares 2014).
Correlation Ratio, (Normalized) Cross-Correlation, which assumes that “there is a linear relation between the intensities of the corresponding structures” (Oliveira and Tavares 2014).
(Normalized) Mutual Information, which assumes that “there is a functional between the variables involved, e.g. between the intensities of both images” (Oliveira and Tavares 2014).

The latter is generally recommended for multimodal registration, as the relationship between the voxel intensities between, for instance, a CT image and a MRI will not be simply linear.

Optimization strategies

Usually — particularly for complex transforms — we’ll have some sort of iterative process such as a gradient descent algorithm. Pre-registration with a rigid transform tends to make the process easier (or, at the very least, faster).

Typical intensity-based registration algorithm pipeline, from (Oliveira and Tavares 2014)

What about machine / deep learning?

Focusing on the multimodal registration problem, (Cao et al. 2020) and (Haskins, Kruger, and Yan 2020) give us some insights on how/where machine learning can intervene in the process.

Learning a similarity metric to improve an otherwise classic pipline. The idea is found for instance in a deep learning method proposed in (Cheng, Zhang, and Zheng 2018). A CNN classifier is trained to “learn the correspondence of two image patches,” then the probability at the output of the CNN is used as a similarity score during new registration. An obvious difficulty is that this requires to have well aligned pairs of images as a training set.
Learning a common feature representation Here the idea is that, since the “intensity values” don’t have the same meaning, we should try to find an “in-between” feature space where the projections from matching features (from both images) are well correlated.
Learning an appearance mapping model between modalities, e.g. to generate a “pseudo-CT” from an MRI. This then reduces the problem to an easier, monomodal registration problem.
Fully learning the transformation itself is much more complicated. It may be possible for relatively simple models (e.g. learning the transformation matrix of an affine transform), but for a deformable model the supervision is way too impractical to create.

Evaluation of the results

The evaluation of the results is a particularly difficult problem for registration, and it’s a problem that I’ll probably look into a lot more deeply, as I’ve done for segmentation and classification methods before (Foucart 2022).

As Oliveira notes, “the image similarity measure optimisation can be used as a crude accuracy measure,” but “most similarity measures frequently used have no geometric/physical significance” (Oliveira and Tavares 2014). So the most common approach is to “manually identify a set of corresponding points in both input images (…) and use them to assess the registration accuracy.” This, however, means relying on an expert-provided “ground truth,” with all the problems that come along (this is where I point again to my thesis, I guess!)

He also mentions using the Dice similarity coefficient (which is the “per-pixel” F1 score, in classification/detection terms) to quantify “the amount of overlapping regions.”

Not mentioned but, I think, potentially useful as well in the same vein would be contour-based measures such as the Hausdorff’s Distance, or similar → for instance, using a border detector and then measuring that the main edges in the registered and target image are close to each other.

Two slices from a mouse brain tissue block, roughly aligned manually here through my great Paint.net skills. How do we measure a “good” match? (Images acquired at the CMMI)

Conclusions

So that wraps this first look at the state-of-the-art. It’s clear that a key difficulty here is that we have a problem with lots of possible choices and parameters and which is difficult to objectively evaluate or, even, objectively pose. As an example, one of the things we need to do in the project is to register histology slices that are separated by a distance of ~100µm. So the goal is not really to match one image to another: there is no actual “match” between the two, they are different parts of the object. Instead, we are trying to “correctly” place the tissue sections in the slides at their correct position and orientation in a 3D volume — and then register that to a CT image.

Getting a good looking results is one thing, but objectively validating that the results are “correct,” or even defining what “correct” means in this case… will be interesting.

References

Cao, Xiaohuan, Jingfan Fan, Pei Dong, Sahar Ahmad, Pew-Thian Yap, and Dinggang Shen. 2020. “Image Registration Using Machine and Deep Learning.” In Handbook of Medical Image Computing and Computer Assisted Intervention, 319–42. Elsevier. https://doi.org/10.1016/B978-0-12-816176-0.00019-3. Cheng, Xi, Li Zhang, and Yefeng Zheng. 2018. “Deep Similarity Learning for Multimodal Medical Images.” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 6 (3): 248–52. https://doi.org/10.1080/21681163.2015.1135299. Foucart, Adrien. 2022. “Impact of Real-World Annotations on the Training and Evaluation of Deep Learning Algorithms in Digital Pathology.” https://research.adfoucart.be/thesis/FOUCART_Adrien_dissertation.pdf. Goubran, Maged, Sandrine de Ribaupierre, Robert R. Hammond, Catherine Currie, Jorge G. Burneo, Andrew G. Parrent, Terry M. Peters, and Ali R. Khan. 2015. “Registration of in-Vivo to Ex-Vivo MRI of Surgically Resected Specimens: A Pipeline for Histology to in-Vivo Registration.” Journal of Neuroscience Methods 241 (February): 53–65. https://doi.org/10.1016/j.jneumeth.2014.12.005. Haskins, Grant, Uwe Kruger, and Pingkun Yan. 2020. “Deep Learning in Medical Image Registration: A Survey.” Machine Vision and Applications 31 (1-2): 8. https://doi.org/10.1007/s00138-020-01060-x. Oliveira, Francisco P. M., and João Manuel R. S. Tavares. 2014. “Medical Image Registration: A Review.” Computer Methods in Biomechanics and Biomedical Engineering 17 (2): 73–93. https://doi.org/10.1080/10255842.2012.670855.

A new project

adrien@adfoucart.be — 2022-11-07

My PhD thesis is, finally, done.

This means some changes: for my career and, by extension, for this research blog. This post is therefore going to be way more “personal blog” than usual.

My new job

Until the end of October, I was a teaching assistant and a PhD student. I am now moving on as a post-doctoral researcher. I am, however, staying at the Université Libre de Bruxelles, and in the LISA laboratory.

I will be working on the ProTherWal project: “Proton Therapy Research in Wallonia”. This is a big research project involving different french-speaking universities in Belgium. My job as a postdoctoral researcher in particular will be related to image analysis aspects. The focus will be on combining information coming from different image modalities: CT, PET, MRI, and histopathology.

There are some obvious links with what I’ve been doing before, but still a fairly significant change, and a good opportunity to learn new things.

What’s next for this blog?

I originally intended for this research blog to be a way to communicate on my PhD research. I haven’t really updated it quite as much as I would have liked to during the thesis, but I’m still mostly happy with what I’ve written here.

I’d like to continue to use this channel. I’d like, however, to use it slightly differently.

First, I think it’d be interesting to give a bit more space to the process of research. As I’m starting a new project, it’s a good opportunity to talk about how I go about organizing things, reading the state-of-the-art, etc. As always, the process of writing things down for this blog is largely intended as a useful exercise for me, which may be helpful to others as a byproduct. I think I’d like to update the blog more often, with generally shorter posts.

Second, I’ll probably also move a bit more often into the realm of opinions, as I’ve done in my “Is AI on the wrong path” post. There are many conversations to be had on the use of AI in medical imaging, and I hope that this blog can contribute (with a limited impact for sure) to these conversations.

You want to know when I publish something?

That’s very nice of you, thank you :)

I’ve now added an RSS feed, as the recent Twitter apocalypse has clearly renewed my interest in not depending on a particular social media for promoting what I write. I’ll also generally publish links to blog posts on my LinkedIn. As long as I remain on Twitter, which may not be very long, I am still there @adfoucart. I’m now trying out Mastodon as well, so it’s another way to find me @AFoucart@social.sciences.re.

Finally, if you want to get updates by email, I’ve setup a substack newsletter: https://adfoucart.substack.com. The goal is to send at most one email per month if I published something on the blog in the meantime.

The thesis

adrien@adfoucart.be — 2022-10-19

Seven years ago, I started working on my thesis.

On October 25th, 2022, I'll be publicly defending it. It took a bit of time, but I'm happy with the result!

The text is available from here: [ Dissertation]

What's inside?

Here are the main topics covered in the thesis

A history of deep learning for image analysis, and a presentation of the main components of deep learning pipelines.
A history of computer vision in digital pathology, and the state-of-the-art of digital pathology image analysis before deep learning.
The state-of-the-art of deep learning in digital pathology and its applications: mitosis detection, tumour classification/scoring, etc...
A study of evaluation metrics and processes: which metrics are used in digital pathology challenges, what are their behaviours and limitations, how can we choose the right metric for the right task?
A study of imperfect annotations: what is their impact on training deep networks, what learning strategies can we use when dealing with imperfections, what is the impact of the imperfections on evaluation metrics?
An application of these concepts to the task of artefact detection and segmentation.
A study of the impact of interobserver variability on the evaluation processes, and how to better include this variability into our evaluations.
A description of some problems with quality control in digital pathology challenges, and recommendations on replicability and transparency for future challenges.

The slideshow (and annotated slides) of the presentation are also available:

Slideshow [.ppsx]
Annotated slides [PDF] (with the text corresponding more or less to what was said during the presentation - or at least what I planned to say!)

Is AI on the wrong path?

adrien@adfoucart.be — 2022-10-04

A current trend in deep learning has been bothering me (and many other people, to be fair…). Deep learning has always been about creating bigger, more complex models, trained on more data, to get better results. But it seems to me that the trend is getting worse, and with big potential consequences for the future of AI research. It may just be in my head, but it seems pretty clear from where I’ve been standing in the field. Which is: slightly on the sidelines, as a researcher who works with deep learning, who regularly writes about deep learning, but who doesn’t really directly works on deep learning architectures.

I started my PhD back in 2015, right when TensorFlow was released and deep learning was just seemingly becoming accessible to all. State-of-the-art models could generally be trained from scratch in a day or so, on a (high-end) consumer-grade GPU. The combination of “Big Data” and powerful GPUs had fueled the shift to the deep learning paradigm, so it was natural that a lot of research came into pushing the boundaries of how large, how deep, how complex we could make those models.

But my impression at the time was that there was also a strong concern for efficiency, and for keeping the complexity as low as possible [1, 2]. This has always be, for me, a key design rule for a good model: it needs to be just complex enough to accurately represent the complexity of the distributions in the data. Moving beyond that complexity inevitably leads to overfitting. Another option for avoiding overfitting, of course, is to add more data. And today’s dominant trend seems to be: let’s be as complex as we can, and use pre-training on very large generic datasets (ImageNet, COCO, ADE20K, Cityscapes, Pascal VOC...) to get the required amount of data, before fine-tuning on whatever task we are actually trying to solve (if there is even one: it's often just about beating the benchmark).

I think that’s a worrying trend, for several reasons.

The first is: I’m not convinced it really works that well. It certainly does on tasks that require an extremely diverse set of features. Tasks like image generation and large language models are a good examples. Stable Diffusion, DALL-E, and all of the very impressive demos of huge transformer-based models are amazing at their task, but it sometimes seems that their main task is to make transformer-based models look amazing. Maybe that’s just because I generally work in a field with highly specialized tasks, such as “mitosis counting” or “tumour grading”, where it’s a lot more useful to combine very reliable models focused on very narrow, well-defined tasks rather than trying to have a model moderately successful at many different tasks. Transformers and pre-trained networks are widely used there as well, but they haven’t really demonstrated a clear benefit over well-designed, targeted models.

The second reason I don’t like this trend is that we are centralizing AI research into the hands of a few key players. These very large, very complex models require huge amount of resources, way beyond what’s possible with consumer-grade GPUs. That means that training those models is typically done on large clusters, powered by Amazon’s AWS, Microsoft’s Azure, or Google Cloud. Widespread use of transformers [3] (or vision transformers, etc.) brings more clients to the big cloud providers. Who are, at the same time, driving a lot of the current research and setting the state-of-the-art. Transformers where introduced by a Google team. DALL-E is from OpenAI, with heavy links with Microsoft. Should we see a conspiracy by Google and Microsoft to drive AI research towards paradigms that allow them to centralize all AIs in their clouds? I don’t think it’s as intentional as that (and certainly some Google researchers like François Chollet are speaking out against the waste of resources [4]). What I do think is that the focus on benchmarks such as ImageNet as a measure of the state-of-the-art has falsely given the impression that super-large models where the only way forward, and that there is no incentive for Google or Microsoft (or even Meta, who has plenty of computational resources available to them) to look much further than that.

The extent to which some folks in deep learning research waste entire datacenter-months of computational resources to produce nothing but hot air is just so sad. It's like watching a macaque smear expensive oil paint on thousands of pristine linen canvases. François Chollet [@fchollet on Twitter]

The third and final reason is becoming even more obvious in the face of climate change and the current energy crisis: this is all very wasteful. I cannot see how using hundreds of thousands of GPU hours to find just the right hyperparameters that make a model gain a fraction of a percent on some general-purpose benchmark is a good use of our limited supplies of energy.

I understand the temptation to go big. Given what deep learning algorithms can do, it sometimes seems like “true” AI is right around the corner. If we just give it a bit more power, a bit more depth, a few billion parameters more... But I don’t think that’s going to work. And I don’t think that’s what we need AI to do right now.

In medical image analysis, the key challenge, in my opinion, is not in designing bigger, more complex networks, but in creating better datasets, clinically relevant tasks, and evaluation methods, in the absence of a clear, definitive ground truth (for more on that, see my PhD thesis [5]!). Other fields will have their own, sometimes similar, challenges. Challenges that won’t be solved by blindly applying huge models, but by actually focusing on the practical applications of those models outside of controlled benchmarks.

References

[↑] Han et al., 2015. https://arxiv.org/abs/1510.00149
[↑] Zela et al., 2018. https://arxiv.org/abs/1807.06906
[↑] Vaswani et al., 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[↑] F. Chollet on Twitter. https://twitter.com/fchollet/status/1484044967907512323
[↑] A. Foucart, 2022. [ Dissertation]

A list of digital pathology image analysis challenges

adrien@adfoucart.be — 2022-04-26

There are several lists of digital pathology challenges and/or datasets floating around in different publications and, of course, on grand-challenge.org, but they never fit exactly what I'm looking for, and either miss some or include some that I would consider as slightly different modalities (such as cytology).

So here is the list I'm including in my thesis, as it may be useful to someone else. I included all challenges between 2010 and 2021 that I could find that used Whole Slide Images (WSIs) and/or image patches extracted from WSIs, either with H&E or IHC staining. I report here the reference to either the post-challenge publication if it exists, or the challenge website if it doesn't, and a (very short) description of the challenge's task(s).

Name, Year	Post-challenge publication or website	Task(s)
PR in HIMA, 2010	Gurcan, 2010 [1]	Lymphocyte segmentation , centroblast detection.
MITOS, 2012	Roux, 2013 [2]	Mitosis detection.
AMIDA, 2013	Veta, 2015 [3]	Mitosis detection.
MITOS-ATYPIA, 2014	Challenge website	Mitosis detection , nuclear atypia scoring.
Brain Tumour DP Challenge, 2014	Challenge website	Necrosis region segmentation , gliobastoma multiforme / low grade glioma classification.
Segmentation of Nuclei (SNI) in DP Images, 2015	Description in TCIA wiki	Nuclei segmentation.
BIOIMAGING, 2015	Challenge website	Tumour classification.
GlaS, 2015	Sirinukunwattana, 2017 [4]	Gland segmentation.
TUPAC, 2016	Veta, 2019 [5]	Mitotic scoring , PAM50 scoring , mitosis detection.
CAMELYON, 2016	Ehteshami Bejnordi, 2017 [6]	Metastases detection.
SNI, 2016	Challenge website	Nuclei segmentation.
HER2, 2016	Qaiser, 2018 [7]	HER2 scoring.
Tissue MicroarrayAnalysis in ThyroidCancer Diagnosis, 2017	Wang, 2018 [8]	Prediction of BRAF gene mutation (classification), TNM stage (scoring), extension status (scoring), tumour size (regression), metastasis status (scoring).
CAMELYON, 2017	Bandi, 2019 [9]	Tumour scoring (pN-stage) in lymph nodes.
SNI, 2017	Vu, 2019 [10]	Nuclei segmentation.
SNI, 2018	Kurc, 2020 [11]	Nuclei segmentation.
ICIAR BACH, 2018	Aresta, 2019 [12]	Tumour type patch classification , tumour type segmentation.
MoNuSeg, 2018	Kumar, 2020 [13]	Nuclei segmentation.
C-NMC, 2019	Gupta, 2019 [14]	Normal/Malignant cell classification.
BreastPathQ, 2019	Petrick, 2021 [15]	Tumour cellularity assessment (regression).
PatchCamelyon, 2019	Challenge website	Metastasis patch classification.
ACDC@LungHP, 2019	Li, 2019 [16]	Lung carcinoma segmentation.
LYON, 2019	Swiderska-Chadaj, 2019 [17]	Lymphocyte detection.
PAIP, 2019	Kim, 2021 [18]	Tumour segmentation , viable tumour ratio estimation (regression).
Gleason, 2019	Challenge website	Tumour scoring , Gleason pattern region segmentation.
DigestPath, 2019	Zhu, 2021 [19]	Signet ring cell detection , lesion segmentation , benign/malign tissue classification.
LYSTO, 2019	Challenge website	Lymphocyte counting.
BCSS, 2019	Amgad, 2019 [20]	Breast cancer regions semantic segmentation.
ANHIR, 2019	Borovec, 2020 [21]	WSI registration.
HeroHE, 2020	Conde-Sousa, 2021 [22]	HER2 scoring.
MoNuSAC, 2020	Verma, 2021 [23]	Nuclei detection , segmentation, and classification.
PANDA, 2020	Bulten, 2022 [24]	Prostate cancer Gleason scoring.
PAIP, 2020	Challenge website	Colorectal cancer MSI scoring and whole tumour area segmentation.
Seg-PC, 2021	Challenge website	Multiple myeloma plasma cells segmentation.
PAIP, 2021	Challenge website	Perineural invasion detection and segmentation.
NuCLS, 2021	Amgad, 2021 [25]	Nuclei detection , segmentation and classification.
WSSS4LUAD, 2021	Challenge website	Tissue semantic segmentation from weak, image-level annotations.
MIDOG, 2021	Challenge website	Mitosis detection.

[1] M. N. Gurcan, A. Madabhushi, and N. Rajpoot, "Pattern Recognition in Histopathological Images: An ICPR 2010 Contest," in Lecture Notes in Computer Science, vol. 6388, 2010, pp. 226–234.

[2] L. Roux et al., "Mitosis detection in breast cancer histological images An ICPR 2012 contest," J. Pathol. Inform., vol. 4, no. 1, p. 8, 2013, doi: 10.4103/2153-3539.112693.

[3] M. Veta et al., "Assessment of algorithms for mitosis detection in breast cancer histopathology images," Med. Image Anal., vol. 20, no. 1, pp. 237–248, Feb. 2015, doi: 10.1016/j.media.2014.11.010.

[4] K. Sirinukunwattana, J. P. W. Pluim, H. Chen, and Others, "Gland segmentation in colon histology images: The glas challenge contest," Med. Image Anal., vol. 35, pp. 489–502, 2017, doi: 10.1016/j.media.2016.08.008.

[5] M. Veta et al., "Predicting breast tumor proliferation from whole-slide images: The TUPAC16 challenge," Med. Image Anal., vol. 54, pp. 111–121, May 2019, doi: 10.1016/j.media.2019.02.012.

[6] B. Ehteshami Bejnordi et al., "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer," JAMA, vol. 318, no. 22, p. 2199, Dec. 2017, doi: 10.1001/jama.2017.14585.

[7] T. Qaiser et al., "HER2 challenge contest: a detailed assessment of automated HER2 scoring algorithms in whole slide images of breast cancer tissues," Histopathology, vol. 72, no. 2, pp. 227–238, Jan. 2018, doi: 10.1111/his.13333.

[8] C.-W. Wang et al., "A benchmark for comparing precision medicine methods in thyroid cancer diagnosis using tissue microarrays," Bioinformatics, vol. 34, no. 10, pp. 1767–1773, May 2018, doi: 10.1093/bioinformatics/btx838.

[9] P. Bandi et al., "From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge," IEEE Trans. Med. Imaging, vol. 38, no. 2, pp. 550–560, Feb. 2019, doi: 10.1109/TMI.2018.2867350.

[10] Q. D. Vu et al., "Methods for Segmentation and Classification of Digital Microscopy Tissue Images," Front. Bioeng. Biotechnol., vol. 7, Apr. 2019, doi: 10.3389/fbioe.2019.00053.

[11] T. Kurc et al., "Segmentation and Classification in Digital Pathology for Glioma Research: Challenges and Deep Learning Approaches," Front. Neurosci., vol. 14, Feb. 2020, doi: 10.3389/fnins.2020.00027.

[12] G. Aresta et al., "BACH: Grand challenge on breast cancer histology images," Med. Image Anal., vol. 56, pp. 122–139, 2019, doi: 10.1016/j.media.2019.05.010.

[13] N. Kumar et al., "A Multi-Organ Nucleus Segmentation Challenge," IEEE Trans. Med. Imaging, vol. 39, no. 5, pp. 1380–1391, 2020, doi: 10.1109/TMI.2019.2947628.

[14] A. Gupta and R. Gupta, Eds., ISBI 2019 C-NMC Challenge: Classification in Cancer Cell Imaging. Singapore: Springer Singapore, 2019.

[15] N. Petrick et al., "SPIE-AAPM-NCI BreastPathQ challenge: an image analysis challenge for quantitative tumor cellularity assessment in breast cancer histology images following neoadjuvant treatment," J. Med. Imaging, vol. 8, no. 03, May 2021, doi: 10.1117/1.JMI.8.3.034501.

[16] Z. Li et al., "Deep Learning Methods for Lung Cancer Segmentation in Whole-Slide Histopathology Images - The ACDC@LungHP Challenge 2019," IEEE J. Biomed. Heal. Informatics, vol. 25, no. 2, pp. 429–440, 2021, doi: 10.1109/JBHI.2020.3039741.

[17] Z. Swiderska-Chadaj et al., "Learning to detect lymphocytes in immunohistochemistry with deep learning," Med. Image Anal., vol. 58, p. 101547, Dec. 2019, doi: 10.1016/j.media.2019.101547.

[18] Y. J. Kim et al., "PAIP 2019: Liver cancer segmentation challenge," Med. Image Anal., vol. 67, p. 101854, 2021, doi: 10.1016/j.media.2020.101854.

[19] C. Zhu et al., "Multi-level colonoscopy malignant tissue detection with adversarial CAC-UNet," Neurocomputing, vol. 438, pp. 165–183, May 2021, doi: 10.1016/j.neucom.2020.04.154.

[20] M. Amgad et al., "Structured crowdsourcing enables convolutional segmentation of histology images," Bioinformatics, vol. 35, no. 18, pp. 3461–3467, 2019, doi: 10.1093/bioinformatics/btz083.

[21] J. Borovec et al., "ANHIR: Automatic Non-Rigid Histological Image Registration Challenge," IEEE Trans. Med. Imaging, vol. 39, no. 10, pp. 3042–3052, Oct. 2020, doi: 10.1109/TMI.2020.2986331.

[22] E. Conde-Sousa et al., "HEROHE Challenge: assessing HER2 status in breast cancer without immunohistochemistry or in situ hybridization," Nov. 2021.

[23] R. Verma et al., "MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge," IEEE Trans. Med. Imaging, vol. 40, no. 12, pp. 3413–3423, Dec. 2021, doi: 10.1109/TMI.2021.3085712.

[24] W. Bulten et al., "Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge," Nat. Med., vol. 28, no. 1, pp. 154–163, Jan. 2022, doi: 10.1038/s41591-021-01620-2.

[25] M. Amgad et al., "NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation," no. Cche 57357, pp. 1–45, 2021.

Final comments on the results of the MoNuSAC challenge

adrien@adfoucart.be — 2022-04-05

Our comment article Comments on "MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge" was just published in the April 2022 issue of the IEEE Transactions on Medical Imaging, alongside the author's reply. The whole story is, in my opinion, a very interesting example of things that can go wrong with digital pathology challenges, and of some weaknesses of the scientific publication industry. But let's start from the beginning...

Summary of previous events

MoNuSAC was a nuclei detection, segmentation and classification challenge hosted at the ISBI 2020 conference. The challenge results were posted online, and a post-challenge paper was published in the IEEE Transactions on Medical Imaging [1] (available in "Early Access" on June 4th, 2021, and published in December 2021). Challenge organisers also published ground truth annotations for the test set, as well as a "colour-coded predictions" on the test set of four of the top-five teams (the link for the L1 team pointing to the predictions of the PL2 team), and the evaluation code used to score the participants' submissions.

In September 2021, as I was working on an analysis of the Panoptic Quality metric [2] used in the challenge, I discovered a bug in the evaluation code and alerted the organisers on September 20th, 2021. They initially replied that "[t]he code to compute PQ is correct". After verification, I confirmed the error and sent a Jupyter Notebook demonstrating it in action.

As I was trying to assess the potential effects of the error on the challenge results, I discovered several additional issues in the evaluation process. On September 22, 2021, I sent a detailed report to the organisers explaining the different problems and offering to "collaborate with [them] in making a correction to the published results". I received no response.

On October 12th, 2021, my PhD supervisor Prof. Decaestecker contacted the managing editor of the IEEE Transactions on Medical Imaging to ask what the procedure was for submitting "comment articles" to the journal, as it seemed to be the preferred method in IEEE journals for reporting potential errors in published papers. After some back-and-forth to clarify the procedure, we submitted our comment article on October 20th, 2021. The comment article was transmitted to the original authors. On February 23rd, 2022, we received notice from the editor-in-chief that our comment article was accepted. Both our comments and a response by the original authors were finally published in the April 2022 issue of the journal [3, 4]... which brings us back to the present.

Claims and responses

Four main issues were raised in our original comment. In this section, I would like to go through our claims, the response from Verma et al., and my thoughts on that response. The four issues that we raised are:

There was a typo in the code used to compute the PQ metric, which led to False Positives being incorrectly added to the count.
There was a confusion between two classes in the detailed per-organ and per-class results, where Macrophage results were reported as Neutrophils and vice-versa.
Because of the way the code processed prediction files, when where there were no ground truth objects of a given class for an image, the predicted objects of that class were not taken into account as False Positives.
The aggregation process of the metric was implemented by computing the PQ metric on each "image patch" and then computing the mean, while the published methodology implied that the aggregation would take place at the "patient" level (which would make a lot more sense given the disparity of image sizes and numbers of objects).

Let's take them one by one.

Computation of the PQ

This one is very straightforward. In their reply, the authors recognised the error and recomputed the results after correction. Fortunately, they find that "the impact of fixing this bug is small". The increase of .3%-.5% in the overall PQ for most teams with the corrected code [4, Table I] is in line with our own experiments. They also find, as we suspected, that some teams were disproportionately affected ("L8" moves three ranks up to "L5" with a 16% improvement in PQ, "L13" moves three ranks up to "L10" with a 16.5% improvement): this is because the error is particularly strong if the object labels between the "ground truth" and "predictions" are very different (as in: using different ranges of numbers), so it's likely that these teams used a different labelling process (full explanation in our comments [3] or on our Notebook on GitHub).

Class confusion

Here, the authors also admit the error, and issue an amended table for the supplementary materials, so I have no further comments to make.

Missing false positives

To our third claim, the authors argue that it comes down to a matter of methodological choice. Quoting from their reply [4, II.C]:

[T]he false positive for one class will be false negative for another class in an mutually exclusive and exhaustive multi-class multi-instance detection problem. We did not want to double count an error, and therefore our loops for error counting run over the ground truth objects. The interpretation of positives and negatives in multi-class problems is a matter interpretation until settled, and this leads to multiple ways of computing the PQ metric for multi-class problems.

I disagree with this for several reasons. To better explain what's going on (the problem, the author's response, and why I think it's wrong), it will be easier to look at a fabricated example. In the figure below, we have a "ground truth" and a "prediction", where colours correspond to classes, numbers to ground truth instance labels, and letters to predicted instance labels.

The submission format require each class' labels to be in a separate file, so in such a situation we would have two "ground truth" label maps (let's call them GT_Blue and GT_Green) and three "prediction" label maps (P_Blue, P_Green and P_White), each containing the labelled mask for one specific class. A pixel cannot belong to two different classes (although there is no mechanism in the code to verify that the submission is valid in this regard).

In the current implementation of the evaluation code, these files would be processed in the following way:

    Open GT_Blue and get label map.
    Open P_Blue to get corresponding predictions.
    Compute PQ_Blue based on GT_Blue and P_Blue. 
        -> 2 True Positives (1-A, 4-D) and 1 False Positive (C)
    Open GT_Green and get label map.
    Open P_Green to get corresponding predictions.
    Compute PQ_Green. 
        -> 1 True Positive (5-E) and 2 False Negatives (2, 3)
    Stop.

So P_White is never opened, and the False Positive "B" will never be counted. If we look at the full confusion matrix for this image, we should have (with rows corresponding to "ground truth" classes and columns to "predicted" classes:

	No obj	Blue	Green	White
No obj	0	0	0	1
Blue	0	2	0	0
Green	1	1	1	0
White	0	0	0	0

But with the evaluation code as it is written, we end up with:

	No obj	Blue	Green
No obj	0	0	0
Blue	0	2	0
Green	1	1	1
White	0	0	0

Coming back now to the author's reply, the statement that "the false positive for one class will be false negative for another" is incorrect... unless you count the background "no object" class, which we shouldn't, as the PQ is not computed for the background class. In our fabricated example, the false positive will not be counted at all.

What if the white "B" object had been matched with, for instance, the green "2" object, which at the moment is a "Green False Negative"? Then the "2-B" match would be both a "Green FN" and a "White FP". The idea that, in such case, it should only be counted once, however, would be an important deviation the PQ metric as defined in [5], and it should then be clearly explained in the methods.

As the metric is computed independently for each class, the misclassification should clearly impact both the "Green" class metric and the "White" class metric. In fact, it is the case for other misclassifications, such as the "3-C" match in our fabricated example, which is counted both as a "Green FN" and a "Blue FP". The only time such misclassifications are not counted "twice" in the evaluation code is if there is no ground truth instance of that class in that particular image patch.

This is not a matter of interpretation in this case: it is clearly a mistake. The effect of that mistake can also be very large on the metric. From our experiment on the teams' released "colour-coded" predictions, correcting this mistake led to a drop of 13-15% in the computed PQ. For instance, looking at the SJTU_426 team, we have a 0.4% increase in the PQ when correcting the PQ computation typo (same as the challenge organisers), but we have a massive 14% drop in PQ when we add the missed False Positives:

SJTU_426	Challenge eval.	Corr. Typo	Corr. FP	Corr. FP + Typo
Verma et al.	0.579	0.618	-	-
Recomputed	0.554	0.594	0.424	0.454

Our replication of the results are based on the "colour-coded" prediction maps. They will be slightly different from the original submission, but should be almost identical in terms of detection performances, which is what changes between those versions of the evaluation code. So while the absolute value is not exactly the same, the delta between the different error corrections should be in the right range.

There is an actual question of interpretation on how exactly those specific false positives should be counted. In our fabricated example, if there was no "White" prediction, then the PQ_white would be undetermined and wouldn't be counted in the average, but if we have a single false positive, we immediately get a PQ of 0 (hence the large impact on the metric). The solution to that problem, however, isn't to just remove the false positives... But to correct the last problem that we have to address: the aggregation process.

Aggregation: patient vs patch

There are two parts to the author's reply to our concerns about the aggregation process. The first part is, again, to argue that aggregating "per-patient" (i.e. computing the PQ directly at the patient level and then averaging over the 25 patients of the test set) or "per-patch" (computing the PQ at the image patch level and then averaging over the 101 image patches of the test set) is a methodological choice, and the second is to state that it was clear in their methodology that they chose the second option, so it's not an error on their part that requires a correction but rather a choice that may be discussed in future works.

Concerning the latter, there may indeed have been some confusion which, as they admit, comes from a typing error in their manuscript (where they mention "25 test images" and aggregating "25 PQ_i scores", which correspond to the number of patients and not the number of patches). It led me to the impression that there was simply a certain inconsistency in their naming convention through the paper, where "image" sometimes referred to WSI and sometimes to image patch.

Part of the confusion also comes from their relatively consistent use of the term "sub-image" in the pre-challenge description of the dataset to refer to the image patches. I was regularly consulting that document as well during my analysis, so it made sense to me that "image" would in general refer to "Whole-Slide Image" and "sub-image" to "image patch". As the information in the supplementary materials shows there is only one WSI per patient, this interpretation seemed reasonable.

I can certainly believe that it wasn't their intent, however, and that it was very clear for them (and possibly for the challenge participants) that the PQ was always meant to be computed per-patch in the challenge. So the question becomes: is that a valid methodological choice?

First: what is the problem with computing the PQ "per-patch"? The difference in size between the different patches is enormous, with the smallest image in the dataset having a size of 86x33px, and the largest a size of 1760x1771px, more than a 1000 times larger in number of pixels! The difference in terms of number of ground truth objects is also very large, ranging from 2 to 861 per patch. Even looking at objects of the least frequent class, we can have anywhere between 1 and 15 instances in patches where the class is represented. This means that the cost of making an error on some image patches is orders of magnitudes higher than on others.

The better option that I identified in our comments article was to compute the PQ per-patient. This vastly decreases the size variability, and it also makes sense for a biomedical dataset. Patches taken from the same patient (and particularly, as is the case here, the same WSI) may share some common properties, so that they can't really be considered independent samples. It also increases the number of ground truth objects at the moment where the metric is computed, which avoids the effect of a single False Positive for an object not present in the ground truth immediately setting the PQ to 0, as in our previous example. The distribution of ground truth objects is more likely to be representative on a larger sample, and the PQ will therefore be more less "noisy". Being able to estimate whether a particular algorithm's performances are steady over multiple patients is also very valuable information, so it makes sense to choose that over the other option identified in the author's reply, which is to compute the PQ just once, at the level of the entire dataset.

The main drawback that the authors identify to using the "one PQ on the whole dataset" approach in their reply is that "this approach does not allow for a robust computation of confidence intervals". I'm very unconvinced by this line of reasoning. First of all, it doesn't really say anything about the per-patient proposition (which does allow for confidence intervals, although obviously with less samples these will be larger). Second, there is the aforementioned problem of the samples' independence. Confidence intervals should really be measured at the patient level anyway, as the patient are, in the end, the "samples" that we are studying.

If this is a methodological question, I don't really think it's as open as the author's reply would suggest. In this case, it seems very clear that the per-patient option is better, and I would even say that, given the scale of the "weights" difference given to the errors, the "per-patch" option is in this case even incorrect. It could be justified when the extracted patch are of very similar sizes, but not in this particular dataset.

Last thoughts on this whole story

The corrected results published by Verma et al. in [4] are, in my opinion, still incorrect. The correct ranking of the challenge is, therefore, unknown, and we shouldn't draw any conclusions on the merits of the different methods proposed by the participants. I doubt that the ranking would change by a lot if all corrections were implemented, but as we have seen with the first correction, it's possible that some participants were affected by some of the errors more than others.

While I'm disappointed by the unwillingness of the organisers to directly work with us on correcting the problems, I would like to emphasise that I don't think they did a bad job overall with the challenge. In fact, as I've repeatedly highlighted, their level of transparency on their data and code is higher than any other digital pathology segmentation challenge that I've seen. That's what interested me in their data in the first place, and what allowed us to perform our study on the PQ metric [2].

Organising a successful international competition and compiling such a dataset is a very big achievement, whatever problems may remain in the results. This whole saga also highlight how important it is to increase the transparency in the reporting of challenge results in general. Without the team's predictions and the evaluation code, results are unverifiable and unreplicable. MoNuSAC really went in the right direction here, although they fell short of releasing all participants' results, and released a "visualisation" instead of the raw prediction maps.

What we can see in every challenge that we analyse is that mistakes happen all the time: in the dataset production, in the evaluation methodology, in the code... Challenges are extremely costly and time-consuming to organise, and they use up a lot of resources (from the organisers, but also from the participants). We should really try to make sure that the results that we get from them are accurate. If challenge organisers go for a fully transparent approach to the evaluation process, then the responsibility of checking the validity of the results becomes shared by the whole community, and our trust in these results is improved.

There are also some things in the process with the journal which I think could be improved. First of all, given the nature of the error we found, I think that some form of notice should have been quickly added to the original publication on the journal's website. Between October 2021 and February 2022, there was reasonable doubt about the validity of the results and, since February 2022 and the acceptation of the authors' reply, it is certain that the original results are incorrect.

Even now that the comment and response have been published, there is still no link to those from the original publication, meaning that the original results are likely to be cited without noticing that they have been amended, and that there are additional concerns beyond what was corrected.

I do think that the journal's Editor-in-Chief and Managing Editor were reactive and tried to do things right. The fact that they had to navigate through a process that seemed obscure to everyone is worrying, however. In many ways, "peer-review" starts when the article is published and, therefore, actually available to more than a couple of "peers". Corrections and retractions should be a natural part of the publishing process. Mistakes happen. They only are a problem if they can't be detected and corrected.

References

[↑] R. Verma et al. MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge. IEEE Transactions on Medical Imaging, 2021. doi:10.1109/TMI.2021.3085712
[↑] A. Foucart et al. Evaluating participating methods in image analysis challenges: lessons from MoNuSAC 2020. Preprint under review, 2022. doi:10.13140/RG.2.2.11627.00801 Download
[↑] A. Foucart et al. Comments on "Monusac2020: A Multi-Organ Nuclei Segmentation and Classification Challenge". IEEE Transactions on Medical Imaging, 2022. doi:10.1109/TMI.2022.3156023.
[↑] R. Verma et al. Author's Reply to "Monusac2020: A Multi-Organ Nuclei Segmentation and Classification Challenge". IEEE Transactions on Medical Imaging, 2022. doi:10.1109/TMI.2022.3157048
[↑] A. Kirillov et al. Panoptic Segmentation. CVPR, 2019. doi:10.1109/CVPR.2019.00963

Deep Learning for Image Analysis: video tutorial

adrien@adfoucart.be — 2021-09-21

Last year, I did a video tutorial for the INFO-H-501 "Pattern Recognition and Image Analysis" course at the Université Libre de Bruxelles. I apparently forgot to also put it here, so here we are...

The video contains a bit of theory on how to approach a segmentation problem with "Deep Learning", and a guide to a working implementation of the pipeline using Tensorflow v2.

The dataset used in the video is the publicly released GlaS challenge dataset. All the code can be found here on GitHub: https://github.com/adfoucart/dlia-videos.

Computers in pathology: a long history

adrien@adfoucart.be — 2021-03-17

We (...) have an unusual situation that we wish to search for objects present in low concentration, the quantity of which is irrelevant, and the final classification of which is not agreed by different experts. At this point the engineer may well begin to feel despair. (A.I. Spriggs, 1969 [1])

This quote comes from a 1969 publication entitled "Automatic scanning for cervical smears". This paper is interesting for several reasons:

It shows that "computer-assisted pathology" is not a new idea.
It identifies difficulties in designing such systems which are still very relevant today.

Specifically, Spriggs worries that the "Carcinoma in situ" which they are trying to detect "is not a clear-cut entity", with "a whole spectrum of changes" where "nobody knows where to draw the line and when to call the lesion definitely precancerous". To sum up: "We therefore do not really know which cases we wish to find". Moreover, "the opinions of different observers also vary".

On that latter point, Spriggs also notes that the classifications or grading used by pathologists are often more "degrees of confidence felt by the observer" than measurable properties of the cells. "It is therefore nonsensical to specify for a machine that it should identify these classes." The only reason to do it is that we have to measure the performance of the machine against the opinion of the expert. This is slightly less of a problem today, as the grading systems have evolved and become more focused on quantifiable measurements, but they still often allow for large margins of interpretation. As we match our systems against these grading, we constrain them to "mimic" the reasoning of the pathologists instead of focusing on the underlying problem of finding relationships between the images and the evolution of the disease itself.

"Digital pathology" really started to appear in the scientific literature around the year 2000. At that time, the focus was on telepathology or virtual microscopy: allowing pathologists to move away from the microscope, to more easily share images for second opinions, and to better integrate the image information with the rest of the patient's record [2, 3, 4].

Digital pathology also relates to image analysis. The 2014 Springer book "Digital Pathology" [5], for instance, includes in their definition not only the acquisition of the specimens "in digital form", but also "their real-time evaluation", "data mining", and the "development of artificial intelligence tools".

The terminology may be recent, but the core idea behind it (linking an acquisition device to a computer to automate the analysis of a pathology sample) is about as old as computers. One of the earliest documented attempts may be the Cytoanalyzer... in the 1950s.

Fig. 1 - LIFE Magazine article on the "Cytoanalyzer", April 25th, 1955 (Google Books).

Two paragraphs in a 1955 issue of LIFE magazine, sandwiched between ads for mattresses and cars, presents an "electronic gadget" which "will spot abnormal cells (...) in about a minute, saving time-consuming specialized analysis for every case". A more comprehensive description of the prototype can be found by Walter E. Tolles in the Transactions of the New York Academy of Science [6].

The Cytoanalyzer (see Fig. 2) had three units: the power supply and computer (left), the scanner (middle) and the oscilloscopes for monitoring and presentation (right). The scanner converts the "density field" of the slide into an electric current, which is used to analyse the properties of the cells and discriminate between normal and abnormal ones.

Fig 2 - 1955 version of the cytoanalyzer, from [6].

Two clinical trials of the Cytoanalyer were conducted, in 1958-59 and in 1959-60. An analysis of the results by Spencer & Bostrom in 1962 [7] considered its results to be "inadequate for practical application of the instrument".

Decades passed. Hardware and software vastly improved, yet even as new methods in image analysis and artificial intelligence got better at solving tasks related to pathology, they still fell short of the strict requirements of an automated system for diagnosis. In 2014, Hamilton et al. were writing that "even the most state of the art AI systems failed to significantly change practice in pathology" [8]. The problem is that medical diagnosis is generally made by integrating information from multiple sources: images from different modalities, expression of the symptoms by the patient, records of their medical history... AI systems can be very successful at relatively simpler sub-tasks (finding nuclei, delineating glands, grading morphological patterns in a region...), but they are just unable, at this point, to get the "big picture". Not to mention, of course, all the thorny obstacles to widespread adoption: trust in the system, regulatory issues, insurance and liability issues, etc., etc.

More than 65 years after the Cytoanalyzer, routine use of AI in clinical practice for pathology appears to be very close... but we're still not there yet. The performance of deep learning algorithms, combined with the widespread use of whole-slide scanners producing high-resolution digital slides, make the field of computer-assisted histopathology a very active and optimistic one at the moment. Still, even with the excellent results of Google Health's breast cancer screening system in clinical studies [9], it's not clear that automated systems are ready for real practice.

The difficulties of our algorithms are in large part the same that were identified by Spriggs in 1969: it is difficult or impossible to get an "objective" assessment of pathological slides and, even with modern grading systems, inter-expert disagreement is high. This makes training and evaluating algorithms more difficult, and when dealing with a subject as sensitive as healthcare, any result short of near perfection will have a hard time getting adopted by the medical community... and by the patients.

References

[↑] A.I. Spriggs, "Automatic scanning for cervical smears", J. clin. Path. 22, suppl. (Coll. Path.), 3, 1-6 (1969). doi:10.1136/jcp.s2-3.1.1
[↑] J.H. Saltz, "Digital pathology - The big picture", Human Pathology 31(7), pp779-780 (2000). doi:10.1053/hupa.2000.9748
[↑] Barbareschi, M., Demichelis, F., Forti, S. & Dalla Palma, P. Digital pathology: Science fiction? Int. J. Surg. Pathol. 8, 261–263 (2000). doi:10.1177/106689690000800401
[↑] May, M. A better lens on disease. Sci. Am. 302, 74–77 (2010). doi:10.1038/scientificamerican0510-74
[↑] Sucaet, Y. & Waelput, W. Digital Pathology. (Springer, 2014). doi:10.1159/isbn.978-3-318-05846-8.
[↑] Tolles, W. E. SECTION OF BIOLOGY: THE CYTOANALYZER-AN EXAMPLE OF PHYSICS IN MEDICAL RESEARCH. Trans. N. Y. Acad. Sci. 17, 250–256 (1955). doi:10.1111/j.2164-0947.1955.tb01204.x.
[↑] Spencer, C. C. & Bostrom, R. C. Performance of the cytoanalyzer in recent clinical trials. J. Natl. Cancer Inst. 29, 267–276 (1962). doi:10.1093/jnci/29.2.267
[↑] Hamilton, P. W. et al. Digital pathology and image analysis in tissue biomarker research. Methods 70, 59–73 (2014). doi:10.1016/j.ymeth.2014.06.015
[↑] McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020). doi:10.1038/s41586-019-1799-6

AI: better than your doctor?

adrien@adfoucart.be — 2021-02-19

AI system is better than human doctors at predicting breast cancer (J. Hamzelou, NewScientist) AI Now Diagnoses Disease Better Than Your Doctor, Study Finds (D. Leibowitz, Towards Data Science) AI Can Outperform Doctors. So Why Don’t Patients Trust It? (C. Longoni and C. K. Morewedge, Harvard Business Review)

Image by Alex Knight from Pexels.

Machines have beaten us at chess and Go, they can drive our cars... and now they are better doctors? That last article from Harvard Business Review asks an interesting question. If machines can perform "with expert-level accuracy", why don't we trust them? In their study, they find that resistance to medical AI is driven by "a concern that AI providers are less able than human providers to account for consumers’ unique characteristics" [1]. The problem, then, would be mostly one of perception. AI can be better than doctors in general, but we fear that they might not be better for us personally.

It's really hard to have a purely "rational" (whatever that means) discussion about the merits of AI versus human doctors, and part of the problem is in the terminology: "Artificial Intelligence". AI is a relatively vague categorization of a certain domain of computer science, but it is perhaps more importantly a term heavily associated with its historical use in science-fiction and scientific speculation. In fiction, AI is often used as a device to explore what it means to be human. It almost always comes along with an "artificial conscience". An AI is a person - or at least it tries to be. It's a notion that we find both in stories where AI is the "good guy" who just wants to be accepted into society, and in dystopian stories where an AI becomes "self-aware" and starts a war on humans.

So when we say "AI is better than human at...", we think about those AI, the self-aware machines. But modern AI, and in particular the kind of algorithms which are behind the "better than humans" headlines, has absolutely nothing to do with any of that. Deep Learning algorithms are tools which are entirely in the hands of the engineers and doctors that use them. They have no more "personality", "wants", or "conscience" than any other tool.

AI is "better than humans" at predicting breast cancer in the same sense as a bread slicer machine is "better than humans" at slicing bread. In both cases, humans are in control of what goes in, what comes out, and how the machine is set up.

It doesn't really make sense to talk about "better than human" performances for machine learning systems, because everything that work or doesn't work in such algorithms can be traced to humans. The engineers, computer scientists or mathematicians create a mathematical model which fully determines the range of data that the system can manage, and the kind of output that it can produce. Doctors and medical experts provide the data and annotations that will determine the parameters of the model, which only learns how to best reproduce what it's been shown.

If there is a bias in the results of an algorithm, it's not because "the AI" is biased, it's because the persons who designed the dataset and the learning process were. The term "AI" was reportedly coined in 1956 by John McCarthy, at the foundational "Dartmouth workshop". At the time, computer science was in its infancy, and the dream was that every aspect of human intelligence could eventually be replicated by a computer. That idea is now mostly reserved for the field of "Artificial General Intelligence" (AGI), and it's not certain that such a thing is even possible. AI, as it's mostly used today, is not "intelligence". It's a toolbox, a set of techniques that we can use to perform various tasks, but it doesn't have an "identity". It starts and stops at the press of a button, and does nothing more and nothing less than what it's programmed to do.

What's the place of AI in medicine?

AI will never be "better than doctors" because it's a meaningless proposition, but that doesn't mean AI has no place in medicine. AI techniques provide very useful tools which can vastly improve patient care.

The best way (in my opinion) to describe existing AI is as a very well indexed library of knowledge. The very complex models that compose modern Deep Learning methods can "store" in their parameters huge amount of medical knowledge, in the forms of links between observations and desired output. Trained algorithms can afterwards process huge amounts of data in short amounts of time.

In clinical practice, this may be very useful to "flag" potentially difficult cases (for instance, if a doctor's diagnosis is different from what the algorithm says, it may be useful to review the results), or to provide a quick first diagnosis in cases where there are no doctors available. It can also serve as a form of quality control. Perhaps more importantly, it can help a lot with ongoing research. Taking large amounts of patient information along with the evolution of their diseases in retrospective studies, machine learning algorithms can detect patterns associated with clinical outcomes. Sometimes, those patterns aren't associated with things that we already know about the disease, so this may give new avenues to explore. Avenues which may lead to dead ends, or to important discoveries. To find out which, human experts will remain well into the loop, at least for the foreseeable future.

References

[↑] C. Longoni, A. Bonezzi, C. Morewedge, " Resistance to Medical Artificial Intelligence ", Journal of Consumer Research 46:4 (2019). doi:10.1093/jcr/ucz013

Artefact detection in Digital Pathology

adrien@adfoucart.be — 2021-01-20

A digital pathology task that illustrates quite well the "imperfect nature of digital pathology annotations" that I mentioned in the previous post is artefact detection. In the first post of this blog, I presented the main steps of Digital Pathology, from the extraction of the tissue sample to the digitalization of the slide. Throughout this process, the tissue is manipulated, cut, moved, stained, cleaned, sometimes frozen, and more generally exposed to the elements. This can cause serious damage, as illustrated in the figure below. Sometimes, problems can also occur during the acquisition process, leading to blurriness or contrast issues.

Example of an artefact caused by tissue manipulation.

Depending on the severity of those issues, we may have to discard parts of the image from further analysis, redo the acquisition process or, in extreme cases, require a new slide to be produced. This can have a big negative impact on the pathology workflow, and it is also a potential source of uncertainty on the results. Typically, this "quality control" is done manually. This can take a lot of time, and it is a very subjective process which can lead to many mistakes [1]. Some algorithms had been proposed in the past to detect some specific types of artefact, such as blurry regions [2] or tissue-folds [3], but our goal was to develop a general-purpose method to catch most artefacts in one pass.

The dataset

Whose slide images are very large. Artefacts can be of varying sizes and shapes, making them extremely tedious to annotate. This is a situation that will always lead to imperfect annotations. We had one slide annotated in details by an histology technologist, and even in that slide many of the annotations were missing, or imprecise. To be able to get enough data to train a neural network, I quickly annotated the rest of the dataset based on the technologist's example, and I'm not an expert at all. As a result, we had a dataset with annotations that were rough, didn't really match the contours of the artefacts, and where a lot of artefacts were not marked at all. We had 26 slides in total. Slides are very large images, so while twenty-six seems like a very low number for training a neural network, we can actually extract a very large number of smaller patches from those slides to satisfy our needs.

Whole-slide images in the dataset, shown here in very low resolution.

An interesting thing about the data is that it has different types of staining, as shown in the figure above. Some of the slides have been stained with a standard "Haematoxilyn & Eosin" treatment, which is typical to show the general morphology of the tissue, while others have been stained with immunohistochemistry biomarkers, highlighting tumoral tissue in brown.

One of the first question we always have to answer once we have a dataset is: what do we use to train our models, and what do we use to evaluate them?

In this case, as we had one slide with "better" annotations than the rest, we decided to keep it for testing. There were two good reasons for that: first, this would make the evaluation more likely to be meaningful; second, that slide came from another source and another staining protocol as the others, which would help us test the "generalization" capability of our methods.

What makes the difference?

The main thing we wanted to see was: what parts of the machine learning pipeline were really important, and had an impact on the results? What we initially did was to take a relatively small, straightforward neural network architecture, and to test many different small changes in its depth ("how many layers") and width ("how many neurons per layer"). We also looked at whether we should use different levels of magnification on the image, or if we should try to "balance" the dataset by sampling more artefacts, as they were more examples of normal tissue than of artefact tissue in the data. Finally, we tried to compare using a "segmentation" output (predicting for each pixel if there is an artefact or not) or a "detection" output (predicting for the whole image patch if there is an artefact or not).

We published our first results at the "CloudTech 2018" conference [4]. What we found most interesting was that, quite clearly, the details of the network architecture didn't really matter. The balance of the dataset and how we decided to produce our "per pixel" output (directly from the network, or by combining the detection results from overlapping patches) were the main factors that influenced the final results.

Let's have a look at this result, for instance:

Effects of dataset balancing: results of the same network trained with random sampling, 25% artefacts, 50% artefacts and 75% artefacts. Blue = annotated artefact detected by the network, Red = detected by the network but not in the annotations, Black = annotated artefact not detected by the network.

Here, we have the same network, which uses patch-based detection, and has been trained using different data sampling. If we randomly sample the patches from the original dataset, we don't have enough examples of artefacts, and the network misses most of them. If we increase the percentage of artefact examples in the training batches, the network becomes a lot more sensitive. Obviously, this comes at the risk of falsely detecting normal tissue.

How can we evaluate those results?

Evaluation metrics have to be related to the usage that we want to make of the results. In this case, the goals can be to assess the "quality" of the slide (or: to know how much of the slide is corrupted), and to remove damaged areas from further processing. For the latter task, we can reformulate it as: finding the "good" areas that can be used later.

There is no single metric that can really give us all that information. We computed the accuracy (which proportion of the pixels were correctly classified), the "True Positive Rate" (which proportion of the annotated artefacts did we correctly identify), the "True Negative Rate" (which proportion of the normal tissue did we correctly identify), and the "Negative Predictive Value" (which proportion of the tissue we identified as normal is actually normal). A low TPR would mean that we missed many artefacts; a low TNR that we removed a large portion of normal tissue; a low NPV that our "normal tissue" prediction is unreliable. To take the subjective nature of the evaluation into account, we also added a simple qualitative assessment, which we formulated as a binary choice: could this result be reasonably used in a digital pathology pipeline?

None of those metrics were really satisfactory. The trend that "what goes around the network" had a lot more impact than minute changes within, however, was clear enough that we could draw useful conclusions from those first experiments.

The main lesson

These first experiments with artefact detection showed that the way we prepared our dataset, and the way we defined our problem, had more impact on our results than the network itself. This is particularly visible when working on a dataset with very imperfect annotations, where just fitting the raw data into a deep learning model will lead to extremely poor results.

A lot of published research in computer vision for digital pathology is done on datasets published in challenges. These datasets tend to be very "clean", with precise annotations and pre-selected regions of interest. This makes it easier to train machine learning models, and it also makes it a lot easier to test and compare algorithms with one another. But the question we wanted to explore now was: as real-world datasets from research or clinical practice are not generally as clean as challenge datasets, shouldn't we be cautious about the conclusions we draw from these challenge results?

References

[↑] A. Janowczyk et al. (2019) "HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides", JCO Clinical Cancer Analytics, DOI: 10.1200/CCI.18.00157
[↑] X. Moles Lopez et al. (2012) "An Automated Blur Detection Method for Histological Whole Slide Imaging", PLoS One 8(12), DOI: 10.1371/journal.pone.0082710
[↑] A. Kothari et al. (2013) "Eliminating tissue-fold artifacts in histopathological whole-slide images for improved image-based prediction of cancer grade", Journal of Pathology Informatics 4:22, DOI: 10.4103/2153-3539.117448
[↑] A. Foucart et al. (2018) "Artifact Identification in Digital Pathology from Weak and Noisy Supervision with Deep Residual Networks", Proc. CloudTech 2018, DOI: 10.1109/CloudTech.2018.8713350

Fifty shades of U-Net

adrien@adfoucart.be — 2021-01-11

Poster sessions in conferences are an exercise that I generally dislike. If you've never seen one, the principle is fairly simple: you take the largest halls and corridors at your disposal in the conference center, give each researcher a one meter wide space to hang a poster, and for one hour they stay next to their poster waiting for someone to come and ask questions about it instead of going for another coffee or biscuit (or sandwich, in the particularly bad cases where the poster session happens during lunch break).

A particularly awful looking poster session. (CC BY-SA 4.0, Wikipedia)

While I think it's generally ineficient as a way of presenting your research, it does provide an unique opportunity of getting a quick overview of "what's going on" in the field. By quickly browsing the posters, which are generally arranged somewhat thematically, you get a sense of the main trends of the domain at the time.

In April 2019, when I attended the ISBI conference, it felt like everywhere we looked we could see the same picture looking back at us:

U-Net architecture, from Ronneberger et al [1].

U-Net. It was introduced in 2015 at another conference (MICCAI) and quickly became a very popular choice for medical image segmentation. The two reasons for that were that it got good results on several different challenges and datasets, and that the authors released the trained network and the source code of the model, so anyone who wanted to try it out didn't have to re-implement everything.

By 2019, U-Net was everywhere. It quickly became the default option for medical image segmentation, along with several extensions and adaptations. From the ISBI poster session, it seemed like any segmentation problem could now be approached in the same way: get a bunch of annotations, fit a U-Net on your dataset, publish the results.

The problem of course is: what are we learning from all that? What's the point? Have we just solved segmentation and that's it?

One interaction I had during this poster session stuck with me. I was looking at a table of results on X-Ray images, which used the Dice (or F1) score as a metric. The results were very close to each other, and when I wondered if they thought it was a significant difference, they replied that even though the scores were very similar, they could see a clear difference visually on the shape of the borders of the object, which is what they were mostly interested in. Why, then, didn't they use another, more relevant metric? They apparently hadn't thought about it.

This stuck with me because it's not an isolated case. The F1/Dice score is by far the most widely used metric in segmentation problems, even for problem where we know it's not really good. Why? Because it's familiar, and if we all use the same metrics all the time it's easier to compare things together. But that means that we are now pursuing tiny improvements on metrics which aren't necessarily related to the clinical task that we are actually trying to solve.

An important feature of digital pathology tasks is that they are often difficult to quantify. Pathologists will typically integrate different observations from different images, along with clinical information about the patient, in order to produce their diagnosis. These assessment, however, are often subjective. That's part of the reason we want more computational methods to help with these tasks: to provide more objectivity. But we are now facing a very difficult problem to measure how "good" an algorithm is.

Let's take a quick example and go back to the mitosis detection challenge we talked about earlier. Mitosis detection is not a "pathology" task by itself. It's used, for instance, as part of the Nottingham system for breast cancer classification, alongside other factors. When assessing how good an algorithm is, the question we should be answering is: does it make errors which would lead to a bad diagnostic? And as soon as we get into that, we make the problem a lot more complex, because not only is that extremely hard to measure (as we can only do that on retrospective studies where we know the clinical outcome), but it only leads to more questions, such as: should we penalize more an underestimation of the cancer severity, or an overestimation? Do we prefer an algorithm that's closer to the reality "on average", or an algorithm which has very few outliers?

With all these criteria, the question of "which algorithm is the best" becomes impossible to answer. So we simplify the problem. Instead of asking "which algorithm leads to a better diagnosis", we ask "which algorithm detects the same mitosis as the expert who annotated the images in the dataset", which may lead to a radically different answer.

This is what I want to focus on in my thesis (and here in the next posts): how do we correctly take into account the imperfect nature of digital pathology annotations when we train and test deep learning algorithms? How do we take into account mistakes, imprecisions, and inter-expert disagreement? In short: how do we ensure that we don't use up all our energy chasing for results which don't actually teach us anything either about pathology, or about Deep Learning?

References

[↑] Ronneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation", Proc. MICCAI 2015. doi:10.1007/978-3-319-24574-4_28

How to train your neural network

adrien@adfoucart.be — 2020-12-14

With all the tools at our disposal now in libraries such as Google's Tensorflow or Facebook's PyTorch, deep neural networks have become increasingly easy to develop, train and use. While this is in general a very good thing, it also makes it very common to treat deep learning algorithms as "black boxes", mysterious beasts which magically transform a "training set" into a trained model with excellent performances on the accompanying "test set". Deep neural networks, however, are only black boxes if we don't open them up and take a peak.

As I've written earlier in the post "Every machine learning algorithm", a "machine learning model" is essentially a function mapping an input (the "data") to an output (the "prediction"). More complex models are characterized by a larger number of parameters in this function. What differentiates a deep neural network from a simple linear model is the much higher complexity of the model which, instead of just being able to find a "straight line" in the space of the input points, is capable of finding essentially any shape. The "shape" in question will be determined by the parameters of the network, which as we've seen in the "building blocks of Deep Learning" are the weights of the connections between the "neurons".

The big question is, of course: how do we find the value of those parameters?

The cost of an error

The first thing that we need to build a training method is a way to determine if an output is correct or not.

Let's go back to our "face detection" problem. The input of our model are the pixel values of a small image patch. The output value is a number representing the probability of the image containing a face.

Fig. 1 - Image patches given as input to a model Original photo by Emma Bauso from Pexels.

The "correct output" for the image patches show in Fig. 1 would be: [1, 1, 1, 0, 0, 0] (from left to right and top to bottom). Let's imagine that, for a given set of parameters, our model outputs the values: [0.71, 0.23, 0.55, 0.12, 0.44, 0.05]. What is the "error" made by the model? One thing we could do is just look at the difference between each pair of "expected/predicted" values: 1-0.71, 1-0.23, etc... So, we would get:

Correct output	1	1	1	0	0	0
Model prediction	.71	.23	.55	.12	.44	.05
Error	.29	.77	.45	-.12	-.44	-.05

The mean error would be the average value of this vector (here: 0.15), but it's not a very useful measure. Imagine for instance that we had a completely wrong prediction vector: [0, 0, 0, 1, 1, 1]. The error vector would be [1, 1, 1, -1, -1, -1], for an average of 0. "Positive" errors and "negative" errors are balancing each other, which is not exactly what we are looking for.

The mean absolute error would remove that problem by first taking the absolute value of the error. In our first error vector, this would lead to a cost of 0.35, and in our "completely wrong" example to a cost of 1, which seems a lot more reasonable.

The more commonly used cost functions in machine learning are the mean squared error (for regression problems), and the cross-entropy (for classification problems). Both have the particularity of giving a much higher cost for larger errors (in other words: they consider it's better to be a little wrong most of the time than to be very wrong on rare occasions).

Let's now imagine that we have all the time and computer power we want. We could take every single combination of parameters in our model, and for each of those compute this cost on our training dataset. Whichever combination gives us the lowest "cost" should be the best. If we had only one parameter in our model, we could make a plot showing the cost for every value of the parameter. This would typically look something like this:

Fig. 2 - Cost depending on the value of the parameter.

Great. But as we don't have all the time and computer power we want, what do we do?

Going downhill

The idea for solving this kind of problem (finding the minimum of a very complex function) comes from all the way to 1847, and a communication to the French Academy of Sciences by Augustin Cauchy [1].

Cauchy's idea is very simple. We start from an arbitrary "position" in this function (meaning in our cases: arbitrary values for all parameters of the model). We can look at the gradient of the function relative to those parameters. In other words: we find the "direction" where moving the parameters makes the function go down.

Now that we have a direction, we move the parameters by a small step, and we start again. Step by step, we go down the cost function, until we hit a point where there is nowhere to go except back up. Then, we know that we've hit a minimum, as illustrated below.

Fig. 3 - An illustration of the Gradient Descent process

This is the basic process of Gradient Descent. As all parts of a machine learning pipeline, it involves several choices, which will be part of the hyper-parameters. The main choices here are:

How do we choose the "initial parameters"?
By how much do we "move" at each step?
How do we know that we are at the global, not a local, minimum?

The answer to those will depend on the model and the dataset, but in general the initialization will be coming from some form of random generator, the movement will be determined by a learning rate which may vary as the learning progresses, and for the third question, we don't.

How far is too far?

I wrote above that we could in theory find the "best parameters" by computing the cost for every possible value they can take on our training dataset. That was not totally correct. At the very least, it depends on what we define as "best". The minimum of the cost function computed on the training dataset that will give us a prediction as close as possible to the "target" that we gave to the model... But as we've seen in "Every machine learning algorithm", that's not always a good thing.

If our model is complex enough, we can end up basically storing the entire training dataset in its parameters. This leads to a situation where it is perfect at predicting the output for things that is has already seen but has a completely unpredictable behaviour everywhere else. That's overfitting, and as Deep Neural Networks often have millions of parameters, they can easily fall into that trap.

There are things that can be done in the model itself, or in the cost function, to encourage generalization. This is a whole topic by itself which I won't get into: regularization.

But at a most fundamental level, what we really need to do is to evaluate the cost on data that isn't used directly for the gradient descent. When we have a dataset that we want to use for training a machine learning model, we split it in three distinct parts: the training set is used by the model to find the best possible parameters; the validation set is used to evaluate, during the training, how well the model performs on unseen data; and finally the test set is kept aside until the very end, when we have finalized our model, to get a representative measure of how well it actually performs.

Overfitting, within that scheme, is very easy to detect. It happens when the cost keeps getting lower when computed on the training set but starts going up when computed on the validation set. This is a clear sign that we have to stop the training, or risk losing generalization capabilities.

Gradient descent for neural networks

There is one last thing I want to mention before we close this series of articles on the main principles of Deep Learning.

To compute the "gradient" that we mentioned before, we need to be able to formulate the cost function as a function of the parameters of the model, and we need to be able to find its derivative. For a neural network with multiple layers, this is not possible. The relationship between the output and the weights applied in the first layer is far too complex. So what can we do?

In the 1970s and 1980s, the idea of back-propagation for neural networks was developed. Paul Werbos published a version of it in 1982 [2], and the one most commonly cited as popularizing the idea was published in 1986 by David Rumelhard, Geoffrey Hinton and Ronald Williams [3].

Fig. 4 - Illustration from Werbos, 1982 [2].

A good detailed step-by-step explanation is provided by Michael Nielsen on his website, but the main idea is this:

While it's hard to formulate the gradient with respect to all the weights in the network, it's easy to formulate the gradient with respect to the weights of the last layer. In the back-propagation algorithm, we start by computing those gradients, then we "propagate" the gradients "backward" (hence the title) to the previous layer.

In a sense, it uses the same idea as neural networks in general: instead of solving one very hard problem, we solve millions of comparatively easy problems sequentially, and let the connections between those problems add complexity.

Conclusion

These are the core components of all Deep learning algorithms:

A robust "pipeline" that transforms the input data into something that the network can handle and interprets its outputs in a way that can be used.
A network architecture which creates enough complexity, through multiple layers, convolutions, pooling and other operations, to accurately represent the mapping between the input and the output.
An optimization method, generally based around gradient descent and backpropagation, to set the best weights for the connections in this network.
A large, good dataset that can provide a reliable source of training, and a good evaluation of the results through the test set.

References

[↑] A. Cauchy, "Méthode générale pour la résolution des systèmes d'équations simultanées", in Compte-Rendu des séances de l'académie des sciences, 5 juillet 1847, p536-538 Available online
[↑] P. Werbos, "Applications of Advances in Nonlinear Sensitivity Analysis" Available online
[↑] D. Rumemhart et al, "Learning representations by back-propagating errors" Available online

The building blocks of Deep Learning

adrien@adfoucart.be — 2020-12-07

If you read a paper describing a deep learning solution to any sort of problem, you'll probably end up looking at something like this:

Fig. 1 - Schematic representation of the ShortRes architecture, from [1].

Or this:

Fig. 2 - Schematic representation of the Unet architecture, from [2].

Schematic representations of deep neural network allow us to quickly represent very complex structures. The reason we can use those representations and, generally, convey the general idea of what the network does, is that most architectures are made of the same basic building blocks. This allows us to reduce relatively complex operations into a simple "block" or "layer" whose purpose will be well understood by other researchers.

I'll take a look at those blocks in a moment, but first I want to take a magnifying glass into the network to observe the element at the base of all artificial neural networks: the artificial neuron.

The artificial neuron

In 1943, McCulloch and Pitts introduced a mathematical model for the activity of a single neuron [3]. This had a huge influence on the history of artificial intelligence: anything that can be written in a formal mathematical language can be replicated by a computer. If a computer can have individual components behaving like neurons... Why couldn't you have an artificial brain?

Fig. 3 - The artificial neuron.

As you can imagine, since it's 2020 and we don't have an artificial brain yet, it turned out to be a bit more complex than that. But the general principle of McCulloch and Pitts' neuron are still valid for modern neural network: input signals (from the "dendrites"), which can have different weights (the "strength" of the connection), positive ("excitatory") or negative ("inhibitory") are summed into a core unit which decides if the signal should be passed on to the output (through the "axon" and into the "synapses").

In an artificial neural network, in general, the structure of the connections (i.e. which neuron is connected to which) is fixed, but the weights of the connections are the parameters that the machine learning algorithm will try to optimize. The structure as well as the activation functions (the function that decides what the output should be based on the aggregated inputs) are hyper-parameters, set by the designer of the solution.

Individual neurons are very limited in what they can do. Connected in large networks, however, they are powerful mathematical tools. Let's start by looking at dense layers.

Dense layers

Dense layers are very easy to explain. All the neurons in the layer are connected to all the outputs from the previous layer, as illustrated below.

Fig. 4 - A very small network.

To determine the output of such a network, we can just follow through the connections. If we have two inputs $x_1$ and $x_2$, we will have $h_1 = A_h(w_{11}x_1 + w_{21}x_2)$ as the output of neuron $h_1$. We can do the same for $h_2$, then move on to the output: $y = A_y(v_1 h_1 + v_2 h_2)$. $A_y$ and $A_h$ being the activation functions of the different neurons.

One of the key features of neural networks is that the individual operations that are performed in the network are very simple, but we can get very complex outputs (and we can in fact approximate "virtually any function of interest" [4]) by just adding enough neurons to the network.

The problem with dense layers, however, is that the number of parameters grows very rapidly with the number of neurons, and often quickly become impractical. This is particularly true in image analysis. If our input is a 256x256 pixels image, that's already 65.536 input neurons (196.608 if the image is in colours). If we need, for instance, two layers of 1000 neurons each followed by a single output neurons, we will have around 66 million parameters in the model (more than 197 million for colour images). As the general "rule of thumb" in machine learning is that, to train a model, we need "a lot more" annotated examples than we have parameters, this makes things really hard.

Convolutional layers

Convolutional layers are arguably one of the key things that made artificial neural network a practical solution for image analysis rather than a fun curiosity. Convolutional networks first appeared in 1980 with Fukushima's "Neocognitron" [5], but it really became a staple of Deep Learning in 1989 with Yann LeCun's "handwritten digit recognition" solution [6].

The basic concept of convolutional layers is that, instead of connecting every neuron from a layer to every neuron of the next, the connections are limited to a "neighbourhood". If an image passes through a convolutional layer, the output will also be an image, filtered by a "convolutional kernel". The values and size of the kernel will determine the effect of the filter. For instance, in the figure below, we apply a 3x3 kernel designed to get a strong response from horizontal edges.

Fig. 5 - Output of a 3x3 convolutional filter designed to detect horizontal edges. White = strong positive response (dark region above a light region), black = strong negative response (light region above a dark region), grey = weak response (no horizontal gradient). Original photo by Emma Bauso from Pexels.

In a convolutional neural network, each layer will consist in a set of different filters applied to the previous layer.

There are two distinct advantages here, especially for image analysis. The first is that we keep, built into the structure of the network itself, a sense of the spatial organization of the information. With a dense layer, the network has no way of knowing if two pixels are neighbours. Here, the network can extract features starting from neighbouring pixels, and only combining them together at the scale of the image later in the network. The second is that we need less parameters to extract useful information. In a dense network, if we learn a useful local filter at the top-left of the image using 9 connections, we have to use 9 other connections to learn the same filter at the bottom right. In a convolutional network, we learn filters which are applied on the entire image.

For instance, if instead of two dense layers with 1000 neurons, we use two convolutional layers with 64 filters (which will give us about 4 million neurons per layer), we will only need about 40.000 parameters to get what will probably be much more interesting features.

Pooling and upsampling

A downside of convolutional architectures is that neurons lose the overall context of the image, as they can only see a small part of it. It's very difficult to recognize objects, or to perform other high-level computer vision tasks, when every component of the algorithm has a very limited field of view.

Pooling layers are used to downsample an image. This allows the next layer to have a wider field of view, at the cost of a lower resolution. The most common pooling technique is the "max-pooling", which selects the maximum value out of a small neighbourhood and feeds it forward in the network. Pooling effectively summarizes the information contained in a neighbourhood, and makes the network invariant to small translations [7].

Upsampling layers have the opposite effect: they upscale the feature maps, either through a "transposed convolution" operation, or through interpolation. This is typical of segmentation networks, where we need the output to have the same size as the input. Many segmentation network will go through a "downsampling" phase, extracting features with a progressively higher semantic level as the network combines low-level features and puts them into their respective context, followed by an "upsampling" phase, where those features are extrapolated into a "prediction map" giving a class probability for every pixel of the image. This can be seen in both figures 1 and 2 above.

...and all the rest

There are many other types of layers, with more specific purposes. For instance, Dropout units [8] act as a sort of "filter", randomly and temporarily removing neurons of a layer during training. This means that we actually train slightly different, smaller versions of the network in parallel. This has a similar result to training multiple classifiers and averaging their results, which is to make it more likely to generalize well.

Batch normalization is another modifier than we can add to a layer. It standardizes its inputs by estimating their means and variances throughout training. This makes the learning process more stable, and tends to accelerate convergence [9].

Skip-connections are another very common feature of modern deep networks. The idea of a skip-connection is to add shortcuts through the network, where the output from an "earlier" layer is reintroduced later in the network. This can be done over a very short distance, as in residual units [10], which skip a few convolutions, or over large parts of the network as can be found for instance in the U-Net architecture [2].

Conclusion

Network architectures of all sizes and shapes exist, and the state of the art has moved from the relatively short networks of the "early days" of the 00s, with half a dozen layers, to increasingly deep and complex networks with more than a 100 layers in the mid-10s, somewhat moving back to smaller, more efficient architectures in more recent years, as the focus shifted a bit from getting the absolute best performance to getting solutions which can work on regular computers and smartphones for the general public.

There is no doubt that the specific design of the network will influence how well it can model a specific dataset, how well it can perform a specific task, and how efficiently it will be able to learn.

But modern, "general-purpose" networks will work well for a variety of tasks, and more often than not, the most important part of the design process is to think about how to train it. How does a neural network learn? That will be our next topic.

References

[↑] Foucart et al., "SNOW: Semi-Supervized, Noisy and/or Weak data for Deep Learning in Digital Pathology", Proc. ISBI 2019. doi:10.1109/ISBI.2019.8759545
[↑] Ronneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation", Proc. MICCAI 2015. doi:10.1007/978-3-319-24574-4_28
[↑] W. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity", Bulletin of Mathematical Biophysics, 1943. doi:10.1007/BF02478259
[↑] K. Hornik et al, "Multilayer Feedforward Networks are Universal Approximators", Neural Networks, 1989. doi:10.1016/0893-6080(89)90020-8
[↑] K. Fukushima, "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position", Biological Cybernetics, 1980. doi:10.1007/BF00344251
[↑] LeCun et al, " Back-propagation applied to handwritten zip code recognition", Neural Computation, 1989. doi:10.1162/neco.1989.1.4.541
[↑] I. Goodfellow, Y. Bengio, A. Courville (2016) "Deep Learning", MIT Press https://www.deeplearningbook.org/
[↑] N. Srivastava et al, "Dropout: A Simple Way to Prevent Neural Networks Overfitting", Journal of Machine Learning Research doi:10.5555/2627435.2670313
[↑] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", Proc. ICML, 2015. doi:10.5555/3045118.3045167
[↑] K. He et al, "Deep Residual Learning for Image Recognition", Proc. CVPR, 2016. doi:10.1109/CVPR.2016.90

The machine learning pipeline

adrien@adfoucart.be — 2020-11-30

Solutions to machine learning tasks can generally be represented in the form of a pipeline, going from the "input", which is the raw data that it works on, to its "output", which will be its solution to the task.

More accurately, there will typically be two pipelines: one for learning, and one for "production", when the trained model can be applied to its task on new data.

Let’s go back to the "face detection" problem from my previous post. The learning pipeline, which would be applied to all the examples in the training dataset, could look something like this:

Fig. 1 - A machine learning pipeline.

Let's take it step by step.

First, we apply some pre-processing to the image. The goal of pre-processing is to make global transformations that reduce the signal of whatever we don't want to detect and increase the signal of what we want to detect. A common step would be noise reduction, or edge detection.

Then, we try to detect "candidates". This step is also very common in image processing tasks. The goal is to not train the model not on one large image with several faces in it, but on many small images with a maximum of one face, as it will make the problem less complex. We could just cut the image into small fragments, or we can first use a simple metric (for instance: sample regions from areas where some borders are present) to compute possible candidates.

Third, we can compute useful features on those fragments. Feature extraction means that we replace the pixel information by a set of vectors of statistics computed on these pixels. It could be the mean value, the standard deviation, but also more complex things like the gradient directions or the presence or absence of certain pre-determined patterns. After this step, we will have replaced our image by a set of vectors. In other words, we now have our data "points", each feature corresponding to a dimension in a possibly highly dimensional feature space. If we only have two features, we could represent every candidate as a point in 2D, as we did with the "movie" data from the previous post.

Step 4 is what we generally think of when talking about "training a machine learning model": fitting the parameters of the model based on the features and the available annotations (e.g. does the fragment contain a face or not?)

Finally, we would probably have to apply some post-processing to the output of the model. This would mean taking the prediction for each fragment, putting the "positive" predictions back on the original image, and possibly using some additional rules to clean up the results. For instance: if we have two fragments which are mostly overlapping, we could merge them into one single face.

The parameters of the machine learning model are those optimized in step 4. But we also have to train the hyper-parameters of the whole pipeline: what type of pre-processing, how do we determine the candidates, which features we choose, what post-processing rules, etc...

Once the model is trained, it could then be used "in production" with the adapted pipeline on previously unseen images:

Apply same pre-processing.
Get candidates.
Extract features.
For each fragment, apply the model to the extracted features and predict if it contains a face or not.
Apply post-processing.

So, what's different about "Deep Learning"?

In this case, the main things that would change if we used a Deep Learning model would be to remove the pre-processing step, and to merge the feature extractions and model optimization steps into one, as illustrated below.

Fig. 2 - A deep learning pipeline.

The reason for that is that Deep Learning method work best by reducing the "human bias" from the equation. If pre-processing is needed, the Deep Learning model should be able to find it by itself. If some features are more useful than others, the model should decide it find them from the raw data, not use those that the researcher thought about including in its tests.

It's important to note that, even with a Deep Learning method, there are still a lot of hyper-parameters outside of the network itself. We will very rarely see, for instance, a Deep Learning solution where we just dump the entire images in the network along with the position of all the faces and expect the network to learn from that. This would make the problem extremely complex, and therefore require a very large network, a huge number of examples (and, therefore, a huge amount of time-consuming annotations), and an unrealistic amount of time to train, unless you're Google.

That's why it’s always dangerous to reduce an algorithm only to the specific architecture of neural network used: many choices are made along the way which influence the end result just as much.

In the next article, we will look at the building blocks of most Deep Learning models: the different layers that form deep neural networks.

Every machine learning algorithm

adrien@adfoucart.be — 2020-11-25

One of the main things that interested us when our laboratory started working on Deep Learning for digital pathology was to determine which parts of "the deep learning pipeline" actually mattered. Many different Deep Neural Networks have been developed over the years, and in different competitions we often see multiple solutions based on networks which are very similar (or identical) obtaining largely different results [1].

Networks which win high profile challenges are considered state-of-the-art and become wildly used, but it’s not always clear how much of the success (or failure) of a given solution can be attributed to the network itself.

So, before we get into that, let’s take a moment to go through the basics. As we have said before, Deep Learning is a subset of machine learning. Let’s therefore review first what are the main elements of a "Machine Learning" solution, what constitutes the "machine learning pipeline", and what are the building blocks of a Deep Neural Network for image analysis.

Every machine learning problem and algorithm

Any machine learning problem is characterised by a Dataset, a Task, and an Evaluation metric.

The Dataset can take many different forms. It is the "knowledge base" from which we will train our algorithms. In the kind of problems of interest to us here, it will generally be a set of images which may be associated with a set of annotations. The annotations can be very precise (we know what every pixel of the image represent), very imprecise (we know that there is a particular object of interest somewhere in the image), or inexistent. Depending on this level of annotation, we will talk about supervised, weakly supervised, semi supervised or unsupervised datasets, but that’s a topic for later.

The Task is what we want to do with the dataset. Do we want to recognize faces in pictures of a crowd? To count the number of nuclei in a microscopy image? To determine the size and position of a tumour in an MRI?

Finally, the Evaluation metric describes how we are going to decide what a "good result" means for solving our tasks. This may be a harder problem than it looks. Let's take as an example a "face detection" task. The task consists in detecting the location of all the faces in a given image. Let's now say that we have three competing algorithms which obtain the results shown below. Which one is the best?

Fig. 1 - Three different results on the same image. Which is the best? Original photo by Emma Bauso from Pexels.

The first one only detected 75% of the faces, but each predicted face is correct. The second detected all faces but added a false prediction. The last one detected all the faces, but the localization is imprecise. The goal of a good evaluation metric is to provide "a quantitative measure of its performance"[2], but different measures will favour different types of performances. If we require a very large overlap between the "true" region of the face and the detected region, the third algorithm will be last, while if we allow approximative localization and simply measure which faces were detected (the sensitivity) and if there were no false detections (the specificity), it may end up being first.

Any machine learning algorithm, conversely, is characterized by a Model with parameters and hyper-parameters, and by an Optimization method.

The Model defines the relationship between the input of the algorithm (the data) and its output. Let's go back to our example from our "So... what’s Deep Learning?" post. We were trying to determine if we will like a movie based on the percentage of comedy and the percentage of violence it contains.

One of the most basic models is the linear model. If $p_c$, $p_v$ are the percentages of comedy and violence, a linear model would search for a line with the equation $y = a \times p_c + b \times p_v + c$ that best separates the "movie we like" from the "movie we dislike". $a$, $b$ and $c$ are the parameters of the model. It's what the model will have to learn using the dataset. Running a simple linear classifier such as the Ridge Classifier from the scikit-learn Python library on some fake, generated data would give us something like the figure below.

Fig. 2 - Results of the RidgeClassifier on fake "movie" data.

The hyper-parameters are all the choices that were made while designing the algorithm. A linear model is, for instance, a specific case of a polynomial model, so we could say that the degree of the polynomial model is an hyper-parameter that we have chosen to set at 1.

The Optimization method is the algorithm that we will use to learn the parameters. There are many different ways to go about this, but the main idea is that we will try to adapt the parameters step by step until they are in a position where we can't improve the results anymore. At each step, the algorithm evaluates if each parameter needs to be increased or decreased to get a better result, and by how much, according to the training data. When any change in the parameters seems to give worse results, it stops.

This also requires us to define what "improving" the results mean, which comes back to the question of the evaluation metric. Without going to deep here, it's worth noting that a good evaluation metric for determining if an algorithm is the best at solving a Task isn't necessarily a good metric for optimizing the parameters. As a quick example, any metric that takes a significant amount of time to compute will not be practical for the learning process, as "learning" will generally involve doing millions or billions of such evaluations.

In general, a model with more parameters is said to be more complex. It can potentially solve problems which are more difficult, but at the cost of being harder to train.

Fig. 3 - Results of the third degree polynomial model on fake "movie" data.

To end this chapter, I briefly want to touch on two important notions related to that complexity: underfitting and overfitting. A model is said to be underfitting if it isn't complex enough to accurately represent the data. It is overfitting if it is so complex that it can match the data almost perfectly, which seems like a good thing at first, but is in fact quite problematic. It means that the model has memorized all the examples from the dataset, but isn't capable of generalizing to new examples.

As an illustration, I have trained three different classifiers on the fake "movie" dataset:

a linear model (trying to fin a straight line that best separates the data), shown in fig. 2 above.
a third-degree polynomial model (trying to fit a curve of the form $a \times x^3 + b \times x^2 + c \times x + d$, shown in fig. 3 above.
a “nearest neighbour” model (attributing the class of the nearest point in the training dataset), shown in fig. 4 below.

Fig. 4 - Results of the nearest neighbour model on fake "movie" data.

Figures 2-3-4 show the result of the training, with the frontier between the two classes. In the linear and polynomial model, we see that there are points which are on the wrong side of the border. The nearest neighbour, on the other hand, perfectly follows the training set. But is it better?

In the table below, we can see the accuracy (percentage of correct classification) of each model on the training data, as well as on new, previously unseen examples from the same dataset:

Accuracy	Training	Test
Linear	87%	94%
Polynomial	89%	96%
Nearest Neighbour	100%	89%

The nearest neighbour may better represent the training data, but its performance on new examples is the worst of the three. This is a clear sign of overfitting. The difference between the linear and third-degree polynomial is small, but as increasing the complexity improves both the training and the testing accuracy, it's likely that the linear model is underfitting the dataset.

In the next article, we will build on that to look at what constitutes a machine learning pipeline.

References

[↑] G. Aresta et al., "BACH: Grand challenge on breast cancer histology images", Medical Image Analysis 56 (2019). doi:10.1016/j.media.2019.05.010
[↑] I. Goodfellow, Y. Bengio, A. Courville (2016) "Deep Learning", MIT Press https://www.deeplearningbook.org/

MITOS12: learning from bad challenge design

adrien@adfoucart.be — 2020-03-09

The ICPR 2012 mitosis detection competition - usually called MITOS12 - proved that Deep Learning was the way to go for mitosis detection, and was influential in introducing Deep Learning into the world of Digital Pathology. It was also a flawed challenge, and there is a lot to learn from the mistakes that were made in its design. To understand those mistakes, we first need to look at what the goal of a "computer vision challenge" is.

1. How do we know if a computer vision algorithm is better than another?

Let's say you are a researcher who just developped a new algorithm for recognizing the species of a bird from a photograph in a natural setting. Now, you want to publish your research. In order to convince reviewers that your algorithm is interesting, however, you not only have to prove that it works, but you also have to prove that it improves on the current state-of-the-art.

If you are not a big fan of ethics, here's what you do: you find a set of test images where your algorithm works really, really well. Then, you implement other algorithm, but without caring too much if you do it exactly right. You test all algorithms on your test set and, surprise, you are the best! This may be very helpful in getting published, but of course it also doesn't tell us anything useful.

This is of course an exageration... but not so far from the truth. Even if you do care for ethics, it's very hard to implement other people's methods - especially if they didn't publish the code - and to find the best parameters for every method on your dataset. A very popular way to solve this problem is to use benchmarks and challenges.

The idea of the benchmark is fairly simple. Let's take our "bird recognition" problem: at one point in time, someone publishes a large collection of annotated bird images. Everyone who works on the problem of "bird recognition" can test their algorithm on the same data, which means that you can directly compare your method to what others have published. Challenges are similar, except that there is usually a "time limit" component: someone publishes a bird dataset, and tell everyone interested to submit their method before a given date. Then, everyone is evaluated at the same time (ideally on previously unreleased test images), and the results are published. This ensures a certain fairness in the comparison, as everyone plays by the same rules.

While many challenges only attract a few participants and are quickly forgotten, some have become true references by the computer vision community. The PASCAL Visual Object Classes challenge, for instance, ran between 2005 and 2012, with researchers tasked with recognizing objects from up to 20 different classes (see figure below). Starting in 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC, often simply referred to as "ImageNet") has become the reference for "generic computer vision task".

Classes from the PASCAL VOC challenge (source)

Many Deep Learning algorithms have become famous through their ImageNet performances, such as the previously mentioned AlexNet [1] (winner in 2012), Google's "Inception" network [2] (winner in 2014), or Microsoft's "ResNet" [3] (winner in 2015).

Challenges entered the world of digital pathology around 2012. Quoting Geert Litjens (emphasis mine):

The introduction of grand challenges in digital pathology has fostered the development of computerized digital pathology techniques. The challenges that evaluated existing and new approaches for analysis of digital pathology images are: EM segmentation challenge 2012 for the 2D segmentation of neuronal processes, mitosis detection challenges in ICPR 2012 and AMIDA 2013, GLAS for gland segmentation and, CAMELYON16 and TUPAC for processing breast cancer tissue samples.
G. Litjens et al [4]

MITOS12 [5] was one of the first computer vision contest with a specific digital pathology task. Let's take a closer look at it.

2. MITOS12: The Challenge

First of all, why do we want to count mitosis? I'll let the experts explain (emphasis mine):

Mitotic count is an important parameter in breast cancer grading as it gives an evaluation of the aggressiveness of the tumor. However, consistency, reproducibility and agreement on mitotic count for the same slide can vary largely among pathologists. An automatic tool for this task may help for reaching a better consistency, and at the same time reducing the burden of this demanding task for pathologists.
L. Roux et al [5]

There are 3 mitosis in this image from the MITOS12 dataset, according to the pathologist who annotated it. Good luck finding them! (click on the image for a full-scale version)

The following process was used to create the challenge:

A team of the pathology department of the Pitié-Salpêtrière Hospital selected five slides stained with H&E, each coming from a different patient.
Each slide was scanned with three different scanners (Aperio, Hamamatsu and a multispectral microscope).
One pathologist manually annotated, in 50 regions (10 per slides), all mitotic cells. There were 326 annotated mitosis in the dataset. An example of a full region is shown above.
The 50 regions were split into a "training set" of 35 regions and a "test set" of 15 regions.
The training set (images and annotations) was released in November 2011, and different teams started developping and training their algorithms.
The test set (images only) was provided in August 2012, and participating teams sent their predictions
All participants were evaluated using the following metric:
- Count the number of "True Positive" (TP), meaning the number of mitotic cells that were correctly predicted by the algorithm.
- Count the number of "False Negatives" (FN), meaning the number of mitotic cells that were not detected by the algorithm.
- Count the number of "False Positive" (FP), meaning the number of non-mitotic cells that were incorrectly detected as mitosis by the algorithm.
- Compute the F1 Score, defined as $2 TP \over 2 TP + FN + FP$. The F1 score is a very common metric for classification tasks, which gives equal importance to the "precision" (are "positive predictions" accurate?) and "recall" (do we find all "positive examples"?).
The organizer published a ranking of the algorithms according to this metric.

The MITOS12 results were computed separately on the images from the three different scanners. Most teams only submitted results for the "Aperio" scanner. The best results on that dataset were achieved by Dan Cireşan's team, IDSIA [6]. Their algorithm correctly detected 70 of the 100 mitosis in the test set, with only 9 false detections, for a winning F1 score of 0.7821, the runner-ups achieving scores of 0.7184 and 0.7094.

3. What is wrong with those results?

There are numerous problems with the challenge design, which were actually acknowledged by the authors in this paragraph from the Discussion of their article:

An improved version of this successful challenge will involve a much larger number of mitosis, images from more slides and multiple pathologists’ collaborative/cooperative annotations. Besides, some slides will be dedicated to test only without any HPF of these slides included in the training data set.
L. Roux et al [5]

There are four different issues here, so let's take them one at a time.

1) The small number of mitosis, with only around 200 mitosis in the training set, is certainly a problem. As mitosis can have a very variable appearance, it is unlikely that all of the possible "morphologies" of mitotic cells will be represented in the dataset. Machine Learning algorithm will therefore not be able to detect them.

2) The small number of slides is even more problematic. More importantly, a small number of slides from a small number of patients. Having more patients mean more diversity, and less risk of a bias in the dataset. For instance, if a patient happens has more mitosis than the others (because she has a more malignant cancer), and also happens to have some independent morphological characteristics in her breast tissue, it is very likely that the algorithm will pick up those independent features and use them to predict the presence of mitotic cells, even if those features would be completely meaningless for other patients.

3) The fact that only one pathologist annotated the slides is also worrying. In the introduction to the challenge, they list as a reason for needing automatic counting the lack of inter-pathologist agreement on mitotic count, citing a study by Christopher Malon and his colleagues [7]. In that study, three different pathologists were asked to classify 4204 nuclei as mitotic or non-mitotic. Pathologists were allowed to put "Maybe" as an answer. Even excluding those "maybe" from the comparison (and therefore comparing only cases where both pathologists were "reasonably certain" of their choice), the agreement between two pathologists was at most 93.5% of "same classification", and in the worst case at 64.7%. That same study also compares each pathologist against the "majority label". The F1 score of a pathologist against the majority, using the data from that study, vary between 0.704 and 0.997. In the MITOS12 challenge results, the top 3 teams are all within that range.

Now that doesn't mean that those three teams are necessarily better than, or even as good as pathologists. It's not fair to compare them on different datasets. But the point is, the difference in performance between the top teams in the challenge is small enough that a different annotator might have led to a completely different ranking.

4) The last problem acknowledged by Roux's publication is the worse one, at least in terms of methodology: they didn't split the training set and the test set properly. Ideally, when we test an algorithm, we want to make sure that it is capable of handling new cases. The best way to do that if to have a test set as independent as possible from the training set.

In biomedical images, that typically means: testing on other slides, taken from other patients, if possible taken with another acquisition machine. Changing more variables mean that the results are a lot more meaningfull. The best algorithms would then have to be those that really learned ways to describe the object of interest - in this case the mitosis. In her master thesis in our lab, Elisabeth Gruwé tested the same algorithm using either the "official" training-set / test-set split from the contest, and then using a "correct" split, by putting one patient aside for testing and training on the four others. The results on the official split were close to those of the three winning teams (0.68), the results on the correct split were significantly worse (0.54).

4. Does it matter?

Does the ranking matter? In terms of visibility, probably. The methods proposed by challenge winners tend to be copied, modified, adapted, and become the norm, while runner-ups may be completely ignored... even if their results are functionally equivalent. If we look at the publications of the three MITOS12 winning methods, a certain trend is visible. The winner, Cireşan's team's article [6], was published in the biggest biomedical imagining conference and has been cited more than a thousand times (according to Google Scholar). The runner-up, Humayun Irshad from the University of Grenoble, was published in a good journal [8] and has been cited about 150 times. The third, Ashkan Tashk and a team from Shiraz University of Technology, was published in the proceedings of an obscure iranian conference [9] and cited about 30 times.

Now the rankings are not the only explanation for this difference in visibility, and the number of citation is not a direct reflection of the influence of a paper. Dan Cireşan was part of a well-established research team with Jurgen Schmidhuber, a Deep Learning pioneer. Humayun Irshad's thesis director was Ludovic Roux, the organizer of the challenge, which is kind of a problem of its own, but ensured that he got some visibility in the follow-up articles. Ashkan Tashk and the iranian team certainly didn't have the same recognition beforehand - or after.

Two years later, an extended version of the challenge was proposed at the ICPR 2014 conference, MITOS-ATYPIA 14. It provided more data, and the annotations were made by two different pathologist, with a third one looking at all cases were the first two were in disagreement. The data included a confidence score for each mitosis based on the agreement or disagreement of the pathologists... and it was correctly split at the patient level. In 2016, Hao Chen and his Hong Kong team published their results on both the 2012 and 2014 datasets [10]. On the 2012 dataset, they achieve a F1 score of 0.788, "beating" Cireşan's entry. They also beat all other existing publications on the 2014 dataset... with a score of 0.482. Comparing the two datasets, they say:

One of the most difficult challenges in [the 2014] dataset is the variability of tissue appearance, mostly resulted from the different conditions during the tissue acquisition process. As a result, the dataset is much more challenging than that in 2012 ICPR.
H. Chen et al [10]

But from a purely machine learning perspective, this doesn't sound quite right. Yes, the increased variability in the test set is more challenging, but the increased variability in the training set should help the algorithms. The huge drop in performance is likely to be in a large part due to the incorrect setup of the 2012 challenge. The 2014 edition, however, attracted a lot less participants, and didn't get the same visibility... probably because the results were, for obvious reasons, a lot worse.

References

[↑] A. Krizhevsky, I. Sutskever, G. Hinton (2012) "ImageNet Classification with Deep Convolutional Neural Networks", Communications of the ACM, 60(6), DOI: 10.1145/3065386
[↑] C. Szegedy et al (2015) "Going deeper with convolutions", CVPR 2015, DOI: 10.1109/CVPR.2015.7298594
[↑] K. He, X. Zhang, S. Ren, J. Sun (2015) "Deep residual learning for image recognition", CVPR 2016, DOI: 10.1109/CVPR.2016.90
[↑] G. Litjens et al. (2017) "A survey on deep learning in medical image analysis", Medical Image Analysis, 42, 60-88, DOI: 10.1016/j.media.2017.07.005
[↑] L. Roux et al. (2013) "Mitosis detection in breast cancer histological images An ICPR 2012 contest", Journal of Pathology Informatics, 4, 1, DOI: 10.4103/2153-3539.112693
[↑] D. Cireşan, A. Giusti, L. Gambardella, J. Schmidhuber (2013) "Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks", Proceedings of MICCAI 2013 in Lecture Notes in Computer Science, 8150, 411-418, DOI: 10.1007/978-3-642-40763-5_51
[↑] C. Malon et al (2012) "Mitotic figure recognition: agreement among pathologists and computerized detector", Analytical Cellular Pathology , 35(2), 97-100, DOI: 10.3233/ACP-2011-0029
[↑] H. Irshad (2013) "Automated mitosis detection in histopathology using morphological and multi-channel statistics features", Journal of Pathology Informatics, 4: 10, DOI: 10.4103/2153-3539.112695
[↑] A. Tashk, M. Helfroush, H. Danyali, M. Akbarzadeh (2013) "An automatic mitosis detection method for breast cancer histopathology slide images based on objective and pixel-wise textural features classification", Proceedings of CIKT, DOI: 10.1109/IKT.2013.6620101
[↑] H. Chen, Q. Dou, X. Wang, J. Qin, P. Heng (2016) "Mitosis Detection in Breast Cancer Histology Images via Deep Cascaded Networks.", AAAI Conference on Artificial Intelligence, AAAI Publications

A timeline of Deep Learning in Digital Pathology

adrien@adfoucart.be — 2020-03-02

1. Who was first?

In 2017, Geert Litjens and his colleagues from Radboud University published a comprehensive survey of deep learning methods in medical image analysis [1]. It's possible that they missed some early papers that could qualify as "Deep Learning": as I've written before, the boundaries of what is or isn't Deep Learning are unclear, and articles written before 2012 are unlikely to use that terminology. If such articles exist, however, they have been forgotten in the large pile of never-cited research that fails to be picked-up by Google Scholar, Scopus, or other large research databases. Litjens' survey therefore remains the reference on the matter.

So who was first to use "Deep Learning in Digital Pathology"? The turning point seems to come from the MICCAI 2013 conference in Nagoya, with two pioneering articles: Angel Cruz-Roa's automated skin cancer detection [2], and Dan Cireşan's mitosis detection in breast cancer [3].

Cruz-Roa uses H&E stained images from skin tissue, to try to automatically determine if there is a malignant cancer in the sample. As illustrated in the figure below, the difference between cancer and non-cancer is based on morphological criteria which are very difficult to define.

Example of H&E images from cancer and non-cancer skin tissue, taken from [2]

Their algorithm produces both a classification (cancer or not) at the image level, and what they call a "digital staining", which is basically a probability map of where the cancer regions are (see figure below). This is a very important feature for machine learning methods in biomedical imaging, related to the concept of interpretability. A machine learning algorithm which only produces a "diagnosis", but is unable to "explain" how this diagnosis came to be, cannot be trusted. I will most certainly come back to that idea later: the "reasoning" that machine learning algorithms (deep or not) make are sometimes more a reflexion of biases and artefacts in the data that was used to train it than of an understanding of the pathology. Having an output which includes an explanation of the diagnosis is therefore essential to control whether any "weird stuff" is happening.

Output of Cruz-Roa's algorithm, taken from [2]

Cireşan also uses H&E images, taken from breast cancer tissue. I'll let him introduce the problem:

Mitosis detection is very hard.
Cireşan, Giusti, Gambardella & Schmidhuber [3]

Mitosis - the process of cell multiplication - is a relatively rare event, meaning that, in the images which are available to train the algorithm, only a very small fraction will be part of a nuclei, and an even smaller fraction will be part of a nuclei going through mitosis. In addition to being rare, the appearance of the cell nucleus will be very different depending of which stage of the mitosis the cell is currently experiencing. To get an idea of how difficult the task is, we can just look at these examples from the article (click on the image for a larger version):

Examples of mitosic and non-mitosic cells, with the "mitosis probability" given by Cireşan's algorithm and the "true classification", taken from [3]

Cireşan's results were far from perfect, but they were impressive enough to be a milestone in the domain: as a winner of the "ICPR 2012 mitosis detection competition", it got a lot of attention... despite the many methodogical issues with the competition itself, which is the topic for another post.

2. Invasion of the Deep Learners

By showing that Deep Learning was a way to get good results on digital pathology tasks, Cireşan and Cruz-Roa opened up the floodgates. Litjens lists many different applications in the subsequent years: bacterial colony counting, classification of mitochondria, classification and grading of different types of cancer, detection of metastases... Mostly on H&E images, but also sometimes using immunohistochemistry, Deep Learning invaded the domain.

A few highly influential works that I would like to mention here, and that I will probably write about more later:

In 2015, Olaf Ronneberger, Philipp Fischer and Thomas Brox introduced the U-Net network architecture [4], winning challenges in cell segmentation and cell tracking at the ISBI 2015 conference. This particular architecture is now probably the most widely used in biomedical imaging.
In 2016, Andrew Janowczyk and Anant Madabhushi published their Deep Learning for digital pathology tutorial [5], a very practical article on how to approach various use cases in digital pathology, with well-annotated datasets that they also made public. For reasearchers in the field, this is probably one of the best available resource to get started.
In 2017, Korsuk Sirinukunwattana published the results of the "GlaS" gland segmentation in colon histology images challenge [6]. This challenge was influential in two ways: first, by providing a high-quality dataset of colon histology images with very accurate annotations, and second by demonstrating how much Deep Learning had penetrated the field of digital pathology. Of the 6 methods deemed good enough to be described in the results article, 5 used deep learning approaches.

3. Looking forward

After these pioneering works, the future of the field may seem a little dull. If we have deep neural network that work for most digital pathology tasks... what is there left to do?

Fortunately, finding a good "deep learning network" is only a part of the "digital pathology pipeline". Everything that comes around the network - from the constitution of the datasets to the way the results are evaluated - is often more important to the final result. That is going to be a large part of what I will write about in future posts. Questions surrounding how data from challenges and data from real-world application may differ, questions about the way we evaluate algorithms, about how we declare winners and losers in ways that may not always reflect how useful the algorithms really are. For that, a good starting point will be to take a closer look at the aforementioned MITOS12 challenge from ICPR 2012.

References

[↑] G. Litjens et al. (2017) "A survey on deep learning in medical image analysis", Medical Image Analysis, 42, 60-88, DOI: 10.1016/j.media.2017.07.005
[↑] A. Cruz-Roa, J. Ovalle, A. Madabhushi, F. Osorio (2013) "A Deep Learning Architecture for Image Representation, Visual Interpretability and Automated Basal-Cell Carcinoma Cancer Detection", Proceedings of MICCAI 2013 in Lecture Notes in Computer Science, 8150, 403-410, DOI: 10.1007/978-3-642-40763-5_50
[↑] D. Cireşan, A. Giusti, L. Gambardella, J. Schmidhuber (2013) "Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks", Proceedings of MICCAI 2013 in Lecture Notes in Computer Science, 8150, 411-418, DOI: 10.1007/978-3-642-40763-5_51
[↑] O. Ronneberger, P. Fischer, T. Brox (2015) "U-Net: Convolutional Networks for Biomedical Image Segmentation", Proceedings of MICCAI 2015 in Lecture Notes in Computer Science, 9351, 234-241, DOI: 10.1007/978-3-319-24574-4_28
[↑] A. Janowczyk, A. Madabhushi (2016) "Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases", Journal of Pathology Informatics, 7, 29, DOI: 10.4103/2153-3539.186902
[↑] K. Sirinukunwattana et al (2017) "Gland Segmentation in Colon Histology Images: The GlaS Challenge Contest", Medical Image Analysis, 35, 489-502, DOI: 10.1016/j.media.2016.08.008

So... what's Deep Learning?

adrien@adfoucart.be — 2020-02-24

1. Defining Deep Learning

It seems like Deep Learning should have an easy, clear-cut definition. Yet... Wikipedia, on this topic, displays a remarkable example of circular citation - or Citogenesis, as the always-relevant XKCD would put it. The Wikipedia definition is a "summary" of five definitions from a Microsoft Research paper, most of which are themselves taken from earlier versions of the same Wikipedia article.

The most direct definition from a reputable source that I could find is probably from the "Deep Learning" Nature Review of AI-superstars Yann LeCun, Yoshua Bengio and Geoffrey Hinton (emphasis mine):

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.
LeCun, Bengio & Hinton [1]

In a slightly more convoluted way, Ian Goodfellow, Yoshua Bengio (again) and Aaron Courville, in their "Deep Learning" book, introduce the topic this way (emphasis mine):

The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images. (...) A solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept deﬁned through its relation to simpler concepts.
Goodfellow, Bengio & Courville [2]

These definitions basically boil down to: it's AI, with machine learning, with layers. So... What's machine learning? And, while we're at it, what is AI?

Timeline of AI, Machine Learning and Deep Learning, by NVIDIA. [source]

2. What is Artificial Intelligence?

The history of Artificial Intelligence as a practical, computer-science-related field of research, goes back to the early days of computers themselves, around and right after World War II. The "ultimate goal" of AI, as illustrated in Alan Turing's best-known paper [3], was to create a computer which could be - at least in specific conditions - indistinguishable from a human.

This, unsurprisingly, is a difficult task. In fact, this particular goal, which is now generally referred to as "Artificial General Intelligence", is still mostly the domain of science-fiction.

One of the early avenues of research in AI, in the tradition of "trying to replicate human intelligence", was the artificial neuron. As early as 1943, McCulloch and Pitts [4] proposed a way to represent neurons in a mathematical model which could be replicated on a computer. They were followed by many others, but while their research was interesting, it proved to be largely impractical. Neural networks, quite simply, did not work. Artificial (General) Intelligence seemed altogether impossible. If AI as it was conceived couldn't be done, the next best thing was to change the definition of AI to something more forgiving.

In "Artificial Intelligence: foundations of computational agents" [5], Poole & Mackworth propose such a definition. AI, in their view, studies computational agents (which are agents whose actions and decisions can be implemented in a physical device, like a computer) that act intelligently, which means that it has actions appropriate for its circumstances and its goals, is flexible to changing environments and changing goals, and learns from experience.

This places any artificial intelligence in the context of task or problem solving. The job of an AI is not to be "like a human", but to have "human-like" (or better-than-human) capabilities in one or several specific tasks.

An interesting aspect of AI as defined by Poole & Mackworth is the capacity to learn from experience. Machine Learning is a subset of AI dealing with this particular aspect of "intelligence": how can a machine learn from experience?

3. Machine Learning

To understand the basic idea of Machine Learning, it's interesting to look at one of its earliest algorithms, "Nearest Neighbor", with a version described already in 1951 [6].

Let's take an example. Imagine that you are a Big Streaming Company, and you want your AI to decide which movies or series in your catalog you should recommend to a particular user. Let's assume that you actually want to recommend something that the user will like, and not just something that you want to promote. For every movie or tv series, you have two pieces of information: how much violence, and how much comedy there is, as a percentage of the total runtime of the movie. For everything that the user has already seen, you also know (don't ask how) if he liked it or not.

You could therefore represent all of the movies that the user has seen on a nice graph, like this:

How do you know if you will like the next movie?

What you do know is that you take the movie that you want to recommend, and you also put it on the same graph. Then, you check if the closest point is something that the user liked or not. If not, you don't recommend the movie.

Obviously, this is a ridiculous example, but the main idea is there: the algorithm uses past experience to predict new behaviour. Of course, this will work a lot better if you have more data, and if you have better ways of describing this data.

There are many, many, more complex, more accurate algorithms in Machine Learning. But in the end, they fundamentally do the exact same thing: put all of the data (the "past experience") into some space that describes is as best as possible, and then find in that space a "Rule" that best predicts what to do with any future event. This rule can usually be as simple as a straight line (or to be precise in the more-than-two-dimensions case, an hyperplane) dividing the "space" in two, or be an intricate function with millions of parameters able to model any boundary shape, as in modern artificial neural networks.

4. Deep Learning and the neural networks comeback

While diverse machine learning algorithms such as Decision Trees, Support Vector Machines, and many others were being developped, the "artificial neural network" world was not completely inactive. One of the major issues with neural networks - how to efficiently "train" them with new examples - was vastly improved upon in 1986 with the backpropagation algorithm [7], which is still used today. In 1989, Yann LeCun and his team used it in what is now considered one of the first practical application of "modern" artificial neural networks, to recognize hand-written ZIP codes for the US Postal system [8].

In 1997, IBM's Deep Blue beat world champion Gary Kasparov in a six-game chess match [9]. In terms of public perception, this certainly gave AI enthusiasts a boost. Deep Blue, however, was an "Intelligence" only if you used the most forgiving definition. It didn't learn anything, it didn't reason anything: it was a pure, brute-force mechanism. Deep Blue simply took the current situation of the game, and computed all possible outcomes for all possible moves, for the next 10 to 20 moves. It did use "previous experience", in the form of thousands of previously played human-vs-human grandmaster games, to determine what a "winning move" was. But in the end, it mostly relied on the fact that chess is a game with fixed rules and a finite amount of outcomes. It worked because it had more processing power than what was previously available, not because it was innovative.

LeCun's success put neural networks back on the map, but they were still a curiosity. In most applications, they were impractical, took way too long to train, and didn't usually perform better than other machine learning approaches. But with the 21st century came two game changers in the machine learning world: Big Data and fast GPUs. Big Data - the ability to store huge amount of data on everything, thanks to cheap hard drives - gave us the ability to improve machine learning in general. Fast GPUs made training larger, more useful neural networks a reality. Quoting Dan Cireşan and his colleagues in 2010:

All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
Cireşan, Meier, Gambardella & Schmidhuber [10]

Around the same time, the idea that "larger" neural networks were actually "deeper" neural networks, with neurons organized in "layers", each layer connected to the next starting from the raw data all the way up to the output, became common usage.

By 2012, the achievements of "Deep" neural networks became impossible to ignore. On ImageNet, the largest visual object recognition challenge, Alex Krizhesvsky's AlexNet [11] dominated that year's field. Deep Learning approaches have since then consistently beaten "classical" machine learning methods on about everything. Most predominantly, it has become the standard solution for computer vision and language processing. In the world of AI, Deep Learning is now the law of the land.

Conclusion?

To summarize:

Artificial Intelligence describes any (artificial) system that take (intelligent) actions based on its environment.
Machine Learning describes a subset of AI which use past events (the "learning" dataset) to create a model of "how to react to new events" (the "decision rule").
Deep Learning describes a subset of Machine Learning where the model is learned in a "layered" manner, with simple rules learned from the raw data, which feed into more complex rules, which feed into more complex rules, etc, etc, until you get to more abstract concepts. To take a computer vision example: from pixels, to shapes and colors, to eyes and noses, to recognizable faces.

These definitions are fuzzy. The boundaries between Deep Learning and "non-deep" Machine Learning are unclear, as are sometimes the boundaries between Machine Learning and "old fashioned AI". That's fine: we don't need every method to fit into a well-defined box.

All right. Now that we have defined what Digital Pathology and Deep Learning are, the next question will be: how has Deep Learning been applied to Digital Pathology?

References

[↑] Y. LeCun, Y. Bengio, G. Hinton (2015) "Deep Learning", Nature, 521, 436-444, DOI: 10.1038/nature14539
[↑] I. Goodfellow, Y. Bengio, A. Courville (2016) "Deep Learning", MIT Press https://www.deeplearningbook.org/
[↑] A. Turing (1950) "Computer machinery and intelligence", MIND, 59, 433-60, DOI: 10.1093/mind/LIX.236.433
[↑] W. McCulloch, W. Pitts (1943). "A logical calculus of the ideas immanent in nervous activity." Bulletin of Mathematical Biophysics, 7, 115–133.
[↑] D. Poole, A. Mackworth (2017) "Computational Intelligence: foundations of computational agents (Second Edition)", available online
[↑] E. Fix, J. Hodges (1951) "Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties", USAF School of Aviation Medicine - University of California, Berkeley, available online
[↑] D. Rumelhart, G. Hinton, R. Williams (1986) "Learning representations by back-propagating errors.", Nature, 323, pp 533-536, DOI: 10.1038/323533a0
[↑] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, et al. (1989) "Back-propagation applied to handwritten zip code recognition.", Neural Computation, 1(4), pp 541-551, DOI: 10.1162/neco.1989.1.4.541
[↑] Deep Blue (chess computer) on Wikipedia, seen on Feb 12th, 2020.
[↑] D. Cireşan, U. Meier, L. Gambardella, J. Schmidhuber (2010) "Deep Big Simple Neural Nets Excel on Hand-written Digit Recognition", Neural Computation, 22(12), pp 3207-3220, DOI: 10.1162/NECO_a_00052
[↑] A. Krizhevsky, I. Sutskever, G. Hinton (2012) "ImageNet Classification with Deep Convolutional Neural Networks", Communications of the ACM, 60(6), DOI: 10.1145/3065386

So... what's Digital Pathology?

adrien@adfoucart.be — 2020-02-17

Let's start from the beginning. When I started my thesis, the topic we settled on was "Deep Learning in Digital Pathology". It's vague - but that was kind of the point. Deep Learning and Digital Pathology were both recent trends at the time, so trying to look at what could be done with it in general seemed like a good idea.

We have two parts in "Deep Learning in Digital Pathology". The first one, Deep Learning, is where I have spent most of my thesis. That's the part that concerns me as a biomedical engineer specialized in image analysis, and where I can contribute the most. The second part, however, is just as important. It's the application, what we want to use Deep Learning for: Digital Pathology. I am certainly not an expert in Digital Pathology - and even less of an expert in not-digital histopathology - but understanding what the methods I'm going to develop may be used for does seem like a good idea, so let's briefly get into it.

1. Histopathology

The goal of histopathology is to examine human tissue (usually taken from a tumour or some other possibly diseased area) under a microscope, to formulate a diagnosis or to get a better understanding of a disease. The process, in short, is as follows:

Cut a small bit of tissue from someone's body (typically during surgery, or a biopsy)
"Fix" it in formalin (or freeze it) so the tissue keeps it shape, and then embed it in a paraffin block.
Cut the block in very thin slices using a microtome, which is basically a very small and precise meat slicer.
Stain the tissue to make whatever we are looking for more visible.
Have a trained pathologist examine the resulting slide under a microscope to get a diagnosis.

Why do we have to stain the tissue? Because cells are mostly water, water tends to be transparent [citation needed], and transparent things are hard to look at with visible light. Fortunately, some chemical pigments have properties which are very useful for pathologists. For instance, in the late 19th and early 20th century, it was discovered that we could use hematoxylin to stain the nuclei of cells in blue, and eosin to stain the cytoplasm in pink [1].

This produces "Hematoxylin & Eosin" - H&E - images like this one below, where the structure of the tissue is easy to analyse for the pathologist:

H&E stained image, from [2].

Slightly more recently, we realized that we could use the properties of antigens and antibodies to get some more specific staining. The idea is this: in our body, some cells - antibodies - are designed to specifically bind to some proteins - antigens - as a defence mechanism to produce an immune response. We can "hack" this process by binding a staining agent to an antigen and therefore "highlight", in the tissue, places where the related antibodies are present. This method is called immunohistochemistry, or IHC. For instance, in the image below, we have the same part of the tissue stained with H&E on one side, and with an IHC marker (anti-pan-cytokeratin, to be precise) on the other. The IHC marker highlights the cells which are part of a tumour, which is a rather useful information to have in histopathology.

(left) H&E versus (right) IHC-stained images, from colorectal cancer tissue.

2. Digital pathology

So where does the "digital" part fit in all this?

The problem with the process above is that it requires the trained pathologist to be physically in front of the microscope, with the slide in it. There are a number of drawbacks to this. One is that it's hard to get a second opinion from a specialist from somewhere else. Another is that comparing a tissue to, for instance, another sample taken some months or years before requires finding the physical slide in the archives of the hospital.

How do we solve that? With digital scanners. Very expensive, very high resolution, very precise digital scanners. The entire slide can be scanned at multiple levels of magnification to produce Gigapixel images, which can be viewed on a computer. The pathologist can then access the image in a "virtual microscope".

And once you start to have the slides as digital objects, you open the door to many possibilities, from tools to easily annotate the slides (for instance, for teaching purposes, or simply to quickly document the reasoning behind a diagnosis) to automated analysis of certain aspects of the tissue. In particular, some quantitative analysis are very hard to do for a human expert in an objective manner (like evaluating "what percentage of the tissue shows this marker?"), yet relatively easy (or at least possible) to do for a well-designed algorithm.

Digital image acquisition are becoming commonplace, and associated image analysis solutions are viewed by most as the next critical step in automated histological analysis.
Laoighse Mulrane [3]

In the more than 10 years since Mulrane's paper, there has indeed been a wide range of image analysis applications in digital pathology. And, in more recent years, as in most image analysis applications, one type of strategy seems to have very quickly surpassed all others: Deep Learning.

I guess that's a teaser for the "next episode"?

References

[↑] Michael Titford (2009) "Progress in the Development of Microscopical Techniques for Diagnostic Pathology", Journal of Histotechnology, 32:1, 9-19, DOI: 10.1179/his.2009.32.1.9
[↑] K. Sirinukunwattana et al (2017), "Gland segmentation in colon histology images: The glas challenge contest", Med. Image Anal., vol. 35, pp. 489–502. DOI: 10.1016/j.media.2016.08.008
[↑] L. Mulrane et al (2008), "Automated image analysis in histopathology: a valuable tool in medical diagnostics", Expert Rev. Mol. Diagno., 8 (6), pp. 707-725. DOI: 10.1586/14737159.8.6.707