Deep Learning in Digital Pathology

Research in the course of a PhD thesis, by Adrien Foucart.

There are several lists of digital pathology challenges and/or datasets floating around in different publications and, of course, on grand-challenge.org, but they never fit exactly what I'm looking for, and either miss some or include some that I would consider as slightly different modalities (such as cytology).

So here is the list I'm including in my thesis, as it may be useful to someone else. I included all challenges between 2010 and 2021 that I could find that used Whole Slide Images (WSIs) and/or image patches extracted from WSIs, either with H&E or IHC staining. I report here the reference to either the post-challenge publication if it exists, or the challenge website if it doesn't, and a (very short) description of the challenge's task(s).

Name, Year Post-challenge publication or website Task(s)
PR in HIMA, 2010 Gurcan, 2010 [1] Lymphocyte segmentation , centroblast detection.
MITOS, 2012 Roux, 2013 [2] Mitosis detection.
AMIDA, 2013 Veta, 2015 [3] Mitosis detection.
MITOS-ATYPIA, 2014 Challenge website Mitosis detection , nuclear atypia scoring.
Brain Tumour DP Challenge, 2014 Challenge website Necrosis region segmentation , gliobastoma multiforme / low grade glioma classification.
Segmentation of Nuclei (SNI) in DP Images, 2015 Description in TCIA wiki Nuclei segmentation.
BIOIMAGING, 2015 Challenge website Tumour classification.
GlaS, 2015 Sirinukunwattana, 2017 [4] Gland segmentation.
TUPAC, 2016 Veta, 2019 [5] Mitotic scoring , PAM50 scoring , mitosis detection.
CAMELYON, 2016 Ehteshami Bejnordi, 2017 [6] Metastases detection.
SNI, 2016 Challenge website Nuclei segmentation.
HER2, 2016 Qaiser, 2018 [7] HER2 scoring.
Tissue MicroarrayAnalysis in ThyroidCancer Diagnosis, 2017 Wang, 2018 [8] Prediction of BRAF gene mutation (classification), TNM stage (scoring), extension status (scoring), tumour size (regression), metastasis status (scoring).
CAMELYON, 2017 Bandi, 2019 [9] Tumour scoring (pN-stage) in lymph nodes.
SNI, 2017 Vu, 2019 [10] Nuclei segmentation.
SNI, 2018 Kurc, 2020 [11] Nuclei segmentation.
ICIAR BACH, 2018 Aresta, 2019 [12] Tumour type patch classification , tumour type segmentation.
MoNuSeg, 2018 Kumar, 2020 [13] Nuclei segmentation.
C-NMC, 2019 Gupta, 2019 [14] Normal/Malignant cell classification.
BreastPathQ, 2019 Petrick, 2021 [15] Tumour cellularity assessment (regression).
PatchCamelyon, 2019 Challenge website Metastasis patch classification.
ACDC@LungHP, 2019 Li, 2019 [16] Lung carcinoma segmentation.
LYON, 2019 Swiderska-Chadaj, 2019 [17] Lymphocyte detection.
PAIP, 2019 Kim, 2021 [18] Tumour segmentation , viable tumour ratio estimation (regression).
Gleason, 2019 Challenge website Tumour scoring , Gleason pattern region segmentation.
DigestPath, 2019 Zhu, 2021 [19] Signet ring cell detection , lesion segmentation , benign/malign tissue classification.
LYSTO, 2019 Challenge website Lymphocyte counting.
BCSS, 2019 Amgad, 2019 [20] Breast cancer regions semantic segmentation.
ANHIR, 2019 Borovec, 2020 [21] WSI registration.
HeroHE, 2020 Conde-Sousa, 2021 [22] HER2 scoring.
MoNuSAC, 2020 Verma, 2021 [23] Nuclei detection , segmentation, and classification.
PANDA, 2020 Bulten, 2022 [24] Prostate cancer Gleason scoring.
PAIP, 2020 Challenge website Colorectal cancer MSI scoring and whole tumour area segmentation.
Seg-PC, 2021 Challenge website Multiple myeloma plasma cells segmentation.
PAIP, 2021 Challenge website Perineural invasion detection and segmentation.
NuCLS, 2021 Amgad, 2021 [25] Nuclei detection , segmentation and classification.
WSSS4LUAD, 2021 Challenge website Tissue semantic segmentation from weak, image-level annotations.
MIDOG, 2021 Challenge website Mitosis detection.

[1] M. N. Gurcan, A. Madabhushi, and N. Rajpoot, "Pattern Recognition in Histopathological Images: An ICPR 2010 Contest," in Lecture Notes in Computer Science, vol. 6388, 2010, pp. 226–234.

[2] L. Roux et al., "Mitosis detection in breast cancer histological images An ICPR 2012 contest," J. Pathol. Inform., vol. 4, no. 1, p. 8, 2013, doi: 10.4103/2153-3539.112693.

[3] M. Veta et al., "Assessment of algorithms for mitosis detection in breast cancer histopathology images," Med. Image Anal., vol. 20, no. 1, pp. 237–248, Feb. 2015, doi: 10.1016/j.media.2014.11.010.

[4] K. Sirinukunwattana, J. P. W. Pluim, H. Chen, and Others, "Gland segmentation in colon histology images: The glas challenge contest," Med. Image Anal., vol. 35, pp. 489–502, 2017, doi: 10.1016/j.media.2016.08.008.

[5] M. Veta et al., "Predicting breast tumor proliferation from whole-slide images: The TUPAC16 challenge," Med. Image Anal., vol. 54, pp. 111–121, May 2019, doi: 10.1016/j.media.2019.02.012.

[6] B. Ehteshami Bejnordi et al., "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer," JAMA, vol. 318, no. 22, p. 2199, Dec. 2017, doi: 10.1001/jama.2017.14585.

[7] T. Qaiser et al., "HER2 challenge contest: a detailed assessment of automated HER2 scoring algorithms in whole slide images of breast cancer tissues," Histopathology, vol. 72, no. 2, pp. 227–238, Jan. 2018, doi: 10.1111/his.13333.

[8] C.-W. Wang et al., "A benchmark for comparing precision medicine methods in thyroid cancer diagnosis using tissue microarrays," Bioinformatics, vol. 34, no. 10, pp. 1767–1773, May 2018, doi: 10.1093/bioinformatics/btx838.

[9] P. Bandi et al., "From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge," IEEE Trans. Med. Imaging, vol. 38, no. 2, pp. 550–560, Feb. 2019, doi: 10.1109/TMI.2018.2867350.

[10] Q. D. Vu et al., "Methods for Segmentation and Classification of Digital Microscopy Tissue Images," Front. Bioeng. Biotechnol., vol. 7, Apr. 2019, doi: 10.3389/fbioe.2019.00053.

[11] T. Kurc et al., "Segmentation and Classification in Digital Pathology for Glioma Research: Challenges and Deep Learning Approaches," Front. Neurosci., vol. 14, Feb. 2020, doi: 10.3389/fnins.2020.00027.

[12] G. Aresta et al., "BACH: Grand challenge on breast cancer histology images," Med. Image Anal., vol. 56, pp. 122–139, 2019, doi: 10.1016/j.media.2019.05.010.

[13] N. Kumar et al., "A Multi-Organ Nucleus Segmentation Challenge," IEEE Trans. Med. Imaging, vol. 39, no. 5, pp. 1380–1391, 2020, doi: 10.1109/TMI.2019.2947628.

[14] A. Gupta and R. Gupta, Eds., ISBI 2019 C-NMC Challenge: Classification in Cancer Cell Imaging. Singapore: Springer Singapore, 2019.

[15] N. Petrick et al., "SPIE-AAPM-NCI BreastPathQ challenge: an image analysis challenge for quantitative tumor cellularity assessment in breast cancer histology images following neoadjuvant treatment," J. Med. Imaging, vol. 8, no. 03, May 2021, doi: 10.1117/1.JMI.8.3.034501.

[16] Z. Li et al., "Deep Learning Methods for Lung Cancer Segmentation in Whole-Slide Histopathology Images - The ACDC@LungHP Challenge 2019," IEEE J. Biomed. Heal. Informatics, vol. 25, no. 2, pp. 429–440, 2021, doi: 10.1109/JBHI.2020.3039741.

[17] Z. Swiderska-Chadaj et al., "Learning to detect lymphocytes in immunohistochemistry with deep learning," Med. Image Anal., vol. 58, p. 101547, Dec. 2019, doi: 10.1016/j.media.2019.101547.

[18] Y. J. Kim et al., "PAIP 2019: Liver cancer segmentation challenge," Med. Image Anal., vol. 67, p. 101854, 2021, doi: 10.1016/j.media.2020.101854.

[19] C. Zhu et al., "Multi-level colonoscopy malignant tissue detection with adversarial CAC-UNet," Neurocomputing, vol. 438, pp. 165–183, May 2021, doi: 10.1016/j.neucom.2020.04.154.

[20] M. Amgad et al., "Structured crowdsourcing enables convolutional segmentation of histology images," Bioinformatics, vol. 35, no. 18, pp. 3461–3467, 2019, doi: 10.1093/bioinformatics/btz083.

[21] J. Borovec et al., "ANHIR: Automatic Non-Rigid Histological Image Registration Challenge," IEEE Trans. Med. Imaging, vol. 39, no. 10, pp. 3042–3052, Oct. 2020, doi: 10.1109/TMI.2020.2986331.

[22] E. Conde-Sousa et al., "HEROHE Challenge: assessing HER2 status in breast cancer without immunohistochemistry or in situ hybridization," Nov. 2021.

[23] R. Verma et al., "MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge," IEEE Trans. Med. Imaging, vol. 40, no. 12, pp. 3413–3423, Dec. 2021, doi: 10.1109/TMI.2021.3085712.

[24] W. Bulten et al., "Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge," Nat. Med., vol. 28, no. 1, pp. 154–163, Jan. 2022, doi: 10.1038/s41591-021-01620-2.

[25] M. Amgad et al., "NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation," no. Cche 57357, pp. 1–45, 2021.

Our comment article Comments on "MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge" was just published in the April 2022 issue of the IEEE Transactions on Medical Imaging, alongside the author's reply. The whole story is, in my opinion, a very interesting example of things that can go wrong with digital pathology challenges, and of some weaknesses of the scientific publication industry. But let's start from the beginning...

MoNuSAC 2020 Banner

Summary of previous events

MoNuSAC was a nuclei detection, segmentation and classification challenge hosted at the ISBI 2020 conference. The challenge results were posted online, and a post-challenge paper was published in the IEEE Transactions on Medical Imaging [1] (available in "Early Access" on June 4th, 2021, and published in December 2021). Challenge organisers also published ground truth annotations for the test set, as well as a "colour-coded predictions" on the test set of four of the top-five teams (the link for the L1 team pointing to the predictions of the PL2 team), and the evaluation code used to score the participants' submissions.

In September 2021, as I was working on an analysis of the Panoptic Quality metric [2] used in the challenge, I discovered a bug in the evaluation code and alerted the organisers on September 20th, 2021. They initially replied that "[t]he code to compute PQ is correct". After verification, I confirmed the error and sent a Jupyter Notebook demonstrating it in action.

As I was trying to assess the potential effects of the error on the challenge results, I discovered several additional issues in the evaluation process. On September 22, 2021, I sent a detailed report to the organisers explaining the different problems and offering to "collaborate with [them] in making a correction to the published results". I received no response.

On October 12th, 2021, my PhD supervisor Prof. Decaestecker contacted the managing editor of the IEEE Transactions on Medical Imaging to ask what the procedure was for submitting "comment articles" to the journal, as it seemed to be the preferred method in IEEE journals for reporting potential errors in published papers. After some back-and-forth to clarify the procedure, we submitted our comment article on October 20th, 2021. The comment article was transmitted to the original authors. On February 23rd, 2022, we received notice from the editor-in-chief that our comment article was accepted. Both our comments and a response by the original authors were finally published in the April 2022 issue of the journal [3, 4]... which brings us back to the present.

Claims and responses

Four main issues were raised in our original comment. In this section, I would like to go through our claims, the response from Verma et al., and my thoughts on that response. The four issues that we raised are:

  • There was a typo in the code used to compute the PQ metric, which led to False Positives being incorrectly added to the count.
  • There was a confusion between two classes in the detailed per-organ and per-class results, where Macrophage results were reported as Neutrophils and vice-versa.
  • Because of the way the code processed prediction files, when where there were no ground truth objects of a given class for an image, the predicted objects of that class were not taken into account as False Positives.
  • The aggregation process of the metric was implemented by computing the PQ metric on each "image patch" and then computing the mean, while the published methodology implied that the aggregation would take place at the "patient" level (which would make a lot more sense given the disparity of image sizes and numbers of objects).

Let's take them one by one.

Computation of the PQ

This one is very straightforward. In their reply, the authors recognised the error and recomputed the results after correction. Fortunately, they find that "the impact of fixing this bug is small". The increase of .3%-.5% in the overall PQ for most teams with the corrected code [4, Table I] is in line with our own experiments. They also find, as we suspected, that some teams were disproportionately affected ("L8" moves three ranks up to "L5" with a 16% improvement in PQ, "L13" moves three ranks up to "L10" with a 16.5% improvement): this is because the error is particularly strong if the object labels between the "ground truth" and "predictions" are very different (as in: using different ranges of numbers), so it's likely that these teams used a different labelling process (full explanation in our comments [3] or on our Notebook on GitHub).

Class confusion

Here, the authors also admit the error, and issue an amended table for the supplementary materials, so I have no further comments to make.

Missing false positives

To our third claim, the authors argue that it comes down to a matter of methodological choice. Quoting from their reply [4, II.C]:

[T]he false positive for one class will be false negative for another class in an mutually exclusive and exhaustive multi-class multi-instance detection problem. We did not want to double count an error, and therefore our loops for error counting run over the ground truth objects. The interpretation of positives and negatives in multi-class problems is a matter interpretation until settled, and this leads to multiple ways of computing the PQ metric for multi-class problems.

I disagree with this for several reasons. To better explain what's going on (the problem, the author's response, and why I think it's wrong), it will be easier to look at a fabricated example. In the figure below, we have a "ground truth" and a "prediction", where colours correspond to classes, numbers to ground truth instance labels, and letters to predicted instance labels.

Fabricated example of missing false positive case

The submission format require each class' labels to be in a separate file, so in such a situation we would have two "ground truth" label maps (let's call them GT_Blue and GT_Green) and three "prediction" label maps (P_Blue, P_Green and P_White), each containing the labelled mask for one specific class. A pixel cannot belong to two different classes (although there is no mechanism in the code to verify that the submission is valid in this regard).

In the current implementation of the evaluation code, these files would be processed in the following way:

    Open GT_Blue and get label map.
    Open P_Blue to get corresponding predictions.
    Compute PQ_Blue based on GT_Blue and P_Blue. 
        -> 2 True Positives (1-A, 4-D) and 1 False Positive (C)
    Open GT_Green and get label map.
    Open P_Green to get corresponding predictions.
    Compute PQ_Green. 
        -> 1 True Positive (5-E) and 2 False Negatives (2, 3)
    Stop.

So P_White is never opened, and the False Positive "B" will never be counted. If we look at the full confusion matrix for this image, we should have (with rows corresponding to "ground truth" classes and columns to "predicted" classes:

No objBlueGreenWhite
No obj0001
Blue0200
Green1110
White0000

But with the evaluation code as it is written, we end up with:

No objBlueGreenWhite
No obj0000
Blue0200
Green1110
White0000

Coming back now to the author's reply, the statement that "the false positive for one class will be false negative for another" is incorrect... unless you count the background "no object" class, which we shouldn't, as the PQ is not computed for the background class. In our fabricated example, the false positive will not be counted at all.

What if the white "B" object had been matched with, for instance, the green "2" object, which at the moment is a "Green False Negative"? Then the "2-B" match would be both a "Green FN" and a "White FP". The idea that, in such case, it should only be counted once, however, would be an important deviation the PQ metric as defined in [5], and it should then be clearly explained in the methods.

As the metric is computed independently for each class, the misclassification should clearly impact both the "Green" class metric and the "White" class metric. In fact, it is the case for other misclassifications, such as the "3-C" match in our fabricated example, which is counted both as a "Green FN" and a "Blue FP". The only time such misclassifications are not counted "twice" in the evaluation code is if there is no ground truth instance of that class in that particular image patch.

This is not a matter of interpretation in this case: it is clearly a mistake. The effect of that mistake can also be very large on the metric. From our experiment on the teams' released "colour-coded" predictions, correcting this mistake led to a drop of 13-15% in the computed PQ. For instance, looking at the SJTU_426 team, we have a 0.4% increase in the PQ when correcting the PQ computation typo (same as the challenge organisers), but we have a massive 14% drop in PQ when we add the missed False Positives:

SJTU_426Challenge eval.Corr. TypoCorr. FPCorr. FP + Typo
Verma et al.0.5790.618--
Recomputed0.5540.5940.4240.454

Our replication of the results are based on the "colour-coded" prediction maps. They will be slightly different from the original submission, but should be almost identical in terms of detection performances, which is what changes between those versions of the evaluation code. So while the absolute value is not exactly the same, the delta between the different error corrections should be in the right range.

There is an actual question of interpretation on how exactly those specific false positives should be counted. In our fabricated example, if there was no "White" prediction, then the PQ_white would be undetermined and wouldn't be counted in the average, but if we have a single false positive, we immediately get a PQ of 0 (hence the large impact on the metric). The solution to that problem, however, isn't to just remove the false positives... But to correct the last problem that we have to address: the aggregation process.

Aggregation: patient vs patch

There are two parts to the author's reply to our concerns about the aggregation process. The first part is, again, to argue that aggregating "per-patient" (i.e. computing the PQ directly at the patient level and then averaging over the 25 patients of the test set) or "per-patch" (computing the PQ at the image patch level and then averaging over the 101 image patches of the test set) is a methodological choice, and the second is to state that it was clear in their methodology that they chose the second option, so it's not an error on their part that requires a correction but rather a choice that may be discussed in future works.

Concerning the latter, there may indeed have been some confusion which, as they admit, comes from a typing error in their manuscript (where they mention "25 test images" and aggregating "25 PQ_i scores", which correspond to the number of patients and not the number of patches). It led me to the impression that there was simply a certain inconsistency in their naming convention through the paper, where "image" sometimes referred to WSI and sometimes to image patch.

Part of the confusion also comes from their relatively consistent use of the term "sub-image" in the pre-challenge description of the dataset to refer to the image patches. I was regularly consulting that document as well during my analysis, so it made sense to me that "image" would in general refer to "Whole-Slide Image" and "sub-image" to "image patch". As the information in the supplementary materials shows there is only one WSI per patient, this interpretation seemed reasonable.

I can certainly believe that it wasn't their intent, however, and that it was very clear for them (and possibly for the challenge participants) that the PQ was always meant to be computed per-patch in the challenge. So the question becomes: is that a valid methodological choice?

First: what is the problem with computing the PQ "per-patch"? The difference in size between the different patches is enormous, with the smallest image in the dataset having a size of 86x33px, and the largest a size of 1760x1771px, more than a 1000 times larger in number of pixels! The difference in terms of number of ground truth objects is also very large, ranging from 2 to 861 per patch. Even looking at objects of the least frequent class, we can have anywhere between 1 and 15 instances in patches where the class is represented. This means that the cost of making an error on some image patches is orders of magnitudes higher than on others.

The better option that I identified in our comments article was to compute the PQ per-patient. This vastly decreases the size variability, and it also makes sense for a biomedical dataset. Patches taken from the same patient (and particularly, as is the case here, the same WSI) may share some common properties, so that they can't really be considered independent samples. It also increases the number of ground truth objects at the moment where the metric is computed, which avoids the effect of a single False Positive for an object not present in the ground truth immediately setting the PQ to 0, as in our previous example. The distribution of ground truth objects is more likely to be representative on a larger sample, and the PQ will therefore be more less "noisy". Being able to estimate whether a particular algorithm's performances are steady over multiple patients is also very valuable information, so it makes sense to choose that over the other option identified in the author's reply, which is to compute the PQ just once, at the level of the entire dataset.

The main drawback that the authors identify to using the "one PQ on the whole dataset" approach in their reply is that "this approach does not allow for a robust computation of confidence intervals". I'm very unconvinced by this line of reasoning. First of all, it doesn't really say anything about the per-patient proposition (which does allow for confidence intervals, although obviously with less samples these will be larger). Second, there is the aforementioned problem of the samples' independence. Confidence intervals should really be measured at the patient level anyway, as the patient are, in the end, the "samples" that we are studying.

If this is a methodological question, I don't really think it's as open as the author's reply would suggest. In this case, it seems very clear that the per-patient option is better, and I would even say that, given the scale of the "weights" difference given to the errors, the "per-patch" option is in this case even incorrect. It could be justified when the extracted patch are of very similar sizes, but not in this particular dataset.

Last thoughts on this whole story

The corrected results published by Verma et al. in [4] are, in my opinion, still incorrect. The correct ranking of the challenge is, therefore, unknown, and we shouldn't draw any conclusions on the merits of the different methods proposed by the participants. I doubt that the ranking would change by a lot if all corrections were implemented, but as we have seen with the first correction, it's possible that some participants were affected by some of the errors more than others.

While I'm disappointed by the unwillingness of the organisers to directly work with us on correcting the problems, I would like to emphasise that I don't think they did a bad job overall with the challenge. In fact, as I've repeatedly highlighted, their level of transparency on their data and code is higher than any other digital pathology segmentation challenge that I've seen. That's what interested me in their data in the first place, and what allowed us to perform our study on the PQ metric [2].

Organising a successful international competition and compiling such a dataset is a very big achievement, whatever problems may remain in the results. This whole saga also highlight how important it is to increase the transparency in the reporting of challenge results in general. Without the team's predictions and the evaluation code, results are unverifiable and unreplicable. MoNuSAC really went in the right direction here, although they fell short of releasing all participants' results, and released a "visualisation" instead of the raw prediction maps.

What we can see in every challenge that we analyse is that mistakes happen all the time: in the dataset production, in the evaluation methodology, in the code... Challenges are extremely costly and time-consuming to organise, and they use up a lot of resources (from the organisers, but also from the participants). We should really try to make sure that the results that we get from them are accurate. If challenge organisers go for a fully transparent approach to the evaluation process, then the responsibility of checking the validity of the results becomes shared by the whole community, and our trust in these results is improved.

There are also some things in the process with the journal which I think could be improved. First of all, given the nature of the error we found, I think that some form of notice should have been quickly added to the original publication on the journal's website. Between October 2021 and February 2022, there was reasonable doubt about the validity of the results and, since February 2022 and the acceptation of the authors' reply, it is certain that the original results are incorrect.

Even now that the comment and response have been published, there is still no link to those from the original publication, meaning that the original results are likely to be cited without noticing that they have been amended, and that there are additional concerns beyond what was corrected.

I do think that the journal's Editor-in-Chief and Managing Editor were reactive and tried to do things right. The fact that they had to navigate through a process that seemed obscure to everyone is worrying, however. In many ways, "peer-review" starts when the article is published and, therefore, actually available to more than a couple of "peers". Corrections and retractions should be a natural part of the publishing process. Mistakes happen. They only are a problem if they can't be detected and corrected.

References

Last year, I did a video tutorial for the INFO-H-501 "Pattern Recognition and Image Analysis" course at the Université Libre de Bruxelles. I apparently forgot to also put it here, so here we are...

The video contains a bit of theory on how to approach a segmentation problem with "Deep Learning", and a guide to a working implementation of the pipeline using Tensorflow v2.

The dataset used in the video is the publicly released GlaS challenge dataset. All the code can be found here on GitHub: https://github.com/adfoucart/dlia-videos.

We (...) have an unusual situation that we wish to search for objects present in low concentration, the quantity of which is irrelevant, and the final classification of which is not agreed by different experts. At this point the engineer may well begin to feel despair. (A.I. Spriggs, 1969 [1])

This quote comes from a 1969 publication entitled "Automatic scanning for cervical smears". This paper is interesting for several reasons:

  1. It shows that "computer-assisted pathology" is not a new idea.
  2. It identifies difficulties in designing such systems which are still very relevant today.

Specifically, Spriggs worries that the "Carcinoma in situ" which they are trying to detect "is not a clear-cut entity", with "a whole spectrum of changes" where "nobody knows where to draw the line and when to call the lesion definitely precancerous". To sum up: "We therefore do not really know which cases we wish to find". Moreover, "the opinions of different observers also vary".

On that latter point, Spriggs also notes that the classifications or grading used by pathologists are often more "degrees of confidence felt by the observer" than measurable properties of the cells. "It is therefore nonsensical to specify for a machine that it should identify these classes." The only reason to do it is that we have to measure the performance of the machine against the opinion of the expert. This is slightly less of a problem today, as the grading systems have evolved and become more focused on quantifiable measurements, but they still often allow for large margins of interpretation. As we match our systems against these grading, we constrain them to "mimic" the reasoning of the pathologists instead of focusing on the underlying problem of finding relationships between the images and the evolution of the disease itself.

"Digital pathology" really started to appear in the scientific literature around the year 2000. At that time, the focus was on telepathology or virtual microscopy: allowing pathologists to move away from the microscope, to more easily share images for second opinions, and to better integrate the image information with the rest of the patient's record [2, 3, 4].

Digital pathology also relates to image analysis. The 2014 Springer book "Digital Pathology" [5], for instance, includes in their definition not only the acquisition of the specimens "in digital form", but also "their real-time evaluation", "data mining", and the "development of artificial intelligence tools".

The terminology may be recent, but the core idea behind it (linking an acquisition device to a computer to automate the analysis of a pathology sample) is about as old as computers. One of the earliest documented attempts may be the Cytoanalyzer... in the 1950s.

Fig. 1 - LIFE Magazine article on the "Cytoanalyzer", April 25th, 1955 (Google Books).

Two paragraphs in a 1955 issue of LIFE magazine, sandwiched between ads for mattresses and cars, presents an "electronic gadget" which "will spot abnormal cells (...) in about a minute, saving time-consuming specialized analysis for every case". A more comprehensive description of the prototype can be found by Walter E. Tolles in the Transactions of the New York Academy of Science [6].

The Cytoanalyzer (see Fig. 2) had three units: the power supply and computer (left), the scanner (middle) and the oscilloscopes for monitoring and presentation (right). The scanner converts the "density field" of the slide into an electric current, which is used to analyse the properties of the cells and discriminate between normal and abnormal ones.

Fig 2 - 1955 version of the cytoanalyzer, from [6].

Two clinical trials of the Cytoanalyer were conducted, in 1958-59 and in 1959-60. An analysis of the results by Spencer & Bostrom in 1962 [7] considered its results to be "inadequate for practical application of the instrument".

Decades passed. Hardware and software vastly improved, yet even as new methods in image analysis and artificial intelligence got better at solving tasks related to pathology, they still fell short of the strict requirements of an automated system for diagnosis. In 2014, Hamilton et al. were writing that "even the most state of the art AI systems failed to significantly change practice in pathology" [8]. The problem is that medical diagnosis is generally made by integrating information from multiple sources: images from different modalities, expression of the symptoms by the patient, records of their medical history... AI systems can be very successful at relatively simpler sub-tasks (finding nuclei, delineating glands, grading morphological patterns in a region...), but they are just unable, at this point, to get the "big picture". Not to mention, of course, all the thorny obstacles to widespread adoption: trust in the system, regulatory issues, insurance and liability issues, etc., etc.

More than 65 years after the Cytoanalyzer, routine use of AI in clinical practice for pathology appears to be very close... but we're still not there yet. The performance of deep learning algorithms, combined with the widespread use of whole-slide scanners producing high-resolution digital slides, make the field of computer-assisted histopathology a very active and optimistic one at the moment. Still, even with the excellent results of Google Health's breast cancer screening system in clinical studies [9], it's not clear that automated systems are ready for real practice.

The difficulties of our algorithms are in large part the same that were identified by Spriggs in 1969: it is difficult or impossible to get an "objective" assessment of pathological slides and, even with modern grading systems, inter-expert disagreement is high. This makes training and evaluating algorithms more difficult, and when dealing with a subject as sensitive as healthcare, any result short of near perfection will have a hard time getting adopted by the medical community... and by the patients.

References

  1. [] A.I. Spriggs, "Automatic scanning for cervical smears", J. clin. Path. 22, suppl. (Coll. Path.), 3, 1-6 (1969). doi:10.1136/jcp.s2-3.1.1
  2. [] J.H. Saltz, "Digital pathology - The big picture", Human Pathology 31(7), pp779-780 (2000). doi:10.1053/hupa.2000.9748
  3. [] Barbareschi, M., Demichelis, F., Forti, S. & Dalla Palma, P. Digital pathology: Science fiction? Int. J. Surg. Pathol. 8, 261–263 (2000). doi:10.1177/106689690000800401
  4. [] May, M. A better lens on disease. Sci. Am. 302, 74–77 (2010). doi:10.1038/scientificamerican0510-74
  5. [] Sucaet, Y. & Waelput, W. Digital Pathology. (Springer, 2014). doi:10.1159/isbn.978-3-318-05846-8.
  6. [] Tolles, W. E. SECTION OF BIOLOGY: THE CYTOANALYZER-AN EXAMPLE OF PHYSICS IN MEDICAL RESEARCH. Trans. N. Y. Acad. Sci. 17, 250–256 (1955). doi:10.1111/j.2164-0947.1955.tb01204.x.
  7. [] Spencer, C. C. & Bostrom, R. C. Performance of the cytoanalyzer in recent clinical trials. J. Natl. Cancer Inst. 29, 267–276 (1962). doi:10.1093/jnci/29.2.267
  8. [] Hamilton, P. W. et al. Digital pathology and image analysis in tissue biomarker research. Methods 70, 59–73 (2014). doi:10.1016/j.ymeth.2014.06.015
  9. [] McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020). doi:10.1038/s41586-019-1799-6