Introduction
This dissertation studies how the reality of digital pathology annotations affects modern image analysis algorithms, as well as the evaluation processes that we use to determine which algorithms are better. In the ideal supervised learning scenario, we have access to a “ground truth”: the output that we want from the algorithms. This “ground truth” is assumed to be unique, and the evaluation of algorithms is typically based on comparing it to the actual output of the algorithm. In the world of biomedical imaging, and more specifically in digital pathology, the reality is very different from this ideal scenario. Image analysis tasks in digital pathology are trying to replicate assessments made by highly trained experts, and these assessments can be complex and difficult, and therefore come with different levels of subjectivity. As a result, the annotations provided by these experts (and typically considered as “ground truth” in the training and evaluation of deep learning algorithms) are necessarily associated with some uncertainty. Our work focuses on different aspects of the impact of this uncertainty, as detailed below (Note: table updated from the official dissertation as some publications were not yet accepted at the time the thesis was submitted).
Year | Title and journal/conference | Reference |
---|---|---|
2018 | Artifact identification in digital pathology from weak and noisy supervision with deep residual networks. 4th International Conference on Cloud Computing Technologies and Applications (CloudTech). | [10] |
2019 | SNOW: Semi-supervised, Noisy and/or weak data for deep learning in digital pathology. 16th International Symposium on Biomedical Imaging (ISBI). | [6] |
2019 | Strategies to Reduce the Expert Supervision Required for Deep Learning-Based Segmentation of Histopathological Images. Frontiers in Medicine (as second author). | [22] |
2020 | SNOW supervision in digital pathology: managing imperfect annotations for segmentation in deep learning. Preprint on ResearchSquare. | [7] |
2021 | Processing multi-expert annotations in digital pathology: a study of the Gleason2019 challenge. 17th International Symposium on Medical Information Processing and Analysis (SIPAIM). | [9] |
2022 | Comments on “Monusac2020: A Multi-Organ Nuclei Segmentation and Classification Challenge.” IEEE Transactions on Medical Imaging. | [12] |
2023 | Evaluating participating methods in image analysis challenges: lessons from MoNuSAC 2020. Pattern Recognition. | [11] |
2023 | Shortcomings and areas for improvement in digital pathology image segmentation challenges. Computerized Medical Imaging and Graphics. | [8] |
2023 | Panoptic quality should be avoided as a metric for assessing cell nuclei segmentation and classification in digital pathology. Scientific Reports. | [25] |
In the rest of this introduction, we summarize the context of digital pathology image analysis, we explain how our publications (listed in Table 0.1) contribute to the domain, and finally we explain the structure of the dissertation and the main contributions of our work.
In the 1960s, the Cytophotometric Data Conversion (CYDAC) system was developed to acquire microscopic images of cells onto magnetic tapes. Based on those images, Prewitt and Mendelsohn explored the possibility of developing computer vision algorithms to automatically differentiate four cellular types. Their proposed 1966 method [18], illustrated in Figure 0.1, relied on the extraction of optical density features from the images. Samples from the four cell types could then be projected onto a two-dimensional feature space, which could be separated in four corresponding quadrants. Based on their results, Prewitt and Mendelsohn appeared cautiously optimistic on the prospects of image analysis for contributing to automated microscopic diagnosis, writing [18, p. 1052]:
This preliminary success encourages us to try the method on larger and more inclusive populations, but future research alone will tell whether this approach will yield the discriminatory power to master the normal blood smear and to go beyond it.
Technology steadily improved in the following years: better acquisition devices, better storage mechanisms, improved computer vision algorithms. Yet the basic premise of Prewitt and Mendelsohn’s approach remained relevant well into the beginning of the 21st century. A 2007 method by Doyle et al. [5] for the grading of prostate cancer, for instance, follows a similar overall structure, as shown in Figure 0.2. The images are larger, in colour, and with a better quality. Instead of two features, they have more than 100. Finally, instead of a manual separation of the feature space in four quadrants, they use a Support Vector Machines (SVM) classifier to discriminate between four cancer types.
The 2009 review of “histopathological image analysis” by Gurcan et al. [14] came at this critical time in the history of computer vision, at the peak of what is now often considered the “traditional approach.” Typically: pre-processing, extraction of “handcrafted features,” dimensionality reduction or feature selection, and finally classification with for instance SVMs, or decision trees. In just a few years, this “traditional approach” would be largely replaced by the “deep learning” approach. Where the traditional approach can generally be thought of as “expert driven,” deep learning is primarily “raw-data driven.” Instead of explicitly defining the processing steps and the features to be extracted, let an algorithm derive those steps from the data themselves. To make a “raw-data driven” approach work, however, unsurprisingly requires a large amount of data. The transition from traditional computer vision to deep learning is therefore intimately linked to another transition: from “traditional” to “digital” pathology [16], [1].
Data acquisition in “traditional” pathology required manually taking pictures of regions of interest, a time-consuming process. The lack of data storage (and bandwidth for transmission) also severely limited how many images could reasonably be used for image analysis research. Whole-slide scanners, able to digitize high-resolution microscopy images relatively easily, were initially developed in the late 1990s and started to really become widely adopted in the late 2000s [17]. Combined with lower prices for data storage and increased availability of high bandwidth internet access, they created the opportunity for raw-data driven approaches to become viable. Alongside this increased access to data came an increased access to computing power, and particularly to parallel computing power, in the form of “General Purpose” Graphical Processing Units (GPGPUs). The CUDA library was launched in 2007 [13] by the NVIDIA company, and GPUs have since largely moved from being “just” about graphics to being essential tools for parallel computing, and particularly adapted to image processing.
The early 2010s where when it all came together, with widespread access to GPUs, large quantities of data from digital pathology, and the release of easy-to-use software for the development of deep neural networks (such as Theano in 2007, Caffe in 2013, Keras and TensorFlow in 2015 [2]).
When the work on this thesis started in 2015, “deep learning” was in a transitional phase. It was no longer a niche studied by a handful of specialized research teams, and the “deep learning revolution” was already well underway, but it did not yet appear to be the only widely used approach, as it is now. Deep neural networks had been used in digital pathology tasks with very promising results by then [3], [4], [15], [19], yet despite the apparent success of such methods, it still seemed that their adoption in clinical practice was a long way ahead. Seven years later, it is still unclear what place exactly deep learning methods will be able to take in the pathology workflow. To quote a 2021 review by van der Laak et al. [21, p. 780]:
Even though promising results for deep learning [computational pathology] algorithms have been shown in many studies, it is still too early to distinguish the hope from the hype. (…)
What could be achieved relatively soon is AI algorithms that work in conjunction with pathologists, rather than as stand-alone solutions, to remove the need for tedious, repetitive work, such as identifying lymph node metastases, or to increase the quality of diagnostic grading.
As we started working on potential applications for deep learning in digital pathology, such as the detection of artefacts [10], a key characteristic of pathology datasets attracted our focus: the frequent absence of genuine “ground truth.”
The rise of popularity of deep learning came in large part through image analysis challenges, with the most famous example being the “ImageNet Large Scale Image Recognition Challenge” [20]. In ImageNet, the target classes are relatively straightforward. In total, more than one million images spanning 1000 classes were used in the challenge.1 These are ideal conditions for deep learning algorithms: a large quantity of data, with a reliable ground truth. If the supervision says that an image shows a bicycle, a sock, a cucumber, or a triceratops, it can be safely assumed that this label is correct.
In digital pathology, there is no such abundance of annotated data. With whole-slide scanners, and the efforts of organisations, such as the US National Cancer Institute, to aggregate images acquired from multiple hospitals and make them available to the public,2 it has become easier to find large quantities of images. But knowing the exact content of these images is a very different matter.
The objects of interest in digital pathology image analysis are, in general, very loosely defined. The boundaries of the objects (which can be very fuzzy) and their exact nature are not necessarily straightforward to determine. The task of automatically determining a diagnostic based on the images is even more complex. Digital pathology datasets are therefore necessarily smaller, and the available “expert annotations” often cannot be considered as the “ground truth,” but rather an expert’s opinion. The question of how the inevitable imperfections of digital pathology datasets influence the results of deep learning algorithms thus became a large point of interest in our work [6], [7], [22].
A less studied aspect of imperfect annotations is their impact on the evaluation of image analysis algorithms. In challenges and other publications, evaluation metrics are computed as if the annotations in the test set are certain. Yet the same imperfections that make the learning process more difficult also make the evaluation process less reliable. The question of interobserver variability is particularly relevant here: when experts disagree, how do we really measure the relative performance of algorithms? This question led us to our study of the Gleason 2019 challenge [9]. Also important in the evaluation process of deep learning algorithms is the variability due to the randomized nature of the algorithms themselves: from the initial conditions of the algorithms to the techniques of data augmentation, and the randomness in the optimization of the models. The importance of using proper statistical tools in the assessment of the algorithms, and of recognizing the limits of the conclusions we can draw from those results, is something that we examined in the evaluation of our experiments [6], [7], and that is further developed in this work.
As we reviewed the results of different digital pathology challenges [8] and explored the choices made by challenge organisers, we found some interesting points of attention. First, we noticed that several challenges suffered from a lack of quality control in the published dataset and/or in the evaluation process. We had already noticed some problems in the Gleason 2019 challenge [9], and we further found issues with the MoNuSAC 2020 challenge, which were reported in a comment article [12] that led to a correction [24] of the previously published results [23]. Second, it became clear that the choice of an evaluation metric that fits a particular digital pathology task is far from trivial. We used the MoNuSAC challenge results to study how the choice of metric can hide or reveal important insights on the strengths and weaknesses of the participating teams’ results [11].
This is therefore the central question to be explored in this work: what is the impact of real-world annotations in digital and computational pathology? This dissertation is structured as follows. In Chapter 1 and Chapter 2, we provide context on what “deep learning” and “digital pathology” are, on their definitions and their history. In Chapter 3, we review the state of the art of deep learning in digital pathology, with a particular focus on the various competitions organized since 2010. Chapter 4 examines in detail evaluation metrics and processes, their use in digital pathology competitions, and through several experiments and original analysis we outline their limitations and biases and provide recommendations for future research. The impact of incomplete, imprecise, and noisy annotations on the learning and evaluation processes is explored in Chapter 5. In Chapter 6, we look at a practical case with our work on artefact detection and segmentation. The question of interobserver variability is the focus of Chapter 7. Our findings and subsequent recommendations on quality control problems in competitions are explained in Chapter 8. A general discussion of our findings and a conclusion can be found in Chapter 9. In addition to these chapters, a description of some of the main datasets used through this work can be found in Annex A. In Annex B, we describe the deep learning models used in our experimental work, as well as some from the state-of-the-art that are commonly used in digital pathology. Code for reproducing the experimental results and figures is available as supplementary materials on GitHub.
Our main contributions to the state-of-the-art are the following. First, we studied the effects of imperfect annotations on deep learning algorithms and proposed adapted learning strategies to counteract adverse effects. Second, we analysed the behaviour of evaluation metrics and proposed guidelines to improve the choices made in the evaluation processes. Third, we demonstrated how the integration of interobserver variability into the evaluation process can provide better insights into the results of image analysis algorithms, and better leverage the annotations from multiple experts. Finally, we reviewed digital pathology challenges and found important shortcomings in their design choices and in their quality control and demonstrated the need for increased transparency.
The past decade has seen lots of changes in the field of computer vision, and in the field of pathology. The prospect of being able to include automated methods to the pathology pipeline, to help in the diagnostic process or in the search for reliable biomarkers that would make these diagnoses easier to obtain, appears increasingly likely. The adoption of such methods in clinical practice, however, requires a large amount of trust. Trust in the capacity of deep learning methods to navigate through the biases and limitations of manmade datasets, and trust that the apparent results of these methods reflect their true abilities when confronted with new examples in real-world settings. A strong movement to improve digital pathology competitions and benchmarks has been underway in the past few years. With this work, we hope to contribute to that movement, and to help bridge the gap between the clean setting of machine learning benchmarks, and the much messier setting of the clinical world.