Abstract

This dissertation studies how the reality of digital pathology annotations affects modern image analysis algorithms, as well as the evaluation processes that we use to determine which algorithms are better.

In the ideal supervised learning scenario, we have access to a “ground truth”: the output that we want from the algorithms. This “ground truth” is assumed to be unique, and the evaluation of algorithms is typically based on comparing it to the actual output of the algorithm. In the world of biomedical imaging, and more specifically in digital pathology, the reality is very different from this ideal scenario. Image analysis tasks in digital pathology are trying to replicate assessments made by highly trained experts, and these assessments can be complex and difficult, and therefore come with different levels of subjectivity.

As a result, the annotations provided by these experts (and typically considered as “ground truth” in the training and evaluation of deep learning algorithms) are necessarily associated with some uncertainty. Our main contributions to the state-of-the-art are the following:

First, we studied the effects of imperfect annotations on deep learning algorithms and proposed adapted learning strategies to counteract adverse effects.

Second, we analysed the behaviour of evaluation metrics and proposed guidelines to improve the choices made in the evaluation processes.

Third, we demonstrated how the integration of interobserver variability into the evaluation process can provide better insights into the results of image analysis algorithms, and better leverage the annotations from multiple experts.

Finally, we reviewed digital pathology challenges and found important shortcomings in their design choices and in their quality control and demonstrated the need for increased transparency.

[Show / Hide Table of Contents for the chapter]

-- Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology
---- Table of contents
---- Acknowledgments
---- List of abbreviations
---- Notations sheet

Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology

Thesis presented by Adrien FOUCART
with a view to obtaining the PhD Degree in biomedical engineering (“Docteur en ingénierie biomédicale”)
Academic year 2022-2023

Supervisor: Professor Christine DECAESTECKER
Co-supervisor: Professor Olivier DEBEIR
Laboratory of Image Synthesis and Analysis

Thesis jury:

Hughes BERSINI (Université Libre de Bruxelles, Chair)
Gauthier LAFRUIT (Université Libre de Bruxelles, Secretary)
Geert LITJENS (Computational Pathology Group, Radboud University Medical Center)
Raphaël MARÉE (Montefiore Institute, Université de Liège)
Isabelle SALMON (Hôpital Érasme)
Gianluca BONTEMPI (Université Libre de Bruxelles)
Olivier DEBEIR (Université Libre de Bruxelles)
Christine DECAESTECKER (Université Libre de Bruxelles)

Front matter

---- Table of contents

---- Acknowledgments

---- List of abbreviations

---- Notations sheet

Introduction

1. Deep learning in computer vision

---- 1.1 The roots of deep learning

---- 1.2 Defining deep learning

---- 1.3 Computer vision tasks

---- 1.4 The deep learning pipeline

---- 1.5 Deep model architectures

---- 1.6 Lost in a hyper-parametric world

---- 1.7 Conclusions

2. Digital pathology and computer vision

---- 2.1 History of computer-assisted pathology

---- 2.2 The digital pathology workflow

---- 2.3 Image analysis in digital pathology, before deep learning

---- 2.4 Characteristic of histopathological image analysis problems

3. Deep learning in digital pathology

---- 3.1 Mitosis detection in breast cancer

---- 3.2 Tumour classification and scoring

---- 3.3 Detection, segmentation, and classification of small structures

---- 3.4 Segmentation and classification of regions

---- 3.5 Public datasets for image analysis in digital pathology

---- 3.6 Summary of the state-of-the-art

---- 3.7 The deep learning pipeline in digital pathology

4. Evaluation metrics and processes

---- 4.1 Definitions

---- 4.2 Metrics in digital pathology challenges

---- 4.3 State-of-the-art of the analyses of metrics

---- 4.4 Experiments and original analyses

---- 4.5 Recommendations for the evaluation digital pathology image analysis tasks

---- 4.6 Conclusion

5. Deep learning with imperfect annotations

---- 5.1 Imperfect annotations

---- 5.2 Datasets and network architectures

---- 5.3 Experiments on the effects of SNOW supervision

---- 5.4 Experiments on learning strategies

---- 5.5 Comparison with similar experiments

---- 5.6 Impact on evaluation metrics

---- 5.7 Conclusions

6. Artefact detection and segmentation

---- 6.1 State of the art before deep learning

---- 6.2 Experimental results

---- 6.3 Prototype for a quality control application

---- 6.4 Recent advances in artefact segmentation

---- 6.5 Discussion

7. Interobserver variability

---- 7.1 Pathology

---- 7.2 Computer vision

---- 7.3 Evaluation from multiple experts

---- 7.4 Insights from the Gleason 2019 challenge

---- 7.5 Discussion

8. Quality control in challenges

---- 8.1 MITOS12: dataset management

---- 8.2 Gleason 2019: annotation errors and evaluation uncertainty

---- 8.3 MoNUSAC 2020: errors in the evaluation code

---- 8.4 Discussion and recommendations: reproducibility and trust

9. Discussion and conclusions

---- 9.1 Deep learning with real-world annotations

---- 9.2 Evaluation with real-world annotations

---- 9.3 Improving digital pathology challenges

---- 9.4 Conclusions: predicting the future

Annex A. Description of the datasets

---- A.1 MITOS12

---- A.2 GlaS 2015

---- A.3 Janowczyk's epithelium dataset

---- A.4 Gleason 2019

---- A.5 MoNuSAC 2020

---- A.6 Artefact dataset

---- References

Annex B. Description of the networks

---- B.1 Networks used in our experimental work

---- B.2 Selected networks from the state-of-the-art.

---- References

Acknowledgments

This thesis is the product of seven years working in the LISA laboratory, as a researcher, a PhD student, and a teaching assistant. Many people have contributed, in one way of another, to the realisation of this work, and I would like to thank them here:

My supervisors, Pr. Christine Decaestecker and Pr. Olivier Debeir, for their support and their valuable feedback throughout the thesis, and for giving me the opportunity to start this thesis in the first place.

The past and present researchers, professors, and members of the lab that I had the pleasure to interact with during these years, for their ideas and inputs, and for the shared discussions, coffees, drinks, burgers, and all the things that make the work environment a better place. I particularly would like to thank Arlette Grave for her support through the various administrative processes of the university. My thanks also go to Isabelle Salmon of DIAPath, and the staff at Erasme Hospital and at the CMMI that contributed materials and feedback.

Being a teaching assistant was extremely stimulating and provided countless opportunities for procrastination, so I need to thank all students who kept me busy during these years. In particular, I would like to thank Élisabeth Gruwé and Alain Zheng for their contribution to this research through their excellent work during their master thesis.

I also want to thank Seyed Alireza Fatemi Jahromi and Amirezza Mahbod for sharing valuable data from their own research, and the NVIDIA Corporation for providing us with a Titan X GPU, which was used in most of the experimental work of this thesis.

I would not be where I am today without my parents Lucy Dever and Dominique Foucart, and the undoubtedly very effective education and support that they have provided to their children. My brothers François and Renaud must also get some thanks (and/or blame) for their contributions, as does my uncle Thierry Dever, without whom I probably wouldn’t have gotten as comfortable with computers.

Finally, for her support (even during the last months of this thesis), I thank my partner Céline Mathieu. And last (but certainly not least) my daughter Margot, for making sure that I didn’t stay too focused.

List of abbreviations

We provide here, in alphabetical order, a list of abbreviations used in this thesis.

ACC – Accuracy
AE – Auto-Encoder
AI – Artificial Intelligence
AP – Average Precision
ASSD – Average Symmetric Surface Distance
AUPRC – Area Under the PR Curve
AUROC – Area Under the ROC curve
BCC – Basal Cell Carcinoma
BP – Backpropagation
CE – Cross-Entropy
CM – Confusion Matrix
CNN – Convolutional Neural Network
CUDA – Compute Unified Device Architecture (software layer for GPU programming)
CYDAC – Cytophotometric Data Conversion System
DCNN – Deep Convolutional Neural Network
DA – Data Augmentation
DCNN – Deep Convolutional Neural Network
DL – Deep Learning
DNN – Deep Neural Network
DP – Digital Pathology
DQ – Detection Quality
DSC – Dice Similarity Coefficient
ER+ – Oestrogen receptor positive
FCN – Fully Convolutional Network
FN – False Negative
FNN – Feedforward Neural Network
FP – False Positive
FPR – False Positive Rate
GAN – Generative Adversarial Network
GlaS – Gland Segmentation challenge
GM – Geometric Mean
GMDH – Group Method of Data Handling
GPU – Graphical Processing Unit
H&E – Haematoxylin and Eosin
HD – Hausdorff’s Distance
HER2 – Human epidermal growth factor receptor 2 (also the name of a 2016 competition)
HSV – Hue-Saturation-Value colour space
ICPR – International Conference on Pattern Recognition
IDC – Invasive Ductal Carcinoma
IHC – Immunohistochemistry
ILSVRC – ImageNet Large Scale Visual Recognition Challenge
IoU – Intersection over Union
ISUP – International Society of Urological Pathology
ISBI – IEEE International Symposium on Biomedical Imaging
κ,κ_U,κ_L,κ_Q – Cohen’s kappa (in general); Unweighted, Linear and Quadratic Cohen’s kappa.
LSTM – Long Short-Term Memory
mAP – Mean Average Precision
MCC – Matthews Correlation Coefficient
MDS – Multi-Dimensional Scaling
MICCAI – Medical Image Computing and Computer Assisted Interventions (scientific society and conference)
MIL – Multiple Instance Learning
ML – Machine Learning
MLP – Multi-Layer Perceptron
MNIST – Modified National Institute of Standards and Technology (handwritten digits dataset)
MoNuSAC Multi-organ Nuclei Segmentation and Classification challenge
MP – Max Pooling
MRI – Magnetic Resonance Imaging
MSE – Mean-Squared Error
NCM – Normalized Confusion Matrix
NPV – Negative Predictive Value
NSD – Normalized Surface Distance
PCA – Principal Component Analysis
PPV – Positive Predictive Value
PQ – Panoptic Quality
PR curve – Precision-Recall curve
PRE – Precision
PR in HIMA – Pattern Recognition in Histological Image Analysis (2010 challenge)
R – Rate of agreement
REC – Recall
ReLU – Rectified Linear Units
ResNet – Network architecture based on “residual units”
RGB – Red-Green-Blue colour space
ROC – Receiver Operating Characteristic
SEN – Sensitivity
SGD – Stochastic Gradient Descent
SNOW – Semi-Supervised, NOisy and/or Weak
SPE – Specificity
SQ – Segmentation Quality
SSL – Semi-Supervised Learning
STAPLE – Simultaneous Truth and Performance Level Estimation
SVM – Support Vector Machine
TCGA – The Cancer Genome Atlas
TCIA – The Cancer Imaging Archive
TMA – Tissue Microarray
TN – True Negative
TNM – Tumour-Node-Metastasis (cancer staging system)
TNR – True Negative Rate
TP – True Positive
TPR – True Positive Rate
UL – Unsupervised Learning
U-Net – Network architecture based on “long-skip” connections
WSI – Whole-Slide Imaging / Whole-Slide Image
WSL – Weakly Supervised Learning
XML – Extensible Markup Language

Notations sheet

While ad-hoc mathematical conventions are sometimes adopted in the manuscript depending on the context (and are thus explained in the text), we reference here the main notations that we use through the thesis.

$x,y,t$
Input, output, target output of a system (algorithm, neuron, function…)
Bold letters generally denote vectors

$w$
Parameter (weight) of a model

$L(y,t)$
Loss function

$T={T_i }$ , $P={P_i}$
Sets of target and predicted items

$|·|$
Cardinality of a set $CM$
Confusion matrix, with $CM_ij$ the confusion between target class $i$ and predicted class $j$

$m,c$
Number of classes/categories in a dataset, class index.

$N,i$
Number of samples in a dataset, sample index

$\lambda_c$
Class imbalance parameter

$\pi_c$
Class proportionality parameter

$\delta$
Distance parameter

$Normal(μ,σ)$
Normal distribution for random sampling with mean $μ$ and standard deviation $σ$

$Uniform(a,b)$
Uniform distribution for random sampling with bounds $[a,b]$