Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology

By Adrien Foucart. This is the web version of my PhD dissertation, defended in October, 2022. It can alternatively be downloaded as a PDF (original manuscript). A (re-)recording of the public defense presentation can also be viewed on YouTube (32:09).
Cite as: Foucart, A. (2022). Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology (PhD Thesis). Université libre de Bruxelles, Brussels.


This dissertation studies how the reality of digital pathology annotations affects modern image analysis algorithms, as well as the evaluation processes that we use to determine which algorithms are better.

In the ideal supervised learning scenario, we have access to a “ground truth”: the output that we want from the algorithms. This “ground truth” is assumed to be unique, and the evaluation of algorithms is typically based on comparing it to the actual output of the algorithm. In the world of biomedical imaging, and more specifically in digital pathology, the reality is very different from this ideal scenario. Image analysis tasks in digital pathology are trying to replicate assessments made by highly trained experts, and these assessments can be complex and difficult, and therefore come with different levels of subjectivity.

As a result, the annotations provided by these experts (and typically considered as “ground truth” in the training and evaluation of deep learning algorithms) are necessarily associated with some uncertainty. Our main contributions to the state-of-the-art are the following:

First, we studied the effects of imperfect annotations on deep learning algorithms and proposed adapted learning strategies to counteract adverse effects.

Second, we analysed the behaviour of evaluation metrics and proposed guidelines to improve the choices made in the evaluation processes.

Third, we demonstrated how the integration of interobserver variability into the evaluation process can provide better insights into the results of image analysis algorithms, and better leverage the annotations from multiple experts.

Finally, we reviewed digital pathology challenges and found important shortcomings in their design choices and in their quality control and demonstrated the need for increased transparency.

[Show / Hide Table of Contents for the chapter]

Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology

Thesis presented by Adrien FOUCART
with a view to obtaining the PhD Degree in biomedical engineering (“Docteur en ingénierie biomédicale”)
Academic year 2022-2023

Supervisor: Professor Christine DECAESTECKER
Co-supervisor: Professor Olivier DEBEIR
Laboratory of Image Synthesis and Analysis

Thesis jury:

Hughes BERSINI (Université Libre de Bruxelles, Chair)
Gauthier LAFRUIT (Université Libre de Bruxelles, Secretary)
Geert LITJENS (Computational Pathology Group, Radboud University Medical Center)
Raphaël MARÉE (Montefiore Institute, Université de Liège)
Isabelle SALMON (Hôpital Érasme)
Gianluca BONTEMPI (Université Libre de Bruxelles)
Olivier DEBEIR (Université Libre de Bruxelles)
Christine DECAESTECKER (Université Libre de Bruxelles)

Table of contents

  • Front matter
  • ---- Table of contents
  • ---- Acknowledgments
  • ---- List of abbreviations
  • ---- Notations sheet
  • Introduction
  • 1. Deep learning in computer vision
  • ---- 1.1 The roots of deep learning
  • ---- 1.2 Defining deep learning
  • ---- 1.3 Computer vision tasks
  • ---- 1.4 The deep learning pipeline
  • ---- 1.5 Deep model architectures
  • ---- 1.6 Lost in a hyper-parametric world
  • ---- 1.7 Conclusions
  • 2. Digital pathology and computer vision
  • ---- 2.1 History of computer-assisted pathology
  • ---- 2.2 The digital pathology workflow
  • ---- 2.3 Image analysis in digital pathology, before deep learning
  • ---- 2.4 Characteristic of histopathological image analysis problems
  • 3. Deep learning in digital pathology
  • ---- 3.1 Mitosis detection in breast cancer
  • ---- 3.2 Tumour classification and scoring
  • ---- 3.3 Detection, segmentation, and classification of small structures
  • ---- 3.4 Segmentation and classification of regions
  • ---- 3.5 Public datasets for image analysis in digital pathology
  • ---- 3.6 Summary of the state-of-the-art
  • ---- 3.7 The deep learning pipeline in digital pathology
  • 4. Evaluation metrics and processes
  • ---- 4.1 Definitions
  • ---- 4.2 Metrics in digital pathology challenges
  • ---- 4.3 State-of-the-art of the analyses of metrics
  • ---- 4.4 Experiments and original analyses
  • ---- 4.5 Recommendations for the evaluation digital pathology image analysis tasks
  • ---- 4.6 Conclusion
  • 5. Deep learning with imperfect annotations
  • ---- 5.1 Imperfect annotations
  • ---- 5.2 Datasets and network architectures
  • ---- 5.3 Experiments on the effects of SNOW supervision
  • ---- 5.4 Experiments on learning strategies
  • ---- 5.5 Comparison with similar experiments
  • ---- 5.6 Impact on evaluation metrics
  • ---- 5.7 Conclusions
  • 6. Artefact detection and segmentation
  • ---- 6.1 State of the art before deep learning
  • ---- 6.2 Experimental results
  • ---- 6.3 Prototype for a quality control application
  • ---- 6.4 Recent advances in artefact segmentation
  • ---- 6.5 Discussion
  • 7. Interobserver variability
  • ---- 7.1 Pathology
  • ---- 7.2 Computer vision
  • ---- 7.3 Evaluation from multiple experts
  • ---- 7.4 Insights from the Gleason 2019 challenge
  • ---- 7.5 Discussion
  • 8. Quality control in challenges
  • ---- 8.1 MITOS12: dataset management
  • ---- 8.2 Gleason 2019: annotation errors and evaluation uncertainty
  • ---- 8.3 MoNUSAC 2020: errors in the evaluation code
  • ---- 8.4 Discussion and recommendations: reproducibility and trust
  • 9. Discussion and conclusions
  • ---- 9.1 Deep learning with real-world annotations
  • ---- 9.2 Evaluation with real-world annotations
  • ---- 9.3 Improving digital pathology challenges
  • ---- 9.4 Conclusions: predicting the future
  • Annex A. Description of the datasets
  • ---- A.1 MITOS12
  • ---- A.2 GlaS 2015
  • ---- A.3 Janowczyk's epithelium dataset
  • ---- A.4 Gleason 2019
  • ---- A.5 MoNuSAC 2020
  • ---- A.6 Artefact dataset
  • ---- References
  • Annex B. Description of the networks
  • ---- B.1 Networks used in our experimental work
  • ---- B.2 Selected networks from the state-of-the-art.
  • ---- References

  • Acknowledgments

    This thesis is the product of seven years working in the LISA laboratory, as a researcher, a PhD student, and a teaching assistant. Many people have contributed, in one way of another, to the realisation of this work, and I would like to thank them here:

    My supervisors, Pr. Christine Decaestecker and Pr. Olivier Debeir, for their support and their valuable feedback throughout the thesis, and for giving me the opportunity to start this thesis in the first place.

    The past and present researchers, professors, and members of the lab that I had the pleasure to interact with during these years, for their ideas and inputs, and for the shared discussions, coffees, drinks, burgers, and all the things that make the work environment a better place. I particularly would like to thank Arlette Grave for her support through the various administrative processes of the university. My thanks also go to Isabelle Salmon of DIAPath, and the staff at Erasme Hospital and at the CMMI that contributed materials and feedback.

    Being a teaching assistant was extremely stimulating and provided countless opportunities for procrastination, so I need to thank all students who kept me busy during these years. In particular, I would like to thank Élisabeth Gruwé and Alain Zheng for their contribution to this research through their excellent work during their master thesis.

    I also want to thank Seyed Alireza Fatemi Jahromi and Amirezza Mahbod for sharing valuable data from their own research, and the NVIDIA Corporation for providing us with a Titan X GPU, which was used in most of the experimental work of this thesis.

    I would not be where I am today without my parents Lucy Dever and Dominique Foucart, and the undoubtedly very effective education and support that they have provided to their children. My brothers François and Renaud must also get some thanks (and/or blame) for their contributions, as does my uncle Thierry Dever, without whom I probably wouldn’t have gotten as comfortable with computers.

    Finally, for her support (even during the last months of this thesis), I thank my partner Céline Mathieu. And last (but certainly not least) my daughter Margot, for making sure that I didn’t stay too focused.

    List of abbreviations

    We provide here, in alphabetical order, a list of abbreviations used in this thesis.

    ACC – Accuracy
    AE – Auto-Encoder
    AI – Artificial Intelligence
    AP – Average Precision
    ASSD – Average Symmetric Surface Distance
    AUPRC – Area Under the PR Curve
    AUROC – Area Under the ROC curve
    BCC – Basal Cell Carcinoma
    BP – Backpropagation
    CE – Cross-Entropy
    CM – Confusion Matrix
    CNN – Convolutional Neural Network
    CUDA – Compute Unified Device Architecture (software layer for GPU programming)
    CYDAC – Cytophotometric Data Conversion System
    DCNN – Deep Convolutional Neural Network
    DA – Data Augmentation
    DCNN – Deep Convolutional Neural Network
    DL – Deep Learning
    DNN – Deep Neural Network
    DP – Digital Pathology
    DQ – Detection Quality
    DSC – Dice Similarity Coefficient
    ER+ – Oestrogen receptor positive
    FCN – Fully Convolutional Network
    FN – False Negative
    FNN – Feedforward Neural Network
    FP – False Positive
    FPR – False Positive Rate
    GAN – Generative Adversarial Network
    GlaS – Gland Segmentation challenge
    GM – Geometric Mean
    GMDH – Group Method of Data Handling
    GPU – Graphical Processing Unit
    H&E – Haematoxylin and Eosin
    HD – Hausdorff’s Distance
    HER2 – Human epidermal growth factor receptor 2 (also the name of a 2016 competition)
    HSV – Hue-Saturation-Value colour space
    ICPR – International Conference on Pattern Recognition
    IDC – Invasive Ductal Carcinoma
    IHC – Immunohistochemistry
    ILSVRC – ImageNet Large Scale Visual Recognition Challenge
    IoU – Intersection over Union
    ISUP – International Society of Urological Pathology
    ISBI – IEEE International Symposium on Biomedical Imaging
    κ,κ_U,κ_L,κ_Q – Cohen’s kappa (in general); Unweighted, Linear and Quadratic Cohen’s kappa.
    LSTM – Long Short-Term Memory
    mAP – Mean Average Precision
    MCC – Matthews Correlation Coefficient
    MDS – Multi-Dimensional Scaling
    MICCAI – Medical Image Computing and Computer Assisted Interventions (scientific society and conference)
    MIL – Multiple Instance Learning
    ML – Machine Learning
    MLP – Multi-Layer Perceptron
    MNIST – Modified National Institute of Standards and Technology (handwritten digits dataset)
    MoNuSAC Multi-organ Nuclei Segmentation and Classification challenge
    MP – Max Pooling
    MRI – Magnetic Resonance Imaging
    MSE – Mean-Squared Error
    NCM – Normalized Confusion Matrix
    NPV – Negative Predictive Value
    NSD – Normalized Surface Distance
    PCA – Principal Component Analysis
    PPV – Positive Predictive Value
    PQ – Panoptic Quality
    PR curve – Precision-Recall curve
    PRE – Precision
    PR in HIMA – Pattern Recognition in Histological Image Analysis (2010 challenge)
    R – Rate of agreement
    REC – Recall
    ReLU – Rectified Linear Units
    ResNet – Network architecture based on “residual units”
    RGB – Red-Green-Blue colour space
    ROC – Receiver Operating Characteristic
    SEN – Sensitivity
    SGD – Stochastic Gradient Descent
    SNOW – Semi-Supervised, NOisy and/or Weak
    SPE – Specificity
    SQ – Segmentation Quality
    SSL – Semi-Supervised Learning
    STAPLE – Simultaneous Truth and Performance Level Estimation
    SVM – Support Vector Machine
    TCGA – The Cancer Genome Atlas
    TCIA – The Cancer Imaging Archive
    TMA – Tissue Microarray
    TN – True Negative
    TNM – Tumour-Node-Metastasis (cancer staging system)
    TNR – True Negative Rate
    TP – True Positive
    TPR – True Positive Rate
    UL – Unsupervised Learning
    U-Net – Network architecture based on “long-skip” connections
    WSI – Whole-Slide Imaging / Whole-Slide Image
    WSL – Weakly Supervised Learning
    XML – Extensible Markup Language

    Notations sheet

    While ad-hoc mathematical conventions are sometimes adopted in the manuscript depending on the context (and are thus explained in the text), we reference here the main notations that we use through the thesis.

    Input, output, target output of a system (algorithm, neuron, function…)
    Bold letters generally denote vectors

    Parameter (weight) of a model

    Loss function

    T=TiT={T_i }, P=PiP={P_i}
    Sets of target and predicted items

    Cardinality of a set CMCM
    Confusion matrix, with CMijCM_ij the confusion between target class ii and predicted class jj

    Number of classes/categories in a dataset, class index.

    Number of samples in a dataset, sample index

    Class imbalance parameter

    Class proportionality parameter

    Distance parameter

    Normal distribution for random sampling with mean μμ and standard deviation σσ

    Uniform distribution for random sampling with bounds [a,b][a,b]