Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology

By Adrien Foucart. This is the web version of my PhD dissertation, defended in October, 2022. It can alternatively be downloaded as a PDF (original manuscript). A (re-)recording of the public defense presentation can also be viewed on YouTube (32:09).
Cite as: Foucart, A. (2022). Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology (PhD Thesis). Université libre de Bruxelles, Brussels.
[Show / Hide Table of Contents for the chapter]

9. Discussion and conclusions

Large convolutional neural networks have become the dominant technique in image analysis. The transition from “traditional” methods based on “handcrafted features” to these “deep learning” techniques has been very quick. In digital pathology, while deep learning methods are the state-of-the-art for the most common tasks of detection, classification, and segmentation, they seem to remain in a constant state of “promising results” for future clinical applications [18], [2].

The adoption of such methods into clinical practice require to build trust in their results. Trust, in turn, requires explainability (why do these methods produce their results) and replicability of the results. Explainability and replicability are important for the inner workings of the deep learning algorithm itself, but it must also apply to the other elements of an “automated” digital pathology pipeline, from the collection of the dataset to the evaluation procedure.

A key characteristic of digital pathology that make this trust-building difficult is the unavoidable uncertainty that surrounds the annotation process. Supervised deep learning algorithms and evaluation processes are built around the existence of a singular, absolute “ground truth” to which the predictions of the algorithm can be compared. As we have seen through this thesis, this absolute ground truth does not exist in digital pathology.

Real-world annotations in digital pathology are imperfect. For segmentation problems, precise annotations are so time-consuming, and deep learning algorithms require so much data, that obtaining these annotations from senior experts is impractical. Even for less time-consuming forms of annotations (such as patient-level or image-level labels for classification or grading), high interobserver variability implies a large uncertainty on the validity of any label, which can only be mitigated through time-consuming consensus processes. If imperfections cannot be avoided, then they must be a part of the learning process and of the evaluation process.

In this discussion, we will now summarize the key findings of this thesis:

Predicting what the future holds is a risky proposition in any field. It is impossible to know whether deep learning will be the technology that brings the long-awaited prospect of automating at least the most tedious pathology tasks to fruition. We however conclude this thesis with our best vision for what lies ahead.

9.1 Deep learning with real-world annotations

A perfect dataset is not necessary for deep learning algorithms. Our SNOW experiments, and other independent works on the topic [17], have shown that, for segmentation tasks in particular, standard network architectures such as U-Net were robust to large imprecisions on the object boundaries, and to a small amount of noise on the class labels. These results have some important implications on the best strategies for annotating digital pathology datasets for the purpose of training deep learning algorithms.

There is always a trade-off between the quality and the quantity of annotations provided. For the constitution of a training set, focusing on the quantity would therefore appear to be a better strategy. For segmentation tasks, this could mean making quick polygonal approximations of the objects of interest rather than detailed outlines. The most important aspect may be to make sure to provide a diverse set of examples: patients, acquisition device, laboratory executing the cutting, fixation, and staining procedure, etc. More generally, our results highlight the importance of a continuous collaboration between the actors from the medical fields – such as pathologists – and developers of image analysis systems. By critically analysing datasets and the impact of the different types of imperfections they may contain, the latter can direct the annotation efforts of the medical experts where they are most needed. Likewise, pathologists can use their expertise to ensure that the definition of the tasks, the constitution of the datasets, and the nature of the annotations are relevant to their practice.

While deep learning algorithms are naturally robust to some level of imperfections, their performance can be further improved by adapted learning strategies. From our experiments, it seems that relatively simple methods can be applied to mitigate the effects of imperfect annotations. In general, methods that work well seem to rely on expanding the dataset from a core of “more certain” labels. In our “GA” method, this was done by a first training pass on only the “positive” regions of the datasets (i.e. regions containing some annotated objects of interest), followed by a second training pass that included the full dataset and re-labelled “uncertain” regions based on the output of the first pass. Recent works have shown that this approach can be pushed further by iteratively improving the labels, essentially propagating the labels to noisy or unlabelled regions of the latent feature space [10], [15], [12], [1]. Unsupervised or self-supervised approaches may also help identifying “natural” classes present in the data, either as a pre-training step before supervised fine-tuning, or with a posteriori expert input to associate the “natural” classes to the target. As an example, in her 2022 master thesis, Rania Charkaoui [4] used the self-supervised method from Ciga et al. [6] to find a feature space in which artefact detection could be framed as an outlier detection problem.

Special care has to be taken around the question of class imbalance. Digital pathology datasets often feature large class imbalance in their distributions. The impact of noise may therefore be particularly strong on minority classes. In the annotations of the Gleason 2019 dataset, for instance, Karimi et al. [11] find a much larger inter-pathologists disagreement on the minority “grade 5” Gleason pattern label than on the much more common benign, grade 3 and grade 4 annotations. Typical remedies for class imbalance based on simply over-sampling the minority class may therefore amplify this noise and hurt the overall performances of the algorithms. In such case, having access to annotations from multiple experts could make it easier to focus the learning (and data augmentation) on examples that are part of a larger consensus.

Using annotations from multiple experts rather than a consensus may in general enrich the learning process. While a consensus of experts provides a more “certain” annotation by itself, it also removes the information about which examples are more likely to be incorrect (or at least subject to diverging opinions). A practice that should be avoided is to completely remove contentious examples from the dataset. While this is likely to improve the performances of algorithms trained on that consensual data, it would be at the cost of a false sense of confidence in the results (if this removal of “harder” examples is also done on the test set). Several protocols used for the generation of consensus annotations in challenges used a “multi-pass” approach where at first individual annotations are made by several experts, then non-consensual cases are debated and corrected. With this approach, it would therefore be possible to provide, ideally, the individual annotations alongside the consensus. At the very least, an indication of which cases were not immediately consensual (as was done for instance in the MITOS-ATYPIA-14[^67] challenge) would greatly help in identifying which examples are likely to be difficult.

9.2 Evaluation with real-world annotations

Parallel to the problem of learning from imperfect annotations is that of evaluating algorithms given these imperfections. As the lack of a ground truth often cannot be avoided in digital pathology tasks, any quantitative evaluation metric is therefore bound to have an uncertainty on its value. The effects and scope of that uncertainty is highly dependent on the type of task and the nature of the target object.

In segmentation problems, uncertainty on the exact boundaries of the objects can for instance have a large effect on overlap-based metrics like the IoU if the target objects are small. Noise or interobserver disagreement in the class labels will likewise add an uncertainty on classification or detection metrics. These uncertainties are furthermore to be added to those that are inherent to all machine learning problems, from the aspects of the pipeline that are subject to the effects of randomness. This includes the sampling of the data, the data augmentation, but also, in the case of deep neural networks, the initialization of the network parameters.

With all those sources of uncertainty, it can be very easy to draw inadequate conclusions on the comparative performances of different algorithms based on the quantitative evaluation. A critical analysis of the dataset and of the metrics is necessary to ensure a proper interpretation. Simulations and analysis of the annotations can greatly help in providing ranges of values for different sources of uncertainties, and to set correspondences between the values of the metrics and their interpretation. An example would be to compare individual experts to the consensus or between themselves using the same evaluation metrics as for the algorithms. This can inform on whether the difference between two algorithms could possibly be attributed to the selection of the experts or the consensus method. Simulations based on small perturbations of the annotations are similarly useful.

The effects of randomness on the training process of deep learning methods can be harder to evaluate. Ignoring it, however, may also lead to incorrect conclusions. In most cases, research on deep learning for digital pathology tasks is still in the process of improving the overall pipelines and methodology, rather than validating a specific trained network for a clinical setting. The question that challenges and other publications based on performance comparisons asks is therefore not “is this trained model the best?” but rather “is this proposed methodology the best?” Assessing how robust the results are to changes in the random conditions, or to small changes in the constitution of the dataset, is a necessary part of such evaluation. Likewise, the question of the resources necessary to obtain a given set of results (such as computing power and time) is an important factor that is often neglected in the evaluations. How useful to clinical practice is a method that cannot be replicated without having access to large clusters of GPUs or TPUs, or putting clinical data into the hands of third-party, for-profit companies? The benefits of more computational efficiency for the practical implementation of automated methods in digital pathology are certainly important.

A possible way to mitigate the uncertainty of the evaluation is to take a greater care in the annotation process for the test set. This can be done either by involving more pathologists to form a stronger consensus, or by directing them to focus most of their time and effort to the test set, possibly at the detriment of quality of the training set annotations. As the test set is generally smaller than the training set, this may be a good compromise to make the quantitative analysis of the results more trustworthy. While this will reduce the uncertainty, it will however not eliminate it. Properly analysing the uncertainty that remains is important for the discussion of the results. As was the case with the learning process, keeping trace of the individual annotations, when possible, will also enrich this discussion. It allows for comparisons not only to the consensus, but also to the individual experts. This can help make the distinction between algorithms that diverge from the consensus but remain within the realm of what some experts may predict, and algorithms that are mistaken in their own unique way. It also removes any bias that may come from the consensus mechanism itself, whether it is an automated process like STAPLE or a majority vote, or a consensus from a discussion. In the latter case, for instance, the seniority of the experts may have a large influence on their weight in the final decision, but so could other factors that influence teams’ decisions, such as their persuasiveness or their gender [7], will be less correlated with their likelihood of being correct.

The choice of the evaluation metric itself is far from trivial, as we have shown in Chapter 4. It has to be informed by the nature of the task, by the characteristics of the dataset, and by the relative impact of different types of errors in the case of the clinical application where, for instance, confusion between some classes may be more or less desirable than for others. Digital pathology tasks are often complex and multi-faceted. Our analyses show that using simple independent metrics is generally preferable to trying to capture all the desired aspects in a single value. Simple metrics are easier to interpret and provide richer information on the relative merits of different methods.

The example of the “Panoptic Quality” for nuclei instance segmentation and classification clearly shows the risks of using a metric that is not adapted to the studied problem. The influence of object size on overlap-based metrics like the IoU can cause important issues in tasks such as nuclei segmentation. This, compounded by the confusion between the classification and detection definitions of the F1-Score, creates lots of possible sources of confusion in the interpretation of the results.

An over-reliance on quantitative metrics may therefore sometimes be detrimental to the best interpretation of the results. Qualitative analysis can be a richer source of information, although such analysis comes with its own constraints and biases. It is impractical when comparing many algorithms and requires the test set to remain relatively small. In tasks where the annotation process is particularly time-consuming, like segmentation, it may however be both more informative and easier to ask pathologists to compare the results of different algorithms (for instance by scoring how much they agree with each algorithm’s prediction) than to ask them to provide a full set of annotations on which quantitative measurements could be made.

Another possibility to reduce the uncertainty of the annotations is to use some external source of validation for the test set annotations, which are not available to the algorithms. A common case is to have information from IHC slides or other modalities available to the test set annotators, while the algorithm only has access to H&E-stained WSIs. The drawback of such a system is that, to compare the algorithms’ performance to those of the experts, another independent expert panel needs to be constituted and given only the same information as the algorithm. This setup therefore largely increases the human resources necessary for conducting the study.

A common thread to this question of real-world annotations is the importance of keeping human experts in the loop through the entire process. The constitution of the dataset, the annotation process, the evaluation process and metrics, and the interpretation of the results, cannot be done without the active involvement of medical experts. To do otherwise increases the risk of wasting a lot of energy chasing performances which will not translate to practical improvements for the pathologist or the patient.

9.3 Improving digital pathology challenges

Several digital pathology challenges have been featured extensively in this thesis. The organisation of such challenges has been largely beneficial to the research community, by bringing attention to important tasks where automated image analysis may be of help, and by making large datasets available.

The complexity of organizing these challenges, however, makes the likelihood of mistakes or weaknesses relatively high. From the challenges that provide a large amount of transparency on their methods, datasets and/or evaluation code, it seems that mistakes may be relatively common. We identified the incorrect annotation maps in Gleason 2019 [8], the mistakes in the evaluation code of MoNuSAC 2020 [9], and mentioned in Chapter 8 the confusion between the stated task and the evaluation code in the Seg-PC 2021 challenge. Many competitions do not provide the information that allows other researchers to properly analyse their results.

This lack of transparency is an important weakness that have a detrimental impact on the trust that we can have on their results. The main hub that centralizes information on biomedical challenges today is the grand-challenge.org website[^68]. Many challenges hosted on that platform, unfortunately, keep important information on their methodology and their data locked for participants only (or, for the test set annotations, sometimes for everyone) long after the challenge ended. When post-challenge publications are made, many challenges do not update their information to provide a reference to it. In our review of segmentation challenges, it sometimes required significant effort to find substantial information. In some cases, the websites have disappeared and only some snapshots on archiving services are available. As all of those challenges are rather recent, the oldest being the 2010 PR in HIMA competition, this trend is very worrying for the future.

The most visible part of the results, in most cases, is the “leaderboard,” with the ranking of the participants and the values of the metric(s) chosen by the organizers on the predictions of the algorithms. In the absence of the test set annotations, and either the participants’ predictions or the code of their algorithm, such a leaderboard is very limited in the information it provides. It is impossible, for instance, to recompute alternate metric(s) which may provide additional insights relevant to tasks adjacent to the one envisioned by the challenge organizers, or simply another point of view on that same task. If the evaluation code is not available, it is furthermore impossible to verify that the implementation of the evaluation corresponds to the published methodology. Aside from the potential for malicious manipulations of the results, the main risk is simply to have mistakes such as those made in the MoNuSAC challenge. Such mistakes are easy to make in digital pathology pipelines, where the evaluation methodology may involve multiple steps and heavily customized metrics not available in standard machine learning libraries. Trust requires replicability, and for challenges replicability requires transparency. Focusing on the leaderboard also emphasizes the competitive aspect of the challenges, rather than its scientific output, which is generally to be found in the discussion of the strengths and weaknesses of the proposed methods.

Transparency is not just good for future researchers. It can also outsource some of the responsibility for the quality control of the challenge to all participants. The mistakes in the MoNuSAC challenge evaluation, for instance, were visible to all participants from the start of the submission period, as all the code was publicly available. Encouraging challenge participants to audit the code and the data could be a good way to involve them as partners in the process of scientific discovery, rather than limiting their role to being competitors. The responsibility for quality control is also shared by the organizers of the conferences where challenges are often hosted, such as MICCAI or ISBI, and by reviewers and editors of journals where their results are published. While it may be unrealistic to expect reviewers or conference organizers to perform the quality control themselves, they could implement stronger requirements in terms of transparency and quality control procedures. While the efforts made by the MICCAI society with the requirements of the “BIAS” transparency reports [13] go in the right direction, replicability is still far from achievable from the information that challenges generally release.

A way forward for challenge organizers could be to move beyond adversarial competitions and towards a more cooperative dynamic. In current competitions, dozens or hundreds of participants compete in parallel, many with very similar solutions and encountering the same problems. The incentives are such that competitors are adversary between themselves, but also, in a way, with the challenge organizers: if errors exist in the data or evaluation code that can be exploited to obtain better results according to the challenge metrics, then the adversarial challenge model rewards the exploitation of those errors rather than their disclosure. Finding the right incentives to reward cooperation is certainly tricky. This could mean focusing the results on solutions rather than teams, with participants being able to contribute to different solutions and being rewarded according to those contributions. Instead of having multiple teams independently working on slight variations of the same pipeline, each intermediary result could then inform all the other participants on which directions may be of interest. The post-challenge publication should similarly be focused on the solutions and not on the teams, to emphasize the scientific insights rather than the often small differences in the quantitative metrics for the top teams. As an example, the PANDA challenge publication [3] does not contain any team rankings in the main text. The competition results are reported in supplementary materials, but the main article focuses on an overall perspective of the proposed methods and the shared characteristics of those that performed well.

9.4 Conclusions: predicting the future

Deep learning solutions to digital pathology tasks now routinely outperform train pathologists… if the tasks are performed in controlled settings. The gap between outperforming a pathologist on a curated dataset and being ready for deployment in a clinical setting, however, is as large as the gap between trusting a self-driving car on a private circuit and letting them run autonomously in the middle of a large city.

It is unlikely that deep learn methods will bridge that gap in the near future in the form of fully automated pathology pipeline. This, however, does not mean that such method cannot find a place in daily clinical or laboratory practice.

If automated methods should not be entrusted to pose an initial diagnosis, they could however serve as a line of defence against errors in a pathologist’s diagnosis. As an automated method can operate in the background on any slide that is processed in a digital pathology pipeline, it can potentially flag cases where its automated diagnosis differs from that of the pathologist, prompting a potential second opinion, or review by the initial pathologist. For such a system to be accepted by clinicians, however, would probably require a great emphasis on the explainability of the algorithm: it is not enough to contradict a pathologist’s opinion, the features or WSI region that prompted the alternative opinion should be highlighted. Pathologists could then for instance quickly assess if they potentially overlooked a useful part of the slide.

Tasks that are not as sensitive in nature as diagnosis but can otherwise facilitate or speed up the pathology workflow are also good candidates for automation. Artefact detection (and quality assessment on the slide preparation and acquisition) is a typical example. Such tasks can make the work of histology technologists and of pathologists easier, without any risk to the patient’s outcome if some data fall outside of the algorithm’s working domain, with potentially unpredictable outputs.

Finally, there is a large potential for deep learning methods as a way to find new biomarkers. Exploiting the very large amount of data from routine scanning of WSIs to find learned features that are correlated with patient outcome is a good way to potentially harness the strengths of deep learning methods with limited potential for harm. The development of such methods would therefore be focused on the exploration and understanding of the features and their potential relationship with biological processes rather than on their pure predictive performance.

As usage of machine learning methods in pathology becomes a reality, the ethical and regulatory issues surrounding deep learning also become more pressing. While most of these issues are largely outside of the scope of this thesis, they cannot be completely left aside. Several recent studies outline some of those issues [5], [16].

Selection biases in the datasets (such as having a large majority of white male subjects) can have a large impact on such data-centric methods as deep learning algorithms, which may perform poorly on samples outside of the majority population. Chauhan et al. [5] give the example of a commercial prediction algorithm used to identify high-risk patients based on several biomarkers, demographic information and known comorbidities. The algorithm was trained to predict healthcare costs for the patients, which was seen as a proxy for their healthcare needs. Obermeyer et al. [14], however, found that because the United States healthcare system spends less money on average for patients identifying as “Black” than for patients identifying as “White,” the algorithm tended to predict lower costs for Black patients. This means that, based on the algorithm’s recommendations, they would receive less additional care at similar risks level than White patients. This type of bias can be very difficult to discover in very complex deep learning models and highlight the importance of good data selection and of good model explainability.

Another important issue is the collection and usage of patient data. Deep learning models are improved by compiling very large datasets. These datasets may be further improved by linking patient information across multiple modalities: digital pathology WSIs, alongside radiological and genetic information, blood test results, and any available personal data. As more personal data is collected, however, anonymisation of the data becomes more difficult, and risks of privacy violations become much more important (particularly, as noted in Sorell et al. [16], when the data contain information that relates to the patient’s wealth, sexual orientation and practices, or political orientations). They contend, however, that while “absolutely irreversible deidentification is, if possible at all, very difficult” [16, p. 280], anonymized data in large datasets require in practice very sophisticated and expensive methods to be linked to known identities. Still, this at the very least require great care in the collection and curation of the dataset to avoid exposing sensitive personal data and violating existing privacy laws such as the GDPR.

The potential use of these data in commercial applications may pose a more serious ethical problem. Patients who agreed to the collection of their data for research purposes may not agree to the inclusion of these data in the model trained by a commercial entity in order to sell a product. The “informed consent” form of the TCGA project[^69] warns that the collected data may “lead to the development of new diagnostic tests, new drugs or other commercial products.” Publicly available datasets do not necessarily always disclose the exact terms that the patients agreed to, however. Furthermore, there is a difference between the data leading to the development of commercial products, and the data being integrated into the model used in a commercial product. Controlling on which data companies trained their model, and whether they respected license requirements, is extremely complicated. It is not yet clear that the current regulatory framework will be capable of dealing with all the questions that arise from the development of commercial products based on deep learning methods and digital pathology data.

Deep learning in digital pathology is a multi-disciplinary field, requiring the collaboration of scientists coming from very different worlds. To wildly caricature: machine learning scientists, used to nicely curated benchmark datasets and to results backed up by mathematics, meeting pathologists, working in a world of constantly evolving guidelines, and the messy variability and unpredictability of human diseases. The success of such a collaboration requires machine learning scientists to be better attuned to the realities of the clinical practice, and the impact of this reality on the datasets that they will work on. It also requires pathologists to better understand the strengths and limitations of the algorithms whose purpose is to help them in their decision making. If this cross-disciplinary understanding is absent, then we run the risk of producing models whose performances only appear good because of faulty dataset design or a poor understanding of the underlying clinical needs behind the computer vision task. Conversely, pathologists may not properly assess the extent to which some results may or may not be trusted and be led towards faulty diagnosis (or to stop trusting those results altogether).

The potential for improving patient care, however, is undeniable. By analysing the impact of real-world annotations on deep learning methods in digital pathology, we hope to help better understand the interaction between those two worlds. In this way, we can work towards ensuring that the large investment in time and resources by pathologists and machine learning scientists alike is well spent and moves the state-of-the-art in directions that best serve the needs of the patients.

[1]
G. Algan and I. Ulusoy, MetaLabelNet: Learning to Generate Soft-Labels From Noisy-Labels,” IEEE Transactions on Image Processing, vol. 31, pp. 4352–4362, 2022, doi: 10.1109/TIP.2022.3183841.
[2]
V. Baxi, R. Edwards, M. Montalto, and S. Saha, Digital pathology and artificial intelligence in translational medicine and clinical practice,” Modern Pathology, vol. 35, no. 1, pp. 23–32, Jan. 2022, doi: 10.1038/s41379-021-00919-2.
[3]
W. Bulten et al., Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge,” Nature Medicine, vol. 28, no. 1, pp. 154–163, Jan. 2022, doi: 10.1038/s41591-021-01620-2.
[4]
R. Charkaoui, Artefacts detection and ischemic time prediction based on TCGA and GTEx samples quality assessment,” Master Thesis, Université Libre de Bruxelles, 2022.
[5]
C. Chauhan and R. R. Gullapalli, Ethics of AI in Pathology,” The American Journal of Pathology, vol. 191, no. 10, pp. 1673–1683, Oct. 2021, doi: 10.1016/j.ajpath.2021.06.011.
[6]
O. Ciga, T. Xu, and A. L. Martel, Self supervised contrastive learning for digital histopathology,” Machine Learning with Applications, vol. 7, Mar. 2022, doi: 10.1016/j.mlwa.2021.100198.
[7]
K. Coffman, C. B. Flikkema, and O. Shurchkov, Gender stereotypes in deliberation and team decisions,” Games and Economic Behavior, vol. 129, pp. 329–349, Sep. 2021, doi: 10.1016/j.geb.2021.06.004.
[8]
A. Foucart, O. Debeir, and C. Decaestecker, Processing multi-expert annotations in digital pathology: a study of the Gleason 2019 challenge,” in 17th international symposium on medical information processing and analysis, Dec. 2021, p. 4. doi: 10.1117/12.2604307.
[9]
A. Foucart, O. Debeir, and C. Decaestecker, Comments on ‘MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge’,” IEEE Transactions on Medical Imaging, vol. 41, no. 4, pp. 997–999, Apr. 2022, doi: 10.1109/TMI.2022.3156023.
[10]
D. Karimi, H. Dou, S. K. Warfield, and A. Gholipour, Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis,” Medical Image Analysis, vol. 65, p. 101759, Oct. 2020, doi: 10.1016/j.media.2020.101759.
[11]
D. Karimi, G. Nir, L. Fazli, P. C. Black, L. Goldenberg, and S. E. Salcudean, Deep Learning-Based Gleason Grading of Prostate Cancer From Histopathology Images—Role of Multiscale Decision Aggregation and Data Augmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 5, pp. 1413–1426, May 2020, doi: 10.1109/JBHI.2019.2944643.
[12]
S. Li, Z. Gao, and X. He, Superpixel-Guided Iterative Learning from Noisy Labels for Medical Image Segmentation,” 2021, pp. 525–535. doi: 10.1007/978-3-030-87193-2_50.
[13]
L. Maier-Hein et al., BIAS: Transparent reporting of biomedical image analysis challenges,” Medical Image Analysis, vol. 66, p. 101796, Dec. 2020, doi: 10.1016/j.media.2020.101796.
[14]
Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, Dissecting racial bias in an algorithm used to manage the health of populations,” Science, vol. 366, no. 6464, pp. 447–453, Oct. 2019, doi: 10.1126/science.aax2342.
[15]
H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, Learning From Noisy Labels With Deep Neural Networks: A Survey,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–19, 2022, doi: 10.1109/TNNLS.2022.3152527.
[16]
T. Sorell, N. Rajpoot, and C. Verrill, Ethical issues in computational pathology,” Journal of Medical Ethics, vol. 48, no. 4, pp. 278–284, Apr. 2022, doi: 10.1136/medethics-2020-107024.
[17]
Ș. Vădineanu, D. Pelt, O. Dzyubachyk, and J. Batenburg, An Analysis of the Impact of Annotation Errors on the Accuracy of Deep Learning for Cell Segmentation,” MIDL, pp. 1–17, 2022.
[18]
J. van der Laak, G. Litjens, and F. Ciompi, Deep learning in histopathology: the path to the clinic,” Nature Medicine, vol. 27, no. 5, pp. 775–784, May 2021, doi: 10.1038/s41591-021-01343-4.