Our paper “Shortcomings and areas for improvement in digital pathology image segmentation challenges” is now available in its final version in the journal Computerized Medical Imaging and Graphics (doi: 10.1016/j.compmedimag.2022.102155).
It’s probably the most important of the articles written during my thesis. We compiled information about 21 digital pathology challenges organized betwen 2010 and 2022 that included a segmentation task, and analyzed different aspects:
- How the task definitions have evolved over the year, from simple binary or instance segmentation to semantic segmentation and instance segmentation and clasisification.
- How the reference annotations were generated: despite the high level of inter-expert disagreement exhibited by most digital pathology tasks, many challenges only rely on a single expert to generate their annotations.
- Their evaluation processes. This is in my opinion the most important part of the study, where we identify a serious lack of transparency in the exact steps of the evaluation process, and show that beside the choice of “metric” (which for segmentation tasks is generally either the IoU, the DSC or Hausdorff’s Distance), there are many choices to be made in the evaluation pipeline (rules for matching instances, aggregation process…) which are often less precisely described, despite the significant impact that they can have on the results. We also note that few challenges make their full evaluation code public.
- How top-ranked methods have changed over the years, with increasingly complex deep learning architectures (and often ensemble of networks), embedded in a relatively settled pipeline: some form of stain normalization, patch extraction, data augmentation…
We take in particular a closer look at three selected challenges with some unique or less common characteristics: GlaS 2015 (which included several subtasks ranked separately), Gleason 2019 (which publicly released individual annotations from multiple experts), and MoNuSAC 2020 (which publicly released their full test set annotations, their evaluation code, and several of the participating teams’ prediction maps).
Segmentation challenges are extremely difficult to organise and properly evaluate. In order to make the “competitive” aspects of the challenge work, we often have to make choices which may make the results less reliable in terms of the scientific perspective they offer, such as ignoring inter-expert disagreement and using metrics which are not necessarily well aligned with the clinical or biological task. To ensure that we fully leverage the time and effort taken to organise those challenges, we highlight the need for more transparency in the process:
Restricting access to the datasets (and the test dataset in particular) is obviously necessary while the challenge is underway, but becomes a barrier to subsequent research once the challenge is over. Challenge websites are also often abandoned once the final ranking has been published and/or the related conference event is over. Even though all the reviewed challenges are relatively recent, a lot of the information has been lost, or has become very difficult to find, with some websites no longer available, some links to the datasets no longer working, and contact email addresses not responding as organisers have moved on to other projects.
We emphasize that, in order for challenge results to be fully reproducible (and therefore to be able to compare them to new results), it is necessary to have access “to the evaluation code, the participants’ predictions and the full dataset, including the test set annotations.”
The preprint (which is only very slightly different from the final version) will remain available here on this website.