Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology

By Adrien Foucart. This is the web version of my PhD dissertation, defended in October, 2022. It can alternatively be downloaded as a PDF (original manuscript). A (re-)recording of the public defense presentation can also be viewed on YouTube (32:09).
Cite as: Foucart, A. (2022). Impact of real-world annotations on the training and evaluation of deep learning algorithms in digital pathology (PhD Thesis). Université libre de Bruxelles, Brussels.
[Show / Hide Table of Contents for the chapter]

B. Description of the networks

This section provides a reference for the different DCNNs used for our experiments through this thesis (B.1), as well as some commonly found architectures from the state-of-the-art (B.2). The main characteristics of the described networks are summarized in Table B.1.

Table B.1. Summary of the main characteristics of the described networks.
Name Reference(s) Short description # parameters
ShortRes Foucart et al. [1]–[3] Small Segmentation network with “short-skip” connections. 500k
U-Net Ronneberger et al. [4] Segmentation network with long-skip connections. ~30M
PAN Foucart et al. [2], [3] Segmentation network with long-skip connections and short-skip connections. 10M
LeNet-5 LeCun et al. [5] Classification network with convolutional, subsampling and dense layers. 60k
AlexNet Krizhevsky et al. [6] Classification network with convolutional, subsampling and dense layers. 60M
ResNet He et al. [7] Classification network with short-skip connections and batch normalization. 250k-20M depending on the chosen depth.
DenseNet Huang et al. [8] Classification network with “dense blocks,” where every convolutional layer has a direct connection to every subsequent layer. 1M-25M depending on the chosen depth.
PSPNet Zhao et al. [9] Semantic segmentation network with a focus on multi-resolution processing of the image by the network. Depends on the chosen encoder.
HoVer-Net Graham et al. [10] Instance segmentation and classification network, designed for digital pathology. 40M
EfficientNet Tan et al. [11] Classification network designed to be easily scalable with high performances at relatively low number of parameters. 5M-70M depending on the scaling factor.
Faster R-CNN Ren et al. [12] Detection network combining a “region proposal network” which finds candidates bounding boxes and a classifier which is applied on those candidates, sharing an encoder. Depends on the chosen encoder.

B.1 Networks used in our experimental work

B.1.1 ShortRes

We introduced the “ShortRes” network in our work on artefact detection [1], and used it in our subsequent work on SNOW annotations [2], [3]. The goal of the architecture is to be very small (to be trainable on a single GPU in a reasonable amount of time in 2016), and to incorporate the short-skip connections of ResNet [7] for better convergence. A schematic representation of the architecture is presented in Figure B.1. The macro-architecture follows a classic encoder-decoder path. The encoder has a succession of residual blocks and max-pooling, while the decoder uses transposed convolutions and residual blocks. All 2D convolutions use 3x3 kernels with a Leaky ReLU activation function. The network has a total of ~500k trainable parameters.

Figure B.1. ShortRes architecture, shown here for a 256x256px input, a network width of 64 and a segmentation output, as used in our SNOW experiments [2], [3]. Upsampling layers are transposed convolutions. Encoder part is shown in blue, decoder in green.

In our original artefact experiments [1], several versions of the network were tested: with different widths, added or removed residual blocks, and a “classification” version where the decoder was replaced with several dense layers.

B.1.2 U-Net

U-Net was introduced in 2015 by Ronneberger et al. [4], and quickly became a state-of-the-art network for biomedical image segmentation. In its original version (see Figure B.2), the encoders and decoders were symmetrical, with a succession of 2D convolutions with 3x3 kernels and ReLU activation, and 2x2 max-pooling (or, conversely, transposed convolution). The main innovation of the architecture was the inclusion of “long-skip” connections, which re-introduced feature maps from the encoder into the decoder, with the aim of providing a better spatial context to the decoder for a more accurate segmentation. The network has a total of ~30M trainable parameters.

Figure B.2. U-Net architecture, from Ronneberger et al. [4].

As the network became more popular, and computing resources became more readily available, alternative versions of the network were developed, mainly by replacing the encoder part with encoders taken from other architectures (such as, for instance, ResNet or EfficientNet, presented in section B.2) and, often, pre-trained on large datasets such as ImageNet.

B.1.3 Perfectly Adequate Network (PAN)

Figure B.3. PAN architecture, used in our SNOW experiments [2], [3].

We introduced the PAN architecture in our SNOW experiments [2], [3].

It combines the short-skip connections from ResNet, the long-skip connections from U-Net, and a final segmentation built from decoder feature maps taken at different levels (see Figure B.3), with the goal of ensuring that our results on the effects of imperfect annotations could be reproduced with different macro- and micro-architectural choices, and different sizes in terms of trainable parameters (~10M for PAN).

B.2 Selected networks from the state-of-the-art.

B.2.1 LeNet-5

The LeNet family of neural networks comes from the work of LeCun et al. on handwritten digits recognition [5], [13], and set the standard for the “classification” macro-architecture. LeNet-5, perhaps the best-known version, is shown in Figure B.4.

It includes a succession of convolutional and subsampling layers, followed by dense layers leading to the final class probabilities. Contrary to most commonly used pooling function today which usually don’t include trainable parameters, the output ysy_{s} of this subsampling layer was computed as yS=wixi+by_{S} = w\sum_{i}^{}x_{i} + b, with ww and bb being trainable parameters, and xix_{i} referring to the inputs, being a 2x2 neighbourhood in the previous layer.

Figure B.4. LeNet-5 architecture, originally published in LeCun et al. [5].

Also of note, the connections in LeNet-5 did not cover the entire “width” of the network, meaning that not all feature maps of layer S2, for instance, were connected to all feature maps of layer C3. This was for performance reason, and to force feature maps “to extract different (hopefully complementary) features.” LeNet-5 used hyperbolic tangents as an activation function through the network.

B.2.2 AlexNet

AlexNet was Krizhevsky et al.’s [6] winning entry into the 2012 ImageNet competition. The macro-architecture was very similar to the LeNet architecture, following the classic encoder-discriminator suitable for classification tasks, but with a number of trainable parameters several orders of magnitude larger (~60M instead of ~60k).

The encoder part was split in two independent paths which could be trained on two separate GPUs, thus accelerating the training. Subsampling was done with a max-pooling operation, and the chosen activation function was the ReLU.

Figure B.5. AlexNet architecture, originally published in Krizhevsky et al. [6]

B.2.3 ResNet

The “ResNet” family of architectures from He et al. [7] popularized short-skip connections by demonstrating great results on the ImageNet dataset. They contain a succession of “residual” blocks (shown in Figure B.6), made of two or three convolutional layers with ReLU activation functions. Subsampling is not done by pooling, but by using a stride of 2 in some of the convolutional layers. The number of parameters depends on the chosen depth, with the shallowest versions having ~250k parameters and the deepest version ~20M. Batch normalization is used after each convolution (before the activation function).

Figure B.6. Two types of residual blocks from the ResNet architecture, originally published in He et al. [7]. The version of the left is for residual blocks where the dimensions of the feature maps remain the same, and the version of the right is a “bottleneck” block with subsampling in the 3x3 convolution.

B.2.4 DenseNet

DenseNet, introduced in Huang et al. [8], pushes the notion of skip-connections to the extreme by using “dense blocks,” where every convolutional layer is connected to every subsequent convolutional layer. Subsampling is done between the dense blocks with a 1x1 convolution and a 2x2 average pooling (see Figure B.7). A particularity of DenseNet is that it works very well with a very “narrow” network, meaning that there are few feature maps per layer. ReLU activation functions are used through the network.

Figure B.7. DenseNet architecture, originally published in Huang et al. [8].

B.2.5 PSPNet

PSPNet, introduced by Zhao et al. [9], is a semantic segmentation network with a strong focus on multi-resolution processing for natural scenes. The overall architecture is shown in Figure B.8. The network starts with a pre-trained DCNN encoder (ResNet-50 was used in the original publication). The feature map is then pooled with different pooling sizes to have a representation of the features at different levels of resolution. Each of these representations passes through a 1x1 convolution to reduce the number of features, then up-sampled with a linear interpolation and finally concatenated with the encoder’s feature map, with a last convolutional layer providing the final prediction.

Figure B.8. PSPNet architecture, originally published in Zhao et al. [9].

B.2.6 HoVer-Net

Originally designed as the “XY Network,” the network now known as HoVer-Net was designed by Graham et al. [10] specifically for nuclei instance segmentation and classification in digital pathology. It is characterized by the use of three different paths for the decoder, which use the same architecture but are trained on different outputs (see Figure B.9). First, a classic binary segmentation decoder (the “Nuclear Pixel Branch”) that produces a per-pixel nuclei/non-nuclei probability. Second, a decoder with two channels trained on the per-pixel regression task of predicting the X-Y vector pointing towards the centre of the nucleus (the “HoVer Branch”). The third decoder, meanwhile, is trained on the semantic segmentation task of predicting the nucleus type. The three branches are trained simultaneously. Residual blocks are used in the encoder, and dense blocks in the decoders.

Figure B.9. HoVer-Net architecture, originally published in Graham et al. [10].

B.2.7 EfficientNet

More than a network architecture, Tan et al. [11] propose with EfficientNet a method for scaling up convolutional neural networks so as to obtain high levels of performances with low numbers of parameters. The method, called “compound model scaling,” uses a single parameter to simultaneously scale the depth (i.e. number of layers), width (i.e. number of channels in the feature maps) and image resolution. They propose a baseline architecture (“EfficientNet-B0”) which uses residual blocks of increasing width and decreasing resolution, as usual for encoders. Then, using the compound scaling parameter, scaled-up versions of the architecture are built (from “B1” to “B7”).

Figure B.10. Compound model scaling in EfficientNet, originally published in Tan et al. [11].

B.2.8 Faster R-CNN and Mask R-CNN

Faster R-CNN, proposed by Ren et al. [12], is the successor of the R-CNN [14] and Fast R-CNN [15], previously introduced by Girshick et al. for object detection based on bounding box localisation. It follows a classification macro-architecture, with two decoders. The first one is the “Region Proposal Network,” which uses a sliding window to find candidate regions based on an “objectness” score. The second is trained to determine the class of candidate regions. The encoder is shared by both decoders.

Figure B.11. Faster R-CNN architecture, originally published in Ren et al. [12].

References

[1] A. Foucart, O. Debeir, and C. Decaestecker, “Artifact Identification in Digital Pathology from Weak and Noisy Supervision with Deep Residual Networks,” in 2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech), Nov. 2018, pp. 1–6. doi: 10.1109/CloudTech.2018.8713350.

[2] A. Foucart, O. Debeir, and C. Decaestecker, “SNOW: Semi-Supervised, Noisy And/Or Weak Data For Deep Learning In Digital Pathology,” in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Apr. 2019, pp. 1869–1872. doi: 10.1109/ISBI.2019.8759545.

[3] A. Foucart, O. Debeir, and C. Decaestecker, “Snow Supervision in Digital Pathology: Managing Imperfect Annotations for Segmentation in Deep Learning,” 2020, doi: 10.21203/rs.3.rs-116512.

[4] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015, 2015, pp. 234–241. doi: 10.1007/978-3-319-24574-4_28.

[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi: 10.1109/5.726791.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances In Neural Information Processing Systems, pp. 1–9, 2012.

[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-Decem, pp. 770–778. doi: 10.1109/CVPR.2016.90.

[8] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.

[9] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Network,” 2017. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Zhao_Pyramid_Scene_Parsing_CVPR_2017_paper.html

[10] S. Graham et al., “Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images,” Medical Image Analysis, vol. 58, p. 101563, Dec. 2019, doi: 10.1016/j.media.2019.101563.

[11] M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the 36th International Conference on Machine Learning, 2019, vol. 97, pp. 6105–6114. [Online]. Available: https://proceedings.mlr.press/v97/tan19a.html

[12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in Neural Information Processing Systems, 2015, vol. 28. [Online]. Available: https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

[13] Y. LeCun et al., “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989, doi: 10.1162/neco.1989.1.4.541.

[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” Nov. 2013.

[15] R. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec. 2015, vol. 11-18-Dece, pp. 1440–1448. doi: 10.1109/ICCV.2015.169.