Deep Learning in Digital Pathology
Research in the course of a PhD thesis, by Adrien Foucart.
Research in the course of a PhD thesis, by Adrien Foucart.
The ICPR 2012 mitosis detection competition - usually called MITOS12 - proved that Deep Learning was the way to go for mitosis detection, and was influential in introducing Deep Learning into the world of Digital Pathology. It was also a flawed challenge, and there is a lot to learn from the mistakes that were made in its design. To understand those mistakes, we first need to look at what the goal of a "computer vision challenge" is.
Let's say you are a researcher who just developped a new algorithm for recognizing the species of a bird from a photograph in a natural setting. Now, you want to publish your research. In order to convince reviewers that your algorithm is interesting, however, you not only have to prove that it works, but you also have to prove that it improves on the current state-of-the-art.
If you are not a big fan of ethics, here's what you do: you find a set of test images where your algorithm works really, really well. Then, you implement other algorithm, but without caring too much if you do it exactly right. You test all algorithms on your test set and, surprise, you are the best! This may be very helpful in getting published, but of course it also doesn't tell us anything useful.
This is of course an exageration... but not so far from the truth. Even if you do care for ethics, it's very hard to implement other people's methods - especially if they didn't publish the code - and to find the best parameters for every method on your dataset. A very popular way to solve this problem is to use benchmarks and challenges.
The idea of the benchmark is fairly simple. Let's take our "bird recognition" problem: at one point in time, someone publishes a large collection of annotated bird images. Everyone who works on the problem of "bird recognition" can test their algorithm on the same data, which means that you can directly compare your method to what others have published. Challenges are similar, except that there is usually a "time limit" component: someone publishes a bird dataset, and tell everyone interested to submit their method before a given date. Then, everyone is evaluated at the same time (ideally on previously unreleased test images), and the results are published. This ensures a certain fairness in the comparison, as everyone plays by the same rules.
While many challenges only attract a few participants and are quickly forgotten, some have become true references by the computer vision community. The PASCAL Visual Object Classes challenge, for instance, ran between 2005 and 2012, with researchers tasked with recognizing objects from up to 20 different classes (see figure below). Starting in 2010, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC, often simply referred to as "ImageNet") has become the reference for "generic computer vision task".
Many Deep Learning algorithms have become famous through their ImageNet performances, such as the previously mentioned AlexNet  (winner in 2012), Google's "Inception" network  (winner in 2014), or Microsoft's "ResNet"  (winner in 2015).
Challenges entered the world of digital pathology around 2012. Quoting Geert Litjens (emphasis mine):
The introduction of grand challenges in digital pathology has fostered the development of computerized digital pathology techniques. The challenges that evaluated existing and new approaches for analysis of digital pathology images are: EM segmentation challenge 2012 for the 2D segmentation of neuronal processes, mitosis detection challenges in ICPR 2012 and AMIDA 2013, GLAS for gland segmentation and, CAMELYON16 and TUPAC for processing breast cancer tissue samples.
MITOS12  was one of the first computer vision contest with a specific digital pathology task. Let's take a closer look at it.
First of all, why do we want to count mitosis? I'll let the experts explain (emphasis mine):
Mitotic count is an important parameter in breast cancer grading as it gives an evaluation of the aggressiveness of the tumor. However, consistency, reproducibility and agreement on mitotic count for the same slide can vary largely among pathologists. An automatic tool for this task may help for reaching a better consistency, and at the same time reducing the burden of this demanding task for pathologists.
The following process was used to create the challenge:
The MITOS12 results were computed separately on the images from the three different scanners. Most teams only submitted results for the "Aperio" scanner. The best results on that dataset were achieved by Dan Cireşan's team, IDSIA . Their algorithm correctly detected 70 of the 100 mitosis in the test set, with only 9 false detections, for a winning F1 score of 0.7821, the runner-ups achieving scores of 0.7184 and 0.7094.
There are numerous problems with the challenge design, which were actually acknowledged by the authors in this paragraph from the Discussion of their article:
An improved version of this successful challenge will involve a much larger number of mitosis, images from more slides and multiple pathologists’ collaborative/cooperative annotations. Besides, some slides will be dedicated to test only without any HPF of these slides included in the training data set.
There are four different issues here, so let's take them one at a time.
1) The small number of mitosis, with only around 200 mitosis in the training set, is certainly a problem. As mitosis can have a very variable appearance, it is unlikely that all of the possible "morphologies" of mitotic cells will be represented in the dataset. Machine Learning algorithm will therefore not be able to detect them.
2) The small number of slides is even more problematic. More importantly, a small number of slides from a small number of patients. Having more patients mean more diversity, and less risk of a bias in the dataset. For instance, if a patient happens has more mitosis than the others (because she has a more malignant cancer), and also happens to have some independent morphological characteristics in her breast tissue, it is very likely that the algorithm will pick up those independent features and use them to predict the presence of mitotic cells, even if those features would be completely meaningless for other patients.
3) The fact that only one pathologist annotated the slides is also worrying. In the introduction to the challenge, they list as a reason for needing automatic counting the lack of inter-pathologist agreement on mitotic count, citing a study by Christopher Malon and his colleagues . In that study, three different pathologists were asked to classify 4204 nuclei as mitotic or non-mitotic. Pathologists were allowed to put "Maybe" as an answer. Even excluding those "maybe" from the comparison (and therefore comparing only cases where both pathologists were "reasonably certain" of their choice), the agreement between two pathologists was at most 93.5% of "same classification", and in the worst case at 64.7%. That same study also compares each pathologist against the "majority label". The F1 score of a pathologist against the majority, using the data from that study, vary between 0.704 and 0.997. In the MITOS12 challenge results, the top 3 teams are all within that range.
Now that doesn't mean that those three teams are necessarily better than, or even as good as pathologists. It's not fair to compare them on different datasets. But the point is, the difference in performance between the top teams in the challenge is small enough that a different annotator might have led to a completely different ranking.
4) The last problem acknowledged by Roux's publication is the worse one, at least in terms of methodology: they didn't split the training set and the test set properly. Ideally, when we test an algorithm, we want to make sure that it is capable of handling new cases. The best way to do that if to have a test set as independent as possible from the training set.
In biomedical images, that typically means: testing on other slides, taken from other patients, if possible taken with another acquisition machine. Changing more variables mean that the results are a lot more meaningfull. The best algorithms would then have to be those that really learned ways to describe the object of interest - in this case the mitosis. In her master thesis in our lab, Elisabeth Gruwé tested the same algorithm using either the "official" training-set / test-set split from the contest, and then using a "correct" split, by putting one patient aside for testing and training on the four others. The results on the official split were close to those of the three winning teams (0.68), the results on the correct split were significantly worse (0.54).
Does the ranking matter? In terms of visibility, probably. The methods proposed by challenge winners tend to be copied, modified, adapted, and become the norm, while runner-ups may be completely ignored... even if their results are functionally equivalent. If we look at the publications of the three MITOS12 winning methods, a certain trend is visible. The winner, Cireşan's team's article , was published in the biggest biomedical imagining conference and has been cited more than a thousand times (according to Google Scholar). The runner-up, Humayun Irshad from the University of Grenoble, was published in a good journal  and has been cited about 150 times. The third, Ashkan Tashk and a team from Shiraz University of Technology, was published in the proceedings of an obscure iranian conference  and cited about 30 times.
Now the rankings are not the only explanation for this difference in visibility, and the number of citation is not a direct reflection of the influence of a paper. Dan Cireşan was part of a well-established research team with Jurgen Schmidhuber, a Deep Learning pioneer. Humayun Irshad's thesis director was Ludovic Roux, the organizer of the challenge, which is kind of a problem of its own, but ensured that he got some visibility in the follow-up articles. Ashkan Tashk and the iranian team certainly didn't have the same recognition beforehand - or after.
Two years later, an extended version of the challenge was proposed at the ICPR 2014 conference, MITOS-ATYPIA 14. It provided more data, and the annotations were made by two different pathologist, with a third one looking at all cases were the first two were in disagreement. The data included a confidence score for each mitosis based on the agreement or disagreement of the pathologists... and it was correctly split at the patient level. In 2016, Hao Chen and his Hong Kong team published their results on both the 2012 and 2014 datasets . On the 2012 dataset, they achieve a F1 score of 0.788, "beating" Cireşan's entry. They also beat all other existing publications on the 2014 dataset... with a score of 0.482. Comparing the two datasets, they say:
One of the most difficult challenges in [the 2014] dataset is the variability of tissue appearance, mostly resulted from the different conditions during the tissue acquisition process. As a result, the dataset is much more challenging than that in 2012 ICPR.
But from a purely machine learning perspective, this doesn't sound quite right. Yes, the increased variability in the test set is more challenging, but the increased variability in the training set should help the algorithms. The huge drop in performance is likely to be in a large part due to the incorrect setup of the 2012 challenge. The 2014 edition, however, attracted a lot less participants, and didn't get the same visibility... probably because the results were, for obvious reasons, a lot worse.
In 2017, Geert Litjens and his colleagues from Radboud University published a comprehensive survey of deep learning methods in medical image analysis . It's possible that they missed some early papers that could qualify as "Deep Learning": as I've written before, the boundaries of what is or isn't Deep Learning are unclear, and articles written before 2012 are unlikely to use that terminology. If such articles exist, however, they have been forgotten in the large pile of never-cited research that fails to be picked-up by Google Scholar, Scopus, or other large research databases. Litjens' survey therefore remains the reference on the matter.
So who was first to use "Deep Learning in Digital Pathology"? The turning point seems to come from the MICCAI 2013 conference in Nagoya, with two pioneering articles: Angel Cruz-Roa's automated skin cancer detection , and Dan Cireşan's mitosis detection in breast cancer .
Cruz-Roa uses H&E stained images from skin tissue, to try to automatically determine if there is a malignant cancer in the sample. As illustrated in the figure below, the difference between cancer and non-cancer is based on morphological criteria which are very difficult to define.
Their algorithm produces both a classification (cancer or not) at the image level, and what they call a "digital staining", which is basically a probability map of where the cancer regions are (see figure below). This is a very important feature for machine learning methods in biomedical imaging, related to the concept of interpretability. A machine learning algorithm which only produces a "diagnosis", but is unable to "explain" how this diagnosis came to be, cannot be trusted. I will most certainly come back to that idea later: the "reasoning" that machine learning algorithms (deep or not) make are sometimes more a reflexion of biases and artefacts in the data that was used to train it than of an understanding of the pathology. Having an output which includes an explanation of the diagnosis is therefore essential to control whether any "weird stuff" is happening.
Cireşan also uses H&E images, taken from breast cancer tissue. I'll let him introduce the problem:
Mitosis detection is very hard.
Mitosis - the process of cell multiplication - is a relatively rare event, meaning that, in the images which are available to train the algorithm, only a very small fraction will be part of a nuclei, and an even smaller fraction will be part of a nuclei going through mitosis. In addition to being rare, the appearance of the cell nucleus will be very different depending of which stage of the mitosis the cell is currently experiencing. To get an idea of how difficult the task is, we can just look at these examples from the article (click on the image for a larger version):
Cireşan's results were far from perfect, but they were impressive enough to be a milestone in the domain: as a winner of the "ICPR 2012 mitosis detection competition", it got a lot of attention... despite the many methodogical issues with the competition itself, which is the topic for another post.
By showing that Deep Learning was a way to get good results on digital pathology tasks, Cireşan and Cruz-Roa opened up the floodgates. Litjens lists many different applications in the subsequent years: bacterial colony counting, classification of mitochondria, classification and grading of different types of cancer, detection of metastases... Mostly on H&E images, but also sometimes using immunohistochemistry, Deep Learning invaded the domain.
A few highly influential works that I would like to mention here, and that I will probably write about more later:
After these pioneering works, the future of the field may seem a little dull. If we have deep neural network that work for most digital pathology tasks... what is there left to do?
Fortunately, finding a good "deep learning network" is only a part of the "digital pathology pipeline". Everything that comes around the network - from the constitution of the datasets to the way the results are evaluated - is often more important to the final result. That is going to be a large part of what I will write about in future posts. Questions surrounding how data from challenges and data from real-world application may differ, questions about the way we evaluate algorithms, about how we declare winners and losers in ways that may not always reflect how useful the algorithms really are. For that, a good starting point will be to take a closer look at the aforementioned MITOS12 challenge from ICPR 2012.
It seems like Deep Learning should have an easy, clear-cut definition. Yet... Wikipedia, on this topic, displays a remarkable example of circular citation - or Citogenesis, as the always-relevant XKCD would put it. The Wikipedia definition is a "summary" of five definitions from a Microsoft Research paper, most of which are themselves taken from earlier versions of the same Wikipedia article.
The most direct definition from a reputable source that I could find is probably from the "Deep Learning" Nature Review of AI-superstars Yann LeCun, Yoshua Bengio and Geoffrey Hinton (emphasis mine):
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.
In a slightly more convoluted way, Ian Goodfellow, Yoshua Bengio (again) and Aaron Courville, in their "Deep Learning" book, introduce the topic this way (emphasis mine):
The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images. (...) A solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept deﬁned through its relation to simpler concepts.
These definitions basically boil down to: it's AI, with machine learning, with layers. So... What's machine learning? And, while we're at it, what is AI?
The history of Artificial Intelligence as a practical, computer-science-related field of research, goes back to the early days of computers themselves, around and right after World War II. The "ultimate goal" of AI, as illustrated in Alan Turing's best-known paper , was to create a computer which could be - at least in specific conditions - indistinguishable from a human.
This, unsurprisingly, is a difficult task. In fact, this particular goal, which is now generally referred to as "Artificial General Intelligence", is still mostly the domain of science-fiction.
One of the early avenues of research in AI, in the tradition of "trying to replicate human intelligence", was the artificial neuron. As early as 1943, McCulloch and Pitts  proposed a way to represent neurons in a mathematical model which could be replicated on a computer. They were followed by many others, but while their research was interesting, it proved to be largely impractical. Neural networks, quite simply, did not work. Artificial (General) Intelligence seemed altogether impossible. If AI as it was conceived couldn't be done, the next best thing was to change the definition of AI to something more forgiving.
In "Artificial Intelligence: foundations of computational agents" , Poole & Mackworth propose such a definition. AI, in their view, studies computational agents (which are agents whose actions and decisions can be implemented in a physical device, like a computer) that act intelligently, which means that it has actions appropriate for its circumstances and its goals, is flexible to changing environments and changing goals, and learns from experience.
This places any artificial intelligence in the context of task or problem solving. The job of an AI is not to be "like a human", but to have "human-like" (or better-than-human) capabilities in one or several specific tasks.
An interesting aspect of AI as defined by Poole & Mackworth is the capacity to learn from experience. Machine Learning is a subset of AI dealing with this particular aspect of "intelligence": how can a machine learn from experience?
To understand the basic idea of Machine Learning, it's interesting to look at one of its earliest algorithms, "Nearest Neighbor", with a version described already in 1951 .
Let's take an example. Imagine that you are a Big Streaming Company, and you want your AI to decide which movies or series in your catalog you should recommend to a particular user. Let's assume that you actually want to recommend something that the user will like, and not just something that you want to promote. For every movie or tv series, you have two pieces of information: how much violence, and how much comedy there is, as a percentage of the total runtime of the movie. For everything that the user has already seen, you also know (don't ask how) if he liked it or not.
You could therefore represent all of the movies that the user has seen on a nice graph, like this:
What you do know is that you take the movie that you want to recommend, and you also put it on the same graph. Then, you check if the closest point is something that the user liked or not. If not, you don't recommend the movie.
Obviously, this is a ridiculous example, but the main idea is there: the algorithm uses past experience to predict new behaviour. Of course, this will work a lot better if you have more data, and if you have better ways of describing this data.
There are many, many, more complex, more accurate algorithms in Machine Learning. But in the end, they fundamentally do the exact same thing: put all of the data (the "past experience") into some space that describes is as best as possible, and then find in that space a "Rule" that best predicts what to do with any future event. This rule can usually be as simple as a straight line (or to be precise in the more-than-two-dimensions case, an hyperplane) dividing the "space" in two, or be an intricate function with millions of parameters able to model any boundary shape, as in modern artificial neural networks.
While diverse machine learning algorithms such as Decision Trees, Support Vector Machines, and many others were being developped, the "artificial neural network" world was not completely inactive. One of the major issues with neural networks - how to efficiently "train" them with new examples - was vastly improved upon in 1986 with the backpropagation algorithm , which is still used today. In 1989, Yann LeCun and his team used it in what is now considered one of the first practical application of "modern" artificial neural networks, to recognize hand-written ZIP codes for the US Postal system .
In 1997, IBM's Deep Blue beat world champion Gary Kasparov in a six-game chess match . In terms of public perception, this certainly gave AI enthusiasts a boost. Deep Blue, however, was an "Intelligence" only if you used the most forgiving definition. It didn't learn anything, it didn't reason anything: it was a pure, brute-force mechanism. Deep Blue simply took the current situation of the game, and computed all possible outcomes for all possible moves, for the next 10 to 20 moves. It did use "previous experience", in the form of thousands of previously played human-vs-human grandmaster games, to determine what a "winning move" was. But in the end, it mostly relied on the fact that chess is a game with fixed rules and a finite amount of outcomes. It worked because it had more processing power than what was previously available, not because it was innovative.
LeCun's success put neural networks back on the map, but they were still a curiosity. In most applications, they were impractical, took way too long to train, and didn't usually perform better than other machine learning approaches. But with the 21st century came two game changers in the machine learning world: Big Data and fast GPUs. Big Data - the ability to store huge amount of data on everything, thanks to cheap hard drives - gave us the ability to improve machine learning in general. Fast GPUs made training larger, more useful neural networks a reality. Quoting Dan Cireşan and his colleagues in 2010:
All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.
Around the same time, the idea that "larger" neural networks were actually "deeper" neural networks, with neurons organized in "layers", each layer connected to the next starting from the raw data all the way up to the output, became common usage.
By 2012, the achievements of "Deep" neural networks became impossible to ignore. On ImageNet, the largest visual object recognition challenge, Alex Krizhesvsky's AlexNet  dominated that year's field. Deep Learning approaches have since then consistently beaten "classical" machine learning methods on about everything. Most predominantly, it has become the standard solution for computer vision and language processing. In the world of AI, Deep Learning is now the law of the land.
These definitions are fuzzy. The boundaries between Deep Learning and "non-deep" Machine Learning are unclear, as are sometimes the boundaries between Machine Learning and "old fashioned AI". That's fine: we don't need every method to fit into a well-defined box.
All right. Now that we have defined what Digital Pathology and Deep Learning are, the next question will be: how has Deep Learning been applied to Digital Pathology?
Let's start from the beginning. When I started my thesis, the topic we settled on was "Deep Learning in Digital Pathology". It's vague - but that was kind of the point. Deep Learning and Digital Pathology were both recent trends at the time, so trying to look at what could be done with it in general seemed like a good idea.
We have two parts in "Deep Learning in Digital Pathology". The first one, Deep Learning, is where I have spent most of my thesis. That's the part that concerns me as a biomedical engineer specialized in image analysis, and where I can contribute the most. The second part, however, is just as important. It's the application, what we want to use Deep Learning for: Digital Pathology. I am certainly not an expert in Digital Pathology - and even less of an expert in not-digital histopathology - but understanding what the methods I'm going to develop may be used for does seem like a good idea, so let's briefly get into it.
The goal of histopathology is to examine human tissue (usually taken from a tumour or some other possibly diseased area) under a microscope, to formulate a diagnosis or to get a better understanding of a disease. The process, in short, is as follows:
Why do we have to stain the tissue? Because cells are mostly water, water tends to be transparent , and transparent things are hard to look at with visible light. Fortunately, some chemical pigments have properties which are very useful for pathologists. For instance, in the late 19th and early 20th century, it was discovered that we could use hematoxylin to stain the nuclei of cells in blue, and eosin to stain the cytoplasm in pink .
This produces "Hematoxylin & Eosin" - H&E - images like this one below, where the structure of the tissue is easy to analyse for the pathologist:
Slightly more recently, we realized that we could use the properties of antigens and antibodies to get some more specific staining. The idea is this: in our body, some cells - antibodies - are designed to specifically bind to some proteins - antigens - as a defence mechanism to produce an immune response. We can "hack" this process by binding a staining agent to an antigen and therefore "highlight", in the tissue, places where the related antibodies are present. This method is called immunohistochemistry, or IHC. For instance, in the image below, we have the same part of the tissue stained with H&E on one side, and with an IHC marker (anti-pan-cytokeratin, to be precise) on the other. The IHC marker highlights the cells which are part of a tumour, which is a rather useful information to have in histopathology.
So where does the "digital" part fit in all this?
The problem with the process above is that it requires the trained pathologist to be physically in front of the microscope, with the slide in it. There are a number of drawbacks to this. One is that it's hard to get a second opinion from a specialist from somewhere else. Another is that comparing a tissue to, for instance, another sample taken some months or years before requires finding the physical slide in the archives of the hospital.
How do we solve that? With digital scanners. Very expensive, very high resolution, very precise digital scanners. The entire slide can be scanned at multiple levels of magnification to produce Gigapixel images, which can be viewed on a computer. The pathologist can then access the image in a "virtual microscope".
And once you start to have the slides as digital objects, you open the door to many possibilities, from tools to easily annotate the slides (for instance, for teaching purposes, or simply to quickly document the reasoning behind a diagnosis) to automated analysis of certain aspects of the tissue. In particular, some quantitative analysis are very hard to do for a human expert in an objective manner (like evaluating "what percentage of the tissue shows this marker?"), yet relatively easy (or at least possible) to do for a well-designed algorithm.
Digital image acquisition are becoming commonplace, and associated image analysis solutions are viewed by most as the next critical step in automated histological analysis.
In the more than 10 years since Mulrane's paper, there has indeed been a wide range of image analysis applications in digital pathology. And, in more recent years, as in most image analysis applications, one type of strategy seems to have very quickly surpassed all others: Deep Learning.
I guess that's a teaser for the "next episode"?